ECHO: Towards Emotionally Appropriate and Contextually Aware Interactive Head Generation

Kong, Xiangyu; Jin, Xiaoyu; Pan, Yihan; Sun, Haoqin; Zhu, Hengde; Xu, Xiaoming; Wei, Xiaoming; Liu, Lu; Song, Siyang

Abstract

In natural face-to-face interaction, participants seamlessly alternate between speaking and listening, producing facial behaviors (FBs) that are finely informed by long-range context and naturally exhibit contextual appropriateness and emotional rationality. Interactive Head Generation (IHG) aims to synthesize lifelike avatar head video emulating such capabilities. Existing IHG methods typically condition on dual-track signals (i.e., human user's behaviors and pre-defined audio for avatar) within a short temporal window, jointly driving generation of avatar's audio-aligned lip articulation and non-verbal FBs. However, two main challenges persist in these methods: (i) the reliance on short-clip behavioral cues without long-range contextual modeling leads them to produce facial behaviors lacking contextual appropriateness; and (ii) the entangled, role-agnostic fusion of dual-track signals empirically introduces cross-signal interference, potentially compromising lip-region synchronization during speaking. To this end, we propose ECHO, a novel IHG framework comprising two key components: a Long-range Contextual Understanding (LCU) component that facilitates contextual understanding of both behavior-grounded dynamics and linguistic-driven affective semantics to promote contextual appropriateness and emotional rationality of synthesized avatar FBs; and a block-wise Spatial-aware Decoupled Cross-attention Modulation (SDCM) module, that preserves self-audio-driven lip articulation while adaptively integrating user contextual behavioral cues for non-lip facial regions, complemented by our designed two-stage training paradigm, to jointly enhance lip synchronization and visual fidelity. Extensive experiments demonstrate the effectiveness of proposed components and ECHO's superior IHG performance.

Methodology

Overview of ECHO and pipeline comparison with existing IHG approaches. (a) Approaches that rely exclusively on dual-track audio signals for avatar FBs generation; (b) Approaches that condition avatar FBs generation on short-range-only context modeling via role-agnostic fusion; (c) Our ECHO, which performs long-range contextual understanding and region-wise decoupled cross-attention mechanism, achieving contextually appropriate and precisely lip-synchronized avatar FBs.

The pipeline of our proposed ECHO framework. ECHO performs long-range contextual understanding grounded in user behavioral dynamics and dyadic linguistic cues to facilitate contextual appropriateness of generated avatar facial behaviors. The SDCM module is introduced for faithful self-audio-driven lip articulation with appropriate non-lip facial dynamics.

Experimental Results

Comparison with State-of-the-Art Approaches

We present video-based qualitative comparisons to comprehensively demonstrate our ECHO's advantages across diverse conversational contexts, encompassing seamless interaction, listening, and speaking scenarios on the RealTalk, HDTF and Seamless Interaction test datasets.

We present additional quantitative and qualitative comparisons on the core Interactive Head Generation (IHG) task, which comprehensively evaluates both speaking and listening capabilities under dyadic interactive settings.

Quantitative comparison with state-of-the-art IHG approaches. The results qualitatively showcase ECHO's superior performance in reactiveness, motion diversity, visual quality, and lip-synchronization under dyadic interaction settings.

Qualitative comparison with open-sourced SOTA methods on interactive head generation. Above two dyadic scenarios including active listening/speaking showcase that our proposed ECHO generates avatar head with more context-consistent and emotionally appropriate facial behaviors, and achieves lip articulation with better audio-visual alignment.

Ablation Study

To clearly attribute performance gains to individual designs, we report four component-level ablations and, for each, jointly analyze quantitative metrics and qualitative results. (a) Top-left: Ablation on low-level perception encoding (LPE) by leveraging long-range visual cues; (b) Bottom Left: Ablation on high-level behavior-grounded context understanding (HBCU); (c) Top-right: Ablation on SDCM module; (d) Bottom-right: Ablation on linguistic-driven affective understanding (LAU).