{"ID":3004958,"CreatedAt":"2026-06-03T03:09:48.883664427Z","UpdatedAt":"2026-06-04T19:14:31.964469513Z","DeletedAt":null,"paper_url":"https://arxiv.org/abs/2606.03132","arxiv_id":"2606.03132","title":"DMT-CBT: Longitudinal Therapeutic State Modeling for CBT Counseling","abstract":"Large language models (LLMs) have shown growing potential for Cognitive Behavioral Therapy (CBT) counseling. However, most existing approaches still formulate counseling as a local response generation problem, focusing on empathetic replies within short, text-only, or single-session interactions. We argue that this formulation fundamentally mismatches the nature of real psychotherapy. In clinical CBT, therapy is a longitudinal process in which therapists continuously infer, update, and intervene on evolving therapeutic states across sessions. Realistic CBT further involves multimodal inference and delayed cross-session intervention effects, requiring models to capture longitudinal therapeutic state evolution under partial observability. We propose DMT-CBT, a framework for Dynamic Modeling of evolving Therapeutic states in CBT counseling. DMT-CBT maintains structured therapeutic states across sessions while incorporating multimodal behavioral grounding and tool-augmented intervention to support adaptive therapeutic reasoning. Based on this framework, we construct DMTCorpus, a synthetic multi-session multimodal CBT counseling dataset featuring evolving therapeutic states, image-grounded client behaviors, and cross-session intervention continuity. Experimental results show that DMT-CBT improves counseling fidelity and therapeutic alliance, produces more favorable longitudinal affective trajectories, and preserves therapeutic states more faithfully than post-hoc extraction approaches.","short_abstract":"Large language models (LLMs) have shown growing potential for Cognitive Behavioral Therapy (CBT) counseling. However, most existing approaches still formulate counseling as a local response generation problem, focusing on empathetic replies within short, text-only, or single-session interactions. We argue that this for...","url_abs":"https://arxiv.org/abs/2606.03132","url_pdf":"https://arxiv.org/pdf/2606.03132v1","authors":"[\"Chang Liu\",\"Shuyi Zhang\",\"Changsheng Ma\",\"Yongfeng Tao\",\"Minqiang Yang\",\"Bin Hu\"]","published":"2026-06-02T04:18:25Z","proceeding":"cs.CL","tasks":"[\"cs.CL\"]","methods":"[\"Large Language Model\",\"Language Model\"]","has_code":false}