Computation and Language 96
★ StepWiser: Stepwise Generative Judges for Wiser Reasoning
Wei Xiong, Wenting Zhao, Weizhe Yuan, Olga Golovneva, Tong Zhang, Jason Weston, Sainbayar Sukhbaatar
As models increasingly leverage multi-step reasoning strategies to solve
complex problems, supervising the logical validity of these intermediate steps
has become a critical research challenge. Process reward models address this by
providing step-by-step feedback, but current approaches have two major
drawbacks: they typically function as classifiers without providing
explanations, and their reliance on supervised fine-tuning with static datasets
limits generalization. Inspired by recent advances, we reframe stepwise reward
modeling from a classification task to a reasoning task itself. We thus propose
a generative judge that reasons about the policy model's reasoning steps (i.e.,
meta-reasons), outputting thinking tokens before delivering a final verdict.
Our model, StepWiser, is trained by reinforcement learning using relative
outcomes of rollouts. We show it provides (i) better judgment accuracy on
intermediate steps than existing methods; (ii) can be used to improve the
policy model at training time; and (iii) improves inference-time search.
★ Generative Interfaces for Language Models
Large language models (LLMs) are increasingly seen as assistants, copilots,
and consultants, capable of supporting a wide range of tasks through natural
conversation. However, most systems remain constrained by a linear
request-response format that often makes interactions inefficient in
multi-turn, information-dense, and exploratory tasks. To address these
limitations, we propose Generative Interfaces for Language Models, a paradigm
in which LLMs respond to user queries by proactively generating user interfaces
(UIs) that enable more adaptive and interactive engagement. Our framework
leverages structured interface-specific representations and iterative
refinements to translate user queries into task-specific UIs. For systematic
evaluation, we introduce a multidimensional assessment framework that compares
generative interfaces with traditional chat-based ones across diverse tasks,
interaction patterns, and query types, capturing functional, interactive, and
emotional aspects of user experience. Results show that generative interfaces
consistently outperform conversational ones, with humans preferring them in
over 70% of cases. These findings clarify when and why users favor generative
interfaces, paving the way for future advancements in human-AI interaction.
comment: Preprint
☆ Evaluating the Evaluators: Are readability metrics good measures of readability?
Plain Language Summarization (PLS) aims to distill complex documents into
accessible summaries for non-expert audiences. In this paper, we conduct a
thorough survey of PLS literature, and identify that the current standard
practice for readability evaluation is to use traditional readability metrics,
such as Flesch-Kincaid Grade Level (FKGL). However, despite proven utility in
other fields, these metrics have not been compared to human readability
judgments in PLS. We evaluate 8 readability metrics and show that most
correlate poorly with human judgments, including the most popular metric, FKGL.
We then show that Language Models (LMs) are better judges of readability, with
the best-performing model achieving a Pearson correlation of 0.56 with human
judgments. Extending our analysis to PLS datasets, which contain summaries
aimed at non-expert audiences, we find that LMs better capture deeper measures
of readability, such as required background knowledge, and lead to different
conclusions than the traditional metrics. Based on these findings, we offer
recommendations for best practices in the evaluation of plain language
summaries. We release our analysis code and survey data.
☆ VibeVoice Technical Report
Zhiliang Peng, Jianwei Yu, Wenhui Wang, Yaoyao Chang, Yutao Sun, Li Dong, Yi Zhu, Weijiang Xu, Hangbo Bao, Zehua Wang, Shaohan Huang, Yan Xia, Furu Wei
This report presents VibeVoice, a novel model designed to synthesize
long-form speech with multiple speakers by employing next-token diffusion,
which is a unified method for modeling continuous data by autoregressively
generating latent vectors via diffusion. To enable this, we introduce a novel
continuous speech tokenizer that, when compared to the popular Encodec model,
improves data compression by 80 times while maintaining comparable performance.
The tokenizer effectively preserves audio fidelity while significantly boosting
computational efficiency for processing long sequences. Thus, VibeVoice can
synthesize long-form speech for up to 90 minutes (in a 64K context window
length) with a maximum of 4 speakers, capturing the authentic conversational
``vibe'' and surpassing open-source and proprietary dialogue models.
☆ Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning
Scientific problem solving poses unique challenges for LLMs, requiring both
deep domain knowledge and the ability to apply such knowledge through complex
reasoning. While automated scientific reasoners hold great promise for
assisting human scientists, there is currently no widely adopted holistic
benchmark for evaluating scientific reasoning, and few approaches
systematically disentangle the distinct roles of knowledge and reasoning in
these tasks. To address these gaps, we introduce SciReas, a diverse suite of
existing benchmarks for scientific reasoning tasks, and SciReas-Pro, a
selective subset that requires more complex reasoning. Our holistic evaluation
surfaces insights about scientific reasoning performance that remain hidden
when relying on individual benchmarks alone. We then propose KRUX, a probing
framework for studying the distinct roles of reasoning and knowledge in
scientific tasks. Combining the two, we conduct an in-depth analysis that
yields several key findings: (1) Retrieving task-relevant knowledge from model
parameters is a critical bottleneck for LLMs in scientific reasoning; (2)
Reasoning models consistently benefit from external knowledge added in-context
on top of the reasoning enhancement; (3) Enhancing verbalized reasoning
improves LLMs' ability to surface task-relevant knowledge. Finally, we conduct
a lightweight analysis, comparing our science-focused data composition with
concurrent efforts on long CoT SFT, and release SciLit01, a strong 8B baseline
for scientific reasoning.
comment: 28 pages, 16 figures
☆ The Ramon Llull's Thinking Machine for Automated Ideation
Xinran Zhao, Boyuan Zheng, Chenglei Si, Haofei Yu, Ken Liu, Runlong Zhou, Ruochen Li, Tong Chen, Xiang Li, Yiming Zhang, Tongshuang Wu
This paper revisits Ramon Llull's Ars combinatoria - a medieval framework for
generating knowledge through symbolic recombination - as a conceptual
foundation for building a modern Llull's thinking machine for research
ideation. Our approach defines three compositional axes: Theme (e.g.,
efficiency, adaptivity), Domain (e.g., question answering, machine
translation), and Method (e.g., adversarial training, linear attention). These
elements represent high-level abstractions common in scientific work -
motivations, problem settings, and technical approaches - and serve as building
blocks for LLM-driven exploration. We mine elements from human experts or
conference papers and show that prompting LLMs with curated combinations
produces research ideas that are diverse, relevant, and grounded in current
literature. This modern thinking machine offers a lightweight, interpretable
tool for augmenting scientific creativity and suggests a path toward
collaborative ideation between humans and AI.
comment: 21 pages, 3 figures
☆ Do LVLMs Know What They Know? A Systematic Study of Knowledge Boundary Perception in LVLMs EMNLP2025
Large vision-language models (LVLMs) demonstrate strong visual question
answering (VQA) capabilities but are shown to hallucinate. A reliable model
should perceive its knowledge boundaries-knowing what it knows and what it does
not. This paper investigates LVLMs' perception of their knowledge boundaries by
evaluating three types of confidence signals: probabilistic confidence, answer
consistency-based confidence, and verbalized confidence. Experiments on three
LVLMs across three VQA datasets show that, although LVLMs possess a reasonable
perception level, there is substantial room for improvement. Among the three
confidences, probabilistic and consistency-based signals are more reliable
indicators, while verbalized confidence often leads to overconfidence. To
enhance LVLMs' perception, we adapt several established confidence calibration
methods from Large Language Models (LLMs) and propose three effective methods.
Additionally, we compare LVLMs with their LLM counterparts, finding that
jointly processing visual and textual inputs decreases question-answering
performance but reduces confidence, resulting in an improved perception level
compared to LLMs.
comment: EMNLP2025 Findings
☆ Beyond the Black Box: Integrating Lexical and Semantic Methods in Quantitative Discourse Analysis with BERTopic
Quantitative Discourse Analysis has seen growing adoption with the rise of
Large Language Models and computational tools. However, reliance on black box
software such as MAXQDA and NVivo risks undermining methodological transparency
and alignment with research goals. This paper presents a hybrid, transparent
framework for QDA that combines lexical and semantic methods to enable
triangulation, reproducibility, and interpretability. Drawing from a case study
in historical political discourse, we demonstrate how custom Python pipelines
using NLTK, spaCy, and Sentence Transformers allow fine-grained control over
preprocessing, lemmatisation, and embedding generation. We further detail our
iterative BERTopic modelling process, incorporating UMAP dimensionality
reduction, HDBSCAN clustering, and c-TF-IDF keyword extraction, optimised
through parameter tuning and multiple runs to enhance topic coherence and
coverage. By juxtaposing precise lexical searches with context-aware semantic
clustering, we argue for a multi-layered approach that mitigates the
limitations of either method in isolation. Our workflow underscores the
importance of code-level transparency, researcher agency, and methodological
triangulation in computational discourse studies. Code and supplementary
materials are available via GitHub.
comment: 5 pages conference paper, 4 tables
☆ Retrieval-Augmented Generation for Natural Language Art Provenance Searches in the Getty Provenance Index
This research presents a Retrieval-Augmented Generation (RAG) framework for
art provenance studies, focusing on the Getty Provenance Index. Provenance
research establishes the ownership history of artworks, which is essential for
verifying authenticity, supporting restitution and legal claims, and
understanding the cultural and historical context of art objects. The process
is complicated by fragmented, multilingual archival data that hinders efficient
retrieval. Current search portals require precise metadata, limiting
exploratory searches. Our method enables natural-language and multilingual
searches through semantic retrieval and contextual summarization, reducing
dependence on metadata structures. We assess RAG's capability to retrieve and
summarize auction records using a 10,000-record sample from the Getty
Provenance Index - German Sales. The results show this approach provides a
scalable solution for navigating art market archives, offering a practical tool
for historians and cultural heritage professionals conducting historically
sensitive research.
☆ It's All About In-Context Learning! Teaching Extremely Low-Resource Languages to LLMs EMNLP 2025
Extremely low-resource languages, especially those written in rare scripts,
as shown in Figure 1, remain largely unsupported by large language models
(LLMs). This is due in part to compounding factors such as the lack of training
data. This paper delivers the first comprehensive analysis of whether LLMs can
acquire such languages purely via in-context learning (ICL), with or without
auxiliary alignment signals, and how these methods compare to
parameter-efficient fine-tuning (PEFT). We systematically evaluate 20
under-represented languages across three state-of-the-art multilingual LLMs.
Our findings highlight the limitation of PEFT when both language and its script
are extremely under-represented by the LLM. In contrast, zero-shot ICL with
language alignment is impressively effective on extremely low-resource
languages, while few-shot ICL or PEFT is more beneficial for languages
relatively better represented by LLMs. For LLM practitioners working on
extremely low-resource languages, we summarise guidelines grounded by our
results on adapting LLMs to low-resource languages, e.g., avoiding fine-tuning
a multilingual model on languages of unseen scripts.
comment: Accepted by EMNLP 2025
☆ "Where does it hurt?" -- Dataset and Study on Physician Intent Trajectories in Doctor Patient Dialogues ECAI 2025
Tom Röhr, Soumyadeep Roy, Fares Al Mohamad, Jens-Michalis Papaioannou, Wolfgang Nejdl, Felix Gers, Alexander Löser
In a doctor-patient dialogue, the primary objective of physicians is to
diagnose patients and propose a treatment plan. Medical doctors guide these
conversations through targeted questioning to efficiently gather the
information required to provide the best possible outcomes for patients. To the
best of our knowledge, this is the first work that studies physician intent
trajectories in doctor-patient dialogues. We use the `Ambient Clinical
Intelligence Benchmark' (Aci-bench) dataset for our study. We collaborate with
medical professionals to develop a fine-grained taxonomy of physician intents
based on the SOAP framework (Subjective, Objective, Assessment, and Plan). We
then conduct a large-scale annotation effort to label over 5000 doctor-patient
turns with the help of a large number of medical experts recruited using
Prolific, a popular crowd-sourcing platform. This large labeled dataset is an
important resource contribution that we use for benchmarking the
state-of-the-art generative and encoder models for medical intent
classification tasks. Our findings show that our models understand the general
structure of medical dialogues with high accuracy, but often fail to identify
transitions between SOAP categories. We also report for the first time common
trajectories in medical dialogue structures that provide valuable insights for
designing `differential diagnosis' systems. Finally, we extensively study the
impact of intent filtering for medical dialogue summarization and observe a
significant boost in performance. We make the codes and data, including
annotation guidelines, publicly available at
https://github.com/DATEXIS/medical-intent-classification.
comment: Accepted at ECAI 2025
☆ HiPlan: Hierarchical Planning for LLM-Based Agents with Adaptive Global-Local Guidance
Large language model (LLM)-based agents have demonstrated remarkable
capabilities in decision-making tasks, but struggle significantly with complex,
long-horizon planning scenarios. This arises from their lack of macroscopic
guidance, causing disorientation and failures in complex tasks, as well as
insufficient continuous oversight during execution, rendering them unresponsive
to environmental changes and prone to deviations. To tackle these challenges,
we introduce HiPlan, a hierarchical planning framework that provides adaptive
global-local guidance to boost LLM-based agents'decision-making. HiPlan
decomposes complex tasks into milestone action guides for general direction and
step-wise hints for detailed actions. During the offline phase, we construct a
milestone library from expert demonstrations, enabling structured experience
reuse by retrieving semantically similar tasks and milestones. In the execution
phase, trajectory segments from past milestones are dynamically adapted to
generate step-wise hints that align current observations with the milestone
objectives, bridging gaps and correcting deviations. Extensive experiments
across two challenging benchmarks demonstrate that HiPlan substantially
outperforms strong baselines, and ablation studies validate the complementary
benefits of its hierarchical components.
☆ MovieCORE: COgnitive REasoning in Movies EMNLP'2025
Gueter Josmy Faure, Min-Hung Chen, Jia-Fong Yeh, Ying Cheng, Hung-Ting Su, Yung-Hao Tang, Shang-Hong Lai, Winston H. Hsu
This paper introduces MovieCORE, a novel video question answering (VQA)
dataset designed to probe deeper cognitive understanding of movie content.
Unlike existing datasets that focus on surface-level comprehension, MovieCORE
emphasizes questions that engage System-2 thinking while remaining specific to
the video material. We present an innovative agentic brainstorming approach,
utilizing multiple large language models (LLMs) as thought agents to generate
and refine high-quality question-answer pairs. To evaluate dataset quality, we
develop a set of cognitive tests assessing depth, thought-provocation
potential, and syntactic complexity. We also propose a comprehensive evaluation
scheme for assessing VQA model performance on deeper cognitive tasks. To
address the limitations of existing video-language models (VLMs), we introduce
an agentic enhancement module, Agentic Choice Enhancement (ACE), which improves
model reasoning capabilities post-training by up to 25%. Our work contributes
to advancing movie understanding in AI systems and provides valuable insights
into the capabilities and limitations of current VQA models when faced with
more challenging, nuanced questions about cinematic content. Our project page,
dataset and code can be found at
https://joslefaure.github.io/assets/html/moviecore.html.
comment: Accepted for EMNLP'2025 Main Conference. Project Page:
https://joslefaure.github.io/assets/html/moviecore.html
☆ Building Self-Evolving Agents via Experience-Driven Lifelong Learning: A Framework and Benchmark
Yuxuan Cai, Yipeng Hao, Jie Zhou, Hang Yan, Zhikai Lei, Rui Zhen, Zhenhua Han, Yutao Yang, Junsong Li, Qianjun Pan, Tianyu Huai, Qin Chen, Xin Li, Kai Chen, Bo Zhang, Xipeng Qiu, Liang He
As AI advances toward general intelligence, the focus is shifting from
systems optimized for static tasks to creating open-ended agents that learn
continuously. In this paper, we introduce Experience-driven Lifelong Learning
(ELL), a framework for building self-evolving agents capable of continuous
growth through real-world interaction. The framework is built on four core
principles: (1) Experience Exploration: Agents learn through continuous,
self-motivated interaction with dynamic environments, navigating interdependent
tasks and generating rich experiential trajectories. (2) Long-term Memory:
Agents preserve and structure historical knowledge, including personal
experiences, domain expertise, and commonsense reasoning, into a persistent
memory system. (3) Skill Learning: Agents autonomously improve by abstracting
recurring patterns from experience into reusable skills, which are actively
refined and validated for application in new tasks. (4) Knowledge
Internalization: Agents internalize explicit and discrete experiences into
implicit and intuitive capabilities as "second nature".
We also introduce StuLife, a benchmark dataset for ELL that simulates a
student's holistic college journey, from enrollment to academic and personal
development, across three core phases and ten detailed sub-scenarios. StuLife
is designed around three key paradigm shifts: From Passive to Proactive, From
Context to Memory, and From Imitation to Learning. In this dynamic environment,
agents must acquire and distill practical skills and maintain persistent memory
to make decisions based on evolving state variables. StuLife provides a
comprehensive platform for evaluating lifelong learning capabilities, including
memory retention, skill transfer, and self-motivated behavior. Beyond
evaluating SOTA LLMs on the StuLife benchmark, we also explore the role of
context engineering in advancing AGI.
☆ Automatic Prompt Optimization with Prompt Distillation
Autoprompting is the process of automatically selecting optimized prompts for
language models, which is gaining popularity due to the rapid development of
prompt engineering driven by extensive research in the field of large language
models (LLMs). This paper presents DistillPrompt -- a novel autoprompting
method based on large language models that employs a multi-stage integration of
task-specific information into prompts using training data. DistillPrompt
utilizes distillation, compression, and aggregation operations to explore the
prompt space more thoroughly. The method was tested on different datasets for
text classification and generation tasks using the t-lite-instruct-0.1 language
model. The results demonstrate a significant average improvement (e.g., 20.12%
across the entire dataset compared to Grips) in key metrics over existing
methods in the field, establishing DistillPrompt as one of the most effective
non-gradient approaches in autoprompting.
☆ Interpretable by AI Mother Tongue: Native Symbolic Reasoning in Neural Models
We present a framework where neural models develop an AI Mother Tongue, a
native symbolic language that simultaneously supports intuitive reasoning,
compositional symbol chains, and inherent interpretability. Unlike post-hoc
explanation methods, our approach embeds reasoning directly into the model's
representations: symbols capture meaningful semantic patterns, chains trace
decision paths, and gated induction mechanisms guide selective focus, yielding
transparent yet flexible reasoning. We introduce complementary training
objectives to enhance symbol purity and decision sparsity, and employ a
sequential specialization strategy to first build broad symbolic competence and
then refine intuitive judgments. Experiments on AI tasks demonstrate
competitive accuracy alongside verifiable reasoning traces, showing that AI
Mother Tongue can serve as a unified mechanism for interpretability, intuition,
and symbolic reasoning in neural models.
comment: 25 pages, 9 figures. The AI Intuition Explorer dashboard is available
at: https://cyrilliu1974.github.io/github.io/vi.html
☆ The Double-edged Sword of LLM-based Data Reconstruction: Understanding and Mitigating Contextual Vulnerability in Word-level Differential Privacy Text Sanitization CCS 2025
Differentially private text sanitization refers to the process of privatizing
texts under the framework of Differential Privacy (DP), providing provable
privacy guarantees while also empirically defending against adversaries seeking
to harm privacy. Despite their simplicity, DP text sanitization methods
operating at the word level exhibit a number of shortcomings, among them the
tendency to leave contextual clues from the original texts due to randomization
during sanitization $\unicode{x2013}$ this we refer to as $\textit{contextual
vulnerability}$. Given the powerful contextual understanding and inference
capabilities of Large Language Models (LLMs), we explore to what extent LLMs
can be leveraged to exploit the contextual vulnerability of DP-sanitized texts.
We expand on previous work not only in the use of advanced LLMs, but also in
testing a broader range of sanitization mechanisms at various privacy levels.
Our experiments uncover a double-edged sword effect of LLM-based data
reconstruction attacks on privacy and utility: while LLMs can indeed infer
original semantics and sometimes degrade empirical privacy protections, they
can also be used for good, to improve the quality and privacy of DP-sanitized
texts. Based on our findings, we propose recommendations for using LLM data
reconstruction as a post-processing step, serving to increase privacy
protection by thinking adversarially.
comment: 15 pages, 4 figures, 8 tables. Accepted to WPES @ CCS 2025
☆ Diverse And Private Synthetic Datasets Generation for RAG evaluation: A multi-agent framework ECAI 2025
Retrieval-augmented generation (RAG) systems improve large language model
outputs by incorporating external knowledge, enabling more informed and
context-aware responses. However, the effectiveness and trustworthiness of
these systems critically depends on how they are evaluated, particularly on
whether the evaluation process captures real-world constraints like protecting
sensitive information. While current evaluation efforts for RAG systems have
primarily focused on the development of performance metrics, far less attention
has been given to the design and quality of the underlying evaluation datasets,
despite their pivotal role in enabling meaningful, reliable assessments. In
this work, we introduce a novel multi-agent framework for generating synthetic
QA datasets for RAG evaluation that prioritize semantic diversity and privacy
preservation. Our approach involves: (1) a Diversity agent leveraging
clustering techniques to maximize topical coverage and semantic variability,
(2) a Privacy Agent that detects and mask sensitive information across multiple
domains and (3) a QA curation agent that synthesizes private and diverse QA
pairs suitable as ground truth for RAG evaluation. Extensive experiments
demonstrate that our evaluation sets outperform baseline methods in diversity
and achieve robust privacy masking on domain-specific datasets. This work
offers a practical and ethically aligned pathway toward safer, more
comprehensive RAG system evaluation, laying the foundation for future
enhancements aligned with evolving AI regulations and compliance standards.
comment: ECAI 2025 TRUST AI workshop
☆ Affective Polarization across European Parliaments
Affective polarization, characterized by increased negativity and hostility
towards opposing groups, has become a prominent feature of political discourse
worldwide. Our study examines the presence of this type of polarization in a
selection of European parliaments in a fully automated manner. Utilizing a
comprehensive corpus of parliamentary speeches from the parliaments of six
European countries, we employ natural language processing techniques to
estimate parliamentarian sentiment. By comparing the levels of negativity
conveyed in references to individuals from opposing groups versus one's own, we
discover patterns of affectively polarized interactions. The findings
demonstrate the existence of consistent affective polarization across all six
European parliaments. Although activity correlates with negativity, there is no
observed difference in affective polarization between less active and more
active members of parliament. Finally, we show that reciprocity is a
contributing mechanism in affective polarization between parliamentarians
across all six parliaments.
comment: 6 pages, 4 figures
☆ Empowering Computing Education Researchers Through LLM-Assisted Content Analysis
Computing education research (CER) is often instigated by practitioners
wanting to improve both their own and the wider discipline's teaching practice.
However, the latter is often difficult as many researchers lack the colleagues,
resources, or capacity to conduct research that is generalisable or rigorous
enough to advance the discipline. As a result, research methods that enable
sense-making with larger volumes of qualitative data, while not increasing the
burden on the researcher, have significant potential within CER.
In this discussion paper, we propose such a method for conducting rigorous
analysis on large volumes of textual data, namely a variation of LLM-assisted
content analysis (LACA). This method combines content analysis with the use of
large language models, empowering researchers to conduct larger-scale research
which they would otherwise not be able to perform. Using a computing education
dataset, we illustrate how LACA could be applied in a reproducible and rigorous
manner. We believe this method has potential in CER, enabling more
generalisable findings from a wider range of research. This, together with the
development of similar methods, can help to advance both the practice and
research quality of the CER discipline.
comment: 7 pages, 2 figures
☆ ReflectivePrompt: Reflective evolution in autoprompting algorithms
Autoprompting is the process of automatically selecting optimized prompts for
language models, which has been gaining popularity with the rapid advancement
of prompt engineering, driven by extensive research in the field of large
language models (LLMs). This paper presents ReflectivePrompt - a novel
autoprompting method based on evolutionary algorithms that employs a reflective
evolution approach for more precise and comprehensive search of optimal
prompts. ReflectivePrompt utilizes short-term and long-term reflection
operations before crossover and elitist mutation to enhance the quality of the
modifications they introduce. This method allows for the accumulation of
knowledge obtained throughout the evolution process and updates it at each
epoch based on the current population. ReflectivePrompt was tested on 33
datasets for classification and text generation tasks using open-access large
language models: t-lite-instruct-0.1 and gemma3-27b-it. The method
demonstrates, on average, a significant improvement (e.g., 28% on BBH compared
to EvoPrompt) in metrics relative to current state-of-the-art approaches,
thereby establishing itself as one of the most effective solutions in
evolutionary algorithm-based autoprompting.
☆ ConfTuner: Training Large Language Models to Express Their Confidence Verbally
Large Language Models (LLMs) are increasingly deployed in high-stakes domains
such as science, law, and healthcare, where accurate expressions of uncertainty
are essential for reliability and trust. However, current LLMs are often
observed to generate incorrect answers with high confidence, a phenomenon known
as "overconfidence". Recent efforts have focused on calibrating LLMs'
verbalized confidence: i.e., their expressions of confidence in text form, such
as "I am 80% confident that...". Existing approaches either rely on prompt
engineering or fine-tuning with heuristically generated uncertainty estimates,
both of which have limited effectiveness and generalizability. Motivated by the
notion of proper scoring rules for calibration in classical machine learning
models, we introduce ConfTuner, a simple and efficient fine-tuning method that
introduces minimal overhead and does not require ground-truth confidence scores
or proxy confidence estimates. ConfTuner relies on a new loss function,
tokenized Brier score, which we theoretically prove to be a proper scoring
rule, intuitively meaning that it "correctly incentivizes the model to report
its true probability of being correct". ConfTuner improves calibration across
diverse reasoning tasks and generalizes to black-box models such as GPT-4o. Our
results further show that better-calibrated confidence enables downstream gains
in self-correction and model cascade, advancing the development of trustworthy
LLM systems. The code is available at
https://github.com/liushiliushi/ConfTuner.
☆ Arrows of Math Reasoning Data Synthesis for Large Language Models: Diversity, Complexity and Correctness
Enhancing the mathematical reasoning of large language models (LLMs) demands
high-quality training data, yet conventional methods face critical challenges
in scalability, cost, and data reliability. To address these limitations, we
propose a novel program-assisted synthesis framework that systematically
generates a high-quality mathematical corpus with guaranteed diversity,
complexity, and correctness. This framework integrates mathematical knowledge
systems and domain-specific tools to create executable programs. These programs
are then translated into natural language problem-solution pairs and vetted by
a bilateral validation mechanism that verifies solution correctness against
program outputs and ensures program-problem consistency. We have generated 12.3
million such problem-solving triples. Experiments demonstrate that models
fine-tuned on our data significantly improve their inference capabilities,
achieving state-of-the-art performance on several benchmark datasets and
showcasing the effectiveness of our synthesis approach.
☆ LLM-based Contrastive Self-Supervised AMR Learning with Masked Graph Autoencoders for Fake News Detection
The proliferation of misinformation in the digital age has led to significant
societal challenges. Existing approaches often struggle with capturing
long-range dependencies, complex semantic relations, and the social dynamics
influencing news dissemination. Furthermore, these methods require extensive
labelled datasets, making their deployment resource-intensive. In this study,
we propose a novel self-supervised misinformation detection framework that
integrates both complex semantic relations using Abstract Meaning
Representation (AMR) and news propagation dynamics. We introduce an LLM-based
graph contrastive loss (LGCL) that utilizes negative anchor points generated by
a Large Language Model (LLM) to enhance feature separability in a zero-shot
manner. To incorporate social context, we employ a multi view graph masked
autoencoder, which learns news propagation features from social context graph.
By combining these semantic and propagation-based features, our approach
effectively differentiates between fake and real news in a self-supervised
manner. Extensive experiments demonstrate that our self-supervised framework
achieves superior performance compared to other state-of-the-art methodologies,
even with limited labelled datasets while improving generalizability.
☆ LaTeXTrans: Structured LaTeX Translation with Multi-Agent Coordination
Ziming Zhu, Chenglong Wang, Shunjie Xing, Yifu Huo, Fengning Tian, Quan Du, Di Yang, Chunliang Zhang, Tong Xiao, Jingbo Zhu
Despite the remarkable progress of modern machine translation (MT) systems on
general-domain texts, translating structured LaTeX-formatted documents remains
a significant challenge. These documents typically interleave natural language
with domain-specific syntax, such as mathematical equations, tables, figures,
and cross-references, all of which must be accurately preserved to maintain
semantic integrity and compilability. In this paper, we introduce LaTeXTrans, a
collaborative multi-agent system designed to address this challenge. LaTeXTrans
ensures format preservation, structural fidelity, and terminology consistency
through six specialized agents: 1) a Parser that decomposes LaTeX into
translation-friendly units via placeholder substitution and syntax filtering;
2) a Translator, Validator, Summarizer, and Terminology Extractor that work
collaboratively to ensure context-aware, self-correcting, and
terminology-consistent translations; 3) a Generator that reconstructs the
translated content into well-structured LaTeX documents. Experimental results
demonstrate that LaTeXTrans can outperform mainstream MT systems in both
translation accuracy and structural fidelity, offering an effective and
practical solution for translating LaTeX-formatted documents.
☆ Controllable Conversational Theme Detection Track at DSTC 12
Igor Shalyminov, Hang Su, Jake Vincent, Siffi Singh, Jason Cai, James Gung, Raphael Shu, Saab Mansour
Conversational analytics has been on the forefront of transformation driven
by the advances in Speech and Natural Language Processing techniques. Rapid
adoption of Large Language Models (LLMs) in the analytics field has taken the
problems that can be automated to a new level of complexity and scale. In this
paper, we introduce Theme Detection as a critical task in conversational
analytics, aimed at automatically identifying and categorizing topics within
conversations. This process can significantly reduce the manual effort involved
in analyzing expansive dialogs, particularly in domains like customer support
or sales. Unlike traditional dialog intent detection, which often relies on a
fixed set of intents for downstream system logic, themes are intended as a
direct, user-facing summary of the conversation's core inquiry. This
distinction allows for greater flexibility in theme surface forms and
user-specific customizations. We pose Controllable Conversational Theme
Detection problem as a public competition track at Dialog System Technology
Challenge (DSTC) 12 -- it is framed as joint clustering and theme labeling of
dialog utterances, with the distinctive aspect being controllability of the
resulting theme clusters' granularity achieved via the provided user preference
data. We give an overview of the problem, the associated dataset and the
evaluation metrics, both automatic and human. Finally, we discuss the
participant teams' submissions and provide insights from those. The track
materials (data and code) are openly available in the GitHub repository.
comment: DSTC12@SigDial2025; data and code available at
https://github.com/amazon-science/dstc12-controllable-conversational-theme-detection
☆ Harnessing Rule-Based Reinforcement Learning for Enhanced Grammatical Error Correction
Grammatical error correction is a significant task in NLP. Traditional
methods based on encoder-decoder models have achieved certain success, but the
application of LLMs in this field is still underexplored. Current research
predominantly relies on supervised fine-tuning to train LLMs to directly
generate the corrected sentence, which limits the model's powerful reasoning
ability. To address this limitation, we propose a novel framework based on
Rule-Based RL. Through experiments on the Chinese datasets, our Rule-Based RL
framework achieves \textbf{state-of-the-art }performance, with a notable
increase in \textbf{recall}. This result clearly highlights the advantages of
using RL to steer LLMs, offering a more controllable and reliable paradigm for
future development in GEC.
comment: Code will be released upon publication
☆ ThinkDial: An Open Recipe for Controlling Reasoning Effort in Large Language Models
Large language models (LLMs) with chain-of-thought reasoning have
demonstrated remarkable problem-solving capabilities, but controlling their
computational effort remains a significant challenge for practical deployment.
Recent proprietary systems like OpenAI's gpt-oss series have introduced
discrete operational modes for intuitive reasoning control, but the open-source
community has largely failed to achieve such capabilities. In this paper, we
introduce ThinkDial, the first open-recipe end-to-end framework that
successfully implements gpt-oss-style controllable reasoning through discrete
operational modes. Our system enables seamless switching between three distinct
reasoning regimes: High mode (full reasoning capability), Medium mode (50
percent token reduction with <10 percent performance degradation), and Low mode
(75 percent token reduction with <15 percent performance degradation). We
achieve this through an end-to-end training paradigm that integrates
budget-mode control throughout the entire pipeline: budget-mode supervised
fine-tuning that embeds controllable reasoning capabilities directly into the
learning process, and two-phase budget-aware reinforcement learning with
adaptive reward shaping. Extensive experiments demonstrate that ThinkDial
achieves target compression-performance trade-offs with clear response length
reductions while maintaining performance thresholds. The framework also
exhibits strong generalization capabilities on out-of-distribution tasks.
☆ Beyond the Textual: Generating Coherent Visual Options for MCQs EMNLP 2025
Multiple-choice questions (MCQs) play a crucial role in fostering deep
thinking and knowledge integration in education. However, previous research has
primarily focused on generating MCQs with textual options, but it largely
overlooks the visual options. Moreover, generating high-quality distractors
remains a major challenge due to the high cost and limited scalability of
manual authoring. To tackle these problems, we propose a Cross-modal Options
Synthesis (CmOS), a novel framework for generating educational MCQs with visual
options. Our framework integrates Multimodal Chain-of-Thought (MCoT) reasoning
process and Retrieval-Augmented Generation (RAG) to produce semantically
plausible and visually similar answer and distractors. It also includes a
discrimination module to identify content suitable for visual options.
Experimental results on test tasks demonstrate the superiority of CmOS in
content discrimination, question generation and visual option generation over
existing methods across various subjects and educational levels.
comment: EMNLP 2025
☆ Answering the Unanswerable Is to Err Knowingly: Analyzing and Mitigating Abstention Failures in Large Reasoning Models
Large reasoning models (LRMs) have shown remarkable progress on complex
reasoning tasks. However, some questions posed to LRMs are inherently
unanswerable, such as math problems lacking sufficient conditions. We find that
LRMs continually fail to provide appropriate abstentions when confronted with
these unanswerable questions. In this paper, we systematically analyze,
investigate, and resolve this issue for trustworthy AI. We first conduct a
detailed analysis of the distinct response behaviors of LRMs when facing
unanswerable questions. Then, we show that LRMs possess sufficient cognitive
capabilities to recognize the flaws in these questions. However, they fail to
exhibit appropriate abstention behavior, revealing a misalignment between their
internal cognition and external response. Finally, to resolve this issue, we
propose a lightweight, two-stage method that combines cognitive monitoring with
inference-time intervention. Experimental results demonstrate that our method
significantly improves the abstention rate while maintaining the overall
reasoning performance.
☆ Text to Query Plans for Question Answering on Large Tables
Efficient querying and analysis of large tabular datasets remain significant
challenges, especially for users without expertise in programming languages
like SQL. Text-to-SQL approaches have shown promising performance on benchmark
data; however, they inherit SQL's drawbacks, including inefficiency with large
datasets and limited support for complex data analyses beyond basic querying.
We propose a novel framework that transforms natural language queries into
query plans. Our solution is implemented outside traditional databases,
allowing us to support classical SQL commands while avoiding SQL's inherent
limitations. Additionally, we enable complex analytical functions, such as
principal component analysis and anomaly detection, providing greater
flexibility and extensibility than traditional SQL capabilities. We leverage
LLMs to iteratively interpret queries and construct operation sequences,
addressing computational complexity by incrementally building solutions. By
executing operations directly on the data, we overcome context length
limitations without requiring the entire dataset to be processed by the model.
We validate our framework through experiments on both standard databases and
large scientific tables, demonstrating its effectiveness in handling extensive
datasets and performing sophisticated data analyses.
☆ Chronological Passage Assembling in RAG framework for Temporal Question Answering
Long-context question answering over narrative tasks is challenging because
correct answers often hinge on reconstructing a coherent timeline of events
while preserving contextual flow in a limited context window.
Retrieval-augmented generation (RAG) indexing methods aim to address this
challenge by selectively retrieving only necessary document segments. However,
narrative texts possess unique characteristics that limit the effectiveness of
these existing approaches. Specifically, understanding narrative texts requires
more than isolated segments, as the broader context and sequential
relationships between segments are crucial for comprehension. To address these
limitations, we propose ChronoRAG, a novel RAG framework specialized for
narrative texts. This approach focuses on two essential aspects: refining
dispersed document information into coherent and structured passages, and
preserving narrative flow by explicitly capturing and maintaining the temporal
order among retrieved passages. We empirically demonstrate the effectiveness of
ChronoRAG through experiments on the NarrativeQA dataset, showing substantial
improvements in tasks requiring both factual identification and comprehension
of complex sequential relationships, underscoring that reasoning over temporal
order is crucial in resolving narrative QA.
comment: 7 pages, 3 figures
☆ CAC-CoT: Connector-Aware Compact Chain-of-Thought for Efficient Reasoning Data Synthesis Across Dual-System Cognitive Tasks EMNLP 2025
Long chain-of-thought (CoT) prompting helps Large Language Models (LLMs)
solve difficult problems, but very long traces often slow or even degrade
performance on fast, intuitive "System-1" tasks. We introduce Connector-Aware
Compact CoT (CAC-CoT) -- a method that deliberately restricts reasoning to a
small, fixed set of connector phrases, steering the model toward concise and
well -- structured explanations. Despite its simplicity, our synthetic method
with Gemini-2.0-Flash yields a high-quality training quality. CAC-CoT achieves
approximately 85% on GSM8K and approximately 40% on GPQA (System-2) while
retaining approximately 90% on S1-Bench (System-1). Its reasoning traces
average approximately 300 tokens(ART), about one-third the length of baseline
traces, delivering higher efficiency without loss of accuracy.
comment: Accepted at EMNLP 2025 findings
☆ M3HG: Multimodal, Multi-scale, and Multi-type Node Heterogeneous Graph for Emotion Cause Triplet Extraction in Conversations ACL 2025
Emotion Cause Triplet Extraction in Multimodal Conversations (MECTEC) has
recently gained significant attention in social media analysis, aiming to
extract emotion utterances, cause utterances, and emotion categories
simultaneously. However, the scarcity of related datasets, with only one
published dataset featuring highly uniform dialogue scenarios, hinders model
development in this field. To address this, we introduce MECAD, the first
multimodal, multi-scenario MECTEC dataset, comprising 989 conversations from 56
TV series spanning a wide range of dialogue contexts. In addition, existing
MECTEC methods fail to explicitly model emotional and causal contexts and
neglect the fusion of semantic information at different levels, leading to
performance degradation. In this paper, we propose M3HG, a novel model that
explicitly captures emotional and causal contexts and effectively fuses
contextual information at both inter- and intra-utterance levels via a
multimodal heterogeneous graph. Extensive experiments demonstrate the
effectiveness of M3HG compared with existing state-of-the-art methods. The
codes and dataset are available at https://github.com/redifinition/M3HG.
comment: 16 pages, 8 figures. Accepted to Findings of ACL 2025
☆ Beyond Quality: Unlocking Diversity in Ad Headline Generation with Large Language Models
Chang Wang, Siyu Yan, Depeng Yuan, Yuqi Chen, Yanhua Huang, Yuanhang Zheng, Shuhao Li, Yinqi Zhang, Kedi Chen, Mingrui Zhu, Ruiwen Xu
The generation of ad headlines plays a vital role in modern advertising,
where both quality and diversity are essential to engage a broad range of
audience segments. Current approaches primarily optimize language models for
headline quality or click-through rates (CTR), often overlooking the need for
diversity and resulting in homogeneous outputs. To address this limitation, we
propose DIVER, a novel framework based on large language models (LLMs) that are
jointly optimized for both diversity and quality. We first design a semantic-
and stylistic-aware data generation pipeline that automatically produces
high-quality training pairs with ad content and multiple diverse headlines. To
achieve the goal of generating high-quality and diversified ad headlines within
a single forward pass, we propose a multi-stage multi-objective optimization
framework with supervised fine-tuning (SFT) and reinforcement learning (RL).
Experiments on real-world industrial datasets demonstrate that DIVER
effectively balances quality and diversity. Deployed on a large-scale
content-sharing platform serving hundreds of millions of users, our framework
improves advertiser value (ADVV) and CTR by 4.0% and 1.4%.
☆ Bias Mitigation Agent: Optimizing Source Selection for Fair and Balanced Knowledge Retrieval KDD'2025
Large Language Models (LLMs) have transformed the field of artificial
intelligence by unlocking the era of generative applications. Built on top of
generative AI capabilities, Agentic AI represents a major shift toward
autonomous, goal-driven systems that can reason, retrieve, and act. However,
they also inherit the bias present in both internal and external information
sources. This significantly affects the fairness and balance of retrieved
information, and hence reduces user trust. To address this critical challenge,
we introduce a novel Bias Mitigation Agent, a multi-agent system designed to
orchestrate the workflow of bias mitigation through specialized agents that
optimize the selection of sources to ensure that the retrieved content is both
highly relevant and minimally biased to promote fair and balanced knowledge
dissemination. The experimental results demonstrate an 81.82\% reduction in
bias compared to a baseline naive retrieval strategy.
comment: Accepted at KDD'2025 Agent4IR workshop
☆ EMMM, Explain Me My Model! Explainable Machine Generated Text Detection in Dialogues
The rapid adoption of large language models (LLMs) in customer service
introduces new risks, as malicious actors can exploit them to conduct
large-scale user impersonation through machine-generated text (MGT). Current
MGT detection methods often struggle in online conversational settings,
reducing the reliability and interpretability essential for trustworthy AI
deployment. In customer service scenarios where operators are typically
non-expert users, explanation become crucial for trustworthy MGT detection. In
this paper, we propose EMMM, an explanation-then-detection framework that
balances latency, accuracy, and non-expert-oriented interpretability.
Experimental results demonstrate that EMMM provides explanations accessible to
non-expert users, with 70\% of human evaluators preferring its outputs, while
achieving competitive accuracy compared to state-of-the-art models and
maintaining low latency, generating outputs within 1 second. Our code and
dataset are open-sourced at
https://github.com/AngieYYF/EMMM-explainable-chatbot-detection.
comment: 15 pages
☆ Filtering for Creativity: Adaptive Prompting for Multilingual Riddle Generation in LLMs
Multilingual riddle generation challenges large language models (LLMs) to
balance cultural fluency with creative abstraction. Standard prompting
strategies -- zero-shot, few-shot, chain-of-thought -- tend to reuse memorized
riddles or perform shallow paraphrasing. We introduce Adaptive Originality
Filtering (AOF), a prompting framework that filters redundant generations using
cosine-based similarity rejection, while enforcing lexical novelty and
cross-lingual fidelity. Evaluated across three LLMs and four language pairs,
AOF-enhanced GPT-4o achieves \texttt{0.177} Self-BLEU and \texttt{0.915}
Distinct-2 in Japanese, signaling improved lexical diversity and reduced
redundancy compared to other prompting methods and language pairs. Our findings
show that semantic rejection can guide culturally grounded, creative generation
without task-specific fine-tuning.
☆ Attention2Probability: Attention-Driven Terminology Probability Estimation for Robust Speech-to-Text System
Recent advances in speech large language models (SLMs) have improved speech
recognition and translation in general domains, but accurately generating
domain-specific terms or neologisms remains challenging. To address this, we
propose Attention2Probability: attention-driven terminology probability
estimation for robust speech-to-text system, which is lightweight, flexible,
and accurate. Attention2Probability converts cross-attention weights between
speech and terminology into presence probabilities, and it further employs
curriculum learning to enhance retrieval accuracy. Furthermore, to tackle the
lack of data for speech-to-text tasks with terminology intervention, we create
and release a new speech dataset with terminology to support future research in
this area. Experimental results show that Attention2Probability significantly
outperforms the VectorDB method on our test set. Specifically, its maximum
recall rates reach 92.57% for Chinese and 86.83% for English. This high recall
is achieved with a latency of only 8.71ms per query. Intervening in SLMs'
recognition and translation tasks using Attention2Probability-retrieved terms
improves terminology accuracy by 6-17%, while revealing that the current
utilization of terminology by SLMs has limitations.
comment: 9 pages, 4 figures, 5 tables
☆ Knowing or Guessing? Robust Medical Visual Question Answering via Joint Consistency and Contrastive Learning
In high-stakes medical applications, consistent answering across diverse
question phrasings is essential for reliable diagnosis. However, we reveal that
current Medical Vision-Language Models (Med-VLMs) exhibit concerning fragility
in Medical Visual Question Answering, as their answers fluctuate significantly
when faced with semantically equivalent rephrasings of medical questions. We
attribute this to two limitations: (1) insufficient alignment of medical
concepts, leading to divergent reasoning patterns, and (2) hidden biases in
training data that prioritize syntactic shortcuts over semantic understanding.
To address these challenges, we construct RoMed, a dataset built upon original
VQA datasets containing 144k questions with variations spanning word-level,
sentence-level, and semantic-level perturbations. When evaluating
state-of-the-art (SOTA) models like LLaVA-Med on RoMed, we observe alarming
performance drops (e.g., a 40\% decline in Recall) compared to original VQA
benchmarks, exposing critical robustness gaps. To bridge this gap, we propose
Consistency and Contrastive Learning (CCL), which integrates two key
components: (1) knowledge-anchored consistency learning, aligning Med-VLMs with
medical knowledge rather than shallow feature patterns, and (2) bias-aware
contrastive learning, mitigating data-specific priors through discriminative
representation refinement. CCL achieves SOTA performance on three popular VQA
benchmarks and notably improves answer consistency by 50\% on the challenging
RoMed test set, demonstrating significantly enhanced robustness. Code will be
released.
☆ FALCON: Autonomous Cyber Threat Intelligence Mining with LLMs for IDS Rule Generation
Shaswata Mitra, Azim Bazarov, Martin Duclos, Sudip Mittal, Aritran Piplai, Md Rayhanur Rahman, Edward Zieglar, Shahram Rahimi
Signature-based Intrusion Detection Systems (IDS) detect malicious activities
by matching network or host activity against predefined rules. These rules are
derived from extensive Cyber Threat Intelligence (CTI), which includes attack
signatures and behavioral patterns obtained through automated tools and manual
threat analysis, such as sandboxing. The CTI is then transformed into
actionable rules for the IDS engine, enabling real-time detection and
prevention. However, the constant evolution of cyber threats necessitates
frequent rule updates, which delay deployment time and weaken overall security
readiness. Recent advancements in agentic systems powered by Large Language
Models (LLMs) offer the potential for autonomous IDS rule generation with
internal evaluation. We introduce FALCON, an autonomous agentic framework that
generates deployable IDS rules from CTI data in real-time and evaluates them
using built-in multi-phased validators. To demonstrate versatility, we target
both network (Snort) and host-based (YARA) mediums and construct a
comprehensive dataset of IDS rules with their corresponding CTIs. Our
evaluations indicate FALCON excels in automatic rule generation, with an
average of 95% accuracy validated by qualitative evaluation with 84%
inter-rater agreement among multiple cybersecurity analysts across all metrics.
These results underscore the feasibility and effectiveness of LLM-driven data
mining for real-time cyber threat mitigation.
comment: 11 pages, 5 figures, 4 tables
☆ Tailored Teaching with Balanced Difficulty: Elevating Reasoning in Multimodal Chain-of-Thought via Prompt Curriculum
Xinglong Yang, Quan Feng, Zhongying Pan, Xiang Chen, Yu Tian, Wentong Li, Shuofei Qiao, Yuxia Geng, Xingyu Zhao, Sheng-Jun Huang
The effectiveness of Multimodal Chain-of-Thought (MCoT) prompting is often
limited by the use of randomly or manually selected examples. These examples
fail to account for both model-specific knowledge distributions and the
intrinsic complexity of the tasks, resulting in suboptimal and unstable model
performance. To address this, we propose a novel framework inspired by the
pedagogical principle of "tailored teaching with balanced difficulty". We
reframe prompt selection as a prompt curriculum design problem: constructing a
well ordered set of training examples that align with the model's current
capabilities. Our approach integrates two complementary signals: (1)
model-perceived difficulty, quantified through prediction disagreement in an
active learning setup, capturing what the model itself finds challenging; and
(2) intrinsic sample complexity, which measures the inherent difficulty of each
question-image pair independently of any model. By jointly analyzing these
signals, we develop a difficulty-balanced sampling strategy that ensures the
selected prompt examples are diverse across both dimensions. Extensive
experiments conducted on five challenging benchmarks and multiple popular
Multimodal Large Language Models (MLLMs) demonstrate that our method yields
substantial and consistent improvements and greatly reduces performance
discrepancies caused by random sampling, providing a principled and robust
approach for enhancing multimodal reasoning.
☆ Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks ICML
Taishi Nakamura, Satoki Ishikawa, Masaki Kawamura, Takumi Okamoto, Daisuke Nohara, Jun Suzuki, Rio Yokota
Empirical scaling laws have driven the evolution of large language models
(LLMs), yet their coefficients shift whenever the model architecture or data
pipeline changes. Mixture-of-Experts (MoE) models, now standard in
state-of-the-art systems, introduce a new sparsity dimension that current
dense-model frontiers overlook. We investigate how MoE sparsity influences two
distinct capability regimes: memorization and reasoning. We train families of
MoE Transformers that systematically vary total parameters, active parameters,
and top-$k$ routing while holding the compute budget fixed. For every model we
record pre-training loss, downstream task loss, and task accuracy, allowing us
to separate the train-test generalization gap from the loss-accuracy gap.
Memorization benchmarks improve monotonically with total parameters, mirroring
training loss. By contrast, reasoning performance saturates and can even
regress despite continued gains in both total parameters and training loss.
Altering top-$k$ alone has little effect when active parameters are constant,
and classic hyperparameters such as learning rate and initialization modulate
the generalization gap in the same direction as sparsity. Neither post-training
reinforcement learning (GRPO) nor extra test-time compute rescues the reasoning
deficit of overly sparse models. Our model checkpoints, code and logs are
open-source at https://github.com/rioyokotalab/optimal-sparsity.
comment: Presented at the Second AI for Math Workshop at ICML
☆ Membership Inference Attacks on LLM-based Recommender Systems
Large language models (LLMs) based Recommender Systems (RecSys) can flexibly
adapt recommendation systems to different domains. It utilizes in-context
learning (ICL), i.e., the prompts, to customize the recommendation functions,
which include sensitive historical user-specific item interactions, e.g.,
implicit feedback like clicked items or explicit product reviews. Such private
information may be exposed to novel privacy attack. However, no study has been
done on this important issue. We design four membership inference attacks
(MIAs), aiming to reveal whether victims' historical interactions have been
used by system prompts. They are \emph{direct inquiry, hallucination,
similarity, and poisoning attacks}, each of which utilizes the unique features
of LLMs or RecSys. We have carefully evaluated them on three LLMs that have
been used to develop ICL-LLM RecSys and two well-known RecSys benchmark
datasets. The results confirm that the MIA threat on LLM RecSys is realistic:
direct inquiry and poisoning attacks showing significantly high attack
advantages. We have also analyzed the factors affecting these attacks, such as
the number of shots in system prompts and the position of the victim in the
shots.
☆ Emotion Omni: Enabling Empathetic Speech Response Generation through Large Language Models ICASSP 2026
With the development of speech large language models (speech LLMs), users can
now interact directly with assistants via speech. However, most existing models
simply convert the response content into speech without fully understanding the
rich emotional and paralinguistic cues embedded in the user's query. In many
cases, the same sentence can have different meanings depending on the emotional
expression. Furthermore, emotional understanding is essential for improving
user experience in human-machine interaction. Currently, most speech LLMs with
empathetic capabilities are trained on massive datasets. This approach requires
vast amounts of data and significant computational resources. Therefore, a key
challenge lies in how to develop a speech LLM capable of generating empathetic
responses with limited data and without the need for large-scale training. To
address this challenge, we propose Emotion Omni, a novel model architecture
designed to understand the emotional content of user speech input and generate
empathetic speech responses. Additionally, we developed a data generation
pipeline based on an open-source TTS framework to construct a 200k emotional
dialogue dataset, which supports the construction of an empathetic speech
assistant. The demos are available at https://w311411.github.io/omni_demo/
comment: 5 pages, 1 figure, submitted to ICASSP 2026
☆ UniC-RAG: Universal Knowledge Corruption Attacks to Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) systems are widely deployed in
real-world applications in diverse domains such as finance, healthcare, and
cybersecurity. However, many studies showed that they are vulnerable to
knowledge corruption attacks, where an attacker can inject adversarial texts
into the knowledge database of a RAG system to induce the LLM to generate
attacker-desired outputs. Existing studies mainly focus on attacking specific
queries or queries with similar topics (or keywords). In this work, we propose
UniC-RAG, a universal knowledge corruption attack against RAG systems. Unlike
prior work, UniC-RAG jointly optimizes a small number of adversarial texts that
can simultaneously attack a large number of user queries with diverse topics
and domains, enabling an attacker to achieve various malicious objectives, such
as directing users to malicious websites, triggering harmful command execution,
or launching denial-of-service attacks. We formulate UniC-RAG as an
optimization problem and further design an effective solution to solve it,
including a balanced similarity-based clustering method to enhance the attack's
effectiveness. Our extensive evaluations demonstrate that UniC-RAG is highly
effective and significantly outperforms baselines. For instance, UniC-RAG could
achieve over 90% attack success rate by injecting 100 adversarial texts into a
knowledge database with millions of texts to simultaneously attack a large set
of user queries (e.g., 2,000). Additionally, we evaluate existing defenses and
show that they are insufficient to defend against UniC-RAG, highlighting the
need for new defense mechanisms in RAG systems.
comment: 21 pages, 4 figures
☆ Breaking the Trade-Off Between Faithfulness and Expressiveness for Large Language Models
Grounding responses in external knowledge represents an effective strategy
for mitigating hallucinations in Large Language Models (LLMs). However, current
LLMs struggle to seamlessly integrate knowledge while simultaneously
maintaining faithfulness (or fidelity) and expressiveness, capabilities that
humans naturally possess. This limitation results in outputs that either lack
support from external knowledge, thereby compromising faithfulness, or appear
overly verbose and unnatural, thus sacrificing expressiveness. In this work, to
break the trade-off between faithfulness and expressiveness, we propose
Collaborative Decoding (CoDe), a novel approach that dynamically integrates
output probabilities generated with and without external knowledge. This
integration is guided by distribution divergence and model confidence, enabling
the selective activation of relevant and reliable expressions from the model's
internal parameters. Furthermore, we introduce a knowledge-aware reranking
mechanism that prevents over-reliance on prior parametric knowledge while
ensuring proper utilization of provided external information. Through
comprehensive experiments, our plug-and-play CoDe framework demonstrates
superior performance in enhancing faithfulness without compromising
expressiveness across diverse LLMs and evaluation metrics, validating both its
effectiveness and generalizability.
☆ Thinking Before You Speak: A Proactive Test-time Scaling Approach
Large Language Models (LLMs) often exhibit deficiencies with complex
reasoning tasks, such as maths, which we attribute to the discrepancy between
human reasoning patterns and those presented in the LLMs' training data. When
dealing with complex problems, humans tend to think carefully before expressing
solutions. However, they often do not articulate their inner thoughts,
including their intentions and chosen methodologies. Consequently, critical
insights essential for bridging reasoning steps may be absent in training data
collected from human sources. To bridge this gap, we proposes inserting
\emph{insight}s between consecutive reasoning steps, which review the status
and initiate the next reasoning steps. Unlike prior prompting strategies that
rely on a single or a workflow of static prompts to facilitate reasoning,
\emph{insight}s are \emph{proactively} generated to guide reasoning processes.
We implement our idea as a reasoning framework, named \emph{Thinking Before You
Speak} (TBYS), and design a pipeline for automatically collecting and filtering
in-context examples for the generation of \emph{insight}s, which alleviates
human labeling efforts and fine-tuning overheads. Experiments on challenging
mathematical datasets verify the effectiveness of TBYS. Project website:
https://gitee.com/jswrt/TBYS
☆ Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap
Jun Wang, Ninglun Gu, Kailai Zhang, Zijiao Zhang, Yelun Bao, Jin Yang, Xu Yin, Liwei Liu, Yihuan Liu, Pengyong Li, Gary G. Yen, Junchi Yan
For Large Language Models (LLMs), a disconnect persists between benchmark
performance and real-world utility. Current evaluation frameworks remain
fragmented, prioritizing technical metrics while neglecting holistic assessment
for deployment. This survey introduces an anthropomorphic evaluation paradigm
through the lens of human intelligence, proposing a novel three-dimensional
taxonomy: Intelligence Quotient (IQ)-General Intelligence for foundational
capacity, Emotional Quotient (EQ)-Alignment Ability for value-based
interactions, and Professional Quotient (PQ)-Professional Expertise for
specialized proficiency. For practical value, we pioneer a Value-oriented
Evaluation (VQ) framework assessing economic viability, social impact, ethical
alignment, and environmental sustainability. Our modular architecture
integrates six components with an implementation roadmap. Through analysis of
200+ benchmarks, we identify key challenges including dynamic assessment needs
and interpretability gaps. It provides actionable guidance for developing LLMs
that are technically proficient, contextually relevant, and ethically sound. We
maintain a curated repository of open-source evaluation resources at:
https://github.com/onejune2018/Awesome-LLM-Eval.
comment: Preprint. Under review
☆ RLMR: Reinforcement Learning with Mixed Rewards for Creative Writing
Jianxing Liao, Tian Zhang, Xiao Feng, Yusong Zhang, Rui Yang, Haorui Wang, Bosi Wen, Ziying Wang, Runzhi Shi
Large language models are extensively utilized in creative writing
applications. Creative writing requires a balance between subjective writing
quality (e.g., literariness and emotional expression) and objective constraint
following (e.g., format requirements and word limits). Existing reinforcement
learning methods struggle to balance these two aspects: single reward
strategies fail to improve both abilities simultaneously, while fixed-weight
mixed-reward methods lack the ability to adapt to different writing scenarios.
To address this problem, we propose Reinforcement Learning with Mixed Rewards
(RLMR), utilizing a dynamically mixed reward system from a writing reward model
evaluating subjective writing quality and a constraint verification model
assessing objective constraint following. The constraint following reward
weight is adjusted dynamically according to the writing quality within sampled
groups, ensuring that samples violating constraints get negative advantage in
GRPO and thus penalized during training, which is the key innovation of this
proposed method. We conduct automated and manual evaluations across diverse
model families from 8B to 72B parameters. Additionally, we construct a
real-world writing benchmark named WriteEval for comprehensive evaluation.
Results illustrate that our method achieves consistent improvements in both
instruction following (IFEval from 83.36\% to 86.65\%) and writing quality
(72.75\% win rate in manual expert pairwise evaluations on WriteEval). To the
best of our knowledge, RLMR is the first work to combine subjective preferences
with objective verification in online RL training, providing an effective
solution for multi-dimensional creative writing optimization.
☆ Scaling Laws for Task-Stratified Knowledge in Post-Training Quantized Large Language Models
Large language models (LLMs) present significant deployment challenges due to
their scale, with post-training quantization (PTQ) emerging as a practical
compression solution. However, a comprehensive understanding of how PTQ
precisely impacts diverse LLM knowledge capabilities remains elusive, and
existing scaling laws for quantized models often overlook crucial PTQ-specific
parameters and task-specific sensitivities. This paper addresses these gaps by
conducting an extensive empirical investigation to establish task-stratified
scaling laws. We disentangle LLM knowledge into memorization and utilization
capabilities and develop a unified quantitative framework that incorporates
model size, effective bit-width, calibration set size, and group size. Our
central finding reveals that knowledge memorization exhibits markedly greater
sensitivity to variations in effective bit-width, calibration set size, and
model size compared to the more robust knowledge utilization. These findings
offer a fine-grained understanding of PTQ's impact and provide guidance for
developing knowledge-aware quantization strategies that can better preserve
targeted cognitive functions.
☆ A New NMT Model for Translating Clinical Texts from English to Spanish ML4H
Translating electronic health record (EHR) narratives from English to Spanish
is a clinically important yet challenging task due to the lack of a
parallel-aligned corpus and the abundant unknown words contained. To address
such challenges, we propose \textbf{NOOV} (for No OOV), a new neural machine
translation (NMT) system that requires little in-domain parallel-aligned corpus
for training. NOOV integrates a bilingual lexicon automatically learned from
parallel-aligned corpora and a phrase look-up table extracted from a large
biomedical knowledge resource, to alleviate both the unknown word problem and
the word-repeat challenge in NMT, enhancing better phrase generation of NMT
systems. Evaluation shows that NOOV is able to generate better translation of
EHR with improvement in both accuracy and fluency.
comment: This work was accepted by the Machine Learning for Health (ML4H)
Workshop at NeurIPS 2018
☆ What do language models model? Transformers, automata, and the format of thought
What do large language models actually model? Do they tell us something about
human capacities, or are they models of the corpus we've trained them on? I
give a non-deflationary defence of the latter position. Cognitive science tells
us that linguistic capabilities in humans rely supralinear formats for
computation. The transformer architecture, by contrast, supports at best a
linear formats for processing. This argument will rely primarily on certain
invariants of the computational architecture of transformers. I then suggest a
positive story about what transformers are doing, focusing on Liu et al.
(2022)'s intriguing speculations about shortcut automata. I conclude with why I
don't think this is a terribly deflationary story. Language is not (just) a
means for expressing inner state but also a kind of 'discourse machine' that
lets us make new language given appropriate context. We have learned to use
this technology in one way; LLMs have also learned to use it too, but via very
different means.
☆ The Mind's Eye: A Multi-Faceted Reward Framework for Guiding Visual Metaphor Generation
Visual metaphor generation is a challenging task that aims to generate an
image given an input text metaphor. Inherently, it needs language understanding
to bind a source concept with a target concept, in a way that preserves meaning
while ensuring visual coherence. We propose a self-evaluating visual metaphor
generation framework that focuses on metaphor alignment. Our self-evaluation
approach combines existing metrics with our newly proposed metaphor
decomposition score and a meaning alignment (MA) metric. Within this setup, we
explore two novel approaches: a training-free pipeline that explicitly
decomposes prompts into source-target-meaning (S-T-M) mapping for image
synthesis, and a complementary training-based pipeline that improves alignment
using our proposed self-evaluation reward schema, without any large-scale
retraining. On the held-out test set, the training-free approach surpasses
strong closed baselines (GPT-4o, Imagen) on decomposition, CLIP, and MA scores,
with the training-based approach close behind. We evaluate our framework output
using a user-facing study, and observed that participants preferred GPT-4o
overall, while our training-free pipeline led open-source methods and edged
Imagen on abstract metaphors. Our analyses show S-T-M prompting helps longer or
more abstract metaphors, with closed models excelling on short, concrete cases;
we also observe sensitivity to sampler settings. Overall, structured prompting
and lightweight RL perform metaphor alignment well under modest compute, and
remaining gaps to human preference appear driven by aesthetics and sampling.
comment: Under Review
♻ ☆ From Intents to Conversations: Generating Intent-Driven Dialogues with Contrastive Learning for Multi-Turn Classification CIKM 2025
In conversational AI systems, a critical challenge in training effective
multi-turn intent classification models lies in the generation of large-scale,
domain-specific, multilingual dialogue datasets. In this paper, we introduce
Chain-of-Intent, a novel framework that integrates Hidden Markov Models (HMMs)
with Large Language Models (LLMs) to generate intent-driven, context-aware
dialogues through self-play. Our method first extracts domain-specific intent
transition patterns from real-world e-commerce chat logs, which guide the
modeling of turn-level dynamics and intent sequences. LLMs are then employed to
parameterize the emission probabilities of HMMs, enabling the generation of
natural, coherent utterances aligned with predicted intents and dialogue
context. We further propose MINT-CL, a multi-task contrastive learning
framework for multi-turn intent classification, which improves performance
while reducing dependence on large-scale annotated datasets. Empirical results
demonstrate that our approach outperforms competitive baselines in both
dialogue generation quality and classification accuracy, particularly in
multilingual settings. To facilitate future research, we release MINT-E, a
comprehensive, multilingual, intent-aware multi-turn dialogue corpus derived
from the e-commerce domain. The reproduced source code and dataset are
available at https://github.com/junhua/chain-of-intent.
comment: Accepted to Proceedings of CIKM 2025
♻ ☆ Bridging the Editing Gap in LLMs: FineEdit for Precise and Targeted Text Modifications
Large Language Models (LLMs) have significantly advanced natural language
processing, demonstrating strong capabilities in tasks such as text generation,
summarization, and reasoning. Recently, their potential for automating precise
text editing tasks across specialized domains, such as programming code, LaTeX,
and structured database languages, has gained attention. However, current
state-of-the-art LLMs still struggle with executing precise, instruction-driven
edits, particularly when structural accuracy and strict adherence to domain
conventions are required. To address these challenges, we introduce
InstrEditBench, an automated benchmark dataset comprising over 30,000
structured editing tasks spanning diverse domains, including Wikipedia
articles, LaTeX documents, source code, and database languages. Using this
benchmark, we develop FineEdit, a specialized editing model explicitly trained
for accurate, context-aware text modifications. Experimental evaluations
demonstrate that FineEdit outperforms state-of-the-art models, achieving
improvements of approximately 10\% over Gemini models on single-turn edits, up
to 30\% over Llama-3.2-3B, and exceeding Mistral-7B-OpenOrca performance by
over 40\% on direct editing tasks. FineEdit also effectively generalizes to
realistic multi-turn editing scenarios, highlighting its practical
applicability. To facilitate further research and reproducibility, we release
FineEdit at https://github.com/StuRinDQB/FineEdit} and
https://huggingface.co/datasets/YimingZeng/FineEdit_bench.
♻ ☆ mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation
Large Vision-Language Models (LVLMs) have made remarkable strides in
multimodal tasks such as visual question answering, visual grounding, and
complex reasoning. However, they remain limited by static training data,
susceptibility to hallucinations, and inability to verify claims against
up-to-date, external evidence, compromising their performance in dynamic
real-world applications. Retrieval-Augmented Generation (RAG) offers a
practical solution to mitigate these challenges by allowing the LVLMs to access
large-scale knowledge databases via retrieval mechanisms, thereby grounding
model outputs in factual, contextually relevant information. Here in this
paper, we conduct the first systematic dissection of the multimodal RAG
pipeline for LVLMs, explicitly investigating (1) the retrieval phase: on the
modality configurations and retrieval strategies, (2) the re-ranking stage: on
strategies to mitigate positional biases and improve the relevance of retrieved
evidence, and (3) the generation phase: we further investigate how to best
integrate retrieved candidates into the final generation process. Finally, we
extend to explore a unified agentic framework that integrates re-ranking and
generation through self-reflection, enabling LVLMs to select relevant evidence
and suppress irrelevant context dynamically. Our full-stack exploration of RAG
for LVLMs yields substantial insights, resulting in an average performance
boost of 5% without any fine-tuning.
comment: 16 pages
♻ ☆ TL-Training: A Task-Feature-Based Framework for Training Large Language Models in Tool Use EMNLP 2025
Junjie Ye, Yilong Wu, Sixian Li, Yuming Yang, Zhiheng Xi, Tao Gui, Qi Zhang, Xuanjing Huang, Peng Wang, Zhongchao Shi, Jianping Fan, Zhengyin Du
Large language models (LLMs) achieve remarkable advancements by leveraging
tools to interact with environments, a critical step toward generalized AI.
However, the standard supervised fine-tuning (SFT) approach, which relies on
large-scale datasets, often overlooks task-specific characteristics in tool
use, leading to performance bottlenecks. To address this issue, we analyze
three existing LLMs and uncover key insights: training data can inadvertently
impede tool-use behavior, token importance is distributed unevenly, and errors
in tool calls fall into a small set of categories. Building on these findings,
we propose~\emph{TL-Training}, a task-feature-based framework that mitigates
the effects of suboptimal training data, dynamically adjusts token weights to
prioritize key tokens during SFT, and incorporates a robust reward mechanism
tailored to error categories, optimized through proximal policy optimization.
We validate TL-Training by training CodeLLaMA-2-7B and evaluating it on four
open-source test sets. Our results demonstrate that the LLM trained by our
method matches or surpasses both open- and closed-source LLMs in tool-use
performance using only 1,217 training data points. Additionally, our method
enhances robustness in noisy environments and improves general task
performance, offering a scalable and efficient paradigm for tool-use training
in LLMs. Code and data are available at
https://github.com/Junjie-Ye/TL-Training.
comment: Accepted by EMNLP 2025
♻ ☆ ChatGPT Doesn't Trust Chargers Fans: Guardrail Sensitivity in Context
While the biases of language models in production are extensively documented,
the biases of their guardrails have been neglected. This paper studies how
contextual information about the user influences the likelihood of an LLM to
refuse to execute a request. By generating user biographies that offer
ideological and demographic information, we find a number of biases in
guardrail sensitivity on GPT-3.5. Younger, female, and Asian-American personas
are more likely to trigger a refusal guardrail when requesting censored or
illegal information. Guardrails are also sycophantic, refusing to comply with
requests for a political position the user is likely to disagree with. We find
that certain identity groups and seemingly innocuous information, e.g., sports
fandom, can elicit changes in guardrail sensitivity similar to direct
statements of political ideology. For each demographic category and even for
American football team fandom, we find that ChatGPT appears to infer a likely
political ideology and modify guardrail behavior accordingly.
♻ ☆ A Survey on Data Selection for LLM Instruction Tuning
Instruction tuning is a vital step of training large language models (LLMs),
so how to enhance the effect of instruction tuning has received increased
attention. Existing works indicate that the quality of the dataset is more
crucial than the quantity during instruction tuning of LLMs. Therefore,
recently a lot of studies focus on exploring the methods of selecting
high-quality subset from instruction datasets, aiming to reduce training costs
and enhance the instruction-following capabilities of LLMs. This paper presents
a comprehensive survey on data selection for LLM instruction tuning. Firstly,
we introduce the wildly used instruction datasets. Then, we propose a new
taxonomy of the data selection methods and provide a detailed introduction of
recent advances, and the evaluation strategies and results of data selection
methods are also elaborated in detail. Finally, we emphasize the open
challenges and present new frontiers of this task.
comment: Published in JAIR (Vol. 83, Article 32, 2025)
♻ ☆ An Ontology-Driven Graph RAG for Legal Norms: A Hierarchical, Temporal, and Deterministic Approach
Retrieval-Augmented Generation (RAG) systems in the legal domain face a
critical challenge: standard, flat-text retrieval is blind to the hierarchical,
diachronic, and causal structure of law, leading to anachronistic and
unreliable answers. This paper introduces an ontology-driven Graph RAG
framework designed to overcome these limitations. We ground our knowledge graph
in a formal, LRMoo-inspired model that distinguishes abstract legal Works from
their versioned Expressions. We model temporal states as efficient aggregations
that reuse the versioned expressions (CTVs) of unchanged components, and we
reify legislative events as first-class Action nodes to make causality explicit
and queryable. This structured backbone enables a unified, planner-guided query
strategy that applies explicit policies to deterministically resolve complex
requests for (i) point-in-time retrieval, (ii) hierarchical impact analysis,
and (iii) auditable provenance reconstruction. Through a case study on the
Brazilian Constitution, we demonstrate how this approach provides a verifiable,
temporally-correct substrate for LLMs, enabling higher-order analytical
capabilities while drastically reducing the risk of factual errors. The result
is a practical framework for building more trustworthy and explainable legal AI
systems.
comment: This is a major revision that significantly expands and deepens the
original manuscript. While the core ontological model remains the same, this
version provides a substantially more rigorous and detailed account of how
the framework is applied in practice, particularly within a
Retrieval-Augmented Generation (RAG) context
♻ ☆ Exploring the Robustness of Language Models for Tabular Question Answering via Attention Analysis
Large Language Models (LLMs), already shown to ace various unstructured text
comprehension tasks, have also remarkably been shown to tackle table
(structured) comprehension tasks without specific training. Building on earlier
studies of LLMs for tabular tasks, we probe how in-context learning (ICL),
model scale, instruction tuning, and domain bias affect Tabular QA (TQA)
robustness by testing LLMs, under diverse augmentations and perturbations, on
diverse domains: Wikipedia-based $\textbf{WTQ}$, financial $\textbf{TAT-QA}$,
and scientific $\textbf{SCITAB}$. Although instruction tuning and larger, newer
LLMs deliver stronger, more robust TQA performance, data contamination and
reliability issues, especially on $\textbf{WTQ}$, remain unresolved. Through an
in-depth attention analysis, we reveal a strong correlation between
perturbation-induced shifts in attention dispersion and the drops in
performance, with sensitivity peaking in the model's middle layers. We
highlight the need for improved interpretable methodologies to develop more
reliable LLMs for table comprehension. Through an in-depth attention analysis,
we reveal a strong correlation between perturbation-induced shifts in attention
dispersion and performance drops, with sensitivity peaking in the model's
middle layers. Based on these findings, we argue for the development of
structure-aware self-attention mechanisms and domain-adaptive processing
techniques to improve the transparency, generalization, and real-world
reliability of LLMs on tabular data.
comment: Accepted TMLR 2025
♻ ☆ Label Set Optimization via Activation Distribution Kurtosis for Zero-shot Classification with Generative Models EMNLP 2025
In-context learning (ICL) performance is highly sensitive to prompt design,
yet the impact of class label options (e.g. lexicon or order) in zero-shot
classification remains underexplored. This study proposes LOADS (Label set
Optimization via Activation Distribution kurtosiS), a post-hoc method for
selecting optimal label sets in zero-shot ICL with large language models
(LLMs). LOADS is built upon the observations in our empirical analysis, the
first to systematically examine how label option design (i.e., lexical choice,
order, and elaboration) impacts classification performance. This analysis shows
that the lexical choice of the labels in the prompt (such as agree vs. support
in stance classification) plays an important role in both model performance and
model's sensitivity to the label order. A further investigation demonstrates
that optimal label words tend to activate fewer outlier neurons in LLMs'
feed-forward networks. LOADS then leverages kurtosis to measure the neuron
activation distribution for label selection, requiring only a single forward
pass without gradient propagation or labelled data. The LOADS-selected label
words consistently demonstrate effectiveness for zero-shot ICL across
classification tasks, datasets, models and languages, achieving maximum
performance gain from 0.54 to 0.76 compared to the conventional approach of
using original dataset label words.
comment: Accepted by EMNLP 2025
♻ ☆ SmartBench: Is Your LLM Truly a Good Chinese Smartphone Assistant?
Large Language Models (LLMs) have become integral to daily life, especially
advancing as intelligent assistants through on-device deployment on
smartphones. However, existing LLM evaluation benchmarks predominantly focus on
objective tasks like mathematics and coding in English, which do not
necessarily reflect the practical use cases of on-device LLMs in real-world
mobile scenarios, especially for Chinese users. To address these gaps, we
introduce SmartBench, the first benchmark designed to evaluate the capabilities
of on-device LLMs in Chinese mobile contexts. We analyze functionalities
provided by representative smartphone manufacturers and divide them into five
categories: text summarization, text Q&A, information extraction, content
creation, and notification management, further detailed into 20 specific tasks.
For each task, we construct high-quality datasets comprising 50 to 200
question-answer pairs that reflect everyday mobile interactions, and we develop
automated evaluation criteria tailored for these tasks. We conduct
comprehensive evaluations of on-device LLMs and MLLMs using SmartBench and also
assess their performance after quantized deployment on real smartphone NPUs.
Our contributions provide a standardized framework for evaluating on-device
LLMs in Chinese, promoting further development and optimization in this
critical area. Code and data will be available at
https://github.com/vivo-ai-lab/SmartBench.
comment: 26 pages
♻ ☆ An Agentic System for Rare Disease Diagnosis with Traceable Reasoning
Weike Zhao, Chaoyi Wu, Yanjie Fan, Xiaoman Zhang, Pengcheng Qiu, Yuze Sun, Xiao Zhou, Yanfeng Wang, Xin Sun, Ya Zhang, Yongguo Yu, Kun Sun, Weidi Xie
Rare diseases collectively affect over 300 million individuals worldwide, yet
timely and accurate diagnosis remains a pervasive challenge. This is largely
due to their clinical heterogeneity, low individual prevalence, and the limited
familiarity most clinicians have with rare conditions. Here, we introduce
DeepRare, the first rare disease diagnosis agentic system powered by a large
language model (LLM), capable of processing heterogeneous clinical inputs. The
system generates ranked diagnostic hypotheses for rare diseases, each
accompanied by a transparent chain of reasoning that links intermediate
analytic steps to verifiable medical evidence.
DeepRare comprises three key components: a central host with a long-term
memory module; specialized agent servers responsible for domain-specific
analytical tasks integrating over 40 specialized tools and web-scale,
up-to-date medical knowledge sources, ensuring access to the most current
clinical information. This modular and scalable design enables complex
diagnostic reasoning while maintaining traceability and adaptability. We
evaluate DeepRare on eight datasets. The system demonstrates exceptional
diagnostic performance among 2,919 diseases, achieving 100% accuracy for 1013
diseases. In HPO-based evaluations, DeepRare significantly outperforms other 15
methods, like traditional bioinformatics diagnostic tools, LLMs, and other
agentic systems, achieving an average Recall@1 score of 57.18% and surpassing
the second-best method (Reasoning LLM) by a substantial margin of 23.79
percentage points. For multi-modal input scenarios, DeepRare achieves 70.60% at
Recall@1 compared to Exomiser's 53.20% in 109 cases. Manual verification of
reasoning chains by clinical experts achieves 95.40% agreements. Furthermore,
the DeepRare system has been implemented as a user-friendly web application
http://raredx.cn/doctor.
♻ ☆ SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLMs EMNLP 2025
Although large language models (LLMs) have made significant progress in
understanding Structured Knowledge (SK) like KG and Table, existing evaluations
for SK understanding are non-rigorous (i.e., lacking evaluations of specific
capabilities) and focus on a single type of SK. Therefore, we aim to propose a
more comprehensive and rigorous structured knowledge understanding benchmark to
diagnose the shortcomings of LLMs. In this paper, we introduce SKA-Bench, a
Structured Knowledge Augmented QA Benchmark that encompasses four widely used
structured knowledge forms: KG, Table, KG+Text, and Table+Text. We utilize a
three-stage pipeline to construct SKA-Bench instances, which includes a
question, an answer, positive knowledge units, and noisy knowledge units. To
evaluate the SK understanding capabilities of LLMs in a fine-grained manner, we
expand the instances into four fundamental ability testbeds: Noise Robustness,
Order Insensitivity, Information Integration, and Negative Rejection. Empirical
evaluations on 8 representative LLMs, including the advanced DeepSeek-R1,
indicate that existing LLMs still face significant challenges in understanding
structured knowledge, and their performance is influenced by factors such as
the amount of noise, the order of knowledge units, and hallucination
phenomenon. Our dataset and code are available at
https://github.com/Lza12a/SKA-Bench.
comment: EMNLP 2025
♻ ☆ LLM-Enhanced Linear Autoencoders for Recommendation CIKM 2025
Large language models (LLMs) have been widely adopted to enrich the semantic
representation of textual item information in recommender systems. However,
existing linear autoencoders (LAEs) that incorporate textual information rely
on sparse word co-occurrence patterns, limiting their ability to capture rich
textual semantics. To address this, we propose L3AE, the first integration of
LLMs into the LAE framework. L3AE effectively integrates the heterogeneous
knowledge of textual semantics and user-item interactions through a two-phase
optimization strategy. (i) L3AE first constructs a semantic item-to-item
correlation matrix from LLM-derived item representations. (ii) It then learns
an item-to-item weight matrix from collaborative signals while distilling
semantic item correlations as regularization. Notably, each phase of L3AE is
optimized through closed-form solutions, ensuring global optimality and
computational efficiency. Extensive experiments demonstrate that L3AE
consistently outperforms state-of-the-art LLM-enhanced models on three
benchmark datasets, achieving gains of 27.6% in Recall@20 and 39.3% in NDCG@20.
The source code is available at https://github.com/jaewan7599/L3AE_CIKM2025.
comment: Accepted by CIKM 2025
♻ ☆ RePPL: Recalibrating Perplexity by Uncertainty in Semantic Propagation and Language Generation for Explainable QA Hallucination Detection
Large Language Models (LLMs) have become powerful, but hallucinations remain
a vital obstacle to their trustworthy use. While previous works improved the
capability of hallucination detection by measuring uncertainty, they all lack
the ability to explain the provenance behind why hallucinations occur, i.e.,
which part of the inputs tends to trigger hallucinations. Recent works on the
prompt attack indicate that uncertainty exists in semantic propagation, where
attention mechanisms gradually fuse local token information into high-level
semantics across layers. Meanwhile, uncertainty also emerges in language
generation, due to its probability-based selection of high-level semantics for
sampled generations. Based on that, we propose RePPL to recalibrate uncertainty
measurement by these two aspects, which dispatches explainable uncertainty
scores to each token and aggregates in Perplexity-style Log-Average form as
total score. Experiments show that our method achieves the best comprehensive
detection performance across various QA datasets on advanced models (average
AUC of 0.833), and our method is capable of producing token-level uncertainty
scores as explanations for the hallucination. Leveraging these scores, we
preliminarily find the chaotic pattern of hallucination and showcase its
promising usage.
♻ ☆ Truth or Twist? Optimal Model Selection for Reliable Label Flipping Evaluation in LLM-based Counterfactuals
Qianli Wang, Van Bach Nguyen, Nils Feldhus, Luis Felipe Villa-Arenas, Christin Seifert, Sebastian Möller, Vera Schmitt
Counterfactual examples are widely employed to enhance the performance and
robustness of large language models (LLMs) through counterfactual data
augmentation (CDA). However, the selection of the judge model used to evaluate
label flipping, the primary metric for assessing the validity of generated
counterfactuals for CDA, yields inconsistent results. To decipher this, we
define four types of relationships between the counterfactual generator and
judge models: being the same model, belonging to the same model family, being
independent models, and having an distillation relationship. Through extensive
experiments involving two state-of-the-art LLM-based methods, three datasets,
four generator models, and 15 judge models, complemented by a user study (n =
90), we demonstrate that judge models with an independent, non-fine-tuned
relationship to the generator model provide the most reliable label flipping
evaluations. Relationships between the generator and judge models, which are
closely aligned with the user study for CDA, result in better model performance
and robustness. Nevertheless, we find that the gap between the most effective
judge models and the results obtained from the user study remains considerably
large. This suggests that a fully automated pipeline for CDA may be inadequate
and requires human intervention.
comment: Accepted at INLG 2025, camera-ready version
♻ ☆ Mind the (Language) Gap: Towards Probing Numerical and Cross-Lingual Limits of LVLMs
We introduce MMCRICBENCH-3K, a benchmark for Visual Question Answering (VQA)
on cricket scorecards, designed to evaluate large vision-language models
(LVLMs) on complex numerical and cross-lingual reasoning over semi-structured
tabular images. MMCRICBENCH-3K comprises 1,463 synthetically generated
scorecard images from ODI, T20, and Test formats, accompanied by 1,500 English
QA pairs. It includes two subsets: MMCRICBENCH-E-1.5K, featuring English
scorecards, and MMCRICBENCH-H-1.5K, containing visually similar Hindi
scorecards, with all questions and answers kept in English to enable controlled
cross-script evaluation. The task demands reasoning over structured numerical
data, multi-image context, and implicit domain knowledge. Empirical results
show that even state-of-the-art LVLMs, such as GPT-4o and Qwen2.5VL, struggle
on the English subset despite it being their primary training language and
exhibit a further drop in performance on the Hindi subset. This reveals key
limitations in structure-aware visual text understanding, numerical reasoning,
and cross-lingual generalization. The dataset is publicly available via Hugging
Face at https://huggingface.co/datasets/DIALab/MMCricBench, to promote LVLM
research in this direction.
♻ ☆ From Confidence to Collapse in LLM Factual Robustness
Ensuring the robustness of factual knowledge in LLMs is critical for reliable
applications in tasks such as question answering and reasoning. However,
existing evaluation methods predominantly focus on performance-based metrics,
often investigating from the perspective of prompt perturbations, which
captures only the externally triggered side of knowledge robustness. To bridge
this gap, we introduce a principled approach to measure factual robustness from
the perspective of the generation process by analyzing token distribution
entropy in combination with temperature scaling sensitivity. These two factors
build the Factual Robustness Score (FRS), a novel metric which quantifies the
stability of a fact against perturbations in decoding conditions, given its
initial uncertainty. To validate our approach, we conduct extensive experiments
on 5 LLMs across 3 closed-book QA datasets (SQuAD, TriviaQA, and HotpotQA). We
show that factual robustness varies significantly -- smaller models report an
FRS of $0.76$, larger ones $0.93$ -- with accuracy degrading by ~$60\%$ under
increased uncertainty. These insights demonstrate how entropy and temperature
scaling impact factual accuracy, and lay a foundation for developing more
robust knowledge retention and retrieval in future models.
♻ ☆ Debate-to-Detect: Reformulating Misinformation Detection as a Real-World Debate with Large Language Models EMNLP 2025
The proliferation of misinformation in digital platforms reveals the
limitations of traditional detection methods, which mostly rely on static
classification and fail to capture the intricate process of real-world
fact-checking. Despite advancements in Large Language Models (LLMs) that
enhance automated reasoning, their application to misinformation detection
remains hindered by issues of logical inconsistency and superficial
verification. In response, we introduce Debate-to-Detect (D2D), a novel
Multi-Agent Debate (MAD) framework that reformulates misinformation detection
as a structured adversarial debate. Inspired by fact-checking workflows, D2D
assigns domain-specific profiles to each agent and orchestrates a five-stage
debate process, including Opening Statement, Rebuttal, Free Debate, Closing
Statement, and Judgment. To transcend traditional binary classification, D2D
introduces a multi-dimensional evaluation mechanism that assesses each claim
across five distinct dimensions: Factuality, Source Reliability, Reasoning
Quality, Clarity, and Ethics. Experiments with GPT-4o on two datasets
demonstrate significant improvements over baseline methods, and the case study
highlight D2D's capability to iteratively refine evidence while improving
decision transparency, representing a substantial advancement towards
interpretable misinformation detection. The code will be released publicly
after the official publication.
comment: This paper has been accepted to EMNLP 2025 (Main Conference)
♻ ☆ Long-context Language Models Fail in Basic Retrieval Tasks Without Sufficient Reasoning Steps
Long-context language models (LCLMs), characterized by their extensive
context window, are becoming popular. However, despite the fact that they are
nearly perfect at standard long-context retrieval tasks, our evaluations
demonstrate they fail in some basic cases. Later, we find they can be well
addressed with a sufficient number of reasoning steps, guided by specific CoT
prompts. This result emphasizes the potential necessity of solving specific
long-context tasks using long-CoT methods, while previous long-context
benchmarks always ignore the necessity of long reasoning for long-context tasks
and treat them as direct QA tasks.
comment: Our code is publicly available at
https://github.com/yuyijiong/hard_retrieval_for_llm and the datasets is at
https://huggingface.co/datasets/yuyijiong/difficult_retrieval
♻ ☆ Weakly-Supervised 3D Visual Grounding based on Visual Language Alignment
Learning to ground natural language queries to target objects or regions in
3D point clouds is quite essential for 3D scene understanding. Nevertheless,
existing 3D visual grounding approaches require a substantial number of
bounding box annotations for text queries, which is time-consuming and
labor-intensive to obtain. In this paper, we propose 3D-VLA, a weakly
supervised approach for 3D visual grounding based on Visual Linguistic
Alignment. Our 3D-VLA exploits the superior ability of current large-scale
vision-language models (VLMs) on aligning the semantics between texts and 2D
images, as well as the naturally existing correspondences between 2D images and
3D point clouds, and thus implicitly constructs correspondences between texts
and 3D point clouds with no need for fine-grained box annotations in the
training procedure. During the inference stage, the learned text-3D
correspondence will help us ground the text queries to the 3D target objects
even without 2D images. To the best of our knowledge, this is the first work to
investigate 3D visual grounding in a weakly supervised manner by involving
large scale vision-language models, and extensive experiments on ReferIt3D and
ScanRefer datasets demonstrate that our 3D-VLA achieves comparable and even
superior results over the fully supervised methods.
♻ ☆ Can Pruning Improve Reasoning? Revisiting Long-CoT Compression with Capability in Mind for Better Reasoning
Long chain-of-thought (Long-CoT) reasoning improves accuracy in LLMs, yet its
verbose, self-reflective style often hinders effective distillation into small
language models (SLMs). We revisit Long-CoT compression through the lens of
capability alignment and ask: Can pruning improve reasoning? We propose
Prune-on-Logic, a structure-aware framework that transforms Long-CoT into logic
graphs and selectively prunes low-utility reasoning steps under
self-verification constraints. Through systematic analysis across three pruning
strategies - targeting entire chains, core reasoning, and verification - we
find that verification pruning consistently improves accuracy while reducing
token usage, whereas reasoning or indiscriminate pruning degrades performance.
Our study reveals that effective pruning aligns supervision with model capacity
rather than merely shortening inputs. Gains hold across tasks, model scales,
and CoT capability, with larger models benefiting more from pruning due to
richer but more redundant reasoning. Our empirical findings highlight pruning
as a structural optimization strategy for aligning CoT reasoning with SLM
capacity.
comment: 19 pages,6 figures
♻ ☆ Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence
Recent advances in reasoning models have demonstrated significant
improvements in accuracy by employing detailed and comprehensive reasoning
processes. However, generating these lengthy reasoning sequences is
computationally expensive and time-consuming. To address this inefficiency, we
leverage the inherent parallelizability of certain tasks to accelerate the
reasoning process. Specifically, when multiple parallel reasoning steps exist,
we decode multiple tokens per forward pass via a tree-like attention mask
within a single sequence, avoiding additional memory usage. Experimental
results show that our method achieves up to nearly 100\% speedup in decoding
while basically maintaining the answer quality.
comment: Our code is available in
https://github.com/yuyijiong/parallel-decoding-in-one-sequence
♻ ☆ Fingerprint Vector: Enabling Scalable and Efficient Model Fingerprint Transfer via Vector Addition
Backdoor-based fingerprinting has emerged as an effective technique for
tracing the ownership of large language models. However, in real-world
deployment scenarios, developers often instantiate multiple downstream models
from a shared base model, and applying fingerprinting to each variant
individually incurs prohibitive computational overhead. While inheritance-based
approaches -- where fingerprints are embedded into the base model and expected
to persist through fine-tuning -- appear attractive, they suffer from three key
limitations: late-stage fingerprinting, fingerprint instability, and
interference with downstream adaptation. To address these challenges, we
propose a novel mechanism called the Fingerprint Vector. Our method first
embeds a fingerprint into the base model via backdoor-based fine-tuning, then
extracts a task-specific parameter delta as a fingerprint vector by computing
the difference between the fingerprinted and clean models. This vector can be
directly added to any structurally compatible downstream model, allowing the
fingerprint to be transferred post hoc without additional fine-tuning.
Extensive experiments show that Fingerprint Vector achieves comparable or
superior performance to direct injection across key desiderata. It maintains
strong effectiveness across diverse model architectures as well as mainstream
downstream variants within the same family. It also preserves harmlessness and
robustness in most cases. Even when slight robustness degradation is observed,
the impact remains within acceptable bounds and is outweighed by the
scalability benefits of our approach.
♻ ☆ SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models EMNLP 2025
Large Language Models (LLMs) excel at various natural language processing
tasks but remain vulnerable to jailbreaking attacks that induce harmful content
generation. In this paper, we reveal a critical safety inconsistency: LLMs can
more effectively identify harmful requests as discriminators than defend
against them as generators. This insight inspires us to explore aligning the
model's inherent discrimination and generation capabilities. To this end, we
propose SDGO (Self-Discrimination-Guided Optimization), a reinforcement
learning framework that leverages the model's own discrimination capabilities
as a reward signal to enhance generation safety through iterative
self-improvement. Our method does not require any additional annotated data or
external models during the training phase. Extensive experiments demonstrate
that SDGO significantly improves model safety compared to both prompt-based and
training-based baselines while maintaining helpfulness on general benchmarks.
By aligning LLMs' discrimination and generation capabilities, SDGO brings
robust performance against out-of-distribution (OOD) jailbreaking attacks. This
alignment achieves tighter coupling between these two capabilities, enabling
the model's generation capability to be further enhanced with only a small
amount of discriminative samples. Our code and datasets are available at
https://github.com/NJUNLP/SDGO.
comment: Accepted by EMNLP 2025 (Main Conference), 15 pages, 4 figures, 6
tables
♻ ☆ sudoLLM: On Multi-role Alignment of Language Models EMNLP 2025
User authorization-based access privileges are a key feature in many
safety-critical systems, but have not been extensively studied in the large
language model (LLM) realm. In this work, drawing inspiration from such access
control systems, we introduce sudoLLM, a novel framework that results in
multi-role aligned LLMs, i.e., LLMs that account for, and behave in accordance
with, user access rights. sudoLLM injects subtle user-based biases into queries
and trains an LLM to utilize this bias signal in order to produce sensitive
information if and only if the user is authorized. We present empirical results
demonstrating that this approach shows substantially improved alignment,
generalization, resistance to prefix-based jailbreaking attacks, and
``fails-closed''. The persistent tension between the language modeling
objective and safety alignment, which is often exploited to jailbreak LLMs, is
somewhat resolved with the aid of the injected bias signal. Our framework is
meant as an additional security layer, and complements existing guardrail
mechanisms for enhanced end-to-end safety with LLMs.
comment: Accepted to EMNLP 2025 (findings)
♻ ☆ Retrieval Enhanced Feedback via In-context Neural Error-book EMNLP 2025
Recent advancements in Large Language Models (LLMs) have significantly
improved reasoning capabilities, with in-context learning (ICL) emerging as a
key technique for adaptation without retraining. While previous works have
focused on leveraging correct examples, recent research highlights the
importance of learning from errors to enhance performance. However, existing
methods lack a structured framework for analyzing and mitigating errors,
particularly in Multimodal Large Language Models (MLLMs), where integrating
visual and textual inputs adds complexity. To address this issue, we propose
REFINE: Retrieval-Enhanced Feedback via In-context Neural Error-book, a
teacher-student framework that systematically structures errors and provides
targeted feedback. REFINE introduces three systematic queries to construct
structured feedback -- Feed-Target, Feed-Check, and Feed-Path -- to enhance
multimodal reasoning by prioritizing relevant visual information, diagnosing
critical failure points, and formulating corrective actions. Unlike prior
approaches that rely on redundant retrievals, REFINE optimizes structured
feedback retrieval, improving inference efficiency, token usage, and
scalability. Our results demonstrate substantial speedup, reduced computational
costs, and successful generalization, highlighting REFINE's potential for
enhancing multimodal reasoning.
comment: Accepted at EMNLP 2025 main conference
♻ ☆ Subjective Perspectives within Learned Representations Predict High-Impact Innovation
Existing studies of innovation emphasize the power of social structures to
shape innovation capacity. Emerging machine learning approaches, however,
enable us to model innovators' personal perspectives and interpersonal
innovation opportunities as a function of their prior experience. We theorize
and then quantify subjective perspectives and their interaction based on
innovator positions within the geometric space of concepts inscribed by dynamic
machine-learned language representations. Using data on millions of scientists,
inventors, screenplay writers, entrepreneurs, and Wikipedia contributors across
their respective creative domains, here we show that measured subjective
perspectives predict which ideas individuals and groups will creatively attend
to and successfully combine in the future. Across all cases and time periods we
examine, when perspective diversity is decomposed as the difference between
collaborators' perspectives on their creation, and background diversity as the
difference between their experiences, the former consistently anticipates
creative achievement while the latter portends its opposite. We analyze a
natural experiment and simulate creative collaborations between AI agents
designed with various perspective and background diversity, which support our
observational findings. We explore mechanisms underlying these findings and
identify how successful collaborators leverage common language to weave
together diverse experiences obtained through trajectories of prior work. These
perspectives converge and provoke one another to innovate. We examine the
significance of these findings for team formation and research policy.
comment: 123 pages, 23 figures
♻ ☆ ELSPR: Evaluator LLM Training Data Self-Purification on Non-Transitive Preferences via Tournament Graph Reconstruction
Yan Yu, Yilun Liu, Minggui He, Shimin Tao, Weibin Meng, Xinhua Yang, Li Zhang, Hongxia Ma, Dengye Li, Daimeng Wei, Boxing Chen, Fuliang Li
Pairwise evaluation of large language models (LLMs) has become the dominant
paradigm for benchmarking open-ended tasks, yet non-transitive preferences,
where evaluators prefer A over B, B over C, but C over A, fundamentally
undermine ranking reliability. We show that this critical issue stems largely
from low-quality data that contains inherently ambiguous preference pairs. To
address this challenge, we propose ELSPR, a principled graph-theoretic
framework that models pairwise preferences as tournament graphs and
systematically identifies problematic training data. ELSPR quantifies
non-transitivity through strongly connected components (SCCs) analysis and
measures overall preference clarity using a novel normalized directed graph
structural entropy metric. Our filtering methodology selectively removes
preference data that induce non-transitivity while preserving transitive
preferences. Extensive experiments on the AlpacaEval benchmark demonstrate that
models fine-tuned on ELSPR-filtered data achieve substantial improvements: a
13.8% reduction in non-transitivity, a 0.088 decrease in structural entropy,
and significantly enhanced discriminative power in real-world evaluation
systems. Human validation confirms that discarded data exhibit dramatically
lower inter-annotator agreement (34.4% vs. 52.6%) and model-human consistency
(51.2% vs. 80.6%) compared to cleaned data. These findings establish ELSPR as
an effective data self-purification approach for developing more robust,
consistent, and human-aligned LLM evaluation systems.
♻ ☆ HateDebias: On the Diversity and Variability of Hate Speech Debiasing
Hate speech frequently appears on social media platforms and urgently needs
to be effectively controlled. Alleviating the bias caused by hate speech can
help resolve various ethical issues. Although existing research has constructed
several datasets for hate speech detection, these datasets seldom consider the
diversity and variability of bias, making them far from real-world scenarios.
To fill this gap, we propose a benchmark HateDebias to analyze the fairness of
models under dynamically evolving environments. Specifically, to meet the
diversity of biases, we collect hate speech data with different types of biases
from real-world scenarios. To further simulate the variability in the
real-world scenarios(i.e., the changing of bias attributes in datasets), we
construct a dataset to follow the continuous learning setting and evaluate the
detection accuracy of models on the HateDebias, where performance degradation
indicates a significant bias toward a specific attribute. To provide a
potential direction, we further propose a continual debiasing framework
tailored to dynamic bias in real-world scenarios, integrating memory replay and
bias information regularization to ensure the fairness of the model. Experiment
results on the HateDebias benchmark reveal that our methods achieve improved
performance in mitigating dynamic biases in real-world scenarios, highlighting
the practicality in real-world applications.
♻ ☆ Collaborative Evaluation of Deepfake Text with Deliberation-Enhancing Dialogue Systems
The proliferation of generative models has presented significant challenges
in distinguishing authentic human-authored content from deepfake content.
Collaborative human efforts, augmented by AI tools, present a promising
solution. In this study, we explore the potential of DeepFakeDeLiBot, a
deliberation-enhancing chatbot, to support groups in detecting deepfake text.
Our findings reveal that group-based problem-solving significantly improves the
accuracy of identifying machine-generated paragraphs compared to individual
efforts. While engagement with DeepFakeDeLiBot does not yield substantial
performance gains overall, it enhances group dynamics by fostering greater
participant engagement, consensus building, and the frequency and diversity of
reasoning-based utterances. Additionally, participants with higher perceived
effectiveness of group collaboration exhibited performance benefits from
DeepFakeDeLiBot. These findings underscore the potential of deliberative
chatbots in fostering interactive and productive group dynamics while ensuring
accuracy in collaborative deepfake text detection. \textit{Dataset and source
code used in this study will be made publicly available upon acceptance of the
manuscript.
comment: 15; To appear in ICWSM 2026 (https://www.icwsm.org/2026/)
♻ ☆ Adapting Large Language Models to Log Analysis with Interpretable Domain Knowledge CIKM 2025
Yuhe Ji, Yilun Liu, Feiyu Yao, Minggui He, Shimin Tao, Xiaofeng Zhao, Su Chang, Xinhua Yang, Weibin Meng, Yuming Xie, Boxing Chen, Shenglin Zhang, Yongqian Sun
Log analysis represents a critical sub-domain within AI applications that
facilitates automatic approaches to fault and error management of large-scaled
software systems, saving labors of traditional manual methods. While existing
solutions using large language models (LLMs) show promise, they are limited by
a significant domain gap between natural and log languages (the latter contains
rich domain-specific tokens such as status codes, IP addresses, resource
pathes), which restricts their effectiveness in real-world applications.
However, directly adapting general-purpose LLMs to log analysis using raw logs
may degrade their performance due to inconsistent token distribution. In this
paper, we present a domain adaptation approach that addresses these limitations
by integrating interpretable domain knowledge into open-source LLMs through
continual pre-training (CPT), which bridges this domain gap by adapting LLMs on
interpretable natural texts with log knowledge (instead of raw logs) to reduce
distribution discrepancy. To achieve this, we developed NLPLog, a comprehensive
dataset containing over 250,000 question-answer pairs on log-related knowledge.
Our resulting model, SuperLog, achieves the best performance across four log
analysis tasks, with an average accuracy improvement of 12.01% over the
second-best model. Ablation study also suggests advantages of domain adaption
using interpretable log knowledge over using raw logs.
comment: Accepted by CIKM 2025
♻ ☆ Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings EMNLP 2025
This work stems from an observed limitation of text encoders: embeddings may
not be able to recognize fine-grained entities or events within encoded
semantics, resulting in failed retrieval even in simple cases. To examine such
behaviors, we first introduce a new evaluation dataset, CapRetrieval, in which
passages are image captions and queries are phrases targeting entity or event
concepts in diverse forms. Zero-shot evaluation suggests that encoders often
struggle with these fine-grained matching, regardless of training sources or
model size. Aiming for enhancement, we proceed to finetune encoders with our
proposed data generation strategies, enabling a small 0.1B encoder to
outperform the state-of-the-art 7B model. Within this process, we further
uncover the granularity dilemma, a challenge for embeddings to capture
fine-grained salience while aligning with overall semantics. Our dataset, code
and models in this work are publicly released at
https://github.com/lxucs/CapRetrieval.
comment: Accepted to EMNLP 2025 Findings
♻ ☆ Dream to Chat: Model-based Reinforcement Learning on Dialogues with User Belief Modeling EMNLP 2025
Yue Zhao, Xiaoyu Wang, Dan Wang, Zhonglin Jiang, Qingqing Gu, Teng Chen, Ningyuan Xi, Jinxian Qu, Yong Chen, Luo Ji
World models have been widely utilized in robotics, gaming, and auto-driving.
However, their applications on natural language tasks are relatively limited.
In this paper, we construct the dialogue world model, which could predict the
user's emotion, sentiment, and intention, and future utterances. By defining a
POMDP, we argue emotion, sentiment and intention can be modeled as the user
belief and solved by maximizing the information bottleneck. By this user belief
modeling, we apply the model-based reinforcement learning framework to the
dialogue system, and propose a framework called DreamCUB. Experiments show that
the pretrained dialogue world model can achieve state-of-the-art performances
on emotion classification and sentiment identification, while dialogue quality
is also enhanced by joint training of the policy, critic and dialogue world
model. Further analysis shows that this manner holds a reasonable
exploration-exploitation balance and also transfers well to out-of-domain
scenarios such as empathetic dialogues.
comment: Accepted to EMNLP 2025 Findings
♻ ☆ Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements EMNLP 2025
In this paper, we propose a ``Generalization Stress Test" to assess Large
Language Models' (LLMs) generalization ability under slight and controlled
perturbations, including option length, problem types, and irrelevant noun
replacements. We achieve novel and significant findings that, despite high
benchmark scores, LLMs exhibit severe accuracy drops and unexpected biases
(e.g., preference for longer distractors) when faced with these minor but
content-preserving modifications. For example, Qwen 2.5 1.5B's MMLU score rises
from 60 to 89 and drops from 89 to 36 when option lengths are changed without
altering the question. Even GPT4o experiences a 25-point accuracy loss when
problem types are changed, with a 6-point drop across all three modification
categories. These analyses suggest that LLMs rely heavily on superficial cues
rather than forming robust, abstract representations that generalize across
formats, lexical variations, and irrelevant content shifts.
comment: EMNLP 2025 Main Conference
♻ ☆ CausalSent: Interpretable Sentiment Classification with RieszNet
Despite the overwhelming performance improvements offered by recent natural
language processing (NLP) models, the decisions made by these models are
largely a black box. Towards closing this gap, the field of causal NLP combines
causal inference literature with modern NLP models to elucidate causal effects
of text features. We replicate and extend Bansal et al's work on regularizing
text classifiers to adhere to estimated effects, focusing instead on model
interpretability. Specifically, we focus on developing a two-headed
RieszNet-based neural network architecture which achieves better treatment
effect estimation accuracy. Our framework, CausalSent, accurately predicts
treatment effects in semi-synthetic IMDB movie reviews, reducing MAE of effect
estimates by 2-3x compared to Bansal et al's MAE on synthetic Civil Comments
data. With an ensemble of validated models, we perform an observational case
study on the causal effect of the word "love" in IMDB movie reviews, finding
that the presence of the word "love" causes a +2.9% increase in the probability
of a positive sentiment.
♻ ☆ Evaluating Scoring Bias in LLM-as-a-Judge
The remarkable performance of Large Language Models (LLMs) gives rise
to``LLM-as-a-Judge'', where LLMs are employed as evaluators for complex tasks.
Moreover, it has been widely adopted across fields such as Natural Language
Processing (NLP), preference learning, and various specific domains. However,
there are various biases within LLM-as-a-Judge, which adversely affect the
fairness and reliability of judgments. Current research on evaluating or
mitigating bias in LLM-as-a-Judge predominantly focuses on comparison-based
evaluations, while systematic investigations into bias in scoring-based
evaluations remain limited. Therefore, we define scoring bias in LLM-as-a-Judge
as the scores differ when scoring judge models are bias-related perturbed, and
provide a well-designed framework to comprehensively evaluate scoring bias. We
augment existing LLM-as-a-Judge benchmarks through data synthesis to construct
our evaluation dataset and design multi-faceted evaluation metrics. Our
experimental results demonstrate that the scoring stability of existing judge
models is disrupted by scoring biases. Further exploratory experiments and
discussions provide valuable insights into the design of scoring prompt
templates and the mitigation of scoring biases on aspects such as score
rubrics, score IDs, and reference answer selection.
♻ ☆ EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Spoken Dialogue Systems
Jingwen Liu, Kan Jen Cheng, Jiachen Lian, Akshay Anand, Rishi Jain, Faith Qiao, Robin Netzorg, Huang-Cheng Chou, Tingle Li, Guan-Ting Lin, Gopala Anumanchipalli
Speech emotions play a crucial role in human-computer interaction, shaping
engagement and context-aware communication. Despite recent advances in spoken
dialogue systems, a holistic system for evaluating emotional reasoning is still
lacking. To address this, we introduce EMO-Reasoning, a benchmark for assessing
emotional coherence in dialogue systems. It leverages a curated dataset
generated via text-to-speech to simulate diverse emotional states, overcoming
the scarcity of emotional speech data. We further propose the Cross-turn
Emotion Reasoning Score to assess the emotion transitions in multi-turn
dialogues. Evaluating seven dialogue systems through continuous, categorical,
and perceptual metrics, we show that our framework effectively detects
emotional inconsistencies, providing insights for improving current dialogue
systems. By releasing a systematic evaluation benchmark, we aim to advance
emotion-aware spoken dialogue modeling toward more natural and adaptive
interactions.
comment: Accepted at (ASRU 2025) 2025 IEEE Automatic Speech Recognition and
Understanding Workshop
♻ ☆ Measuring Sycophancy of Language Models in Multi-turn Dialogues EMNLP 2025
Large Language Models (LLMs) are expected to provide helpful and harmless
responses, yet they often exhibit sycophancy--conforming to user beliefs
regardless of factual accuracy or ethical soundness. Prior research on
sycophancy has primarily focused on single-turn factual correctness,
overlooking the dynamics of real-world interactions. In this work, we introduce
SYCON Bench, a novel benchmark for evaluating sycophantic behavior in
multi-turn, free-form conversational settings. Our benchmark measures how
quickly a model conforms to the user (Turn of Flip) and how frequently it
shifts its stance under sustained user pressure (Number of Flip). Applying
SYCON Bench to 17 LLMs across three real-world scenarios, we find that
sycophancy remains a prevalent failure mode. Our analysis shows that alignment
tuning amplifies sycophantic behavior, whereas model scaling and reasoning
optimization strengthen the model's ability to resist undesirable user views.
Reasoning models generally outperform instruction-tuned models but often fail
when they over-index on logical exposition instead of directly addressing the
user's underlying beliefs. Finally, we evaluate four additional prompting
strategies and demonstrate that adopting a third-person perspective reduces
sycophancy by up to 63.8% in debate scenario. We release our code and data at
https://github.com/JiseungHong/SYCON-Bench.
comment: Accepted to Findings of EMNLP 2025
♻ ☆ DLLMQuant: Quantizing Diffusion-based Large Language Models
Diffusion-based large language models (DLLMs) have shown promise for
non-autoregressive text generation, but their deployment is constrained by
large model sizes and heavy computational costs. Post-training quantization
(PTQ), a widely used method for compressing and accelerating Large Language
Models (LLMs), suffers from severe accuracy degradation and reduced
generalization performance when directly applied to DLLMs (e.g., AWQ suffers a
16% accuracy drop on LLADA under W4A4). This paper explores how DLLMs' key
mechanisms - dynamic masking, iterative generation, bidirectional attention -
clash with quantization. We identify three core issues: 1) Iterative generation
and dynamic masking ratios lead to distinct token distributions across decoding
steps, which are not adequately captured by existing PTQ calibration methods;
2) Quantization errors are accumulated and amplified progressively during
iteration in DLLMs, causing quantized models to perform worse as decoding steps
progress; 3) Unmasked tokens stabilize while masked remain probabilistic,
making overall feature distribution incompatible with existing PTQ methods. To
address these issues, we propose DLLMQuant, a PTQ framework tailored for DLLMs,
which incorporates three novel techniques: 1) Temporal-Mask Adaptive Sampling
(TMAS), a calibration method that accounts for both time and mask factors, with
the capacity to capture distributions across timesteps. 2) Interaction-Aware
Activation Quantization (IA-AQ), which utilizes bidirectional attention's
interaction signals to dynamically allocate quantization resources. 3)
Certainty-Guided Quantization (CGQ), which integrates mask status and token
scores as key weighting criteria into error compensation, making weight
quantization more suitable for DLLMs. Experiments show that DLLMQuant achieves
significant performance gains while enhancing efficiency.
comment: 12 pages, 6 figures
♻ ☆ Improving Multilingual Language Models by Aligning Representations through Steering
This paper investigates how Large Language Models (LLMs) represent
non-English tokens -- a question that remains underexplored despite recent
progress. We propose a lightweight intervention method using representation
steering, where a learned vector is added to the residual stream at a single
model layer to enhance multilingual performance. Through extensive experiments
across seven competitive baselines -- including prompt optimization, supervised
fine-tuning (SFT), in-context learning, cross-lingual transfer, and
translation-based methods-we show that our approach consistently outperforms
most alternatives. In particular, it achieves performance on par with
production-grade translation systems while requiring far fewer resources. We
further explore the complementarity between our method and SFT, demonstrating
that steering offers a direct, efficient way to realign internal
representations. These findings underscore the potential of activation-level
interventions as a powerful tool for improving the multilingual capabilities of
LLMs.
♻ ☆ Less Is More? Examining Fairness in Pruned Large Language Models for Summarising Opinions EMNLP 2025
Model compression through post-training pruning offers a way to reduce model
size and computational requirements without significantly impacting model
performance. However, the effect of pruning on the fairness of LLM-generated
summaries remains unexplored, particularly for opinion summarisation where
biased outputs could influence public views.In this paper, we present a
comprehensive empirical analysis of opinion summarisation, examining three
state-of-the-art pruning methods and various calibration sets across three
open-source LLMs using four fairness metrics. Our systematic analysis reveals
that pruning methods have a greater impact on fairness than calibration sets.
Building on these insights, we propose High Gradient Low Activation (HGLA)
pruning, which identifies and removes parameters that are redundant for input
processing but influential in output generation. Our experiments demonstrate
that HGLA can better maintain or even improve fairness compared to existing
methods, showing promise across models and tasks where traditional methods have
limitations. Our human evaluation shows HGLA-generated outputs are fairer than
existing state-of-the-art pruning methods. Code is available at:
https://github.com/amberhuang01/HGLA.
comment: Accepted to EMNLP 2025 Main Conference
♻ ☆ Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing
Efficient state restoration in multi-turn conversations with large language
models (LLMs) remains a critical challenge, primarily due to the overhead of
recomputing or loading full key-value (KV) caches for all historical tokens. To
address this, existing approaches compress KV caches across adjacent layers
with highly similar attention patterns. However, these methods often apply a
fixed compression scheme across all conversations, selecting the same layer
pairs for compression without considering conversation-specific attention
dynamics. This static strategy overlooks variability in attention pattern
similarity across different conversations, which can lead to noticeable
accuracy degradation.
We present Krul, a multi-turn LLM inference system that enables accurate and
efficient KV cache restoration. Krul dynamically selects compression strategies
based on attention similarity across layer pairs and uses a
recomputation-loading pipeline to restore the KV cache. It introduces three key
innovations: 1) a preemptive compression strategy selector to preserve critical
context for future conversation turns and selects a customized strategy for the
conversation; 2) a token-wise heterogeneous attention similarity estimator to
mitigate the attention similarity computation and storage overhead during model
generation; 3) a bubble-free restoration scheduler to reduce potential bubbles
brought by the imbalance of recomputing and loading stream due to compressed KV
caches. Empirical evaluations on real-world tasks demonstrate that Krul
achieves a 1.5x-2.68x reduction in time-to-first-token (TTFT) and a 1.33x-2.35x
reduction in KV cache storage compared to state-of-the-art methods without
compromising generation quality.