MyArxiv
Computation and Language 96
StepWiser: Stepwise Generative Judges for Wiser Reasoning
As models increasingly leverage multi-step reasoning strategies to solve complex problems, supervising the logical validity of these intermediate steps has become a critical research challenge. Process reward models address this by providing step-by-step feedback, but current approaches have two major drawbacks: they typically function as classifiers without providing explanations, and their reliance on supervised fine-tuning with static datasets limits generalization. Inspired by recent advances, we reframe stepwise reward modeling from a classification task to a reasoning task itself. We thus propose a generative judge that reasons about the policy model's reasoning steps (i.e., meta-reasons), outputting thinking tokens before delivering a final verdict. Our model, StepWiser, is trained by reinforcement learning using relative outcomes of rollouts. We show it provides (i) better judgment accuracy on intermediate steps than existing methods; (ii) can be used to improve the policy model at training time; and (iii) improves inference-time search.
Generative Interfaces for Language Models
Large language models (LLMs) are increasingly seen as assistants, copilots, and consultants, capable of supporting a wide range of tasks through natural conversation. However, most systems remain constrained by a linear request-response format that often makes interactions inefficient in multi-turn, information-dense, and exploratory tasks. To address these limitations, we propose Generative Interfaces for Language Models, a paradigm in which LLMs respond to user queries by proactively generating user interfaces (UIs) that enable more adaptive and interactive engagement. Our framework leverages structured interface-specific representations and iterative refinements to translate user queries into task-specific UIs. For systematic evaluation, we introduce a multidimensional assessment framework that compares generative interfaces with traditional chat-based ones across diverse tasks, interaction patterns, and query types, capturing functional, interactive, and emotional aspects of user experience. Results show that generative interfaces consistently outperform conversational ones, with humans preferring them in over 70% of cases. These findings clarify when and why users favor generative interfaces, paving the way for future advancements in human-AI interaction.
comment: Preprint
☆ Evaluating the Evaluators: Are readability metrics good measures of readability?
Plain Language Summarization (PLS) aims to distill complex documents into accessible summaries for non-expert audiences. In this paper, we conduct a thorough survey of PLS literature, and identify that the current standard practice for readability evaluation is to use traditional readability metrics, such as Flesch-Kincaid Grade Level (FKGL). However, despite proven utility in other fields, these metrics have not been compared to human readability judgments in PLS. We evaluate 8 readability metrics and show that most correlate poorly with human judgments, including the most popular metric, FKGL. We then show that Language Models (LMs) are better judges of readability, with the best-performing model achieving a Pearson correlation of 0.56 with human judgments. Extending our analysis to PLS datasets, which contain summaries aimed at non-expert audiences, we find that LMs better capture deeper measures of readability, such as required background knowledge, and lead to different conclusions than the traditional metrics. Based on these findings, we offer recommendations for best practices in the evaluation of plain language summaries. We release our analysis code and survey data.
☆ VibeVoice Technical Report
This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encodec model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VibeVoice can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational ``vibe'' and surpassing open-source and proprietary dialogue models.
☆ Demystifying Scientific Problem-Solving in LLMs by Probing Knowledge and Reasoning
Scientific problem solving poses unique challenges for LLMs, requiring both deep domain knowledge and the ability to apply such knowledge through complex reasoning. While automated scientific reasoners hold great promise for assisting human scientists, there is currently no widely adopted holistic benchmark for evaluating scientific reasoning, and few approaches systematically disentangle the distinct roles of knowledge and reasoning in these tasks. To address these gaps, we introduce SciReas, a diverse suite of existing benchmarks for scientific reasoning tasks, and SciReas-Pro, a selective subset that requires more complex reasoning. Our holistic evaluation surfaces insights about scientific reasoning performance that remain hidden when relying on individual benchmarks alone. We then propose KRUX, a probing framework for studying the distinct roles of reasoning and knowledge in scientific tasks. Combining the two, we conduct an in-depth analysis that yields several key findings: (1) Retrieving task-relevant knowledge from model parameters is a critical bottleneck for LLMs in scientific reasoning; (2) Reasoning models consistently benefit from external knowledge added in-context on top of the reasoning enhancement; (3) Enhancing verbalized reasoning improves LLMs' ability to surface task-relevant knowledge. Finally, we conduct a lightweight analysis, comparing our science-focused data composition with concurrent efforts on long CoT SFT, and release SciLit01, a strong 8B baseline for scientific reasoning.
comment: 28 pages, 16 figures
☆ The Ramon Llull's Thinking Machine for Automated Ideation
This paper revisits Ramon Llull's Ars combinatoria - a medieval framework for generating knowledge through symbolic recombination - as a conceptual foundation for building a modern Llull's thinking machine for research ideation. Our approach defines three compositional axes: Theme (e.g., efficiency, adaptivity), Domain (e.g., question answering, machine translation), and Method (e.g., adversarial training, linear attention). These elements represent high-level abstractions common in scientific work - motivations, problem settings, and technical approaches - and serve as building blocks for LLM-driven exploration. We mine elements from human experts or conference papers and show that prompting LLMs with curated combinations produces research ideas that are diverse, relevant, and grounded in current literature. This modern thinking machine offers a lightweight, interpretable tool for augmenting scientific creativity and suggests a path toward collaborative ideation between humans and AI.
comment: 21 pages, 3 figures
☆ Do LVLMs Know What They Know? A Systematic Study of Knowledge Boundary Perception in LVLMs EMNLP2025
Large vision-language models (LVLMs) demonstrate strong visual question answering (VQA) capabilities but are shown to hallucinate. A reliable model should perceive its knowledge boundaries-knowing what it knows and what it does not. This paper investigates LVLMs' perception of their knowledge boundaries by evaluating three types of confidence signals: probabilistic confidence, answer consistency-based confidence, and verbalized confidence. Experiments on three LVLMs across three VQA datasets show that, although LVLMs possess a reasonable perception level, there is substantial room for improvement. Among the three confidences, probabilistic and consistency-based signals are more reliable indicators, while verbalized confidence often leads to overconfidence. To enhance LVLMs' perception, we adapt several established confidence calibration methods from Large Language Models (LLMs) and propose three effective methods. Additionally, we compare LVLMs with their LLM counterparts, finding that jointly processing visual and textual inputs decreases question-answering performance but reduces confidence, resulting in an improved perception level compared to LLMs.
comment: EMNLP2025 Findings
☆ Beyond the Black Box: Integrating Lexical and Semantic Methods in Quantitative Discourse Analysis with BERTopic
Quantitative Discourse Analysis has seen growing adoption with the rise of Large Language Models and computational tools. However, reliance on black box software such as MAXQDA and NVivo risks undermining methodological transparency and alignment with research goals. This paper presents a hybrid, transparent framework for QDA that combines lexical and semantic methods to enable triangulation, reproducibility, and interpretability. Drawing from a case study in historical political discourse, we demonstrate how custom Python pipelines using NLTK, spaCy, and Sentence Transformers allow fine-grained control over preprocessing, lemmatisation, and embedding generation. We further detail our iterative BERTopic modelling process, incorporating UMAP dimensionality reduction, HDBSCAN clustering, and c-TF-IDF keyword extraction, optimised through parameter tuning and multiple runs to enhance topic coherence and coverage. By juxtaposing precise lexical searches with context-aware semantic clustering, we argue for a multi-layered approach that mitigates the limitations of either method in isolation. Our workflow underscores the importance of code-level transparency, researcher agency, and methodological triangulation in computational discourse studies. Code and supplementary materials are available via GitHub.
comment: 5 pages conference paper, 4 tables
☆ Retrieval-Augmented Generation for Natural Language Art Provenance Searches in the Getty Provenance Index
This research presents a Retrieval-Augmented Generation (RAG) framework for art provenance studies, focusing on the Getty Provenance Index. Provenance research establishes the ownership history of artworks, which is essential for verifying authenticity, supporting restitution and legal claims, and understanding the cultural and historical context of art objects. The process is complicated by fragmented, multilingual archival data that hinders efficient retrieval. Current search portals require precise metadata, limiting exploratory searches. Our method enables natural-language and multilingual searches through semantic retrieval and contextual summarization, reducing dependence on metadata structures. We assess RAG's capability to retrieve and summarize auction records using a 10,000-record sample from the Getty Provenance Index - German Sales. The results show this approach provides a scalable solution for navigating art market archives, offering a practical tool for historians and cultural heritage professionals conducting historically sensitive research.
☆ It's All About In-Context Learning! Teaching Extremely Low-Resource Languages to LLMs EMNLP 2025
Extremely low-resource languages, especially those written in rare scripts, as shown in Figure 1, remain largely unsupported by large language models (LLMs). This is due in part to compounding factors such as the lack of training data. This paper delivers the first comprehensive analysis of whether LLMs can acquire such languages purely via in-context learning (ICL), with or without auxiliary alignment signals, and how these methods compare to parameter-efficient fine-tuning (PEFT). We systematically evaluate 20 under-represented languages across three state-of-the-art multilingual LLMs. Our findings highlight the limitation of PEFT when both language and its script are extremely under-represented by the LLM. In contrast, zero-shot ICL with language alignment is impressively effective on extremely low-resource languages, while few-shot ICL or PEFT is more beneficial for languages relatively better represented by LLMs. For LLM practitioners working on extremely low-resource languages, we summarise guidelines grounded by our results on adapting LLMs to low-resource languages, e.g., avoiding fine-tuning a multilingual model on languages of unseen scripts.
comment: Accepted by EMNLP 2025
☆ "Where does it hurt?" -- Dataset and Study on Physician Intent Trajectories in Doctor Patient Dialogues ECAI 2025
In a doctor-patient dialogue, the primary objective of physicians is to diagnose patients and propose a treatment plan. Medical doctors guide these conversations through targeted questioning to efficiently gather the information required to provide the best possible outcomes for patients. To the best of our knowledge, this is the first work that studies physician intent trajectories in doctor-patient dialogues. We use the `Ambient Clinical Intelligence Benchmark' (Aci-bench) dataset for our study. We collaborate with medical professionals to develop a fine-grained taxonomy of physician intents based on the SOAP framework (Subjective, Objective, Assessment, and Plan). We then conduct a large-scale annotation effort to label over 5000 doctor-patient turns with the help of a large number of medical experts recruited using Prolific, a popular crowd-sourcing platform. This large labeled dataset is an important resource contribution that we use for benchmarking the state-of-the-art generative and encoder models for medical intent classification tasks. Our findings show that our models understand the general structure of medical dialogues with high accuracy, but often fail to identify transitions between SOAP categories. We also report for the first time common trajectories in medical dialogue structures that provide valuable insights for designing `differential diagnosis' systems. Finally, we extensively study the impact of intent filtering for medical dialogue summarization and observe a significant boost in performance. We make the codes and data, including annotation guidelines, publicly available at https://github.com/DATEXIS/medical-intent-classification.
comment: Accepted at ECAI 2025
☆ HiPlan: Hierarchical Planning for LLM-Based Agents with Adaptive Global-Local Guidance
Large language model (LLM)-based agents have demonstrated remarkable capabilities in decision-making tasks, but struggle significantly with complex, long-horizon planning scenarios. This arises from their lack of macroscopic guidance, causing disorientation and failures in complex tasks, as well as insufficient continuous oversight during execution, rendering them unresponsive to environmental changes and prone to deviations. To tackle these challenges, we introduce HiPlan, a hierarchical planning framework that provides adaptive global-local guidance to boost LLM-based agents'decision-making. HiPlan decomposes complex tasks into milestone action guides for general direction and step-wise hints for detailed actions. During the offline phase, we construct a milestone library from expert demonstrations, enabling structured experience reuse by retrieving semantically similar tasks and milestones. In the execution phase, trajectory segments from past milestones are dynamically adapted to generate step-wise hints that align current observations with the milestone objectives, bridging gaps and correcting deviations. Extensive experiments across two challenging benchmarks demonstrate that HiPlan substantially outperforms strong baselines, and ablation studies validate the complementary benefits of its hierarchical components.
☆ MovieCORE: COgnitive REasoning in Movies EMNLP'2025
This paper introduces MovieCORE, a novel video question answering (VQA) dataset designed to probe deeper cognitive understanding of movie content. Unlike existing datasets that focus on surface-level comprehension, MovieCORE emphasizes questions that engage System-2 thinking while remaining specific to the video material. We present an innovative agentic brainstorming approach, utilizing multiple large language models (LLMs) as thought agents to generate and refine high-quality question-answer pairs. To evaluate dataset quality, we develop a set of cognitive tests assessing depth, thought-provocation potential, and syntactic complexity. We also propose a comprehensive evaluation scheme for assessing VQA model performance on deeper cognitive tasks. To address the limitations of existing video-language models (VLMs), we introduce an agentic enhancement module, Agentic Choice Enhancement (ACE), which improves model reasoning capabilities post-training by up to 25%. Our work contributes to advancing movie understanding in AI systems and provides valuable insights into the capabilities and limitations of current VQA models when faced with more challenging, nuanced questions about cinematic content. Our project page, dataset and code can be found at https://joslefaure.github.io/assets/html/moviecore.html.
comment: Accepted for EMNLP'2025 Main Conference. Project Page: https://joslefaure.github.io/assets/html/moviecore.html
☆ Building Self-Evolving Agents via Experience-Driven Lifelong Learning: A Framework and Benchmark
As AI advances toward general intelligence, the focus is shifting from systems optimized for static tasks to creating open-ended agents that learn continuously. In this paper, we introduce Experience-driven Lifelong Learning (ELL), a framework for building self-evolving agents capable of continuous growth through real-world interaction. The framework is built on four core principles: (1) Experience Exploration: Agents learn through continuous, self-motivated interaction with dynamic environments, navigating interdependent tasks and generating rich experiential trajectories. (2) Long-term Memory: Agents preserve and structure historical knowledge, including personal experiences, domain expertise, and commonsense reasoning, into a persistent memory system. (3) Skill Learning: Agents autonomously improve by abstracting recurring patterns from experience into reusable skills, which are actively refined and validated for application in new tasks. (4) Knowledge Internalization: Agents internalize explicit and discrete experiences into implicit and intuitive capabilities as "second nature". We also introduce StuLife, a benchmark dataset for ELL that simulates a student's holistic college journey, from enrollment to academic and personal development, across three core phases and ten detailed sub-scenarios. StuLife is designed around three key paradigm shifts: From Passive to Proactive, From Context to Memory, and From Imitation to Learning. In this dynamic environment, agents must acquire and distill practical skills and maintain persistent memory to make decisions based on evolving state variables. StuLife provides a comprehensive platform for evaluating lifelong learning capabilities, including memory retention, skill transfer, and self-motivated behavior. Beyond evaluating SOTA LLMs on the StuLife benchmark, we also explore the role of context engineering in advancing AGI.
☆ Automatic Prompt Optimization with Prompt Distillation
Autoprompting is the process of automatically selecting optimized prompts for language models, which is gaining popularity due to the rapid development of prompt engineering driven by extensive research in the field of large language models (LLMs). This paper presents DistillPrompt -- a novel autoprompting method based on large language models that employs a multi-stage integration of task-specific information into prompts using training data. DistillPrompt utilizes distillation, compression, and aggregation operations to explore the prompt space more thoroughly. The method was tested on different datasets for text classification and generation tasks using the t-lite-instruct-0.1 language model. The results demonstrate a significant average improvement (e.g., 20.12% across the entire dataset compared to Grips) in key metrics over existing methods in the field, establishing DistillPrompt as one of the most effective non-gradient approaches in autoprompting.
☆ Interpretable by AI Mother Tongue: Native Symbolic Reasoning in Neural Models
We present a framework where neural models develop an AI Mother Tongue, a native symbolic language that simultaneously supports intuitive reasoning, compositional symbol chains, and inherent interpretability. Unlike post-hoc explanation methods, our approach embeds reasoning directly into the model's representations: symbols capture meaningful semantic patterns, chains trace decision paths, and gated induction mechanisms guide selective focus, yielding transparent yet flexible reasoning. We introduce complementary training objectives to enhance symbol purity and decision sparsity, and employ a sequential specialization strategy to first build broad symbolic competence and then refine intuitive judgments. Experiments on AI tasks demonstrate competitive accuracy alongside verifiable reasoning traces, showing that AI Mother Tongue can serve as a unified mechanism for interpretability, intuition, and symbolic reasoning in neural models.
comment: 25 pages, 9 figures. The AI Intuition Explorer dashboard is available at: https://cyrilliu1974.github.io/github.io/vi.html
☆ The Double-edged Sword of LLM-based Data Reconstruction: Understanding and Mitigating Contextual Vulnerability in Word-level Differential Privacy Text Sanitization CCS 2025
Differentially private text sanitization refers to the process of privatizing texts under the framework of Differential Privacy (DP), providing provable privacy guarantees while also empirically defending against adversaries seeking to harm privacy. Despite their simplicity, DP text sanitization methods operating at the word level exhibit a number of shortcomings, among them the tendency to leave contextual clues from the original texts due to randomization during sanitization $\unicode{x2013}$ this we refer to as $\textit{contextual vulnerability}$. Given the powerful contextual understanding and inference capabilities of Large Language Models (LLMs), we explore to what extent LLMs can be leveraged to exploit the contextual vulnerability of DP-sanitized texts. We expand on previous work not only in the use of advanced LLMs, but also in testing a broader range of sanitization mechanisms at various privacy levels. Our experiments uncover a double-edged sword effect of LLM-based data reconstruction attacks on privacy and utility: while LLMs can indeed infer original semantics and sometimes degrade empirical privacy protections, they can also be used for good, to improve the quality and privacy of DP-sanitized texts. Based on our findings, we propose recommendations for using LLM data reconstruction as a post-processing step, serving to increase privacy protection by thinking adversarially.
comment: 15 pages, 4 figures, 8 tables. Accepted to WPES @ CCS 2025
☆ Diverse And Private Synthetic Datasets Generation for RAG evaluation: A multi-agent framework ECAI 2025
Retrieval-augmented generation (RAG) systems improve large language model outputs by incorporating external knowledge, enabling more informed and context-aware responses. However, the effectiveness and trustworthiness of these systems critically depends on how they are evaluated, particularly on whether the evaluation process captures real-world constraints like protecting sensitive information. While current evaluation efforts for RAG systems have primarily focused on the development of performance metrics, far less attention has been given to the design and quality of the underlying evaluation datasets, despite their pivotal role in enabling meaningful, reliable assessments. In this work, we introduce a novel multi-agent framework for generating synthetic QA datasets for RAG evaluation that prioritize semantic diversity and privacy preservation. Our approach involves: (1) a Diversity agent leveraging clustering techniques to maximize topical coverage and semantic variability, (2) a Privacy Agent that detects and mask sensitive information across multiple domains and (3) a QA curation agent that synthesizes private and diverse QA pairs suitable as ground truth for RAG evaluation. Extensive experiments demonstrate that our evaluation sets outperform baseline methods in diversity and achieve robust privacy masking on domain-specific datasets. This work offers a practical and ethically aligned pathway toward safer, more comprehensive RAG system evaluation, laying the foundation for future enhancements aligned with evolving AI regulations and compliance standards.
comment: ECAI 2025 TRUST AI workshop
☆ Affective Polarization across European Parliaments
Affective polarization, characterized by increased negativity and hostility towards opposing groups, has become a prominent feature of political discourse worldwide. Our study examines the presence of this type of polarization in a selection of European parliaments in a fully automated manner. Utilizing a comprehensive corpus of parliamentary speeches from the parliaments of six European countries, we employ natural language processing techniques to estimate parliamentarian sentiment. By comparing the levels of negativity conveyed in references to individuals from opposing groups versus one's own, we discover patterns of affectively polarized interactions. The findings demonstrate the existence of consistent affective polarization across all six European parliaments. Although activity correlates with negativity, there is no observed difference in affective polarization between less active and more active members of parliament. Finally, we show that reciprocity is a contributing mechanism in affective polarization between parliamentarians across all six parliaments.
comment: 6 pages, 4 figures
☆ Empowering Computing Education Researchers Through LLM-Assisted Content Analysis
Computing education research (CER) is often instigated by practitioners wanting to improve both their own and the wider discipline's teaching practice. However, the latter is often difficult as many researchers lack the colleagues, resources, or capacity to conduct research that is generalisable or rigorous enough to advance the discipline. As a result, research methods that enable sense-making with larger volumes of qualitative data, while not increasing the burden on the researcher, have significant potential within CER. In this discussion paper, we propose such a method for conducting rigorous analysis on large volumes of textual data, namely a variation of LLM-assisted content analysis (LACA). This method combines content analysis with the use of large language models, empowering researchers to conduct larger-scale research which they would otherwise not be able to perform. Using a computing education dataset, we illustrate how LACA could be applied in a reproducible and rigorous manner. We believe this method has potential in CER, enabling more generalisable findings from a wider range of research. This, together with the development of similar methods, can help to advance both the practice and research quality of the CER discipline.
comment: 7 pages, 2 figures
☆ ReflectivePrompt: Reflective evolution in autoprompting algorithms
Autoprompting is the process of automatically selecting optimized prompts for language models, which has been gaining popularity with the rapid advancement of prompt engineering, driven by extensive research in the field of large language models (LLMs). This paper presents ReflectivePrompt - a novel autoprompting method based on evolutionary algorithms that employs a reflective evolution approach for more precise and comprehensive search of optimal prompts. ReflectivePrompt utilizes short-term and long-term reflection operations before crossover and elitist mutation to enhance the quality of the modifications they introduce. This method allows for the accumulation of knowledge obtained throughout the evolution process and updates it at each epoch based on the current population. ReflectivePrompt was tested on 33 datasets for classification and text generation tasks using open-access large language models: t-lite-instruct-0.1 and gemma3-27b-it. The method demonstrates, on average, a significant improvement (e.g., 28% on BBH compared to EvoPrompt) in metrics relative to current state-of-the-art approaches, thereby establishing itself as one of the most effective solutions in evolutionary algorithm-based autoprompting.
☆ ConfTuner: Training Large Language Models to Express Their Confidence Verbally
Large Language Models (LLMs) are increasingly deployed in high-stakes domains such as science, law, and healthcare, where accurate expressions of uncertainty are essential for reliability and trust. However, current LLMs are often observed to generate incorrect answers with high confidence, a phenomenon known as "overconfidence". Recent efforts have focused on calibrating LLMs' verbalized confidence: i.e., their expressions of confidence in text form, such as "I am 80% confident that...". Existing approaches either rely on prompt engineering or fine-tuning with heuristically generated uncertainty estimates, both of which have limited effectiveness and generalizability. Motivated by the notion of proper scoring rules for calibration in classical machine learning models, we introduce ConfTuner, a simple and efficient fine-tuning method that introduces minimal overhead and does not require ground-truth confidence scores or proxy confidence estimates. ConfTuner relies on a new loss function, tokenized Brier score, which we theoretically prove to be a proper scoring rule, intuitively meaning that it "correctly incentivizes the model to report its true probability of being correct". ConfTuner improves calibration across diverse reasoning tasks and generalizes to black-box models such as GPT-4o. Our results further show that better-calibrated confidence enables downstream gains in self-correction and model cascade, advancing the development of trustworthy LLM systems. The code is available at https://github.com/liushiliushi/ConfTuner.
☆ Arrows of Math Reasoning Data Synthesis for Large Language Models: Diversity, Complexity and Correctness
Enhancing the mathematical reasoning of large language models (LLMs) demands high-quality training data, yet conventional methods face critical challenges in scalability, cost, and data reliability. To address these limitations, we propose a novel program-assisted synthesis framework that systematically generates a high-quality mathematical corpus with guaranteed diversity, complexity, and correctness. This framework integrates mathematical knowledge systems and domain-specific tools to create executable programs. These programs are then translated into natural language problem-solution pairs and vetted by a bilateral validation mechanism that verifies solution correctness against program outputs and ensures program-problem consistency. We have generated 12.3 million such problem-solving triples. Experiments demonstrate that models fine-tuned on our data significantly improve their inference capabilities, achieving state-of-the-art performance on several benchmark datasets and showcasing the effectiveness of our synthesis approach.
☆ LLM-based Contrastive Self-Supervised AMR Learning with Masked Graph Autoencoders for Fake News Detection
The proliferation of misinformation in the digital age has led to significant societal challenges. Existing approaches often struggle with capturing long-range dependencies, complex semantic relations, and the social dynamics influencing news dissemination. Furthermore, these methods require extensive labelled datasets, making their deployment resource-intensive. In this study, we propose a novel self-supervised misinformation detection framework that integrates both complex semantic relations using Abstract Meaning Representation (AMR) and news propagation dynamics. We introduce an LLM-based graph contrastive loss (LGCL) that utilizes negative anchor points generated by a Large Language Model (LLM) to enhance feature separability in a zero-shot manner. To incorporate social context, we employ a multi view graph masked autoencoder, which learns news propagation features from social context graph. By combining these semantic and propagation-based features, our approach effectively differentiates between fake and real news in a self-supervised manner. Extensive experiments demonstrate that our self-supervised framework achieves superior performance compared to other state-of-the-art methodologies, even with limited labelled datasets while improving generalizability.
☆ LaTeXTrans: Structured LaTeX Translation with Multi-Agent Coordination
Despite the remarkable progress of modern machine translation (MT) systems on general-domain texts, translating structured LaTeX-formatted documents remains a significant challenge. These documents typically interleave natural language with domain-specific syntax, such as mathematical equations, tables, figures, and cross-references, all of which must be accurately preserved to maintain semantic integrity and compilability. In this paper, we introduce LaTeXTrans, a collaborative multi-agent system designed to address this challenge. LaTeXTrans ensures format preservation, structural fidelity, and terminology consistency through six specialized agents: 1) a Parser that decomposes LaTeX into translation-friendly units via placeholder substitution and syntax filtering; 2) a Translator, Validator, Summarizer, and Terminology Extractor that work collaboratively to ensure context-aware, self-correcting, and terminology-consistent translations; 3) a Generator that reconstructs the translated content into well-structured LaTeX documents. Experimental results demonstrate that LaTeXTrans can outperform mainstream MT systems in both translation accuracy and structural fidelity, offering an effective and practical solution for translating LaTeX-formatted documents.
☆ Controllable Conversational Theme Detection Track at DSTC 12
Conversational analytics has been on the forefront of transformation driven by the advances in Speech and Natural Language Processing techniques. Rapid adoption of Large Language Models (LLMs) in the analytics field has taken the problems that can be automated to a new level of complexity and scale. In this paper, we introduce Theme Detection as a critical task in conversational analytics, aimed at automatically identifying and categorizing topics within conversations. This process can significantly reduce the manual effort involved in analyzing expansive dialogs, particularly in domains like customer support or sales. Unlike traditional dialog intent detection, which often relies on a fixed set of intents for downstream system logic, themes are intended as a direct, user-facing summary of the conversation's core inquiry. This distinction allows for greater flexibility in theme surface forms and user-specific customizations. We pose Controllable Conversational Theme Detection problem as a public competition track at Dialog System Technology Challenge (DSTC) 12 -- it is framed as joint clustering and theme labeling of dialog utterances, with the distinctive aspect being controllability of the resulting theme clusters' granularity achieved via the provided user preference data. We give an overview of the problem, the associated dataset and the evaluation metrics, both automatic and human. Finally, we discuss the participant teams' submissions and provide insights from those. The track materials (data and code) are openly available in the GitHub repository.
comment: DSTC12@SigDial2025; data and code available at https://github.com/amazon-science/dstc12-controllable-conversational-theme-detection
☆ Harnessing Rule-Based Reinforcement Learning for Enhanced Grammatical Error Correction
Grammatical error correction is a significant task in NLP. Traditional methods based on encoder-decoder models have achieved certain success, but the application of LLMs in this field is still underexplored. Current research predominantly relies on supervised fine-tuning to train LLMs to directly generate the corrected sentence, which limits the model's powerful reasoning ability. To address this limitation, we propose a novel framework based on Rule-Based RL. Through experiments on the Chinese datasets, our Rule-Based RL framework achieves \textbf{state-of-the-art }performance, with a notable increase in \textbf{recall}. This result clearly highlights the advantages of using RL to steer LLMs, offering a more controllable and reliable paradigm for future development in GEC.
comment: Code will be released upon publication
☆ ThinkDial: An Open Recipe for Controlling Reasoning Effort in Large Language Models
Large language models (LLMs) with chain-of-thought reasoning have demonstrated remarkable problem-solving capabilities, but controlling their computational effort remains a significant challenge for practical deployment. Recent proprietary systems like OpenAI's gpt-oss series have introduced discrete operational modes for intuitive reasoning control, but the open-source community has largely failed to achieve such capabilities. In this paper, we introduce ThinkDial, the first open-recipe end-to-end framework that successfully implements gpt-oss-style controllable reasoning through discrete operational modes. Our system enables seamless switching between three distinct reasoning regimes: High mode (full reasoning capability), Medium mode (50 percent token reduction with <10 percent performance degradation), and Low mode (75 percent token reduction with <15 percent performance degradation). We achieve this through an end-to-end training paradigm that integrates budget-mode control throughout the entire pipeline: budget-mode supervised fine-tuning that embeds controllable reasoning capabilities directly into the learning process, and two-phase budget-aware reinforcement learning with adaptive reward shaping. Extensive experiments demonstrate that ThinkDial achieves target compression-performance trade-offs with clear response length reductions while maintaining performance thresholds. The framework also exhibits strong generalization capabilities on out-of-distribution tasks.
☆ Beyond the Textual: Generating Coherent Visual Options for MCQs EMNLP 2025
Multiple-choice questions (MCQs) play a crucial role in fostering deep thinking and knowledge integration in education. However, previous research has primarily focused on generating MCQs with textual options, but it largely overlooks the visual options. Moreover, generating high-quality distractors remains a major challenge due to the high cost and limited scalability of manual authoring. To tackle these problems, we propose a Cross-modal Options Synthesis (CmOS), a novel framework for generating educational MCQs with visual options. Our framework integrates Multimodal Chain-of-Thought (MCoT) reasoning process and Retrieval-Augmented Generation (RAG) to produce semantically plausible and visually similar answer and distractors. It also includes a discrimination module to identify content suitable for visual options. Experimental results on test tasks demonstrate the superiority of CmOS in content discrimination, question generation and visual option generation over existing methods across various subjects and educational levels.
comment: EMNLP 2025
☆ Answering the Unanswerable Is to Err Knowingly: Analyzing and Mitigating Abstention Failures in Large Reasoning Models
Large reasoning models (LRMs) have shown remarkable progress on complex reasoning tasks. However, some questions posed to LRMs are inherently unanswerable, such as math problems lacking sufficient conditions. We find that LRMs continually fail to provide appropriate abstentions when confronted with these unanswerable questions. In this paper, we systematically analyze, investigate, and resolve this issue for trustworthy AI. We first conduct a detailed analysis of the distinct response behaviors of LRMs when facing unanswerable questions. Then, we show that LRMs possess sufficient cognitive capabilities to recognize the flaws in these questions. However, they fail to exhibit appropriate abstention behavior, revealing a misalignment between their internal cognition and external response. Finally, to resolve this issue, we propose a lightweight, two-stage method that combines cognitive monitoring with inference-time intervention. Experimental results demonstrate that our method significantly improves the abstention rate while maintaining the overall reasoning performance.
☆ Text to Query Plans for Question Answering on Large Tables
Efficient querying and analysis of large tabular datasets remain significant challenges, especially for users without expertise in programming languages like SQL. Text-to-SQL approaches have shown promising performance on benchmark data; however, they inherit SQL's drawbacks, including inefficiency with large datasets and limited support for complex data analyses beyond basic querying. We propose a novel framework that transforms natural language queries into query plans. Our solution is implemented outside traditional databases, allowing us to support classical SQL commands while avoiding SQL's inherent limitations. Additionally, we enable complex analytical functions, such as principal component analysis and anomaly detection, providing greater flexibility and extensibility than traditional SQL capabilities. We leverage LLMs to iteratively interpret queries and construct operation sequences, addressing computational complexity by incrementally building solutions. By executing operations directly on the data, we overcome context length limitations without requiring the entire dataset to be processed by the model. We validate our framework through experiments on both standard databases and large scientific tables, demonstrating its effectiveness in handling extensive datasets and performing sophisticated data analyses.
☆ Chronological Passage Assembling in RAG framework for Temporal Question Answering
Long-context question answering over narrative tasks is challenging because correct answers often hinge on reconstructing a coherent timeline of events while preserving contextual flow in a limited context window. Retrieval-augmented generation (RAG) indexing methods aim to address this challenge by selectively retrieving only necessary document segments. However, narrative texts possess unique characteristics that limit the effectiveness of these existing approaches. Specifically, understanding narrative texts requires more than isolated segments, as the broader context and sequential relationships between segments are crucial for comprehension. To address these limitations, we propose ChronoRAG, a novel RAG framework specialized for narrative texts. This approach focuses on two essential aspects: refining dispersed document information into coherent and structured passages, and preserving narrative flow by explicitly capturing and maintaining the temporal order among retrieved passages. We empirically demonstrate the effectiveness of ChronoRAG through experiments on the NarrativeQA dataset, showing substantial improvements in tasks requiring both factual identification and comprehension of complex sequential relationships, underscoring that reasoning over temporal order is crucial in resolving narrative QA.
comment: 7 pages, 3 figures
☆ CAC-CoT: Connector-Aware Compact Chain-of-Thought for Efficient Reasoning Data Synthesis Across Dual-System Cognitive Tasks EMNLP 2025
Long chain-of-thought (CoT) prompting helps Large Language Models (LLMs) solve difficult problems, but very long traces often slow or even degrade performance on fast, intuitive "System-1" tasks. We introduce Connector-Aware Compact CoT (CAC-CoT) -- a method that deliberately restricts reasoning to a small, fixed set of connector phrases, steering the model toward concise and well -- structured explanations. Despite its simplicity, our synthetic method with Gemini-2.0-Flash yields a high-quality training quality. CAC-CoT achieves approximately 85% on GSM8K and approximately 40% on GPQA (System-2) while retaining approximately 90% on S1-Bench (System-1). Its reasoning traces average approximately 300 tokens(ART), about one-third the length of baseline traces, delivering higher efficiency without loss of accuracy.
comment: Accepted at EMNLP 2025 findings
☆ M3HG: Multimodal, Multi-scale, and Multi-type Node Heterogeneous Graph for Emotion Cause Triplet Extraction in Conversations ACL 2025
Emotion Cause Triplet Extraction in Multimodal Conversations (MECTEC) has recently gained significant attention in social media analysis, aiming to extract emotion utterances, cause utterances, and emotion categories simultaneously. However, the scarcity of related datasets, with only one published dataset featuring highly uniform dialogue scenarios, hinders model development in this field. To address this, we introduce MECAD, the first multimodal, multi-scenario MECTEC dataset, comprising 989 conversations from 56 TV series spanning a wide range of dialogue contexts. In addition, existing MECTEC methods fail to explicitly model emotional and causal contexts and neglect the fusion of semantic information at different levels, leading to performance degradation. In this paper, we propose M3HG, a novel model that explicitly captures emotional and causal contexts and effectively fuses contextual information at both inter- and intra-utterance levels via a multimodal heterogeneous graph. Extensive experiments demonstrate the effectiveness of M3HG compared with existing state-of-the-art methods. The codes and dataset are available at https://github.com/redifinition/M3HG.
comment: 16 pages, 8 figures. Accepted to Findings of ACL 2025
☆ Beyond Quality: Unlocking Diversity in Ad Headline Generation with Large Language Models
The generation of ad headlines plays a vital role in modern advertising, where both quality and diversity are essential to engage a broad range of audience segments. Current approaches primarily optimize language models for headline quality or click-through rates (CTR), often overlooking the need for diversity and resulting in homogeneous outputs. To address this limitation, we propose DIVER, a novel framework based on large language models (LLMs) that are jointly optimized for both diversity and quality. We first design a semantic- and stylistic-aware data generation pipeline that automatically produces high-quality training pairs with ad content and multiple diverse headlines. To achieve the goal of generating high-quality and diversified ad headlines within a single forward pass, we propose a multi-stage multi-objective optimization framework with supervised fine-tuning (SFT) and reinforcement learning (RL). Experiments on real-world industrial datasets demonstrate that DIVER effectively balances quality and diversity. Deployed on a large-scale content-sharing platform serving hundreds of millions of users, our framework improves advertiser value (ADVV) and CTR by 4.0% and 1.4%.
☆ Bias Mitigation Agent: Optimizing Source Selection for Fair and Balanced Knowledge Retrieval KDD'2025
Large Language Models (LLMs) have transformed the field of artificial intelligence by unlocking the era of generative applications. Built on top of generative AI capabilities, Agentic AI represents a major shift toward autonomous, goal-driven systems that can reason, retrieve, and act. However, they also inherit the bias present in both internal and external information sources. This significantly affects the fairness and balance of retrieved information, and hence reduces user trust. To address this critical challenge, we introduce a novel Bias Mitigation Agent, a multi-agent system designed to orchestrate the workflow of bias mitigation through specialized agents that optimize the selection of sources to ensure that the retrieved content is both highly relevant and minimally biased to promote fair and balanced knowledge dissemination. The experimental results demonstrate an 81.82\% reduction in bias compared to a baseline naive retrieval strategy.
comment: Accepted at KDD'2025 Agent4IR workshop
☆ EMMM, Explain Me My Model! Explainable Machine Generated Text Detection in Dialogues
The rapid adoption of large language models (LLMs) in customer service introduces new risks, as malicious actors can exploit them to conduct large-scale user impersonation through machine-generated text (MGT). Current MGT detection methods often struggle in online conversational settings, reducing the reliability and interpretability essential for trustworthy AI deployment. In customer service scenarios where operators are typically non-expert users, explanation become crucial for trustworthy MGT detection. In this paper, we propose EMMM, an explanation-then-detection framework that balances latency, accuracy, and non-expert-oriented interpretability. Experimental results demonstrate that EMMM provides explanations accessible to non-expert users, with 70\% of human evaluators preferring its outputs, while achieving competitive accuracy compared to state-of-the-art models and maintaining low latency, generating outputs within 1 second. Our code and dataset are open-sourced at https://github.com/AngieYYF/EMMM-explainable-chatbot-detection.
comment: 15 pages
☆ Filtering for Creativity: Adaptive Prompting for Multilingual Riddle Generation in LLMs
Multilingual riddle generation challenges large language models (LLMs) to balance cultural fluency with creative abstraction. Standard prompting strategies -- zero-shot, few-shot, chain-of-thought -- tend to reuse memorized riddles or perform shallow paraphrasing. We introduce Adaptive Originality Filtering (AOF), a prompting framework that filters redundant generations using cosine-based similarity rejection, while enforcing lexical novelty and cross-lingual fidelity. Evaluated across three LLMs and four language pairs, AOF-enhanced GPT-4o achieves \texttt{0.177} Self-BLEU and \texttt{0.915} Distinct-2 in Japanese, signaling improved lexical diversity and reduced redundancy compared to other prompting methods and language pairs. Our findings show that semantic rejection can guide culturally grounded, creative generation without task-specific fine-tuning.
☆ Attention2Probability: Attention-Driven Terminology Probability Estimation for Robust Speech-to-Text System
Recent advances in speech large language models (SLMs) have improved speech recognition and translation in general domains, but accurately generating domain-specific terms or neologisms remains challenging. To address this, we propose Attention2Probability: attention-driven terminology probability estimation for robust speech-to-text system, which is lightweight, flexible, and accurate. Attention2Probability converts cross-attention weights between speech and terminology into presence probabilities, and it further employs curriculum learning to enhance retrieval accuracy. Furthermore, to tackle the lack of data for speech-to-text tasks with terminology intervention, we create and release a new speech dataset with terminology to support future research in this area. Experimental results show that Attention2Probability significantly outperforms the VectorDB method on our test set. Specifically, its maximum recall rates reach 92.57% for Chinese and 86.83% for English. This high recall is achieved with a latency of only 8.71ms per query. Intervening in SLMs' recognition and translation tasks using Attention2Probability-retrieved terms improves terminology accuracy by 6-17%, while revealing that the current utilization of terminology by SLMs has limitations.
comment: 9 pages, 4 figures, 5 tables
☆ Knowing or Guessing? Robust Medical Visual Question Answering via Joint Consistency and Contrastive Learning
In high-stakes medical applications, consistent answering across diverse question phrasings is essential for reliable diagnosis. However, we reveal that current Medical Vision-Language Models (Med-VLMs) exhibit concerning fragility in Medical Visual Question Answering, as their answers fluctuate significantly when faced with semantically equivalent rephrasings of medical questions. We attribute this to two limitations: (1) insufficient alignment of medical concepts, leading to divergent reasoning patterns, and (2) hidden biases in training data that prioritize syntactic shortcuts over semantic understanding. To address these challenges, we construct RoMed, a dataset built upon original VQA datasets containing 144k questions with variations spanning word-level, sentence-level, and semantic-level perturbations. When evaluating state-of-the-art (SOTA) models like LLaVA-Med on RoMed, we observe alarming performance drops (e.g., a 40\% decline in Recall) compared to original VQA benchmarks, exposing critical robustness gaps. To bridge this gap, we propose Consistency and Contrastive Learning (CCL), which integrates two key components: (1) knowledge-anchored consistency learning, aligning Med-VLMs with medical knowledge rather than shallow feature patterns, and (2) bias-aware contrastive learning, mitigating data-specific priors through discriminative representation refinement. CCL achieves SOTA performance on three popular VQA benchmarks and notably improves answer consistency by 50\% on the challenging RoMed test set, demonstrating significantly enhanced robustness. Code will be released.
☆ FALCON: Autonomous Cyber Threat Intelligence Mining with LLMs for IDS Rule Generation
Signature-based Intrusion Detection Systems (IDS) detect malicious activities by matching network or host activity against predefined rules. These rules are derived from extensive Cyber Threat Intelligence (CTI), which includes attack signatures and behavioral patterns obtained through automated tools and manual threat analysis, such as sandboxing. The CTI is then transformed into actionable rules for the IDS engine, enabling real-time detection and prevention. However, the constant evolution of cyber threats necessitates frequent rule updates, which delay deployment time and weaken overall security readiness. Recent advancements in agentic systems powered by Large Language Models (LLMs) offer the potential for autonomous IDS rule generation with internal evaluation. We introduce FALCON, an autonomous agentic framework that generates deployable IDS rules from CTI data in real-time and evaluates them using built-in multi-phased validators. To demonstrate versatility, we target both network (Snort) and host-based (YARA) mediums and construct a comprehensive dataset of IDS rules with their corresponding CTIs. Our evaluations indicate FALCON excels in automatic rule generation, with an average of 95% accuracy validated by qualitative evaluation with 84% inter-rater agreement among multiple cybersecurity analysts across all metrics. These results underscore the feasibility and effectiveness of LLM-driven data mining for real-time cyber threat mitigation.
comment: 11 pages, 5 figures, 4 tables
☆ Tailored Teaching with Balanced Difficulty: Elevating Reasoning in Multimodal Chain-of-Thought via Prompt Curriculum
The effectiveness of Multimodal Chain-of-Thought (MCoT) prompting is often limited by the use of randomly or manually selected examples. These examples fail to account for both model-specific knowledge distributions and the intrinsic complexity of the tasks, resulting in suboptimal and unstable model performance. To address this, we propose a novel framework inspired by the pedagogical principle of "tailored teaching with balanced difficulty". We reframe prompt selection as a prompt curriculum design problem: constructing a well ordered set of training examples that align with the model's current capabilities. Our approach integrates two complementary signals: (1) model-perceived difficulty, quantified through prediction disagreement in an active learning setup, capturing what the model itself finds challenging; and (2) intrinsic sample complexity, which measures the inherent difficulty of each question-image pair independently of any model. By jointly analyzing these signals, we develop a difficulty-balanced sampling strategy that ensures the selected prompt examples are diverse across both dimensions. Extensive experiments conducted on five challenging benchmarks and multiple popular Multimodal Large Language Models (MLLMs) demonstrate that our method yields substantial and consistent improvements and greatly reduces performance discrepancies caused by random sampling, providing a principled and robust approach for enhancing multimodal reasoning.
☆ Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks ICML
Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture-of-Experts (MoE) models, now standard in state-of-the-art systems, introduce a new sparsity dimension that current dense-model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization and reasoning. We train families of MoE Transformers that systematically vary total parameters, active parameters, and top-$k$ routing while holding the compute budget fixed. For every model we record pre-training loss, downstream task loss, and task accuracy, allowing us to separate the train-test generalization gap from the loss-accuracy gap. Memorization benchmarks improve monotonically with total parameters, mirroring training loss. By contrast, reasoning performance saturates and can even regress despite continued gains in both total parameters and training loss. Altering top-$k$ alone has little effect when active parameters are constant, and classic hyperparameters such as learning rate and initialization modulate the generalization gap in the same direction as sparsity. Neither post-training reinforcement learning (GRPO) nor extra test-time compute rescues the reasoning deficit of overly sparse models. Our model checkpoints, code and logs are open-source at https://github.com/rioyokotalab/optimal-sparsity.
comment: Presented at the Second AI for Math Workshop at ICML
☆ Membership Inference Attacks on LLM-based Recommender Systems
Large language models (LLMs) based Recommender Systems (RecSys) can flexibly adapt recommendation systems to different domains. It utilizes in-context learning (ICL), i.e., the prompts, to customize the recommendation functions, which include sensitive historical user-specific item interactions, e.g., implicit feedback like clicked items or explicit product reviews. Such private information may be exposed to novel privacy attack. However, no study has been done on this important issue. We design four membership inference attacks (MIAs), aiming to reveal whether victims' historical interactions have been used by system prompts. They are \emph{direct inquiry, hallucination, similarity, and poisoning attacks}, each of which utilizes the unique features of LLMs or RecSys. We have carefully evaluated them on three LLMs that have been used to develop ICL-LLM RecSys and two well-known RecSys benchmark datasets. The results confirm that the MIA threat on LLM RecSys is realistic: direct inquiry and poisoning attacks showing significantly high attack advantages. We have also analyzed the factors affecting these attacks, such as the number of shots in system prompts and the position of the victim in the shots.
☆ Emotion Omni: Enabling Empathetic Speech Response Generation through Large Language Models ICASSP 2026
With the development of speech large language models (speech LLMs), users can now interact directly with assistants via speech. However, most existing models simply convert the response content into speech without fully understanding the rich emotional and paralinguistic cues embedded in the user's query. In many cases, the same sentence can have different meanings depending on the emotional expression. Furthermore, emotional understanding is essential for improving user experience in human-machine interaction. Currently, most speech LLMs with empathetic capabilities are trained on massive datasets. This approach requires vast amounts of data and significant computational resources. Therefore, a key challenge lies in how to develop a speech LLM capable of generating empathetic responses with limited data and without the need for large-scale training. To address this challenge, we propose Emotion Omni, a novel model architecture designed to understand the emotional content of user speech input and generate empathetic speech responses. Additionally, we developed a data generation pipeline based on an open-source TTS framework to construct a 200k emotional dialogue dataset, which supports the construction of an empathetic speech assistant. The demos are available at https://w311411.github.io/omni_demo/
comment: 5 pages, 1 figure, submitted to ICASSP 2026
☆ UniC-RAG: Universal Knowledge Corruption Attacks to Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) systems are widely deployed in real-world applications in diverse domains such as finance, healthcare, and cybersecurity. However, many studies showed that they are vulnerable to knowledge corruption attacks, where an attacker can inject adversarial texts into the knowledge database of a RAG system to induce the LLM to generate attacker-desired outputs. Existing studies mainly focus on attacking specific queries or queries with similar topics (or keywords). In this work, we propose UniC-RAG, a universal knowledge corruption attack against RAG systems. Unlike prior work, UniC-RAG jointly optimizes a small number of adversarial texts that can simultaneously attack a large number of user queries with diverse topics and domains, enabling an attacker to achieve various malicious objectives, such as directing users to malicious websites, triggering harmful command execution, or launching denial-of-service attacks. We formulate UniC-RAG as an optimization problem and further design an effective solution to solve it, including a balanced similarity-based clustering method to enhance the attack's effectiveness. Our extensive evaluations demonstrate that UniC-RAG is highly effective and significantly outperforms baselines. For instance, UniC-RAG could achieve over 90% attack success rate by injecting 100 adversarial texts into a knowledge database with millions of texts to simultaneously attack a large set of user queries (e.g., 2,000). Additionally, we evaluate existing defenses and show that they are insufficient to defend against UniC-RAG, highlighting the need for new defense mechanisms in RAG systems.
comment: 21 pages, 4 figures
☆ Breaking the Trade-Off Between Faithfulness and Expressiveness for Large Language Models
Grounding responses in external knowledge represents an effective strategy for mitigating hallucinations in Large Language Models (LLMs). However, current LLMs struggle to seamlessly integrate knowledge while simultaneously maintaining faithfulness (or fidelity) and expressiveness, capabilities that humans naturally possess. This limitation results in outputs that either lack support from external knowledge, thereby compromising faithfulness, or appear overly verbose and unnatural, thus sacrificing expressiveness. In this work, to break the trade-off between faithfulness and expressiveness, we propose Collaborative Decoding (CoDe), a novel approach that dynamically integrates output probabilities generated with and without external knowledge. This integration is guided by distribution divergence and model confidence, enabling the selective activation of relevant and reliable expressions from the model's internal parameters. Furthermore, we introduce a knowledge-aware reranking mechanism that prevents over-reliance on prior parametric knowledge while ensuring proper utilization of provided external information. Through comprehensive experiments, our plug-and-play CoDe framework demonstrates superior performance in enhancing faithfulness without compromising expressiveness across diverse LLMs and evaluation metrics, validating both its effectiveness and generalizability.
☆ Thinking Before You Speak: A Proactive Test-time Scaling Approach
Large Language Models (LLMs) often exhibit deficiencies with complex reasoning tasks, such as maths, which we attribute to the discrepancy between human reasoning patterns and those presented in the LLMs' training data. When dealing with complex problems, humans tend to think carefully before expressing solutions. However, they often do not articulate their inner thoughts, including their intentions and chosen methodologies. Consequently, critical insights essential for bridging reasoning steps may be absent in training data collected from human sources. To bridge this gap, we proposes inserting \emph{insight}s between consecutive reasoning steps, which review the status and initiate the next reasoning steps. Unlike prior prompting strategies that rely on a single or a workflow of static prompts to facilitate reasoning, \emph{insight}s are \emph{proactively} generated to guide reasoning processes. We implement our idea as a reasoning framework, named \emph{Thinking Before You Speak} (TBYS), and design a pipeline for automatically collecting and filtering in-context examples for the generation of \emph{insight}s, which alleviates human labeling efforts and fine-tuning overheads. Experiments on challenging mathematical datasets verify the effectiveness of TBYS. Project website: https://gitee.com/jswrt/TBYS
☆ Beyond Benchmark: LLMs Evaluation with an Anthropomorphic and Value-oriented Roadmap
For Large Language Models (LLMs), a disconnect persists between benchmark performance and real-world utility. Current evaluation frameworks remain fragmented, prioritizing technical metrics while neglecting holistic assessment for deployment. This survey introduces an anthropomorphic evaluation paradigm through the lens of human intelligence, proposing a novel three-dimensional taxonomy: Intelligence Quotient (IQ)-General Intelligence for foundational capacity, Emotional Quotient (EQ)-Alignment Ability for value-based interactions, and Professional Quotient (PQ)-Professional Expertise for specialized proficiency. For practical value, we pioneer a Value-oriented Evaluation (VQ) framework assessing economic viability, social impact, ethical alignment, and environmental sustainability. Our modular architecture integrates six components with an implementation roadmap. Through analysis of 200+ benchmarks, we identify key challenges including dynamic assessment needs and interpretability gaps. It provides actionable guidance for developing LLMs that are technically proficient, contextually relevant, and ethically sound. We maintain a curated repository of open-source evaluation resources at: https://github.com/onejune2018/Awesome-LLM-Eval.
comment: Preprint. Under review
☆ RLMR: Reinforcement Learning with Mixed Rewards for Creative Writing
Large language models are extensively utilized in creative writing applications. Creative writing requires a balance between subjective writing quality (e.g., literariness and emotional expression) and objective constraint following (e.g., format requirements and word limits). Existing reinforcement learning methods struggle to balance these two aspects: single reward strategies fail to improve both abilities simultaneously, while fixed-weight mixed-reward methods lack the ability to adapt to different writing scenarios. To address this problem, we propose Reinforcement Learning with Mixed Rewards (RLMR), utilizing a dynamically mixed reward system from a writing reward model evaluating subjective writing quality and a constraint verification model assessing objective constraint following. The constraint following reward weight is adjusted dynamically according to the writing quality within sampled groups, ensuring that samples violating constraints get negative advantage in GRPO and thus penalized during training, which is the key innovation of this proposed method. We conduct automated and manual evaluations across diverse model families from 8B to 72B parameters. Additionally, we construct a real-world writing benchmark named WriteEval for comprehensive evaluation. Results illustrate that our method achieves consistent improvements in both instruction following (IFEval from 83.36\% to 86.65\%) and writing quality (72.75\% win rate in manual expert pairwise evaluations on WriteEval). To the best of our knowledge, RLMR is the first work to combine subjective preferences with objective verification in online RL training, providing an effective solution for multi-dimensional creative writing optimization.
☆ Scaling Laws for Task-Stratified Knowledge in Post-Training Quantized Large Language Models
Large language models (LLMs) present significant deployment challenges due to their scale, with post-training quantization (PTQ) emerging as a practical compression solution. However, a comprehensive understanding of how PTQ precisely impacts diverse LLM knowledge capabilities remains elusive, and existing scaling laws for quantized models often overlook crucial PTQ-specific parameters and task-specific sensitivities. This paper addresses these gaps by conducting an extensive empirical investigation to establish task-stratified scaling laws. We disentangle LLM knowledge into memorization and utilization capabilities and develop a unified quantitative framework that incorporates model size, effective bit-width, calibration set size, and group size. Our central finding reveals that knowledge memorization exhibits markedly greater sensitivity to variations in effective bit-width, calibration set size, and model size compared to the more robust knowledge utilization. These findings offer a fine-grained understanding of PTQ's impact and provide guidance for developing knowledge-aware quantization strategies that can better preserve targeted cognitive functions.
☆ A New NMT Model for Translating Clinical Texts from English to Spanish ML4H
Translating electronic health record (EHR) narratives from English to Spanish is a clinically important yet challenging task due to the lack of a parallel-aligned corpus and the abundant unknown words contained. To address such challenges, we propose \textbf{NOOV} (for No OOV), a new neural machine translation (NMT) system that requires little in-domain parallel-aligned corpus for training. NOOV integrates a bilingual lexicon automatically learned from parallel-aligned corpora and a phrase look-up table extracted from a large biomedical knowledge resource, to alleviate both the unknown word problem and the word-repeat challenge in NMT, enhancing better phrase generation of NMT systems. Evaluation shows that NOOV is able to generate better translation of EHR with improvement in both accuracy and fluency.
comment: This work was accepted by the Machine Learning for Health (ML4H) Workshop at NeurIPS 2018
☆ What do language models model? Transformers, automata, and the format of thought
What do large language models actually model? Do they tell us something about human capacities, or are they models of the corpus we've trained them on? I give a non-deflationary defence of the latter position. Cognitive science tells us that linguistic capabilities in humans rely supralinear formats for computation. The transformer architecture, by contrast, supports at best a linear formats for processing. This argument will rely primarily on certain invariants of the computational architecture of transformers. I then suggest a positive story about what transformers are doing, focusing on Liu et al. (2022)'s intriguing speculations about shortcut automata. I conclude with why I don't think this is a terribly deflationary story. Language is not (just) a means for expressing inner state but also a kind of 'discourse machine' that lets us make new language given appropriate context. We have learned to use this technology in one way; LLMs have also learned to use it too, but via very different means.
☆ The Mind's Eye: A Multi-Faceted Reward Framework for Guiding Visual Metaphor Generation
Visual metaphor generation is a challenging task that aims to generate an image given an input text metaphor. Inherently, it needs language understanding to bind a source concept with a target concept, in a way that preserves meaning while ensuring visual coherence. We propose a self-evaluating visual metaphor generation framework that focuses on metaphor alignment. Our self-evaluation approach combines existing metrics with our newly proposed metaphor decomposition score and a meaning alignment (MA) metric. Within this setup, we explore two novel approaches: a training-free pipeline that explicitly decomposes prompts into source-target-meaning (S-T-M) mapping for image synthesis, and a complementary training-based pipeline that improves alignment using our proposed self-evaluation reward schema, without any large-scale retraining. On the held-out test set, the training-free approach surpasses strong closed baselines (GPT-4o, Imagen) on decomposition, CLIP, and MA scores, with the training-based approach close behind. We evaluate our framework output using a user-facing study, and observed that participants preferred GPT-4o overall, while our training-free pipeline led open-source methods and edged Imagen on abstract metaphors. Our analyses show S-T-M prompting helps longer or more abstract metaphors, with closed models excelling on short, concrete cases; we also observe sensitivity to sampler settings. Overall, structured prompting and lightweight RL perform metaphor alignment well under modest compute, and remaining gaps to human preference appear driven by aesthetics and sampling.
comment: Under Review
♻ ☆ From Intents to Conversations: Generating Intent-Driven Dialogues with Contrastive Learning for Multi-Turn Classification CIKM 2025
In conversational AI systems, a critical challenge in training effective multi-turn intent classification models lies in the generation of large-scale, domain-specific, multilingual dialogue datasets. In this paper, we introduce Chain-of-Intent, a novel framework that integrates Hidden Markov Models (HMMs) with Large Language Models (LLMs) to generate intent-driven, context-aware dialogues through self-play. Our method first extracts domain-specific intent transition patterns from real-world e-commerce chat logs, which guide the modeling of turn-level dynamics and intent sequences. LLMs are then employed to parameterize the emission probabilities of HMMs, enabling the generation of natural, coherent utterances aligned with predicted intents and dialogue context. We further propose MINT-CL, a multi-task contrastive learning framework for multi-turn intent classification, which improves performance while reducing dependence on large-scale annotated datasets. Empirical results demonstrate that our approach outperforms competitive baselines in both dialogue generation quality and classification accuracy, particularly in multilingual settings. To facilitate future research, we release MINT-E, a comprehensive, multilingual, intent-aware multi-turn dialogue corpus derived from the e-commerce domain. The reproduced source code and dataset are available at https://github.com/junhua/chain-of-intent.
comment: Accepted to Proceedings of CIKM 2025
♻ ☆ Bridging the Editing Gap in LLMs: FineEdit for Precise and Targeted Text Modifications
Large Language Models (LLMs) have significantly advanced natural language processing, demonstrating strong capabilities in tasks such as text generation, summarization, and reasoning. Recently, their potential for automating precise text editing tasks across specialized domains, such as programming code, LaTeX, and structured database languages, has gained attention. However, current state-of-the-art LLMs still struggle with executing precise, instruction-driven edits, particularly when structural accuracy and strict adherence to domain conventions are required. To address these challenges, we introduce InstrEditBench, an automated benchmark dataset comprising over 30,000 structured editing tasks spanning diverse domains, including Wikipedia articles, LaTeX documents, source code, and database languages. Using this benchmark, we develop FineEdit, a specialized editing model explicitly trained for accurate, context-aware text modifications. Experimental evaluations demonstrate that FineEdit outperforms state-of-the-art models, achieving improvements of approximately 10\% over Gemini models on single-turn edits, up to 30\% over Llama-3.2-3B, and exceeding Mistral-7B-OpenOrca performance by over 40\% on direct editing tasks. FineEdit also effectively generalizes to realistic multi-turn editing scenarios, highlighting its practical applicability. To facilitate further research and reproducibility, we release FineEdit at https://github.com/StuRinDQB/FineEdit} and https://huggingface.co/datasets/YimingZeng/FineEdit_bench.
♻ ☆ mRAG: Elucidating the Design Space of Multi-modal Retrieval-Augmented Generation
Large Vision-Language Models (LVLMs) have made remarkable strides in multimodal tasks such as visual question answering, visual grounding, and complex reasoning. However, they remain limited by static training data, susceptibility to hallucinations, and inability to verify claims against up-to-date, external evidence, compromising their performance in dynamic real-world applications. Retrieval-Augmented Generation (RAG) offers a practical solution to mitigate these challenges by allowing the LVLMs to access large-scale knowledge databases via retrieval mechanisms, thereby grounding model outputs in factual, contextually relevant information. Here in this paper, we conduct the first systematic dissection of the multimodal RAG pipeline for LVLMs, explicitly investigating (1) the retrieval phase: on the modality configurations and retrieval strategies, (2) the re-ranking stage: on strategies to mitigate positional biases and improve the relevance of retrieved evidence, and (3) the generation phase: we further investigate how to best integrate retrieved candidates into the final generation process. Finally, we extend to explore a unified agentic framework that integrates re-ranking and generation through self-reflection, enabling LVLMs to select relevant evidence and suppress irrelevant context dynamically. Our full-stack exploration of RAG for LVLMs yields substantial insights, resulting in an average performance boost of 5% without any fine-tuning.
comment: 16 pages
♻ ☆ TL-Training: A Task-Feature-Based Framework for Training Large Language Models in Tool Use EMNLP 2025
Large language models (LLMs) achieve remarkable advancements by leveraging tools to interact with environments, a critical step toward generalized AI. However, the standard supervised fine-tuning (SFT) approach, which relies on large-scale datasets, often overlooks task-specific characteristics in tool use, leading to performance bottlenecks. To address this issue, we analyze three existing LLMs and uncover key insights: training data can inadvertently impede tool-use behavior, token importance is distributed unevenly, and errors in tool calls fall into a small set of categories. Building on these findings, we propose~\emph{TL-Training}, a task-feature-based framework that mitigates the effects of suboptimal training data, dynamically adjusts token weights to prioritize key tokens during SFT, and incorporates a robust reward mechanism tailored to error categories, optimized through proximal policy optimization. We validate TL-Training by training CodeLLaMA-2-7B and evaluating it on four open-source test sets. Our results demonstrate that the LLM trained by our method matches or surpasses both open- and closed-source LLMs in tool-use performance using only 1,217 training data points. Additionally, our method enhances robustness in noisy environments and improves general task performance, offering a scalable and efficient paradigm for tool-use training in LLMs. Code and data are available at https://github.com/Junjie-Ye/TL-Training.
comment: Accepted by EMNLP 2025
♻ ☆ ChatGPT Doesn't Trust Chargers Fans: Guardrail Sensitivity in Context
While the biases of language models in production are extensively documented, the biases of their guardrails have been neglected. This paper studies how contextual information about the user influences the likelihood of an LLM to refuse to execute a request. By generating user biographies that offer ideological and demographic information, we find a number of biases in guardrail sensitivity on GPT-3.5. Younger, female, and Asian-American personas are more likely to trigger a refusal guardrail when requesting censored or illegal information. Guardrails are also sycophantic, refusing to comply with requests for a political position the user is likely to disagree with. We find that certain identity groups and seemingly innocuous information, e.g., sports fandom, can elicit changes in guardrail sensitivity similar to direct statements of political ideology. For each demographic category and even for American football team fandom, we find that ChatGPT appears to infer a likely political ideology and modify guardrail behavior accordingly.
♻ ☆ A Survey on Data Selection for LLM Instruction Tuning
Instruction tuning is a vital step of training large language models (LLMs), so how to enhance the effect of instruction tuning has received increased attention. Existing works indicate that the quality of the dataset is more crucial than the quantity during instruction tuning of LLMs. Therefore, recently a lot of studies focus on exploring the methods of selecting high-quality subset from instruction datasets, aiming to reduce training costs and enhance the instruction-following capabilities of LLMs. This paper presents a comprehensive survey on data selection for LLM instruction tuning. Firstly, we introduce the wildly used instruction datasets. Then, we propose a new taxonomy of the data selection methods and provide a detailed introduction of recent advances, and the evaluation strategies and results of data selection methods are also elaborated in detail. Finally, we emphasize the open challenges and present new frontiers of this task.
comment: Published in JAIR (Vol. 83, Article 32, 2025)
♻ ☆ An Ontology-Driven Graph RAG for Legal Norms: A Hierarchical, Temporal, and Deterministic Approach
Retrieval-Augmented Generation (RAG) systems in the legal domain face a critical challenge: standard, flat-text retrieval is blind to the hierarchical, diachronic, and causal structure of law, leading to anachronistic and unreliable answers. This paper introduces an ontology-driven Graph RAG framework designed to overcome these limitations. We ground our knowledge graph in a formal, LRMoo-inspired model that distinguishes abstract legal Works from their versioned Expressions. We model temporal states as efficient aggregations that reuse the versioned expressions (CTVs) of unchanged components, and we reify legislative events as first-class Action nodes to make causality explicit and queryable. This structured backbone enables a unified, planner-guided query strategy that applies explicit policies to deterministically resolve complex requests for (i) point-in-time retrieval, (ii) hierarchical impact analysis, and (iii) auditable provenance reconstruction. Through a case study on the Brazilian Constitution, we demonstrate how this approach provides a verifiable, temporally-correct substrate for LLMs, enabling higher-order analytical capabilities while drastically reducing the risk of factual errors. The result is a practical framework for building more trustworthy and explainable legal AI systems.
comment: This is a major revision that significantly expands and deepens the original manuscript. While the core ontological model remains the same, this version provides a substantially more rigorous and detailed account of how the framework is applied in practice, particularly within a Retrieval-Augmented Generation (RAG) context
♻ ☆ Exploring the Robustness of Language Models for Tabular Question Answering via Attention Analysis
Large Language Models (LLMs), already shown to ace various unstructured text comprehension tasks, have also remarkably been shown to tackle table (structured) comprehension tasks without specific training. Building on earlier studies of LLMs for tabular tasks, we probe how in-context learning (ICL), model scale, instruction tuning, and domain bias affect Tabular QA (TQA) robustness by testing LLMs, under diverse augmentations and perturbations, on diverse domains: Wikipedia-based $\textbf{WTQ}$, financial $\textbf{TAT-QA}$, and scientific $\textbf{SCITAB}$. Although instruction tuning and larger, newer LLMs deliver stronger, more robust TQA performance, data contamination and reliability issues, especially on $\textbf{WTQ}$, remain unresolved. Through an in-depth attention analysis, we reveal a strong correlation between perturbation-induced shifts in attention dispersion and the drops in performance, with sensitivity peaking in the model's middle layers. We highlight the need for improved interpretable methodologies to develop more reliable LLMs for table comprehension. Through an in-depth attention analysis, we reveal a strong correlation between perturbation-induced shifts in attention dispersion and performance drops, with sensitivity peaking in the model's middle layers. Based on these findings, we argue for the development of structure-aware self-attention mechanisms and domain-adaptive processing techniques to improve the transparency, generalization, and real-world reliability of LLMs on tabular data.
comment: Accepted TMLR 2025
♻ ☆ Label Set Optimization via Activation Distribution Kurtosis for Zero-shot Classification with Generative Models EMNLP 2025
In-context learning (ICL) performance is highly sensitive to prompt design, yet the impact of class label options (e.g. lexicon or order) in zero-shot classification remains underexplored. This study proposes LOADS (Label set Optimization via Activation Distribution kurtosiS), a post-hoc method for selecting optimal label sets in zero-shot ICL with large language models (LLMs). LOADS is built upon the observations in our empirical analysis, the first to systematically examine how label option design (i.e., lexical choice, order, and elaboration) impacts classification performance. This analysis shows that the lexical choice of the labels in the prompt (such as agree vs. support in stance classification) plays an important role in both model performance and model's sensitivity to the label order. A further investigation demonstrates that optimal label words tend to activate fewer outlier neurons in LLMs' feed-forward networks. LOADS then leverages kurtosis to measure the neuron activation distribution for label selection, requiring only a single forward pass without gradient propagation or labelled data. The LOADS-selected label words consistently demonstrate effectiveness for zero-shot ICL across classification tasks, datasets, models and languages, achieving maximum performance gain from 0.54 to 0.76 compared to the conventional approach of using original dataset label words.
comment: Accepted by EMNLP 2025
♻ ☆ SmartBench: Is Your LLM Truly a Good Chinese Smartphone Assistant?
Large Language Models (LLMs) have become integral to daily life, especially advancing as intelligent assistants through on-device deployment on smartphones. However, existing LLM evaluation benchmarks predominantly focus on objective tasks like mathematics and coding in English, which do not necessarily reflect the practical use cases of on-device LLMs in real-world mobile scenarios, especially for Chinese users. To address these gaps, we introduce SmartBench, the first benchmark designed to evaluate the capabilities of on-device LLMs in Chinese mobile contexts. We analyze functionalities provided by representative smartphone manufacturers and divide them into five categories: text summarization, text Q&A, information extraction, content creation, and notification management, further detailed into 20 specific tasks. For each task, we construct high-quality datasets comprising 50 to 200 question-answer pairs that reflect everyday mobile interactions, and we develop automated evaluation criteria tailored for these tasks. We conduct comprehensive evaluations of on-device LLMs and MLLMs using SmartBench and also assess their performance after quantized deployment on real smartphone NPUs. Our contributions provide a standardized framework for evaluating on-device LLMs in Chinese, promoting further development and optimization in this critical area. Code and data will be available at https://github.com/vivo-ai-lab/SmartBench.
comment: 26 pages
♻ ☆ An Agentic System for Rare Disease Diagnosis with Traceable Reasoning
Rare diseases collectively affect over 300 million individuals worldwide, yet timely and accurate diagnosis remains a pervasive challenge. This is largely due to their clinical heterogeneity, low individual prevalence, and the limited familiarity most clinicians have with rare conditions. Here, we introduce DeepRare, the first rare disease diagnosis agentic system powered by a large language model (LLM), capable of processing heterogeneous clinical inputs. The system generates ranked diagnostic hypotheses for rare diseases, each accompanied by a transparent chain of reasoning that links intermediate analytic steps to verifiable medical evidence. DeepRare comprises three key components: a central host with a long-term memory module; specialized agent servers responsible for domain-specific analytical tasks integrating over 40 specialized tools and web-scale, up-to-date medical knowledge sources, ensuring access to the most current clinical information. This modular and scalable design enables complex diagnostic reasoning while maintaining traceability and adaptability. We evaluate DeepRare on eight datasets. The system demonstrates exceptional diagnostic performance among 2,919 diseases, achieving 100% accuracy for 1013 diseases. In HPO-based evaluations, DeepRare significantly outperforms other 15 methods, like traditional bioinformatics diagnostic tools, LLMs, and other agentic systems, achieving an average Recall@1 score of 57.18% and surpassing the second-best method (Reasoning LLM) by a substantial margin of 23.79 percentage points. For multi-modal input scenarios, DeepRare achieves 70.60% at Recall@1 compared to Exomiser's 53.20% in 109 cases. Manual verification of reasoning chains by clinical experts achieves 95.40% agreements. Furthermore, the DeepRare system has been implemented as a user-friendly web application http://raredx.cn/doctor.
♻ ☆ SKA-Bench: A Fine-Grained Benchmark for Evaluating Structured Knowledge Understanding of LLMs EMNLP 2025
Although large language models (LLMs) have made significant progress in understanding Structured Knowledge (SK) like KG and Table, existing evaluations for SK understanding are non-rigorous (i.e., lacking evaluations of specific capabilities) and focus on a single type of SK. Therefore, we aim to propose a more comprehensive and rigorous structured knowledge understanding benchmark to diagnose the shortcomings of LLMs. In this paper, we introduce SKA-Bench, a Structured Knowledge Augmented QA Benchmark that encompasses four widely used structured knowledge forms: KG, Table, KG+Text, and Table+Text. We utilize a three-stage pipeline to construct SKA-Bench instances, which includes a question, an answer, positive knowledge units, and noisy knowledge units. To evaluate the SK understanding capabilities of LLMs in a fine-grained manner, we expand the instances into four fundamental ability testbeds: Noise Robustness, Order Insensitivity, Information Integration, and Negative Rejection. Empirical evaluations on 8 representative LLMs, including the advanced DeepSeek-R1, indicate that existing LLMs still face significant challenges in understanding structured knowledge, and their performance is influenced by factors such as the amount of noise, the order of knowledge units, and hallucination phenomenon. Our dataset and code are available at https://github.com/Lza12a/SKA-Bench.
comment: EMNLP 2025
♻ ☆ LLM-Enhanced Linear Autoencoders for Recommendation CIKM 2025
Large language models (LLMs) have been widely adopted to enrich the semantic representation of textual item information in recommender systems. However, existing linear autoencoders (LAEs) that incorporate textual information rely on sparse word co-occurrence patterns, limiting their ability to capture rich textual semantics. To address this, we propose L3AE, the first integration of LLMs into the LAE framework. L3AE effectively integrates the heterogeneous knowledge of textual semantics and user-item interactions through a two-phase optimization strategy. (i) L3AE first constructs a semantic item-to-item correlation matrix from LLM-derived item representations. (ii) It then learns an item-to-item weight matrix from collaborative signals while distilling semantic item correlations as regularization. Notably, each phase of L3AE is optimized through closed-form solutions, ensuring global optimality and computational efficiency. Extensive experiments demonstrate that L3AE consistently outperforms state-of-the-art LLM-enhanced models on three benchmark datasets, achieving gains of 27.6% in Recall@20 and 39.3% in NDCG@20. The source code is available at https://github.com/jaewan7599/L3AE_CIKM2025.
comment: Accepted by CIKM 2025
♻ ☆ RePPL: Recalibrating Perplexity by Uncertainty in Semantic Propagation and Language Generation for Explainable QA Hallucination Detection
Large Language Models (LLMs) have become powerful, but hallucinations remain a vital obstacle to their trustworthy use. While previous works improved the capability of hallucination detection by measuring uncertainty, they all lack the ability to explain the provenance behind why hallucinations occur, i.e., which part of the inputs tends to trigger hallucinations. Recent works on the prompt attack indicate that uncertainty exists in semantic propagation, where attention mechanisms gradually fuse local token information into high-level semantics across layers. Meanwhile, uncertainty also emerges in language generation, due to its probability-based selection of high-level semantics for sampled generations. Based on that, we propose RePPL to recalibrate uncertainty measurement by these two aspects, which dispatches explainable uncertainty scores to each token and aggregates in Perplexity-style Log-Average form as total score. Experiments show that our method achieves the best comprehensive detection performance across various QA datasets on advanced models (average AUC of 0.833), and our method is capable of producing token-level uncertainty scores as explanations for the hallucination. Leveraging these scores, we preliminarily find the chaotic pattern of hallucination and showcase its promising usage.
♻ ☆ Truth or Twist? Optimal Model Selection for Reliable Label Flipping Evaluation in LLM-based Counterfactuals
Counterfactual examples are widely employed to enhance the performance and robustness of large language models (LLMs) through counterfactual data augmentation (CDA). However, the selection of the judge model used to evaluate label flipping, the primary metric for assessing the validity of generated counterfactuals for CDA, yields inconsistent results. To decipher this, we define four types of relationships between the counterfactual generator and judge models: being the same model, belonging to the same model family, being independent models, and having an distillation relationship. Through extensive experiments involving two state-of-the-art LLM-based methods, three datasets, four generator models, and 15 judge models, complemented by a user study (n = 90), we demonstrate that judge models with an independent, non-fine-tuned relationship to the generator model provide the most reliable label flipping evaluations. Relationships between the generator and judge models, which are closely aligned with the user study for CDA, result in better model performance and robustness. Nevertheless, we find that the gap between the most effective judge models and the results obtained from the user study remains considerably large. This suggests that a fully automated pipeline for CDA may be inadequate and requires human intervention.
comment: Accepted at INLG 2025, camera-ready version
♻ ☆ Mind the (Language) Gap: Towards Probing Numerical and Cross-Lingual Limits of LVLMs
We introduce MMCRICBENCH-3K, a benchmark for Visual Question Answering (VQA) on cricket scorecards, designed to evaluate large vision-language models (LVLMs) on complex numerical and cross-lingual reasoning over semi-structured tabular images. MMCRICBENCH-3K comprises 1,463 synthetically generated scorecard images from ODI, T20, and Test formats, accompanied by 1,500 English QA pairs. It includes two subsets: MMCRICBENCH-E-1.5K, featuring English scorecards, and MMCRICBENCH-H-1.5K, containing visually similar Hindi scorecards, with all questions and answers kept in English to enable controlled cross-script evaluation. The task demands reasoning over structured numerical data, multi-image context, and implicit domain knowledge. Empirical results show that even state-of-the-art LVLMs, such as GPT-4o and Qwen2.5VL, struggle on the English subset despite it being their primary training language and exhibit a further drop in performance on the Hindi subset. This reveals key limitations in structure-aware visual text understanding, numerical reasoning, and cross-lingual generalization. The dataset is publicly available via Hugging Face at https://huggingface.co/datasets/DIALab/MMCricBench, to promote LVLM research in this direction.
♻ ☆ From Confidence to Collapse in LLM Factual Robustness
Ensuring the robustness of factual knowledge in LLMs is critical for reliable applications in tasks such as question answering and reasoning. However, existing evaluation methods predominantly focus on performance-based metrics, often investigating from the perspective of prompt perturbations, which captures only the externally triggered side of knowledge robustness. To bridge this gap, we introduce a principled approach to measure factual robustness from the perspective of the generation process by analyzing token distribution entropy in combination with temperature scaling sensitivity. These two factors build the Factual Robustness Score (FRS), a novel metric which quantifies the stability of a fact against perturbations in decoding conditions, given its initial uncertainty. To validate our approach, we conduct extensive experiments on 5 LLMs across 3 closed-book QA datasets (SQuAD, TriviaQA, and HotpotQA). We show that factual robustness varies significantly -- smaller models report an FRS of $0.76$, larger ones $0.93$ -- with accuracy degrading by ~$60\%$ under increased uncertainty. These insights demonstrate how entropy and temperature scaling impact factual accuracy, and lay a foundation for developing more robust knowledge retention and retrieval in future models.
♻ ☆ Debate-to-Detect: Reformulating Misinformation Detection as a Real-World Debate with Large Language Models EMNLP 2025
The proliferation of misinformation in digital platforms reveals the limitations of traditional detection methods, which mostly rely on static classification and fail to capture the intricate process of real-world fact-checking. Despite advancements in Large Language Models (LLMs) that enhance automated reasoning, their application to misinformation detection remains hindered by issues of logical inconsistency and superficial verification. In response, we introduce Debate-to-Detect (D2D), a novel Multi-Agent Debate (MAD) framework that reformulates misinformation detection as a structured adversarial debate. Inspired by fact-checking workflows, D2D assigns domain-specific profiles to each agent and orchestrates a five-stage debate process, including Opening Statement, Rebuttal, Free Debate, Closing Statement, and Judgment. To transcend traditional binary classification, D2D introduces a multi-dimensional evaluation mechanism that assesses each claim across five distinct dimensions: Factuality, Source Reliability, Reasoning Quality, Clarity, and Ethics. Experiments with GPT-4o on two datasets demonstrate significant improvements over baseline methods, and the case study highlight D2D's capability to iteratively refine evidence while improving decision transparency, representing a substantial advancement towards interpretable misinformation detection. The code will be released publicly after the official publication.
comment: This paper has been accepted to EMNLP 2025 (Main Conference)
♻ ☆ Long-context Language Models Fail in Basic Retrieval Tasks Without Sufficient Reasoning Steps
Long-context language models (LCLMs), characterized by their extensive context window, are becoming popular. However, despite the fact that they are nearly perfect at standard long-context retrieval tasks, our evaluations demonstrate they fail in some basic cases. Later, we find they can be well addressed with a sufficient number of reasoning steps, guided by specific CoT prompts. This result emphasizes the potential necessity of solving specific long-context tasks using long-CoT methods, while previous long-context benchmarks always ignore the necessity of long reasoning for long-context tasks and treat them as direct QA tasks.
comment: Our code is publicly available at https://github.com/yuyijiong/hard_retrieval_for_llm and the datasets is at https://huggingface.co/datasets/yuyijiong/difficult_retrieval
♻ ☆ Weakly-Supervised 3D Visual Grounding based on Visual Language Alignment
Learning to ground natural language queries to target objects or regions in 3D point clouds is quite essential for 3D scene understanding. Nevertheless, existing 3D visual grounding approaches require a substantial number of bounding box annotations for text queries, which is time-consuming and labor-intensive to obtain. In this paper, we propose 3D-VLA, a weakly supervised approach for 3D visual grounding based on Visual Linguistic Alignment. Our 3D-VLA exploits the superior ability of current large-scale vision-language models (VLMs) on aligning the semantics between texts and 2D images, as well as the naturally existing correspondences between 2D images and 3D point clouds, and thus implicitly constructs correspondences between texts and 3D point clouds with no need for fine-grained box annotations in the training procedure. During the inference stage, the learned text-3D correspondence will help us ground the text queries to the 3D target objects even without 2D images. To the best of our knowledge, this is the first work to investigate 3D visual grounding in a weakly supervised manner by involving large scale vision-language models, and extensive experiments on ReferIt3D and ScanRefer datasets demonstrate that our 3D-VLA achieves comparable and even superior results over the fully supervised methods.
♻ ☆ Can Pruning Improve Reasoning? Revisiting Long-CoT Compression with Capability in Mind for Better Reasoning
Long chain-of-thought (Long-CoT) reasoning improves accuracy in LLMs, yet its verbose, self-reflective style often hinders effective distillation into small language models (SLMs). We revisit Long-CoT compression through the lens of capability alignment and ask: Can pruning improve reasoning? We propose Prune-on-Logic, a structure-aware framework that transforms Long-CoT into logic graphs and selectively prunes low-utility reasoning steps under self-verification constraints. Through systematic analysis across three pruning strategies - targeting entire chains, core reasoning, and verification - we find that verification pruning consistently improves accuracy while reducing token usage, whereas reasoning or indiscriminate pruning degrades performance. Our study reveals that effective pruning aligns supervision with model capacity rather than merely shortening inputs. Gains hold across tasks, model scales, and CoT capability, with larger models benefiting more from pruning due to richer but more redundant reasoning. Our empirical findings highlight pruning as a structural optimization strategy for aligning CoT reasoning with SLM capacity.
comment: 19 pages,6 figures
♻ ☆ Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence
Recent advances in reasoning models have demonstrated significant improvements in accuracy by employing detailed and comprehensive reasoning processes. However, generating these lengthy reasoning sequences is computationally expensive and time-consuming. To address this inefficiency, we leverage the inherent parallelizability of certain tasks to accelerate the reasoning process. Specifically, when multiple parallel reasoning steps exist, we decode multiple tokens per forward pass via a tree-like attention mask within a single sequence, avoiding additional memory usage. Experimental results show that our method achieves up to nearly 100\% speedup in decoding while basically maintaining the answer quality.
comment: Our code is available in https://github.com/yuyijiong/parallel-decoding-in-one-sequence
♻ ☆ Fingerprint Vector: Enabling Scalable and Efficient Model Fingerprint Transfer via Vector Addition
Backdoor-based fingerprinting has emerged as an effective technique for tracing the ownership of large language models. However, in real-world deployment scenarios, developers often instantiate multiple downstream models from a shared base model, and applying fingerprinting to each variant individually incurs prohibitive computational overhead. While inheritance-based approaches -- where fingerprints are embedded into the base model and expected to persist through fine-tuning -- appear attractive, they suffer from three key limitations: late-stage fingerprinting, fingerprint instability, and interference with downstream adaptation. To address these challenges, we propose a novel mechanism called the Fingerprint Vector. Our method first embeds a fingerprint into the base model via backdoor-based fine-tuning, then extracts a task-specific parameter delta as a fingerprint vector by computing the difference between the fingerprinted and clean models. This vector can be directly added to any structurally compatible downstream model, allowing the fingerprint to be transferred post hoc without additional fine-tuning. Extensive experiments show that Fingerprint Vector achieves comparable or superior performance to direct injection across key desiderata. It maintains strong effectiveness across diverse model architectures as well as mainstream downstream variants within the same family. It also preserves harmlessness and robustness in most cases. Even when slight robustness degradation is observed, the impact remains within acceptable bounds and is outweighed by the scalability benefits of our approach.
♻ ☆ SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models EMNLP 2025
Large Language Models (LLMs) excel at various natural language processing tasks but remain vulnerable to jailbreaking attacks that induce harmful content generation. In this paper, we reveal a critical safety inconsistency: LLMs can more effectively identify harmful requests as discriminators than defend against them as generators. This insight inspires us to explore aligning the model's inherent discrimination and generation capabilities. To this end, we propose SDGO (Self-Discrimination-Guided Optimization), a reinforcement learning framework that leverages the model's own discrimination capabilities as a reward signal to enhance generation safety through iterative self-improvement. Our method does not require any additional annotated data or external models during the training phase. Extensive experiments demonstrate that SDGO significantly improves model safety compared to both prompt-based and training-based baselines while maintaining helpfulness on general benchmarks. By aligning LLMs' discrimination and generation capabilities, SDGO brings robust performance against out-of-distribution (OOD) jailbreaking attacks. This alignment achieves tighter coupling between these two capabilities, enabling the model's generation capability to be further enhanced with only a small amount of discriminative samples. Our code and datasets are available at https://github.com/NJUNLP/SDGO.
comment: Accepted by EMNLP 2025 (Main Conference), 15 pages, 4 figures, 6 tables
♻ ☆ sudoLLM: On Multi-role Alignment of Language Models EMNLP 2025
User authorization-based access privileges are a key feature in many safety-critical systems, but have not been extensively studied in the large language model (LLM) realm. In this work, drawing inspiration from such access control systems, we introduce sudoLLM, a novel framework that results in multi-role aligned LLMs, i.e., LLMs that account for, and behave in accordance with, user access rights. sudoLLM injects subtle user-based biases into queries and trains an LLM to utilize this bias signal in order to produce sensitive information if and only if the user is authorized. We present empirical results demonstrating that this approach shows substantially improved alignment, generalization, resistance to prefix-based jailbreaking attacks, and ``fails-closed''. The persistent tension between the language modeling objective and safety alignment, which is often exploited to jailbreak LLMs, is somewhat resolved with the aid of the injected bias signal. Our framework is meant as an additional security layer, and complements existing guardrail mechanisms for enhanced end-to-end safety with LLMs.
comment: Accepted to EMNLP 2025 (findings)
♻ ☆ Retrieval Enhanced Feedback via In-context Neural Error-book EMNLP 2025
Recent advancements in Large Language Models (LLMs) have significantly improved reasoning capabilities, with in-context learning (ICL) emerging as a key technique for adaptation without retraining. While previous works have focused on leveraging correct examples, recent research highlights the importance of learning from errors to enhance performance. However, existing methods lack a structured framework for analyzing and mitigating errors, particularly in Multimodal Large Language Models (MLLMs), where integrating visual and textual inputs adds complexity. To address this issue, we propose REFINE: Retrieval-Enhanced Feedback via In-context Neural Error-book, a teacher-student framework that systematically structures errors and provides targeted feedback. REFINE introduces three systematic queries to construct structured feedback -- Feed-Target, Feed-Check, and Feed-Path -- to enhance multimodal reasoning by prioritizing relevant visual information, diagnosing critical failure points, and formulating corrective actions. Unlike prior approaches that rely on redundant retrievals, REFINE optimizes structured feedback retrieval, improving inference efficiency, token usage, and scalability. Our results demonstrate substantial speedup, reduced computational costs, and successful generalization, highlighting REFINE's potential for enhancing multimodal reasoning.
comment: Accepted at EMNLP 2025 main conference
♻ ☆ Subjective Perspectives within Learned Representations Predict High-Impact Innovation
Existing studies of innovation emphasize the power of social structures to shape innovation capacity. Emerging machine learning approaches, however, enable us to model innovators' personal perspectives and interpersonal innovation opportunities as a function of their prior experience. We theorize and then quantify subjective perspectives and their interaction based on innovator positions within the geometric space of concepts inscribed by dynamic machine-learned language representations. Using data on millions of scientists, inventors, screenplay writers, entrepreneurs, and Wikipedia contributors across their respective creative domains, here we show that measured subjective perspectives predict which ideas individuals and groups will creatively attend to and successfully combine in the future. Across all cases and time periods we examine, when perspective diversity is decomposed as the difference between collaborators' perspectives on their creation, and background diversity as the difference between their experiences, the former consistently anticipates creative achievement while the latter portends its opposite. We analyze a natural experiment and simulate creative collaborations between AI agents designed with various perspective and background diversity, which support our observational findings. We explore mechanisms underlying these findings and identify how successful collaborators leverage common language to weave together diverse experiences obtained through trajectories of prior work. These perspectives converge and provoke one another to innovate. We examine the significance of these findings for team formation and research policy.
comment: 123 pages, 23 figures
♻ ☆ ELSPR: Evaluator LLM Training Data Self-Purification on Non-Transitive Preferences via Tournament Graph Reconstruction
Pairwise evaluation of large language models (LLMs) has become the dominant paradigm for benchmarking open-ended tasks, yet non-transitive preferences, where evaluators prefer A over B, B over C, but C over A, fundamentally undermine ranking reliability. We show that this critical issue stems largely from low-quality data that contains inherently ambiguous preference pairs. To address this challenge, we propose ELSPR, a principled graph-theoretic framework that models pairwise preferences as tournament graphs and systematically identifies problematic training data. ELSPR quantifies non-transitivity through strongly connected components (SCCs) analysis and measures overall preference clarity using a novel normalized directed graph structural entropy metric. Our filtering methodology selectively removes preference data that induce non-transitivity while preserving transitive preferences. Extensive experiments on the AlpacaEval benchmark demonstrate that models fine-tuned on ELSPR-filtered data achieve substantial improvements: a 13.8% reduction in non-transitivity, a 0.088 decrease in structural entropy, and significantly enhanced discriminative power in real-world evaluation systems. Human validation confirms that discarded data exhibit dramatically lower inter-annotator agreement (34.4% vs. 52.6%) and model-human consistency (51.2% vs. 80.6%) compared to cleaned data. These findings establish ELSPR as an effective data self-purification approach for developing more robust, consistent, and human-aligned LLM evaluation systems.
♻ ☆ HateDebias: On the Diversity and Variability of Hate Speech Debiasing
Hate speech frequently appears on social media platforms and urgently needs to be effectively controlled. Alleviating the bias caused by hate speech can help resolve various ethical issues. Although existing research has constructed several datasets for hate speech detection, these datasets seldom consider the diversity and variability of bias, making them far from real-world scenarios. To fill this gap, we propose a benchmark HateDebias to analyze the fairness of models under dynamically evolving environments. Specifically, to meet the diversity of biases, we collect hate speech data with different types of biases from real-world scenarios. To further simulate the variability in the real-world scenarios(i.e., the changing of bias attributes in datasets), we construct a dataset to follow the continuous learning setting and evaluate the detection accuracy of models on the HateDebias, where performance degradation indicates a significant bias toward a specific attribute. To provide a potential direction, we further propose a continual debiasing framework tailored to dynamic bias in real-world scenarios, integrating memory replay and bias information regularization to ensure the fairness of the model. Experiment results on the HateDebias benchmark reveal that our methods achieve improved performance in mitigating dynamic biases in real-world scenarios, highlighting the practicality in real-world applications.
♻ ☆ Collaborative Evaluation of Deepfake Text with Deliberation-Enhancing Dialogue Systems
The proliferation of generative models has presented significant challenges in distinguishing authentic human-authored content from deepfake content. Collaborative human efforts, augmented by AI tools, present a promising solution. In this study, we explore the potential of DeepFakeDeLiBot, a deliberation-enhancing chatbot, to support groups in detecting deepfake text. Our findings reveal that group-based problem-solving significantly improves the accuracy of identifying machine-generated paragraphs compared to individual efforts. While engagement with DeepFakeDeLiBot does not yield substantial performance gains overall, it enhances group dynamics by fostering greater participant engagement, consensus building, and the frequency and diversity of reasoning-based utterances. Additionally, participants with higher perceived effectiveness of group collaboration exhibited performance benefits from DeepFakeDeLiBot. These findings underscore the potential of deliberative chatbots in fostering interactive and productive group dynamics while ensuring accuracy in collaborative deepfake text detection. \textit{Dataset and source code used in this study will be made publicly available upon acceptance of the manuscript.
comment: 15; To appear in ICWSM 2026 (https://www.icwsm.org/2026/)
♻ ☆ Adapting Large Language Models to Log Analysis with Interpretable Domain Knowledge CIKM 2025
Log analysis represents a critical sub-domain within AI applications that facilitates automatic approaches to fault and error management of large-scaled software systems, saving labors of traditional manual methods. While existing solutions using large language models (LLMs) show promise, they are limited by a significant domain gap between natural and log languages (the latter contains rich domain-specific tokens such as status codes, IP addresses, resource pathes), which restricts their effectiveness in real-world applications. However, directly adapting general-purpose LLMs to log analysis using raw logs may degrade their performance due to inconsistent token distribution. In this paper, we present a domain adaptation approach that addresses these limitations by integrating interpretable domain knowledge into open-source LLMs through continual pre-training (CPT), which bridges this domain gap by adapting LLMs on interpretable natural texts with log knowledge (instead of raw logs) to reduce distribution discrepancy. To achieve this, we developed NLPLog, a comprehensive dataset containing over 250,000 question-answer pairs on log-related knowledge. Our resulting model, SuperLog, achieves the best performance across four log analysis tasks, with an average accuracy improvement of 12.01% over the second-best model. Ablation study also suggests advantages of domain adaption using interpretable log knowledge over using raw logs.
comment: Accepted by CIKM 2025
♻ ☆ Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings EMNLP 2025
This work stems from an observed limitation of text encoders: embeddings may not be able to recognize fine-grained entities or events within encoded semantics, resulting in failed retrieval even in simple cases. To examine such behaviors, we first introduce a new evaluation dataset, CapRetrieval, in which passages are image captions and queries are phrases targeting entity or event concepts in diverse forms. Zero-shot evaluation suggests that encoders often struggle with these fine-grained matching, regardless of training sources or model size. Aiming for enhancement, we proceed to finetune encoders with our proposed data generation strategies, enabling a small 0.1B encoder to outperform the state-of-the-art 7B model. Within this process, we further uncover the granularity dilemma, a challenge for embeddings to capture fine-grained salience while aligning with overall semantics. Our dataset, code and models in this work are publicly released at https://github.com/lxucs/CapRetrieval.
comment: Accepted to EMNLP 2025 Findings
♻ ☆ Dream to Chat: Model-based Reinforcement Learning on Dialogues with User Belief Modeling EMNLP 2025
World models have been widely utilized in robotics, gaming, and auto-driving. However, their applications on natural language tasks are relatively limited. In this paper, we construct the dialogue world model, which could predict the user's emotion, sentiment, and intention, and future utterances. By defining a POMDP, we argue emotion, sentiment and intention can be modeled as the user belief and solved by maximizing the information bottleneck. By this user belief modeling, we apply the model-based reinforcement learning framework to the dialogue system, and propose a framework called DreamCUB. Experiments show that the pretrained dialogue world model can achieve state-of-the-art performances on emotion classification and sentiment identification, while dialogue quality is also enhanced by joint training of the policy, critic and dialogue world model. Further analysis shows that this manner holds a reasonable exploration-exploitation balance and also transfers well to out-of-domain scenarios such as empathetic dialogues.
comment: Accepted to EMNLP 2025 Findings
♻ ☆ Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements EMNLP 2025
In this paper, we propose a ``Generalization Stress Test" to assess Large Language Models' (LLMs) generalization ability under slight and controlled perturbations, including option length, problem types, and irrelevant noun replacements. We achieve novel and significant findings that, despite high benchmark scores, LLMs exhibit severe accuracy drops and unexpected biases (e.g., preference for longer distractors) when faced with these minor but content-preserving modifications. For example, Qwen 2.5 1.5B's MMLU score rises from 60 to 89 and drops from 89 to 36 when option lengths are changed without altering the question. Even GPT4o experiences a 25-point accuracy loss when problem types are changed, with a 6-point drop across all three modification categories. These analyses suggest that LLMs rely heavily on superficial cues rather than forming robust, abstract representations that generalize across formats, lexical variations, and irrelevant content shifts.
comment: EMNLP 2025 Main Conference
♻ ☆ CausalSent: Interpretable Sentiment Classification with RieszNet
Despite the overwhelming performance improvements offered by recent natural language processing (NLP) models, the decisions made by these models are largely a black box. Towards closing this gap, the field of causal NLP combines causal inference literature with modern NLP models to elucidate causal effects of text features. We replicate and extend Bansal et al's work on regularizing text classifiers to adhere to estimated effects, focusing instead on model interpretability. Specifically, we focus on developing a two-headed RieszNet-based neural network architecture which achieves better treatment effect estimation accuracy. Our framework, CausalSent, accurately predicts treatment effects in semi-synthetic IMDB movie reviews, reducing MAE of effect estimates by 2-3x compared to Bansal et al's MAE on synthetic Civil Comments data. With an ensemble of validated models, we perform an observational case study on the causal effect of the word "love" in IMDB movie reviews, finding that the presence of the word "love" causes a +2.9% increase in the probability of a positive sentiment.
♻ ☆ Evaluating Scoring Bias in LLM-as-a-Judge
The remarkable performance of Large Language Models (LLMs) gives rise to``LLM-as-a-Judge'', where LLMs are employed as evaluators for complex tasks. Moreover, it has been widely adopted across fields such as Natural Language Processing (NLP), preference learning, and various specific domains. However, there are various biases within LLM-as-a-Judge, which adversely affect the fairness and reliability of judgments. Current research on evaluating or mitigating bias in LLM-as-a-Judge predominantly focuses on comparison-based evaluations, while systematic investigations into bias in scoring-based evaluations remain limited. Therefore, we define scoring bias in LLM-as-a-Judge as the scores differ when scoring judge models are bias-related perturbed, and provide a well-designed framework to comprehensively evaluate scoring bias. We augment existing LLM-as-a-Judge benchmarks through data synthesis to construct our evaluation dataset and design multi-faceted evaluation metrics. Our experimental results demonstrate that the scoring stability of existing judge models is disrupted by scoring biases. Further exploratory experiments and discussions provide valuable insights into the design of scoring prompt templates and the mitigation of scoring biases on aspects such as score rubrics, score IDs, and reference answer selection.
♻ ☆ EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Spoken Dialogue Systems
Speech emotions play a crucial role in human-computer interaction, shaping engagement and context-aware communication. Despite recent advances in spoken dialogue systems, a holistic system for evaluating emotional reasoning is still lacking. To address this, we introduce EMO-Reasoning, a benchmark for assessing emotional coherence in dialogue systems. It leverages a curated dataset generated via text-to-speech to simulate diverse emotional states, overcoming the scarcity of emotional speech data. We further propose the Cross-turn Emotion Reasoning Score to assess the emotion transitions in multi-turn dialogues. Evaluating seven dialogue systems through continuous, categorical, and perceptual metrics, we show that our framework effectively detects emotional inconsistencies, providing insights for improving current dialogue systems. By releasing a systematic evaluation benchmark, we aim to advance emotion-aware spoken dialogue modeling toward more natural and adaptive interactions.
comment: Accepted at (ASRU 2025) 2025 IEEE Automatic Speech Recognition and Understanding Workshop
♻ ☆ Measuring Sycophancy of Language Models in Multi-turn Dialogues EMNLP 2025
Large Language Models (LLMs) are expected to provide helpful and harmless responses, yet they often exhibit sycophancy--conforming to user beliefs regardless of factual accuracy or ethical soundness. Prior research on sycophancy has primarily focused on single-turn factual correctness, overlooking the dynamics of real-world interactions. In this work, we introduce SYCON Bench, a novel benchmark for evaluating sycophantic behavior in multi-turn, free-form conversational settings. Our benchmark measures how quickly a model conforms to the user (Turn of Flip) and how frequently it shifts its stance under sustained user pressure (Number of Flip). Applying SYCON Bench to 17 LLMs across three real-world scenarios, we find that sycophancy remains a prevalent failure mode. Our analysis shows that alignment tuning amplifies sycophantic behavior, whereas model scaling and reasoning optimization strengthen the model's ability to resist undesirable user views. Reasoning models generally outperform instruction-tuned models but often fail when they over-index on logical exposition instead of directly addressing the user's underlying beliefs. Finally, we evaluate four additional prompting strategies and demonstrate that adopting a third-person perspective reduces sycophancy by up to 63.8% in debate scenario. We release our code and data at https://github.com/JiseungHong/SYCON-Bench.
comment: Accepted to Findings of EMNLP 2025
♻ ☆ DLLMQuant: Quantizing Diffusion-based Large Language Models
Diffusion-based large language models (DLLMs) have shown promise for non-autoregressive text generation, but their deployment is constrained by large model sizes and heavy computational costs. Post-training quantization (PTQ), a widely used method for compressing and accelerating Large Language Models (LLMs), suffers from severe accuracy degradation and reduced generalization performance when directly applied to DLLMs (e.g., AWQ suffers a 16% accuracy drop on LLADA under W4A4). This paper explores how DLLMs' key mechanisms - dynamic masking, iterative generation, bidirectional attention - clash with quantization. We identify three core issues: 1) Iterative generation and dynamic masking ratios lead to distinct token distributions across decoding steps, which are not adequately captured by existing PTQ calibration methods; 2) Quantization errors are accumulated and amplified progressively during iteration in DLLMs, causing quantized models to perform worse as decoding steps progress; 3) Unmasked tokens stabilize while masked remain probabilistic, making overall feature distribution incompatible with existing PTQ methods. To address these issues, we propose DLLMQuant, a PTQ framework tailored for DLLMs, which incorporates three novel techniques: 1) Temporal-Mask Adaptive Sampling (TMAS), a calibration method that accounts for both time and mask factors, with the capacity to capture distributions across timesteps. 2) Interaction-Aware Activation Quantization (IA-AQ), which utilizes bidirectional attention's interaction signals to dynamically allocate quantization resources. 3) Certainty-Guided Quantization (CGQ), which integrates mask status and token scores as key weighting criteria into error compensation, making weight quantization more suitable for DLLMs. Experiments show that DLLMQuant achieves significant performance gains while enhancing efficiency.
comment: 12 pages, 6 figures
♻ ☆ Improving Multilingual Language Models by Aligning Representations through Steering
This paper investigates how Large Language Models (LLMs) represent non-English tokens -- a question that remains underexplored despite recent progress. We propose a lightweight intervention method using representation steering, where a learned vector is added to the residual stream at a single model layer to enhance multilingual performance. Through extensive experiments across seven competitive baselines -- including prompt optimization, supervised fine-tuning (SFT), in-context learning, cross-lingual transfer, and translation-based methods-we show that our approach consistently outperforms most alternatives. In particular, it achieves performance on par with production-grade translation systems while requiring far fewer resources. We further explore the complementarity between our method and SFT, demonstrating that steering offers a direct, efficient way to realign internal representations. These findings underscore the potential of activation-level interventions as a powerful tool for improving the multilingual capabilities of LLMs.
♻ ☆ Less Is More? Examining Fairness in Pruned Large Language Models for Summarising Opinions EMNLP 2025
Model compression through post-training pruning offers a way to reduce model size and computational requirements without significantly impacting model performance. However, the effect of pruning on the fairness of LLM-generated summaries remains unexplored, particularly for opinion summarisation where biased outputs could influence public views.In this paper, we present a comprehensive empirical analysis of opinion summarisation, examining three state-of-the-art pruning methods and various calibration sets across three open-source LLMs using four fairness metrics. Our systematic analysis reveals that pruning methods have a greater impact on fairness than calibration sets. Building on these insights, we propose High Gradient Low Activation (HGLA) pruning, which identifies and removes parameters that are redundant for input processing but influential in output generation. Our experiments demonstrate that HGLA can better maintain or even improve fairness compared to existing methods, showing promise across models and tasks where traditional methods have limitations. Our human evaluation shows HGLA-generated outputs are fairer than existing state-of-the-art pruning methods. Code is available at: https://github.com/amberhuang01/HGLA.
comment: Accepted to EMNLP 2025 Main Conference
♻ ☆ Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing
Efficient state restoration in multi-turn conversations with large language models (LLMs) remains a critical challenge, primarily due to the overhead of recomputing or loading full key-value (KV) caches for all historical tokens. To address this, existing approaches compress KV caches across adjacent layers with highly similar attention patterns. However, these methods often apply a fixed compression scheme across all conversations, selecting the same layer pairs for compression without considering conversation-specific attention dynamics. This static strategy overlooks variability in attention pattern similarity across different conversations, which can lead to noticeable accuracy degradation. We present Krul, a multi-turn LLM inference system that enables accurate and efficient KV cache restoration. Krul dynamically selects compression strategies based on attention similarity across layer pairs and uses a recomputation-loading pipeline to restore the KV cache. It introduces three key innovations: 1) a preemptive compression strategy selector to preserve critical context for future conversation turns and selects a customized strategy for the conversation; 2) a token-wise heterogeneous attention similarity estimator to mitigate the attention similarity computation and storage overhead during model generation; 3) a bubble-free restoration scheduler to reduce potential bubbles brought by the imbalance of recomputing and loading stream due to compressed KV caches. Empirical evaluations on real-world tasks demonstrate that Krul achieves a 1.5x-2.68x reduction in time-to-first-token (TTFT) and a 1.33x-2.35x reduction in KV cache storage compared to state-of-the-art methods without compromising generation quality.
Information Retrieval 8
☆ Optimization of Latent-Space Compression using Game-Theoretic Techniques for Transformer-Based Vector Search
Vector similarity search plays a pivotal role in modern information retrieval systems, especially when powered by transformer-based embeddings. However, the scalability and efficiency of such systems are often hindered by the high dimensionality of latent representations. In this paper, we propose a novel game-theoretic framework for optimizing latent-space compression to enhance both the efficiency and semantic utility of vector search. By modeling the compression strategy as a zero-sum game between retrieval accuracy and storage efficiency, we derive a latent transformation that preserves semantic similarity while reducing redundancy. We benchmark our method against FAISS, a widely-used vector search library, and demonstrate that our approach achieves a significantly higher average similarity (0.9981 vs. 0.5517) and utility (0.8873 vs. 0.5194), albeit with a modest increase in query time. This trade-off highlights the practical value of game-theoretic latent compression in high-utility, transformer-based search applications. The proposed system can be seamlessly integrated into existing LLM pipelines to yield more semantically accurate and computationally efficient retrieval.
☆ Taming the One-Epoch Phenomenon in Online Recommendation System by Two-stage Contrastive ID Pre-training RecSys'24
ID-based embeddings are widely used in web-scale online recommendation systems. However, their susceptibility to overfitting, particularly due to the long-tail nature of data distributions, often limits training to a single epoch, a phenomenon known as the "one-epoch problem." This challenge has driven research efforts to optimize performance within the first epoch by enhancing convergence speed or feature sparsity. In this study, we introduce a novel two-stage training strategy that incorporates a pre-training phase using a minimal model with contrastive loss, enabling broader data coverage for the embedding system. Our offline experiments demonstrate that multi-epoch training during the pre-training phase does not lead to overfitting, and the resulting embeddings improve online generalization when fine-tuned for more complex downstream recommendation tasks. We deployed the proposed system in live traffic at Pinterest, achieving significant site-wide engagement gains.
comment: Published at RecSys'24, see https://dl.acm.org/doi/10.1145/3640457.3688053
☆ Membership Inference Attacks on LLM-based Recommender Systems
Large language models (LLMs) based Recommender Systems (RecSys) can flexibly adapt recommendation systems to different domains. It utilizes in-context learning (ICL), i.e., the prompts, to customize the recommendation functions, which include sensitive historical user-specific item interactions, e.g., implicit feedback like clicked items or explicit product reviews. Such private information may be exposed to novel privacy attack. However, no study has been done on this important issue. We design four membership inference attacks (MIAs), aiming to reveal whether victims' historical interactions have been used by system prompts. They are \emph{direct inquiry, hallucination, similarity, and poisoning attacks}, each of which utilizes the unique features of LLMs or RecSys. We have carefully evaluated them on three LLMs that have been used to develop ICL-LLM RecSys and two well-known RecSys benchmark datasets. The results confirm that the MIA threat on LLM RecSys is realistic: direct inquiry and poisoning attacks showing significantly high attack advantages. We have also analyzed the factors affecting these attacks, such as the number of shots in system prompts and the position of the victim in the shots.
☆ Extracting Information from Scientific Literature via Visual Table Question Answering Models
This study explores three approaches to processing table data in scientific papers to enhance extractive question answering and develop a software tool for the systematic review process. The methods evaluated include: (1) Optical Character Recognition (OCR) for extracting information from documents, (2) Pre-trained models for document visual question answering, and (3) Table detection and structure recognition to extract and merge key information from tables with textual content to answer extractive questions. In exploratory experiments, we augmented ten sample test documents containing tables and relevant content against RF- EMF-related scientific papers with seven predefined extractive question-answer pairs. The results indicate that approaches preserving table structure outperform the others, particularly in representing and organizing table content. Accurately recognizing specific notations and symbols within the documents emerged as a critical factor for improved results. Our study concludes that preserving the structural integrity of tables is essential for enhancing the accuracy and reliability of extractive question answering in scientific documents.
comment: Accepted at ACM International Conference on Research in Adaptive and Convergent Systems, November 5-8, 2024, Pompei, Italy
♻ ☆ An Ontology-Driven Graph RAG for Legal Norms: A Hierarchical, Temporal, and Deterministic Approach
Retrieval-Augmented Generation (RAG) systems in the legal domain face a critical challenge: standard, flat-text retrieval is blind to the hierarchical, diachronic, and causal structure of law, leading to anachronistic and unreliable answers. This paper introduces an ontology-driven Graph RAG framework designed to overcome these limitations. We ground our knowledge graph in a formal, LRMoo-inspired model that distinguishes abstract legal Works from their versioned Expressions. We model temporal states as efficient aggregations that reuse the versioned expressions (CTVs) of unchanged components, and we reify legislative events as first-class Action nodes to make causality explicit and queryable. This structured backbone enables a unified, planner-guided query strategy that applies explicit policies to deterministically resolve complex requests for (i) point-in-time retrieval, (ii) hierarchical impact analysis, and (iii) auditable provenance reconstruction. Through a case study on the Brazilian Constitution, we demonstrate how this approach provides a verifiable, temporally-correct substrate for LLMs, enabling higher-order analytical capabilities while drastically reducing the risk of factual errors. The result is a practical framework for building more trustworthy and explainable legal AI systems.
comment: This is a major revision that significantly expands and deepens the original manuscript. While the core ontological model remains the same, this version provides a substantially more rigorous and detailed account of how the framework is applied in practice, particularly within a Retrieval-Augmented Generation (RAG) context
♻ ☆ LLM-Enhanced Linear Autoencoders for Recommendation CIKM 2025
Large language models (LLMs) have been widely adopted to enrich the semantic representation of textual item information in recommender systems. However, existing linear autoencoders (LAEs) that incorporate textual information rely on sparse word co-occurrence patterns, limiting their ability to capture rich textual semantics. To address this, we propose L3AE, the first integration of LLMs into the LAE framework. L3AE effectively integrates the heterogeneous knowledge of textual semantics and user-item interactions through a two-phase optimization strategy. (i) L3AE first constructs a semantic item-to-item correlation matrix from LLM-derived item representations. (ii) It then learns an item-to-item weight matrix from collaborative signals while distilling semantic item correlations as regularization. Notably, each phase of L3AE is optimized through closed-form solutions, ensuring global optimality and computational efficiency. Extensive experiments demonstrate that L3AE consistently outperforms state-of-the-art LLM-enhanced models on three benchmark datasets, achieving gains of 27.6% in Recall@20 and 39.3% in NDCG@20. The source code is available at https://github.com/jaewan7599/L3AE_CIKM2025.
comment: Accepted by CIKM 2025
♻ ☆ PCR-CA: Parallel Codebook Representations with Contrastive Alignment for Multiple-Category App Recommendation
Modern app store recommender systems struggle with multiple-category apps, as traditional taxonomies fail to capture overlapping semantics, leading to suboptimal personalization. We propose PCR-CA (Parallel Codebook Representations with Contrastive Alignment), an end-to-end framework for improved CTR prediction. PCR-CA first extracts compact multimodal embeddings from app text, then introduces a Parallel Codebook VQ-AE module that learns discrete semantic representations across multiple codebooks in parallel -- unlike hierarchical residual quantization (RQ-VAE). This design enables independent encoding of diverse aspects (e.g., gameplay, art style), better modeling multiple-category semantics. To bridge semantic and collaborative signals, we employ a contrastive alignment loss at both the user and item levels, enhancing representation learning for long-tail items. Additionally, a dual-attention fusion mechanism combines ID-based and semantic features to capture user interests, especially for long-tail apps. Experiments on a large-scale dataset show PCR-CA achieves a +0.76% AUC improvement over strong baselines, with +2.15% AUC gains for long-tail apps. Online A/B testing further validates our approach, showing a +10.52% lift in CTR and a +16.30% improvement in CVR, demonstrating PCR-CA's effectiveness in real-world deployment. The new framework has now been fully deployed on the Microsoft Store.
comment: 9 pages, 4 figures, conference
♻ ☆ ST-Raptor: LLM-Powered Semi-Structured Table Question Answering SIGMOD 2026
Semi-structured tables, widely used in real-world applications (e.g., financial reports, medical records, transactional orders), often involve flexible and complex layouts (e.g., hierarchical headers and merged cells). These tables generally rely on human analysts to interpret table layouts and answer relevant natural language questions, which is costly and inefficient. To automate the procedure, existing methods face significant challenges. First, methods like NL2SQL require converting semi-structured tables into structured ones, which often causes substantial information loss. Second, methods like NL2Code and multi-modal LLM QA struggle to understand the complex layouts of semi-structured tables and cannot accurately answer corresponding questions. To this end, we propose ST-Raptor, a tree-based framework for semi-structured table question answering using large language models. First, we introduce the Hierarchical Orthogonal Tree (HO-Tree), a structural model that captures complex semi-structured table layouts, along with an effective algorithm for constructing the tree. Second, we define a set of basic tree operations to guide LLMs in executing common QA tasks. Given a user question, ST-Raptor decomposes it into simpler sub-questions, generates corresponding tree operation pipelines, and conducts operation-table alignment for accurate pipeline execution. Third, we incorporate a two-stage verification mechanism: forward validation checks the correctness of execution steps, while backward validation evaluates answer reliability by reconstructing queries from predicted answers. To benchmark the performance, we present SSTQA, a dataset of 764 questions over 102 real-world semi-structured tables. Experiments show that ST-Raptor outperforms nine baselines by up to 20% in answer accuracy. The code is available at https://github.com/weAIDB/ST-Raptor.
comment: Extension of our SIGMOD 2026 paper. Please refer to source code available at: https://github.com/weAIDB/ST-Raptor
Machine Learning 150
☆ Predicting the Order of Upcoming Tokens Improves Language Modeling
Multi-Token Prediction (MTP) has been proposed as an auxiliary objective to improve next-token prediction (NTP) in language model training but shows inconsistent improvements, underperforming in standard NLP benchmarks. We argue that MTP's exact future token prediction is too difficult as an auxiliary loss. Instead, we propose Token Order Prediction (TOP), which trains models to order upcoming tokens by their proximity using a learning-to-rank loss. TOP requires only a single additional unembedding layer compared to MTP's multiple transformer layers. We pretrain models of 340M, 1.8B, and 7B parameters using NTP, MTP, and TOP objectives. Results on eight standard NLP benchmarks show that TOP overall outperforms both NTP and MTP even at scale. Our code is available at https://github.com/zaydzuhri/token-order-prediction
☆ Understanding Tool-Integrated Reasoning
We study why Tool-Integrated Reasoning (TIR) makes Large Language Models (LLMs) more capable. While LLMs integrated with tools like Python code interpreters show great promise, a principled theory explaining why this paradigm is effective has been missing. This work provides the first formal proof that TIR fundamentally expands an LLM's capabilities. We demonstrate that tools enable a strict expansion of the model's empirical and feasible support, breaking the capability ceiling of pure-text models by unlocking problem-solving strategies that are otherwise impossible or intractably verbose. To guide model behavior without compromising training stability and performance, we also introduce Advantage Shaping Policy Optimization (ASPO), a novel algorithm that directly modifies the advantage function to guide the policy behavior. We conduct comprehensive experiments on challenging mathematical benchmarks, leveraging a Python interpreter as the external tool. Our results show that the TIR model decisively outperforms its pure-text counterpart on the pass@k metric. Crucially, this advantage is not confined to computationally-intensive problems but extends to those requiring significant abstract insight. We further identify the emergent cognitive patterns that illustrate how models learn to think with tools. Finally, we report improved tool usage behavior with early code invocation and much more interactive turns with ASPO. Overall, our work provides the first principled explanation for TIR's success, shifting the focus from the mere fact that tools work to why and how they enable more powerful reasoning.
☆ Planning-Query-Guided Model Generation for Model-Based Deformable Object Manipulation
Efficient planning in high-dimensional spaces, such as those involving deformable objects, requires computationally tractable yet sufficiently expressive dynamics models. This paper introduces a method that automatically generates task-specific, spatially adaptive dynamics models by learning which regions of the object require high-resolution modeling to achieve good task performance for a given planning query. Task performance depends on the complex interplay between the dynamics model, world dynamics, control, and task requirements. Our proposed diffusion-based model generator predicts per-region model resolutions based on start and goal pointclouds that define the planning query. To efficiently collect the data for learning this mapping, a two-stage process optimizes resolution using predictive dynamics as a prior before directly optimizing using closed-loop performance. On a tree-manipulation task, our method doubles planning speed with only a small decrease in task performance over using a full-resolution model. This approach informs a path towards using previous planning and control data to generate computationally efficient yet sufficiently expressive dynamics models for new tasks.
comment: 9 pages, 7 figures
☆ Emotions as Ambiguity-aware Ordinal Representations
Emotions are inherently ambiguous and dynamic phenomena, yet existing continuous emotion recognition approaches either ignore their ambiguity or treat ambiguity as an independent and static variable over time. Motivated by this gap in the literature, in this paper we introduce \emph{ambiguity-aware ordinal} emotion representations, a novel framework that captures both the ambiguity present in emotion annotation and the inherent temporal dynamics of emotional traces. Specifically, we propose approaches that model emotion ambiguity through its rate of change. We evaluate our framework on two affective corpora -- RECOLA and GameVibe -- testing our proposed approaches on both bounded (arousal, valence) and unbounded (engagement) continuous traces. Our results demonstrate that ordinal representations outperform conventional ambiguity-aware models on unbounded labels, achieving the highest Concordance Correlation Coefficient (CCC) and Signed Differential Agreement (SDA) scores, highlighting their effectiveness in modeling the traces' dynamics. For bounded traces, ordinal representations excel in SDA, revealing their superior ability to capture relative changes of annotated emotion traces.
☆ Get Global Guarantees: On the Probabilistic Nature of Perturbation Robustness
In safety-critical deep learning applications, robustness measures the ability of neural models that handle imperceptible perturbations in input data, which may lead to potential safety hazards. Existing pre-deployment robustness assessment methods typically suffer from significant trade-offs between computational cost and measurement precision, limiting their practical utility. To address these limitations, this paper conducts a comprehensive comparative analysis of existing robustness definitions and associated assessment methodologies. We propose tower robustness to evaluate robustness, which is a novel, practical metric based on hypothesis testing to quantitatively evaluate probabilistic robustness, enabling more rigorous and efficient pre-deployment assessments. Our extensive comparative evaluation illustrates the advantages and applicability of our proposed approach, thereby advancing the systematic understanding and enhancement of model robustness in safety-critical deep learning applications.
☆ Leveraging Evolutionary Surrogate-Assisted Prescription in Multi-Objective Chlorination Control Systems
This short, written report introduces the idea of Evolutionary Surrogate-Assisted Prescription (ESP) and presents preliminary results on its potential use in training real-world agents as a part of the 1st AI for Drinking Water Chlorination Challenge at IJCAI-2025. This work was done by a team from Project Resilience, an organization interested in bridging AI to real-world problems.
☆ From Tabula Rasa to Emergent Abilities: Discovering Robot Skills via Real-World Unsupervised Quality-Diversity
Autonomous skill discovery aims to enable robots to acquire diverse behaviors without explicit supervision. Learning such behaviors directly on physical hardware remains challenging due to safety and data efficiency constraints. Existing methods, including Quality-Diversity Actor-Critic (QDAC), require manually defined skill spaces and carefully tuned heuristics, limiting real-world applicability. We propose Unsupervised Real-world Skill Acquisition (URSA), an extension of QDAC that enables robots to autonomously discover and master diverse, high-performing skills directly in the real world. We demonstrate that URSA successfully discovers diverse locomotion skills on a Unitree A1 quadruped in both simulation and the real world. Our approach supports both heuristic-driven skill discovery and fully unsupervised settings. We also show that the learned skill repertoire can be reused for downstream tasks such as real-world damage adaptation, where URSA outperforms all baselines in 5 out of 9 simulated and 3 out of 5 real-world damage scenarios. Our results establish a new framework for real-world robot learning that enables continuous skill discovery with limited human intervention, representing a significant step toward more autonomous and adaptable robotic systems. Demonstration videos are available at http://adaptive-intelligent-robotics.github.io/URSA .
comment: Accepted at CoRL 2025
☆ Few-Shot Connectivity-Aware Text Line Segmentation in Historical Documents
A foundational task for the digital analysis of documents is text line segmentation. However, automating this process with deep learning models is challenging because it requires large, annotated datasets that are often unavailable for historical documents. Additionally, the annotation process is a labor- and cost-intensive task that requires expert knowledge, which makes few-shot learning a promising direction for reducing data requirements. In this work, we demonstrate that small and simple architectures, coupled with a topology-aware loss function, are more accurate and data-efficient than more complex alternatives. We pair a lightweight UNet++ with a connectivity-aware loss, initially developed for neuron morphology, which explicitly penalizes structural errors like line fragmentation and unintended line merges. To increase our limited data, we train on small patches extracted from a mere three annotated pages per manuscript. Our methodology significantly improves upon the current state-of-the-art on the U-DIADS-TL dataset, with a 200% increase in Recognition Accuracy and a 75% increase in Line Intersection over Union. Our method also achieves an F-Measure score on par with or even exceeding that of the competition winner of the DIVA-HisDB baseline detection task, all while requiring only three annotated pages, exemplifying the efficacy of our approach. Our implementation is publicly available at: https://github.com/RafaelSterzinger/acpr_few_shot_hist.
comment: 15 pages, accepted at ACPR2025
☆ Playstyle and Artificial Intelligence: An Initial Blueprint Through the Lens of Video Games
Contemporary artificial intelligence (AI) development largely centers on rational decision-making, valued for its measurability and suitability for objective evaluation. Yet in real-world contexts, an intelligent agent's decisions are shaped not only by logic but also by deeper influences such as beliefs, values, and preferences. The diversity of human decision-making styles emerges from these differences, highlighting that "style" is an essential but often overlooked dimension of intelligence. This dissertation introduces playstyle as an alternative lens for observing and analyzing the decision-making behavior of intelligent agents, and examines its foundational meaning and historical context from a philosophical perspective. By analyzing how beliefs and values drive intentions and actions, we construct a two-tier framework for style formation: the external interaction loop with the environment and the internal cognitive loop of deliberation. On this basis, we formalize style-related characteristics and propose measurable indicators such as style capacity, style popularity, and evolutionary dynamics. The study focuses on three core research directions: (1) Defining and measuring playstyle, proposing a general playstyle metric based on discretized state spaces, and extending it to quantify strategic diversity and competitive balance; (2) Expressing and generating playstyle, exploring how reinforcement learning and imitation learning can be used to train agents exhibiting specific stylistic tendencies, and introducing a novel approach for human-like style learning and modeling; and (3) Practical applications, analyzing the potential of these techniques in domains such as game design and interactive entertainment. Finally, the dissertation outlines future extensions, including the role of style as a core element in building artificial general intelligence (AGI).
comment: PhD Dissertation, National Yang Ming Chiao Tung University, 2025. This is the public version without Chinese abstract or postscript
☆ Saddle Hierarchy in Dense Associative Memory
Dense associative memory (DAM) models have been attracting renewed attention since they were shown to be robust to adversarial examples and closely related to state-of-the-art machine learning paradigms, such as the attention mechanisms in transformers and generative diffusion models. We study a DAM built upon a three-layer Boltzmann machine with Potts hidden units, which represent data clusters and classes. Through a statistical mechanics analysis, we derive saddle-point equations that characterize both the stationary points of DAMs trained on real data and the fixed points of DAMs trained on synthetic data within a teacher-student framework. Based on these results, we propose a novel regularization scheme that makes training significantly more stable. Moreover, we show empirically that our DAM learns interpretable solutions to both supervised and unsupervised classification problems. Pushing our theoretical analysis further, we find that the weights learned by relatively small DAMs correspond to unstable saddle points in larger DAMs. We implement a network-growing algorithm that leverages this saddle-point hierarchy to drastically reduce the computational cost of training dense associative memory.
comment: 55 pages, 10 figures
☆ Echoes of the past: A unified perspective on fading memory and echo states
Recurrent neural networks (RNNs) have become increasingly popular in information processing tasks involving time series and temporal data. A fundamental property of RNNs is their ability to create reliable input/output responses, often linked to how the network handles its memory of the information it processed. Various notions have been proposed to conceptualize the behavior of memory in RNNs, including steady states, echo states, state forgetting, input forgetting, and fading memory. Although these notions are often used interchangeably, their precise relationships remain unclear. This work aims to unify these notions in a common language, derive new implications and equivalences between them, and provide alternative proofs to some existing results. By clarifying the relationships between these concepts, this research contributes to a deeper understanding of RNNs and their temporal information processing capabilities.
☆ A Bag of Tricks for Efficient Implicit Neural Point Clouds
Implicit Neural Point Cloud (INPC) is a recent hybrid representation that combines the expressiveness of neural fields with the efficiency of point-based rendering, achieving state-of-the-art image quality in novel view synthesis. However, as with other high-quality approaches that query neural networks during rendering, the practical usability of INPC is limited by comparatively slow rendering. In this work, we present a collection of optimizations that significantly improve both the training and inference performance of INPC without sacrificing visual fidelity. The most significant modifications are an improved rasterizer implementation, more effective sampling techniques, and the incorporation of pre-training for the convolutional neural network used for hole-filling. Furthermore, we demonstrate that points can be modeled as small Gaussians during inference to further improve quality in extrapolated, e.g., close-up views of the scene. We design our implementations to be broadly applicable beyond INPC and systematically evaluate each modification in a series of experiments. Our optimized INPC pipeline achieves up to 25% faster training, 2x faster rendering, and 20% reduced VRAM usage paired with slight image quality improvements.
comment: Project page: https://fhahlbohm.github.io/inpc_v2/
☆ Active Query Selection for Crowd-Based Reinforcement Learning
Preference-based reinforcement learning has gained prominence as a strategy for training agents in environments where the reward signal is difficult to specify or misaligned with human intent. However, its effectiveness is often limited by the high cost and low availability of reliable human input, especially in domains where expert feedback is scarce or errors are costly. To address this, we propose a novel framework that combines two complementary strategies: probabilistic crowd modelling to handle noisy, multi-annotator feedback, and active learning to prioritize feedback on the most informative agent actions. We extend the Advise algorithm to support multiple trainers, estimate their reliability online, and incorporate entropy-based query selection to guide feedback requests. We evaluate our approach in a set of environments that span both synthetic and real-world-inspired settings, including 2D games (Taxi, Pacman, Frozen Lake) and a blood glucose control task for Type 1 Diabetes using the clinically approved UVA/Padova simulator. Our preliminary results demonstrate that agents trained with feedback on uncertain trajectories exhibit faster learning in most tasks, and we outperform the baselines for the blood glucose control task.
comment: 7 pages, 4 figures, 2 tables plus appendices
☆ Random forest-based out-of-distribution detection for robust lung cancer segmentation
Accurate detection and segmentation of cancerous lesions from computed tomography (CT) scans is essential for automated treatment planning and cancer treatment response assessment. Transformer-based models with self-supervised pretraining can produce reliably accurate segmentation from in-distribution (ID) data but degrade when applied to out-of-distribution (OOD) datasets. We address this challenge with RF-Deep, a random forest classifier that utilizes deep features from a pretrained transformer encoder of the segmentation model to detect OOD scans and enhance segmentation reliability. The segmentation model comprises a Swin Transformer encoder, pretrained with masked image modeling (SimMIM) on 10,432 unlabeled 3D CT scans covering cancerous and non-cancerous conditions, with a convolution decoder, trained to segment lung cancers in 317 3D scans. Independent testing was performed on 603 3D CT public datasets that included one ID dataset and four OOD datasets comprising chest CTs with pulmonary embolism (PE) and COVID-19, and abdominal CTs with kidney cancers and healthy volunteers. RF-Deep detected OOD cases with a FPR95 of 18.26%, 27.66%, and less than 0.1% on PE, COVID-19, and abdominal CTs, consistently outperforming established OOD approaches. The RF-Deep classifier provides a simple and effective approach to enhance reliability of cancer segmentation in ID and OOD scenarios.
☆ Composition and Alignment of Diffusion Models using Constrained Learning
Diffusion models have become prevalent in generative modeling due to their ability to sample from complex distributions. To improve the quality of generated samples and their compliance with user requirements, two commonly used methods are: (i) Alignment, which involves fine-tuning a diffusion model to align it with a reward; and (ii) Composition, which combines several pre-trained diffusion models, each emphasizing a desirable attribute in the generated outputs. However, trade-offs often arise when optimizing for multiple rewards or combining multiple models, as they can often represent competing properties. Existing methods cannot guarantee that the resulting model faithfully generates samples with all the desired properties. To address this gap, we propose a constrained optimization framework that unifies alignment and composition of diffusion models by enforcing that the aligned model satisfies reward constraints and/or remains close to (potentially multiple) pre-trained models. We provide a theoretical characterization of the solutions to the constrained alignment and composition problems and develop a Lagrangian-based primal-dual training algorithm to approximate these solutions. Empirically, we demonstrate the effectiveness and merits of our proposed approach in image generation, applying it to alignment and composition, and show that our aligned or composed model satisfies constraints effectively, and improves on the equally-weighted approach. Our implementation can be found at https://github.com/shervinkhalafi/constrained_comp_align.
☆ APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration
Large language models (LLMs) have revolutionized AI applications, yet their enormous computational demands severely limit deployment and real-time performance. Quantization methods can help reduce computational costs, however, attaining the extreme efficiency associated with ultra-low-bit quantized LLMs at arbitrary precision presents challenges on GPUs. This is primarily due to the limited support for GPU Tensor Cores, inefficient memory management, and inflexible kernel optimizations. To tackle these challenges, we propose a comprehensive acceleration scheme for arbitrary precision LLMs, namely APT-LLM. Firstly, we introduce a novel data format, bipolar-INT, which allows for efficient and lossless conversion with signed INT, while also being more conducive to parallel computation. We also develop a matrix multiplication (MatMul) method allowing for arbitrary precision by dismantling and reassembling matrices at the bit level. This method provides flexible precision and optimizes the utilization of GPU Tensor Cores. In addition, we propose a memory management system focused on data recovery, which strategically employs fast shared memory to substantially increase kernel execution speed and reduce memory access latency. Finally, we develop a kernel mapping method that dynamically selects the optimal configurable hyperparameters of kernels for varying matrix sizes, enabling optimal performance across different LLM architectures and precision settings. In LLM inference, APT-LLM achieves up to a 3.99$\times$ speedup compared to FP16 baselines and a 2.16$\times$ speedup over NVIDIA CUTLASS INT4 acceleration on RTX 3090. On RTX 4090 and H800, APT-LLM achieves up to 2.44$\times$ speedup over FP16 and 1.65$\times$ speedup over CUTLASS integer baselines.
comment: To appear in the IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD)
☆ Universal Dynamics with Globally Controlled Analog Quantum Simulators
Analog quantum simulators with global control fields have emerged as powerful platforms for exploring complex quantum phenomena. Recent breakthroughs, such as the coherent control of thousands of atoms, highlight the growing potential for quantum applications at scale. Despite these advances, a fundamental theoretical question remains unresolved: to what extent can such systems realize universal quantum dynamics under global control? Here we establish a necessary and sufficient condition for universal quantum computation using only global pulse control, proving that a broad class of analog quantum simulators is, in fact, universal. We further extend this framework to fermionic and bosonic systems, including modern platforms such as ultracold atoms in optical superlattices. Crucially, to connect the theoretical possibility with experimental reality, we introduce a new control technique into the experiment - direct quantum optimal control. This method enables the synthesis of complex effective Hamiltonians and allows us to incorporate realistic hardware constraints. To show its practical power, we experimentally engineer three-body interactions outside the blockade regime and demonstrate topological dynamics on a Rydberg atom array. Using the new control framework, we overcome key experimental challenges, including hardware limitations and atom position fluctuations in the non-blockade regime, by identifying smooth, short-duration pulses that achieve high-fidelity dynamics. Experimental measurements reveal dynamical signatures of symmetry-protected-topological edge modes, confirming both the expressivity and feasibility of our approach. Our work opens a new avenue for quantum simulation beyond native hardware Hamiltonians, enabling the engineering of effective multi-body interactions and advancing the frontier of quantum information processing with globally-controlled analog platforms.
comment: 12 pages, 5 figures
☆ CARMA: Collocation-Aware Resource Manager with GPU Memory Estimator
Studies conducted on enterprise-scale infrastructure have shown that GPUs -- the core computational resource for deep learning (DL) training -- are often significantly underutilized. DL task collocation on GPUs is an opportunity to address this challenge. However, it may result in (1) out-of-memory crashes for the subsequently arriving task and (2) slowdowns for all tasks sharing the GPU due to resource interference. The former challenge poses a threat to robustness, while the latter affects the quality of service and energy efficiency. We propose CARMA, a server-scale task-level collocation-aware resource management system that handles both collocation challenges. CARMA encompasses GPUMemNet, a novel ML-based GPU memory estimator framework for DL training tasks, to minimize out-of-memory errors and introduces collocation policies that cap GPU utilization to minimize interference. Furthermore, CARMA introduces a recovery method to ensure robust restart of tasks that crash. Our evaluation on traces modeled after real-world DL training task traces shows that CARMA increases the GPU utilization over time by 39.3\%, decreases the end-to-end execution time by $\sim$26.7\%, and reduces the GPU energy use by $\sim$14.2\%.
☆ Attackers Strike Back? Not Anymore -- An Ensemble of RL Defenders Awakens for APT Detection
Advanced Persistent Threats (APTs) represent a growing menace to modern digital infrastructure. Unlike traditional cyberattacks, APTs are stealthy, adaptive, and long-lasting, often bypassing signature-based detection systems. This paper introduces a novel framework for APT detection that unites deep learning, reinforcement learning (RL), and active learning into a cohesive, adaptive defense system. Our system combines auto-encoders for latent behavioral encoding with a multi-agent ensemble of RL-based defenders, each trained to distinguish between benign and malicious process behaviors. We identify a critical challenge in existing detection systems: their static nature and inability to adapt to evolving attack strategies. To this end, our architecture includes multiple RL agents (Q-Learning, PPO, DQN, adversarial defenders), each analyzing latent vectors generated by an auto-encoder. When any agent is uncertain about its decision, the system triggers an active learning loop to simulate expert feedback, thus refining decision boundaries. An ensemble voting mechanism, weighted by each agent's performance, ensures robust final predictions.
☆ Dynamic Triangulation-Based Graph Rewiring for Graph Neural Networks CIKM 2025
Graph Neural Networks (GNNs) have emerged as the leading paradigm for learning over graph-structured data. However, their performance is limited by issues inherent to graph topology, most notably oversquashing and oversmoothing. Recent advances in graph rewiring aim to mitigate these limitations by modifying the graph topology to promote more effective information propagation. In this work, we introduce TRIGON, a novel framework that constructs enriched, non-planar triangulations by learning to select relevant triangles from multiple graph views. By jointly optimizing triangle selection and downstream classification performance, our method produces a rewired graph with markedly improved structural properties such as reduced diameter, increased spectral gap, and lower effective resistance compared to existing rewiring methods. Empirical results demonstrate that TRIGON outperforms state-of-the-art approaches on node classification tasks across a range of homophilic and heterophilic benchmarks.
comment: Accepted to CIKM 2025
☆ Learning Binary Sampling Patterns for Single-Pixel Imaging using Bilevel Optimisation
Single-Pixel Imaging enables reconstructing objects using a single detector through sequential illuminations with structured light patterns. We propose a bilevel optimisation method for learning task-specific, binary illumination patterns, optimised for applications like single-pixel fluorescence microscopy. We address the non-differentiable nature of binary pattern optimisation using the Straight-Through Estimator and leveraging a Total Deep Variation regulariser in the bilevel formulation. We demonstrate our method on the CytoImageNet microscopy dataset and show that learned patterns achieve superior reconstruction performance compared to baseline methods, especially in highly undersampled regimes.
☆ Tackling Federated Unlearning as a Parameter Estimation Problem
Privacy regulations require the erasure of data from deep learning models. This is a significant challenge that is amplified in Federated Learning, where data remains on clients, making full retraining or coordinated updates often infeasible. This work introduces an efficient Federated Unlearning framework based on information theory, modeling leakage as a parameter estimation problem. Our method uses second-order Hessian information to identify and selectively reset only the parameters most sensitive to the data being forgotten, followed by minimal federated retraining. This model-agnostic approach supports categorical and client unlearning without requiring server access to raw client data after initial information aggregation. Evaluations on benchmark datasets demonstrate strong privacy (MIA success near random, categorical knowledge erased) and high performance (Normalized Accuracy against re-trained benchmarks of $\approx$ 0.9), while aiming for increased efficiency over complete retraining. Furthermore, in a targeted backdoor attack scenario, our framework effectively neutralizes the malicious trigger, restoring model integrity. This offers a practical solution for data forgetting in FL.
comment: 18 pages, 1 figure
☆ Automated discovery of finite volume schemes using Graph Neural Networks
Graph Neural Networks (GNNs) have deeply modified the landscape of numerical simulations by demonstrating strong capabilities in approximating solutions of physical systems. However, their ability to extrapolate beyond their training domain (\textit{e.g.} larger or structurally different graphs) remains uncertain. In this work, we establish that GNNs can serve purposes beyond their traditional role, and be exploited to generate numerical schemes, in conjunction with symbolic regression. First, we show numerically and theoretically that a GNN trained on a dataset consisting solely of two-node graphs can extrapolate a first-order Finite Volume (FV) scheme for the heat equation on out-of-distribution, unstructured meshes. Specifically, if a GNN achieves a loss $\varepsilon$ on such a dataset, it implements the FV scheme with an error of $\mathcal{O}(\varepsilon)$. Using symbolic regression, we show that the network effectively rediscovers the exact analytical formulation of the standard first-order FV scheme. We then extend this approach to an unsupervised context: the GNN recovers the first-order FV scheme using only a residual loss similar to Physics-Informed Neural Networks (PINNs) with no access to ground-truth data. Finally, we push the methodology further by considering higher-order schemes: we train (i) a 2-hop and (ii) a 2-layers GNN using the same PINN loss, that autonomously discover (i) a second-order correction term to the initial scheme using a 2-hop stencil, and (ii) the classic second-order midpoint scheme. These findings follows a recent paradigm in scientific computing: GNNs are not only strong approximators, but can be active contributors to the development of novel numerical methods.
☆ Breaking the Black Box: Inherently Interpretable Physics-Informed Machine Learning for Imbalanced Seismic Data
Ground motion models (GMMs) predict how strongly the ground will shake during an earthquake. They are essential for structural analysis, seismic design, and seismic risk assessment studies. Traditional machine learning (ML) approaches are popular to develop GMMs, due to large earthquake databases worldwide. However, they operate as "black boxes," which are hard to interpret and trust, limiting their use in high-stake decisions. Additionally, these databases suffer from significant data imbalances: fewer large, critically damaging records near the fault compared to abundant, less severely damaging distant records. These two limitations are addressed in this work by developing a transparent ML architecture using the HazBinLoss function. Each input (e.g., magnitude, distance, their interaction term, etc.) is processed separately and added linearly to obtain the output, resulting in exact contribution of each term. The HazBinLoss function assigns higher weights to critical near-field large magnitude records and lower weights to less-critical far-field smaller magnitude records, during training to prevent underprediction of the most damaging scenarios. Our model captures known seismological principles and achieves comparable performance with established GMMs while maintaining transparency. This framework enables broader adoption of ML-based approaches for risk assessment studies and disaster planning.
comment: 19 pages, 9 Figures and 2 Tables
☆ GReAT: leveraging geometric artery data to improve wall shear stress assessment MICCAI 2025
Leveraging big data for patient care is promising in many medical fields such as cardiovascular health. For example, hemodynamic biomarkers like wall shear stress could be assessed from patient-specific medical images via machine learning algorithms, bypassing the need for time-intensive computational fluid simulation. However, it is extremely challenging to amass large-enough datasets to effectively train such models. We could address this data scarcity by means of self-supervised pre-training and foundations models given large datasets of geometric artery models. In the context of coronary arteries, leveraging learned representations to improve hemodynamic biomarker assessment has not yet been well studied. In this work, we address this gap by investigating whether a large dataset (8449 shapes) consisting of geometric models of 3D blood vessels can benefit wall shear stress assessment in coronary artery models from a small-scale clinical trial (49 patients). We create a self-supervised target for the 3D blood vessels by computing the heat kernel signature, a quantity obtained via Laplacian eigenvectors, which captures the very essence of the shapes. We show how geometric representations learned from this datasets can boost segmentation of coronary arteries into regions of low, mid and high (time-averaged) wall shear stress even when trained on limited data.
comment: (MICCAI 2025) Workshop on Shape in Medical Imaging (ShapeMI)
☆ When recalling in-context, Transformers are not SSMs
Despite the advantageous subquadratic complexity of modern recurrent deep learning models -- such as state-space models (SSMs) -- recent studies have highlighted their potential shortcomings compared to transformers on reasoning and memorization tasks. In this paper, we dive deeper into one of such benchmarks: associative recall (AR), which has been shown to correlate well with language modeling performance, and inspect in detail the effects of scaling and optimization issues in recently proposed token mixing strategies. We first demonstrate that, unlike standard transformers, the choice of learning rate plays a critical role in the performance of modern recurrent models: an issue that can severely affect reported performance in previous works and suggests further research is needed to stabilize training. Next, we show that recurrent and attention-based models exhibit contrasting benefits when scaling in width as opposed to depth, with attention being notably unable to solve AR when limited to a single layer. We then further inspect 1-layer transformers, revealing that despite their poor performance, their training dynamics surprisingly resemble the formation of induction heads, a phenomenon previously observed only in their 2-layer counterparts. Finally, through architectural ablations, we study how components affects Transformer and Mamba's performance and optimization stability.
☆ GRADSTOP: Early Stopping of Gradient Descent via Posterior Sampling
Machine learning models are often learned by minimising a loss function on the training data using a gradient descent algorithm. These models often suffer from overfitting, leading to a decline in predictive performance on unseen data. A standard solution is early stopping using a hold-out validation set, which halts the minimisation when the validation loss stops decreasing. However, this hold-out set reduces the data available for training. This paper presents {\sc gradstop}, a novel stochastic early stopping method that only uses information in the gradients, which are produced by the gradient descent algorithm ``for free.'' Our main contributions are that we estimate the Bayesian posterior by the gradient information, define the early stopping problem as drawing sample from this posterior, and use the approximated posterior to obtain a stopping criterion. Our empirical evaluation shows that {\sc gradstop} achieves a small loss on test data and compares favourably to a validation-set-based stopping criterion. By leveraging the entire dataset for training, our method is particularly advantageous in data-limited settings, such as transfer learning. It can be incorporated as an optional feature in gradient descent libraries with only a small computational overhead. The source code is available at https://github.com/edahelsinki/gradstop.
☆ Metric Matters: A Formal Evaluation of Similarity Measures in Active Learning for Cyber Threat Intelligence
Advanced Persistent Threats (APTs) pose a severe challenge to cyber defense due to their stealthy behavior and the extreme class imbalance inherent in detection datasets. To address these issues, we propose a novel active learning-based anomaly detection framework that leverages similarity search to iteratively refine the decision space. Built upon an Attention-Based Autoencoder, our approach uses feature-space similarity to identify normal-like and anomaly-like instances, thereby enhancing model robustness with minimal oracle supervision. Crucially, we perform a formal evaluation of various similarity measures to understand their influence on sample selection and anomaly ranking effectiveness. Through experiments on diverse datasets, including DARPA Transparent Computing APT traces, we demonstrate that the choice of similarity metric significantly impacts model convergence, anomaly detection accuracy, and label efficiency. Our results offer actionable insights for selecting similarity functions in active learning pipelines tailored for threat intelligence and cyber defense.
☆ Working My Way Back to You: Resource-Centric Next-Activity Prediction
Predictive Process Monitoring (PPM) aims to train models that forecast upcoming events in process executions. These predictions support early bottleneck detection, improved scheduling, proactive interventions, and timely communication with stakeholders. While existing research adopts a control-flow perspective, we investigate next-activity prediction from a resource-centric viewpoint, which offers additional benefits such as improved work organization, workload balancing, and capacity forecasting. Although resource information has been shown to enhance tasks such as process performance analysis, its role in next-activity prediction remains unexplored. In this study, we evaluate four prediction models and three encoding strategies across four real-life datasets. Compared to the baseline, our results show that LightGBM and Transformer models perform best with an encoding based on 2-gram activity transitions, while Random Forest benefits most from an encoding that combines 2-gram transitions and activity repetition features. This combined encoding also achieves the highest average accuracy. This resource-centric approach could enable smarter resource allocation, strategic workforce planning, and personalized employee support by analyzing individual behavior rather than case-level progression. The findings underscore the potential of resource-centric next-activity prediction, opening up new venues for research on PPM.
☆ Learning with springs and sticks
Learning is a physical process. Here, we aim to study a simple dynamical system composed of springs and sticks capable of arbitrarily approximating any continuous function. The main idea of our work is to use the sticks to mimic a piecewise-linear approximation of the given function, use the potential energy of springs to encode a desired mean squared error loss function, and converge to a minimum-energy configuration via dissipation. We apply the proposed simulation system to regression tasks and show that its performance is comparable to that of multi-layer perceptrons. In addition, we study the thermodynamic properties of the system and find a relation between the free energy change of the system and its ability to learn an underlying data distribution. We empirically find a \emph{thermodynamic learning barrier} for the system caused by the fluctuations of the environment, whereby the system cannot learn if its change in free energy hits such a barrier. We believe this simple model can help us better understand learning systems from a physical point of view.
comment: 13 pages, 6 figures
☆ STDiff: A State Transition Diffusion Framework for Time Series Imputation in Industrial Systems
Most deep learning methods for imputing missing values treat the task as completing patterns within a fixed time window. This assumption often fails in industrial systems, where dynamics are driven by control actions, are highly non-stationary, and can experience long, uninterrupted gaps. We propose STDiff, which reframes imputation as learning how the system evolves from one state to the next. STDiff uses a conditional denoising diffusion model with a causal bias aligned to control theory, generating missing values step-by-step based on the most recent known state and relevant control or environmental inputs. On a public wastewater treatment dataset with simulated missing blocks, STDiff consistently achieves the lowest errors, with its advantage increasing for longer gaps. On a raw industrial dataset with substantial real gaps, it produces trajectories that remain dynamically plausible, in contrast to window-based models that tend to flatten or over-smooth. These results support dynamics-aware, explicitly conditioned imputation as a robust approach for industrial time series, and we discuss computational trade-offs and extensions to broader domains.
☆ FedProtoKD: Dual Knowledge Distillation with Adaptive Class-wise Prototype Margin for Heterogeneous Federated Learning
Heterogeneous Federated Learning (HFL) has gained attention for its ability to accommodate diverse models and heterogeneous data across clients. Prototype-based HFL methods emerge as a promising solution to address statistical heterogeneity and privacy challenges, paving the way for new advancements in HFL research. This method focuses on sharing only class-representative prototypes among heterogeneous clients. However, these prototypes are often aggregated on the server using weighted averaging, leading to sub-optimal global knowledge; these cause the shrinking of aggregated prototypes, which negatively affects the model performance in scenarios when models are heterogeneous and data distributions are extremely non-IID. We propose FedProtoKD in a Heterogeneous Federated Learning setting, using an enhanced dual-knowledge distillation mechanism to improve the system performance with clients' logits and prototype feature representation. We aim to resolve the prototype margin-shrinking problem using a contrastive learning-based trainable server prototype by leveraging a class-wise adaptive prototype margin. Furthermore, we assess the importance of public samples using the closeness of the sample's prototype to its class representative prototypes, which enhances learning performance. FedProtoKD achieved average improvements of 1.13% up to 34.13% accuracy across various settings and significantly outperforms existing state-of-the-art HFL methods.
comment: 12 pages, 6 figures
☆ Is attention truly all we need? An empirical study of asset pricing in pretrained RNN sparse and global attention models
This study investigates the pretrained RNN attention models with the mainstream attention mechanisms such as additive attention, Luong's three attentions, global self-attention (Self-att) and sliding window sparse attention (Sparse-att) for the empirical asset pricing research on top 420 large-cap US stocks. This is the first paper on the large-scale state-of-the-art (SOTA) attention mechanisms applied in the asset pricing context. They overcome the limitations of the traditional machine learning (ML) based asset pricing, such as mis-capturing the temporal dependency and short memory. Moreover, the enforced causal masks in the attention mechanisms address the future data leaking issue ignored by the more advanced attention-based models, such as the classic Transformer. The proposed attention models also consider the temporal sparsity characteristic of asset pricing data and mitigate potential overfitting issues by deploying the simplified model structures. This provides some insights for future empirical economic research. All models are examined in three periods, which cover pre-COVID-19 (mild uptrend), COVID-19 (steep uptrend with a large drawdown) and one year post-COVID-19 (sideways movement with high fluctuations), for testing the stability of these models under extreme market conditions. The study finds that in value-weighted portfolio back testing, Model Self-att and Model Sparse-att exhibit great capabilities in deriving the absolute returns and hedging downside risks, while they achieve an annualized Sortino ratio of 2.0 and 1.80 respectively in the period with COVID-19. And Model Sparse-att performs more stably than Model Self-att from the perspective of absolute portfolio returns with respect to the size of stocks' market capitalization.
comment: 55 pages including appendix, 21 figures and 5 tables
☆ Automatic Prompt Optimization with Prompt Distillation
Autoprompting is the process of automatically selecting optimized prompts for language models, which is gaining popularity due to the rapid development of prompt engineering driven by extensive research in the field of large language models (LLMs). This paper presents DistillPrompt -- a novel autoprompting method based on large language models that employs a multi-stage integration of task-specific information into prompts using training data. DistillPrompt utilizes distillation, compression, and aggregation operations to explore the prompt space more thoroughly. The method was tested on different datasets for text classification and generation tasks using the t-lite-instruct-0.1 language model. The results demonstrate a significant average improvement (e.g., 20.12% across the entire dataset compared to Grips) in key metrics over existing methods in the field, establishing DistillPrompt as one of the most effective non-gradient approaches in autoprompting.
☆ Interpretable by AI Mother Tongue: Native Symbolic Reasoning in Neural Models
We present a framework where neural models develop an AI Mother Tongue, a native symbolic language that simultaneously supports intuitive reasoning, compositional symbol chains, and inherent interpretability. Unlike post-hoc explanation methods, our approach embeds reasoning directly into the model's representations: symbols capture meaningful semantic patterns, chains trace decision paths, and gated induction mechanisms guide selective focus, yielding transparent yet flexible reasoning. We introduce complementary training objectives to enhance symbol purity and decision sparsity, and employ a sequential specialization strategy to first build broad symbolic competence and then refine intuitive judgments. Experiments on AI tasks demonstrate competitive accuracy alongside verifiable reasoning traces, showing that AI Mother Tongue can serve as a unified mechanism for interpretability, intuition, and symbolic reasoning in neural models.
comment: 25 pages, 9 figures. The AI Intuition Explorer dashboard is available at: https://cyrilliu1974.github.io/github.io/vi.html
☆ PAX-TS: Model-agnostic multi-granular explanations for time series forecasting via localized perturbations
Time series forecasting has seen considerable improvement during the last years, with transformer models and large language models driving advancements of the state of the art. Modern forecasting models are generally opaque and do not provide explanations for their forecasts, while well-known post-hoc explainability methods like LIME are not suitable for the forecasting context. We propose PAX-TS, a model-agnostic post-hoc algorithm to explain time series forecasting models and their forecasts. Our method is based on localized input perturbations and results in multi-granular explanations. Further, it is able to characterize cross-channel correlations for multivariate time series forecasts. We clearly outline the algorithmic procedure behind PAX-TS, demonstrate it on a benchmark with 7 algorithms and 10 diverse datasets, compare it with two other state-of-the-art explanation algorithms, and present the different explanation types of the method. We found that the explanations of high-performing and low-performing algorithms differ on the same datasets, highlighting that the explanations of PAX-TS effectively capture a model's behavior. Based on time step correlation matrices resulting from the benchmark, we identify 6 classes of patterns that repeatedly occur across different datasets and algorithms. We found that the patterns are indicators of performance, with noticeable differences in forecasting error between the classes. Lastly, we outline a multivariate example where PAX-TS demonstrates how the forecasting model takes cross-channel correlations into account. With PAX-TS, time series forecasting models' mechanisms can be illustrated in different levels of detail, and its explanations can be used to answer practical questions on forecasts.
☆ USO: Unified Style and Subject-Driven Generation via Disentangled and Reward Learning
Existing literature typically treats style-driven and subject-driven generation as two disjoint tasks: the former prioritizes stylistic similarity, whereas the latter insists on subject consistency, resulting in an apparent antagonism. We argue that both objectives can be unified under a single framework because they ultimately concern the disentanglement and re-composition of content and style, a long-standing theme in style-driven research. To this end, we present USO, a Unified Style-Subject Optimized customization model. First, we construct a large-scale triplet dataset consisting of content images, style images, and their corresponding stylized content images. Second, we introduce a disentangled learning scheme that simultaneously aligns style features and disentangles content from style through two complementary objectives, style-alignment training and content-style disentanglement training. Third, we incorporate a style reward-learning paradigm denoted as SRL to further enhance the model's performance. Finally, we release USO-Bench, the first benchmark that jointly evaluates style similarity and subject fidelity across multiple metrics. Extensive experiments demonstrate that USO achieves state-of-the-art performance among open-source models along both dimensions of subject consistency and style similarity. Code and model: https://github.com/bytedance/USO
comment: Project page: https://bytedance.github.io/USO/ Code and model: https://github.com/bytedance/USO
☆ Enhancing compact convolutional transformers with super attention
In this paper, we propose a vision model that adopts token mixing, sequence-pooling, and convolutional tokenizers to achieve state-of-the-art performance and efficient inference in fixed context-length tasks. In the CIFAR100 benchmark, our model significantly improves the baseline of the top 1% and top 5% validation accuracy from 36.50% to 46.29% and 66.33% to 76.31%, while being more efficient than the Scaled Dot Product Attention (SDPA) transformers when the context length is less than the embedding dimension and only 60% the size. In addition, the architecture demonstrates high training stability and does not rely on techniques such as data augmentation like mixup, positional embeddings, or learning rate scheduling. We make our code available on Github.
comment: 9 pages, 4 figures
☆ On the Generalisation of Koopman Representations for Chaotic System Control
This paper investigates the generalisability of Koopman-based representations for chaotic dynamical systems, focusing on their transferability across prediction and control tasks. Using the Lorenz system as a testbed, we propose a three-stage methodology: learning Koopman embeddings through autoencoding, pre-training a transformer on next-state prediction, and fine-tuning for safety-critical control. Our results show that Koopman embeddings outperform both standard and physics-informed PCA baselines, achieving accurate and data-efficient performance. Notably, fixing the pre-trained transformer weights during fine-tuning leads to no performance degradation, indicating that the learned representations capture reusable dynamical structure rather than task-specific patterns. These findings support the use of Koopman embeddings as a foundation for multi-task learning in physics-informed machine learning. A project page is available at https://kikisprdx.github.io/.
comment: 18 pages, 4 figures
☆ Estimating Conditional Covariance between labels for Multilabel Data
Multilabel data should be analysed for label dependence before applying multilabel models. Independence between multilabel data labels cannot be measured directly from the label values due to their dependence on the set of covariates $\vec{x}$, but can be measured by examining the conditional label covariance using a multivariate Probit model. Unfortunately, the multivariate Probit model provides an estimate of its copula covariance, and so might not be reliable in estimating constant covariance and dependent covariance. In this article, we compare three models (Multivariate Probit, Multivariate Bernoulli and Staged Logit) for estimating the constant and dependent multilabel conditional label covariance. We provide an experiment that allows us to observe each model's measurement of conditional covariance. We found that all models measure constant and dependent covariance equally well, depending on the strength of the covariance, but the models all falsely detect that dependent covariance is present for data where constant covariance is present. Of the three models, the Multivariate Probit model had the lowest error rate.
☆ Energy-Based Flow Matching for Generating 3D Molecular Structure
Molecular structure generation is a fundamental problem that involves determining the 3D positions of molecules' constituents. It has crucial biological applications, such as molecular docking, protein folding, and molecular design. Recent advances in generative modeling, such as diffusion models and flow matching, have made great progress on these tasks by modeling molecular conformations as a distribution. In this work, we focus on flow matching and adopt an energy-based perspective to improve training and inference of structure generation models. Our view results in a mapping function, represented by a deep network, that is directly learned to \textit{iteratively} map random configurations, i.e. samples from the source distribution, to target structures, i.e. points in the data manifold. This yields a conceptually simple and empirically effective flow matching setup that is theoretically justified and has interesting connections to fundamental properties such as idempotency and stability, as well as the empirically useful techniques such as structure refinement in AlphaFold. Experiments on protein docking as well as protein backbone generation consistently demonstrate the method's effectiveness, where it outperforms recent baselines of task-associated flow matching and diffusion models, using a similar computational budget.
comment: Accepted to the International Conference on Machine Learning (2025)
☆ The GINN framework: a stochastic QED correspondence for stability and chaos in deep neural networks
The development of a Euclidean stochastic field-theoretic approach that maps deep neural networks (DNNs) to quantum electrodynamics (QED) with local U(1) symmetry is presented. Neural activations and weights are represented by fermionic matter and gauge fields, with a fictitious Langevin time enabling covariant gauge fixing. This mapping identifies the gauge parameter with kernel design choices in wide DNNs, relating stability thresholds to gauge-dependent amplification factors. Finite-width fluctuations correspond to loop corrections in QED. As a proof of concept, we validate the theoretical predictions through numerical simulations of standard multilayer perceptrons and, in parallel, propose a gauge-invariant neural network (GINN) implementation using magnitude--phase parameterization of weights. Finally, a double-copy replica approach is shown to unify the computation of the largest Lyapunov exponent in stochastic QED and wide DNNs.
comment: 18 pages, 3 figures, 1 table
☆ HierCVAE: Hierarchical Attention-Driven Conditional Variational Autoencoders for Multi-Scale Temporal Modeling
Temporal modeling in complex systems requires capturing dependencies across multiple time scales while managing inherent uncertainties. We propose HierCVAE, a novel architecture that integrates hierarchical attention mechanisms with conditional variational autoencoders to address these challenges. HierCVAE employs a three-tier attention structure (local, global, cross-temporal) combined with multi-modal condition encoding to capture temporal, statistical, and trend information. The approach incorporates ResFormer blocks in the latent space and provides explicit uncertainty quantification via prediction heads. Through evaluations on energy consumption datasets, HierCVAE demonstrates a 15-40% improvement in prediction accuracy and superior uncertainty calibration compared to state-of-the-art methods, excelling in long-term forecasting and complex multi-variate dependencies.
comment: 10 pages, 6 figures
☆ Forecasting Probability Distributions of Financial Returns with Deep Neural Networks
This study evaluates deep neural networks for forecasting probability distributions of financial returns. 1D convolutional neural networks (CNN) and Long Short-Term Memory (LSTM) architectures are used to forecast parameters of three probability distributions: Normal, Student's t, and skewed Student's t. Using custom negative log-likelihood loss functions, distribution parameters are optimized directly. The models are tested on six major equity indices (S\&P 500, BOVESPA, DAX, WIG, Nikkei 225, and KOSPI) using probabilistic evaluation metrics including Log Predictive Score (LPS), Continuous Ranked Probability Score (CRPS), and Probability Integral Transform (PIT). Results show that deep learning models provide accurate distributional forecasts and perform competitively with classical GARCH models for Value-at-Risk estimation. The LSTM with skewed Student's t distribution performs best across multiple evaluation criteria, capturing both heavy tails and asymmetry in financial returns. This work shows that deep neural networks are viable alternatives to traditional econometric models for financial risk assessment and portfolio management.
comment: 12 pages, 4 figures, 5 tables
☆ Generalization Bound for a General Class of Neural Ordinary Differential Equations
Neural ordinary differential equations (neural ODEs) are a popular type of deep learning model that operate with continuous-depth architectures. To assess how well such models perform on unseen data, it is crucial to understand their generalization error bounds. Previous research primarily focused on the linear case for the dynamics function in neural ODEs - Marion, P. (2023), or provided bounds for Neural Controlled ODEs that depend on the sampling interval Bleistein et al. (2023). In this work, we analyze a broader class of neural ODEs where the dynamics function is a general nonlinear function, either time dependent or time independent, and is Lipschitz continuous with respect to the state variables. We showed that under this Lipschitz condition, the solutions to neural ODEs have solutions with bounded variations. Based on this observation, we establish generalization bounds for both time-dependent and time-independent cases and investigate how overparameterization and domain constraints influence these bounds. To our knowledge, this is the first derivation of generalization bounds for neural ODEs with general nonlinear dynamics.
comment: 23 pages, 4 figures
☆ HOTSPOT-YOLO: A Lightweight Deep Learning Attention-Driven Model for Detecting Thermal Anomalies in Drone-Based Solar Photovoltaic Inspections
Thermal anomaly detection in solar photovoltaic (PV) systems is essential for ensuring operational efficiency and reducing maintenance costs. In this study, we developed and named HOTSPOT-YOLO, a lightweight artificial intelligence (AI) model that integrates an efficient convolutional neural network backbone and attention mechanisms to improve object detection. This model is specifically designed for drone-based thermal inspections of PV systems, addressing the unique challenges of detecting small and subtle thermal anomalies, such as hotspots and defective modules, while maintaining real-time performance. Experimental results demonstrate a mean average precision of 90.8%, reflecting a significant improvement over baseline object detection models. With a reduced computational load and robustness under diverse environmental conditions, HOTSPOT-YOLO offers a scalable and reliable solution for large-scale PV inspections. This work highlights the integration of advanced AI techniques with practical engineering applications, revolutionizing automated fault detection in renewable energy systems.
☆ Enhancing Model Privacy in Federated Learning with Random Masking and Quantization
Experimental results across various models and tasks demonstrate that our approach not only maintains strong model performance in federated learning settings but also achieves enhanced protection of model parameters compared to baseline methods.
☆ Distance-informed Neural Processes
We propose the Distance-informed Neural Process (DNP), a novel variant of Neural Processes that improves uncertainty estimation by combining global and distance-aware local latent structures. Standard Neural Processes (NPs) often rely on a global latent variable and struggle with uncertainty calibration and capturing local data dependencies. DNP addresses these limitations by introducing a global latent variable to model task-level variations and a local latent variable to capture input similarity within a distance-preserving latent space. This is achieved through bi-Lipschitz regularization, which bounds distortions in input relationships and encourages the preservation of relative distances in the latent space. This modeling approach allows DNP to produce better-calibrated uncertainty estimates and more effectively distinguish in- from out-of-distribution data. Empirical results demonstrate that DNP achieves strong predictive performance and improved uncertainty calibration across regression and classification tasks.
comment: 22 pages
☆ Sparse minimum Redundancy Maximum Relevance for feature selection
We propose a feature screening method that integrates both feature-feature and feature-target relationships. Inactive features are identified via a penalized minimum Redundancy Maximum Relevance (mRMR) procedure, which is the continuous version of the classic mRMR penalized by a non-convex regularizer, and where the parameters estimated as zero coefficients represent the set of inactive features. We establish the conditions under which zero coefficients are correctly identified to guarantee accurate recovery of inactive features. We introduce a multi-stage procedure based on the knockoff filter enabling the penalized mRMR to discard inactive features while controlling the false discovery rate (FDR). Our method performs comparably to HSIC-LASSO but is more conservative in the number of selected features. It only requires setting an FDR threshold, rather than specifying the number of features to retain. The effectiveness of the method is illustrated through simulations and real-world datasets. The code to reproduce this work is available on the following GitHub: https://github.com/PeterJackNaylor/SmRMR.
☆ Interpretable Decision-Making for End-to-End Autonomous Driving ICCV 2025
Trustworthy AI is mandatory for the broad deployment of autonomous vehicles. Although end-to-end approaches derive control commands directly from raw data, interpreting these decisions remains challenging, especially in complex urban scenarios. This is mainly attributed to very deep neural networks with non-linear decision boundaries, making it challenging to grasp the logic behind AI-driven decisions. This paper presents a method to enhance interpretability while optimizing control commands in autonomous driving. To address this, we propose loss functions that promote the interpretability of our model by generating sparse and localized feature maps. The feature activations allow us to explain which image regions contribute to the predicted control command. We conduct comprehensive ablation studies on the feature extraction step and validate our method on the CARLA benchmarks. We also demonstrate that our approach improves interpretability, which correlates with reducing infractions, yielding a safer, high-performance driving model. Notably, our monocular, non-ensemble model surpasses the top-performing approaches from the CARLA Leaderboard by achieving lower infraction scores and the highest route completion rate, all while ensuring interpretability.
comment: Accepted to the ICCV 2025 2nd Workshop on the Challenge Of Out-of-Label Hazards in Autonomous Driving (2COOOL)
☆ pyFAST: A Modular PyTorch Framework for Time Series Modeling with Multi-source and Sparse Data
Modern time series analysis demands frameworks that are flexible, efficient, and extensible. However, many existing Python libraries exhibit limitations in modularity and in their native support for irregular, multi-source, or sparse data. We introduce pyFAST, a research-oriented PyTorch framework that explicitly decouples data processing from model computation, fostering a cleaner separation of concerns and facilitating rapid experimentation. Its data engine is engineered for complex scenarios, supporting multi-source loading, protein sequence handling, efficient sequence- and patch-level padding, dynamic normalization, and mask-based modeling for both imputation and forecasting. pyFAST integrates LLM-inspired architectures for the alignment-free fusion of sparse data sources and offers native sparse metrics, specialized loss functions, and flexible exogenous data fusion. Training utilities include batch-based streaming aggregation for evaluation and device synergy to maximize computational efficiency. A comprehensive suite of classical and deep learning models (Linears, CNNs, RNNs, Transformers, and GNNs) is provided within a modular architecture that encourages extension. Released under the MIT license at GitHub, pyFAST provides a compact yet powerful platform for advancing time series research and applications.
☆ HAEPO: History-Aggregated Exploratory Policy Optimization
Exploration is essential in modern learning, from reinforcement learning environments with small neural policies to large language models (LLMs). Existing work, such as DPO, leverages full sequence log-likelihoods to capture an entire trajectory of the model's decisions, while methods like GRPO aggregate per-token ratios into a trajectory-level update. However, both often limit exploration on long-horizon tasks. We introduce History-Aggregated Exploratory Policy Optimization (HAEPO), a history-aware exploratory loss to combat these shortcomings. HAEPO compresses each trajectory into the sum of its logarithmic probabilities (a cumulative logarithmic likelihood), and applies a Plackett-Luce softmax across trajectories to obtain normalized weights proportional to their returns, thus encouraging broader exploration. We add entropy regularization to stabilize the aggressive updates to prevent premature collapse and a soft KL penalty relative to a frozen copy of the previous (reference) policy. Empirically, HAEPO converges fast, explores thoroughly, aligns closely with true rewards, and demonstrates robust learning behavior better or at par with PPO, GRPO, and DPO across diverse tasks. Thus, HAEPO provides a stable and interpretable framework by explicitly leveraging full-trajectory history while balancing exploration and stability.
comment: Under review
☆ Optimization of Latent-Space Compression using Game-Theoretic Techniques for Transformer-Based Vector Search
Vector similarity search plays a pivotal role in modern information retrieval systems, especially when powered by transformer-based embeddings. However, the scalability and efficiency of such systems are often hindered by the high dimensionality of latent representations. In this paper, we propose a novel game-theoretic framework for optimizing latent-space compression to enhance both the efficiency and semantic utility of vector search. By modeling the compression strategy as a zero-sum game between retrieval accuracy and storage efficiency, we derive a latent transformation that preserves semantic similarity while reducing redundancy. We benchmark our method against FAISS, a widely-used vector search library, and demonstrate that our approach achieves a significantly higher average similarity (0.9981 vs. 0.5517) and utility (0.8873 vs. 0.5194), albeit with a modest increase in query time. This trade-off highlights the practical value of game-theoretic latent compression in high-utility, transformer-based search applications. The proposed system can be seamlessly integrated into existing LLM pipelines to yield more semantically accurate and computationally efficient retrieval.
☆ MOCHA: Discovering Multi-Order Dynamic Causality in Temporal Point Processes
Discovering complex causal dependencies in temporal point processes (TPPs) is critical for modeling real-world event sequences. Existing methods typically rely on static or first-order causal structures, overlooking the multi-order and time-varying nature of causal relationships. In this paper, we propose MOCHA, a novel framework for discovering multi-order dynamic causality in TPPs. MOCHA characterizes multi-order influences as multi-hop causal paths over a latent time-evolving graph. To model such dynamics, we introduce a time-varying directed acyclic graph (DAG) with learnable structural weights, where acyclicity and sparsity constraints are enforced to ensure structural validity. We design an end-to-end differentiable framework that jointly models causal discovery and TPP dynamics, enabling accurate event prediction and revealing interpretable structures. Extensive experiments on real-world datasets demonstrate that MOCHA not only achieves state-of-the-art performance in event prediction, but also reveals meaningful and interpretable causal structures.
☆ ReflectivePrompt: Reflective evolution in autoprompting algorithms
Autoprompting is the process of automatically selecting optimized prompts for language models, which has been gaining popularity with the rapid advancement of prompt engineering, driven by extensive research in the field of large language models (LLMs). This paper presents ReflectivePrompt - a novel autoprompting method based on evolutionary algorithms that employs a reflective evolution approach for more precise and comprehensive search of optimal prompts. ReflectivePrompt utilizes short-term and long-term reflection operations before crossover and elitist mutation to enhance the quality of the modifications they introduce. This method allows for the accumulation of knowledge obtained throughout the evolution process and updates it at each epoch based on the current population. ReflectivePrompt was tested on 33 datasets for classification and text generation tasks using open-access large language models: t-lite-instruct-0.1 and gemma3-27b-it. The method demonstrates, on average, a significant improvement (e.g., 28% on BBH compared to EvoPrompt) in metrics relative to current state-of-the-art approaches, thereby establishing itself as one of the most effective solutions in evolutionary algorithm-based autoprompting.
☆ C-Flat++: Towards a More Efficient and Powerful Framework for Continual Learning
Balancing sensitivity to new tasks and stability for retaining past knowledge is crucial in continual learning (CL). Recently, sharpness-aware minimization has proven effective in transfer learning and has also been adopted in continual learning (CL) to improve memory retention and learning efficiency. However, relying on zeroth-order sharpness alone may favor sharper minima over flatter ones in certain settings, leading to less robust and potentially suboptimal solutions. In this paper, we propose \textbf{C}ontinual \textbf{Flat}ness (\textbf{C-Flat}), a method that promotes flatter loss landscapes tailored for CL. C-Flat offers plug-and-play compatibility, enabling easy integration with minimal modifications to the code pipeline. Besides, we present a general framework that integrates C-Flat into all major CL paradigms and conduct comprehensive comparisons with loss-minima optimizers and flat-minima-based CL methods. Our results show that C-Flat consistently improves performance across a wide range of settings. In addition, we introduce C-Flat++, an efficient yet effective framework that leverages selective flatness-driven promotion, significantly reducing the update cost required by C-Flat. Extensive experiments across multiple CL methods, datasets, and scenarios demonstrate the effectiveness and efficiency of our proposed approaches. Code is available at https://github.com/WanNaa/C-Flat.
☆ Recycling History: Efficient Recommendations from Contextual Dueling Bandits
The contextual duelling bandit problem models adaptive recommender systems, where the algorithm presents a set of items to the user, and the user's choice reveals their preference. This setup is well suited for implicit choices users make when navigating a content platform, but does not capture other possible comparison queries. Motivated by the fact that users provide more reliable feedback after consuming items, we propose a new bandit model that can be described as follows. The algorithm recommends one item per time step; after consuming that item, the user is asked to compare it with another item chosen from the user's consumption history. Importantly, in our model, this comparison item can be chosen without incurring any additional regret, potentially leading to better performance. However, the regret analysis is challenging because of the temporal dependency in the user's history. To overcome this challenge, we first show that the algorithm can construct informative queries provided the history is rich, i.e., satisfies a certain diversity condition. We then show that a short initial random exploration phase is sufficient for the algorithm to accumulate a rich history with high probability. This result, proven via matrix concentration bounds, yields $O(\sqrt{T})$ regret guarantees. Additionally, our simulations show that reusing past items for comparisons can lead to significantly lower regret than only comparing between simultaneously recommended items.
comment: 16 pages, 3 figures
☆ DRMD: Deep Reinforcement Learning for Malware Detection under Concept Drift
Malware detection in real-world settings must deal with evolving threats, limited labeling budgets, and uncertain predictions. Traditional classifiers, without additional mechanisms, struggle to maintain performance under concept drift in malware domains, as their supervised learning formulation cannot optimize when to defer decisions to manual labeling and adaptation. Modern malware detection pipelines combine classifiers with monthly active learning (AL) and rejection mechanisms to mitigate the impact of concept drift. In this work, we develop a novel formulation of malware detection as a one-step Markov Decision Process and train a deep reinforcement learning (DRL) agent, simultaneously optimizing sample classification performance and rejecting high-risk samples for manual labeling. We evaluated the joint detection and drift mitigation policy learned by the DRL-based Malware Detection (DRMD) agent through time-aware evaluations on Android malware datasets subject to realistic drift requiring multi-year performance stability. The policies learned under these conditions achieve a higher Area Under Time (AUT) performance compared to standard classification approaches used in the domain, showing improved resilience to concept drift. Specifically, the DRMD agent achieved a $5.18\pm5.44$, $14.49\pm12.86$, and $10.06\pm10.81$ average AUT performance improvement for the classification only, classification with rejection, and classification with rejection and AL settings, respectively. Our results demonstrate for the first time that DRL can facilitate effective malware detection and improved resiliency to concept drift in the dynamic environment of the Android malware domain.
comment: 10 pages
☆ SWiFT: Soft-Mask Weight Fine-tuning for Bias Mitigation
Recent studies have shown that Machine Learning (ML) models can exhibit bias in real-world scenarios, posing significant challenges in ethically sensitive domains such as healthcare. Such bias can negatively affect model fairness, model generalization abilities and further risks amplifying social discrimination. There is a need to remove biases from trained models. Existing debiasing approaches often necessitate access to original training data and need extensive model retraining; they also typically exhibit trade-offs between model fairness and discriminative performance. To address these challenges, we propose Soft-Mask Weight Fine-Tuning (SWiFT), a debiasing framework that efficiently improves fairness while preserving discriminative performance with much less debiasing costs. Notably, SWiFT requires only a small external dataset and only a few epochs of model fine-tuning. The idea behind SWiFT is to first find the relative, and yet distinct, contributions of model parameters to both bias and predictive performance. Then, a two-step fine-tuning process updates each parameter with different gradient flows defined by its contribution. Extensive experiments with three bias sensitive attributes (gender, skin tone, and age) across four dermatological and two chest X-ray datasets demonstrate that SWiFT can consistently reduce model bias while achieving competitive or even superior diagnostic accuracy under common fairness and accuracy metrics, compared to the state-of-the-art. Specifically, we demonstrate improved model generalization ability as evidenced by superior performance on several out-of-distribution (OOD) datasets.
comment: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2025:015
☆ Learning Real-World Acrobatic Flight from Human Preferences
Preference-based reinforcement learning (PbRL) enables agents to learn control policies without requiring manually designed reward functions, making it well-suited for tasks where objectives are difficult to formalize or inherently subjective. Acrobatic flight poses a particularly challenging problem due to its complex dynamics, rapid movements, and the importance of precise execution. In this work, we explore the use of PbRL for agile drone control, focusing on the execution of dynamic maneuvers such as powerloops. Building on Preference-based Proximal Policy Optimization (Preference PPO), we propose Reward Ensemble under Confidence (REC), an extension to the reward learning objective that improves preference modeling and learning stability. Our method achieves 88.4% of the shaped reward performance, compared to 55.2% with standard Preference PPO. We train policies in simulation and successfully transfer them to real-world drones, demonstrating multiple acrobatic maneuvers where human preferences emphasize stylistic qualities of motion. Furthermore, we demonstrate the applicability of our probabilistic reward model in a representative MuJoCo environment for continuous control. Finally, we highlight the limitations of manually designed rewards, observing only 60.7% agreement with human preferences. These results underscore the effectiveness of PbRL in capturing complex, human-centered objectives across both physical and simulated domains.
comment: 8 pages, 7 figures
☆ Temperature-Aware Recurrent Neural Operator for Temperature-Dependent Anisotropic Plasticity in HCP Materials
Neural network surrogate models for constitutive laws in computational mechanics have been in use for some time. In plasticity, these models often rely on gated recurrent units (GRUs) or long short-term memory (LSTM) cells, which excel at capturing path-dependent phenomena. However, they suffer from long training times and time-resolution-dependent predictions that extrapolate poorly. Moreover, most existing surrogates for macro- or mesoscopic plasticity handle only relatively simple material behavior. To overcome these limitations, we introduce the Temperature-Aware Recurrent Neural Operator (TRNO), a time-resolution-independent neural architecture. We apply the TRNO to model the temperature-dependent plastic response of polycrystalline magnesium, which shows strong plastic anisotropy and thermal sensitivity. The TRNO achieves high predictive accuracy and generalizes effectively across diverse loading cases, temperatures, and time resolutions. It also outperforms conventional GRU and LSTM models in training efficiency and predictive performance. Finally, we demonstrate multiscale simulations with the TRNO, yielding a speedup of at least three orders of magnitude over traditional constitutive models.
☆ PseudoMapTrainer: Learning Online Mapping without HD Maps ICCV 2025
Online mapping models show remarkable results in predicting vectorized maps from multi-view camera images only. However, all existing approaches still rely on ground-truth high-definition maps during training, which are expensive to obtain and often not geographically diverse enough for reliable generalization. In this work, we propose PseudoMapTrainer, a novel approach to online mapping that uses pseudo-labels generated from unlabeled sensor data. We derive those pseudo-labels by reconstructing the road surface from multi-camera imagery using Gaussian splatting and semantics of a pre-trained 2D segmentation network. In addition, we introduce a mask-aware assignment algorithm and loss function to handle partially masked pseudo-labels, allowing for the first time the training of online mapping models without any ground-truth maps. Furthermore, our pseudo-labels can be effectively used to pre-train an online model in a semi-supervised manner to leverage large-scale unlabeled crowdsourced data. The code is available at github.com/boschresearch/PseudoMapTrainer.
comment: Accepted at ICCV 2025
☆ Federated Learning with Heterogeneous and Private Label Sets
Although common in real-world applications, heterogeneous client label sets are rarely investigated in federated learning (FL). Furthermore, in the cases they are, clients are assumed to be willing to share their entire label sets with other clients. Federated learning with private label sets, shared only with the central server, adds further constraints on learning algorithms and is, in general, a more difficult problem to solve. In this work, we study the effects of label set heterogeneity on model performance, comparing the public and private label settings -- when the union of label sets in the federation is known to clients and when it is not. We apply classical methods for the classifier combination problem to FL using centralized tuning, adapt common FL methods to the private label set setting, and discuss the justification of both approaches under practical assumptions. Our experiments show that reducing the number of labels available to each client harms the performance of all methods substantially. Centralized tuning of client models for representational alignment can help remedy this, but often at the cost of higher variance. Throughout, our proposed adaptations of standard FL methods perform well, showing similar performance in the private label setting as the standard methods achieve in the public setting. This shows that clients can enjoy increased privacy at little cost to model accuracy.
☆ Efficient Best-of-Both-Worlds Algorithms for Contextual Combinatorial Semi-Bandits
We introduce the first best-of-both-worlds algorithm for contextual combinatorial semi-bandits that simultaneously guarantees $\widetilde{\mathcal{O}}(\sqrt{T})$ regret in the adversarial regime and $\widetilde{\mathcal{O}}(\ln T)$ regret in the corrupted stochastic regime. Our approach builds on the Follow-the-Regularized-Leader (FTRL) framework equipped with a Shannon entropy regularizer, yielding a flexible method that admits efficient implementations. Beyond regret bounds, we tackle the practical bottleneck in FTRL (or, equivalently, Online Stochastic Mirror Descent) arising from the high-dimensional projection step encountered in each round of interaction. By leveraging the Karush-Kuhn-Tucker conditions, we transform the $K$-dimensional convex projection problem into a single-variable root-finding problem, dramatically accelerating each round. Empirical evaluations demonstrate that this combined strategy not only attains the attractive regret bounds of best-of-both-worlds algorithms but also delivers substantial per-round speed-ups, making it well-suited for large-scale, real-time applications.
☆ Predicting Drug-Drug Interactions Using Heterogeneous Graph Neural Networks: HGNN-DDI
Drug-drug interactions (DDIs) are a major concern in clinical practice, as they can lead to reduced therapeutic efficacy or severe adverse effects. Traditional computational approaches often struggle to capture the complex relationships among drugs, targets, and biological entities. In this work, we propose HGNN-DDI, a heterogeneous graph neural network model designed to predict potential DDIs by integrating multiple drug-related data sources. HGNN-DDI leverages graph representation learning to model heterogeneous biomedical networks, enabling effective information propagation across diverse node and edge types. Experimental results on benchmark DDI datasets demonstrate that HGNN-DDI outperforms state-of-the-art baselines in prediction accuracy and robustness, highlighting its potential to support safer drug development and precision medicine.
comment: 12 pages, 5 figures. Published in Applied and Computational Engineering, Vol. 79, pp. 77-89, July 25, 2024. Licensed under CC BY 4.0
☆ Governance-as-a-Service: A Multi-Agent Framework for AI System Compliance and Policy Enforcement
As AI systems evolve into distributed ecosystems with autonomous execution, asynchronous reasoning, and multi-agent coordination, the absence of scalable, decoupled governance poses a structural risk. Existing oversight mechanisms are reactive, brittle, and embedded within agent architectures, making them non-auditable and hard to generalize across heterogeneous deployments. We introduce Governance-as-a-Service (GaaS): a modular, policy-driven enforcement layer that regulates agent outputs at runtime without altering model internals or requiring agent cooperation. GaaS employs declarative rules and a Trust Factor mechanism that scores agents based on compliance and severity-weighted violations. It enables coercive, normative, and adaptive interventions, supporting graduated enforcement and dynamic trust modulation. To evaluate GaaS, we conduct three simulation regimes with open-source models (LLaMA3, Qwen3, DeepSeek-R1) across content generation and financial decision-making. In the baseline, agents act without governance; in the second, GaaS enforces policies; in the third, adversarial agents probe robustness. All actions are intercepted, evaluated, and logged for analysis. Results show that GaaS reliably blocks or redirects high-risk behaviors while preserving throughput. Trust scores track rule adherence, isolating and penalizing untrustworthy components in multi-agent systems. By positioning governance as a runtime service akin to compute or storage, GaaS establishes infrastructure-level alignment for interoperable agent ecosystems. It does not teach agents ethics; it enforces them.
☆ UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning
While Mixture of Experts (MoE) models achieve remarkable efficiency by activating only subsets of parameters, they suffer from high memory access costs during inference. Memory-layer architectures offer an appealing alternative with very few memory access, but previous attempts like UltraMem have only matched the performance of 2-expert MoE models, falling significantly short of state-of-the-art 8-expert configurations. We present UltraMemV2, a redesigned memory-layer architecture that closes this performance gap. Our approach introduces five key improvements: integrating memory layers into every transformer block, simplifying value expansion with single linear projections, adopting FFN-based value processing from PEER, implementing principled parameter initialization, and rebalancing memory-to-FFN computation ratios. Through extensive evaluation, we demonstrate that UltraMemV2 achieves performance parity with 8-expert MoE models under same computation and parameters but significantly low memory access. Notably, UltraMemV2 shows superior performance on memory-intensive tasks, with improvements of +1.6 points on long-context memorization, +6.2 points on multi-round memorization, and +7.9 points on in-context learning. We validate our approach at scale with models up to 2.5B activated parameters from 120B total parameters, and establish that activation density has greater impact on performance than total sparse parameter count. Our work brings memory-layer architectures to performance parity with state-of-the-art MoE models, presenting a compelling alternative for efficient sparse computation.
☆ Constraint Matters: Multi-Modal Representation for Reducing Mixed-Integer Linear programming
Model reduction, which aims to learn a simpler model of the original mixed integer linear programming (MILP), can solve large-scale MILP problems much faster. Most existing model reduction methods are based on variable reduction, which predicts a solution value for a subset of variables. From a dual perspective, constraint reduction that transforms a subset of inequality constraints into equalities can also reduce the complexity of MILP, but has been largely ignored. Therefore, this paper proposes a novel constraint-based model reduction approach for the MILP. Constraint-based MILP reduction has two challenges: 1) which inequality constraints are critical such that reducing them can accelerate MILP solving while preserving feasibility, and 2) how to predict these critical constraints efficiently. To identify critical constraints, we first label these tight-constraints at the optimal solution as potential critical constraints and design a heuristic rule to select a subset of critical tight-constraints. To learn the critical tight-constraints, we propose a multi-modal representation technique that leverages information from both instance-level and abstract-level MILP formulations. The experimental results show that, compared to the state-of-the-art methods, our method improves the quality of the solution by over 50\% and reduces the computation time by 17.47\%.
☆ Stability and Generalization for Bellman Residuals
Offline reinforcement learning and offline inverse reinforcement learning aim to recover near-optimal value functions or reward models from a fixed batch of logged trajectories, yet current practice still struggles to enforce Bellman consistency. Bellman residual minimization (BRM) has emerged as an attractive remedy, as a globally convergent stochastic gradient descent-ascent based method for BRM has been recently discovered. However, its statistical behavior in the offline setting remains largely unexplored. In this paper, we close this statistical gap. Our analysis introduces a single Lyapunov potential that couples SGDA runs on neighbouring datasets and yields an O(1/n) on-average argument-stability bound-doubling the best known sample-complexity exponent for convex-concave saddle problems. The same stability constant translates into the O(1/n) excess risk bound for BRM, without variance reduction, extra regularization, or restrictive independence assumptions on minibatch sampling. The results hold for standard neural-network parameterizations and minibatch SGD.
☆ Beyond Quality: Unlocking Diversity in Ad Headline Generation with Large Language Models
The generation of ad headlines plays a vital role in modern advertising, where both quality and diversity are essential to engage a broad range of audience segments. Current approaches primarily optimize language models for headline quality or click-through rates (CTR), often overlooking the need for diversity and resulting in homogeneous outputs. To address this limitation, we propose DIVER, a novel framework based on large language models (LLMs) that are jointly optimized for both diversity and quality. We first design a semantic- and stylistic-aware data generation pipeline that automatically produces high-quality training pairs with ad content and multiple diverse headlines. To achieve the goal of generating high-quality and diversified ad headlines within a single forward pass, we propose a multi-stage multi-objective optimization framework with supervised fine-tuning (SFT) and reinforcement learning (RL). Experiments on real-world industrial datasets demonstrate that DIVER effectively balances quality and diversity. Deployed on a large-scale content-sharing platform serving hundreds of millions of users, our framework improves advertiser value (ADVV) and CTR by 4.0% and 1.4%.
☆ FLAegis: A Two-Layer Defense Framework for Federated Learning Against Poisoning Attacks
Federated Learning (FL) has become a powerful technique for training Machine Learning (ML) models in a decentralized manner, preserving the privacy of the training datasets involved. However, the decentralized nature of FL limits the visibility of the training process, relying heavily on the honesty of participating clients. This assumption opens the door to malicious third parties, known as Byzantine clients, which can poison the training process by submitting false model updates. Such malicious clients may engage in poisoning attacks, manipulating either the dataset or the model parameters to induce misclassification. In response, this study introduces FLAegis, a two-stage defensive framework designed to identify Byzantine clients and improve the robustness of FL systems. Our approach leverages symbolic time series transformation (SAX) to amplify the differences between benign and malicious models, and spectral clustering, which enables accurate detection of adversarial behavior. Furthermore, we incorporate a robust FFT-based aggregation function as a final layer to mitigate the impact of those Byzantine clients that manage to evade prior defenses. We rigorously evaluate our method against five poisoning attacks, ranging from simple label flipping to adaptive optimization-based strategies. Notably, our approach outperforms state-of-the-art defenses in both detection precision and final model accuracy, maintaining consistently high performance even under strong adversarial conditions.
comment: 15 pages, 5 tables, and 5 figures
☆ Rethinking Caching for LLM Serving Systems: Beyond Traditional Heuristics
Serving Large Language Models (LLMs) at scale requires meeting strict Service Level Objectives (SLOs) under severe computational and memory constraints. Nevertheless, traditional caching strategies fall short: exact-matching and prefix caches neglect query semantics, while state-of-the-art semantic caches remain confined to traditional intuitions, offering little conceptual departure. Building on this, we present SISO, a semantic caching system that redefines efficiency for LLM serving. SISO introduces centroid-based caching to maximize coverage with minimal memory, locality-aware replacement to preserve high-value entries, and dynamic thresholding to balance accuracy and latency under varying workloads. Across diverse datasets, SISO delivers up to 1.71$\times$ higher hit ratios and consistently stronger SLO attainment compared to state-of-the-art systems.
☆ Beyond Tokens: Enhancing RTL Quality Estimation via Structural Graph Learning
Estimating the quality of register transfer level (RTL) designs is crucial in the electronic design automation (EDA) workflow, as it enables instant feedback on key metrics like area and delay without the need for time-consuming logic synthesis. While recent approaches have leveraged large language models (LLMs) to derive embeddings from RTL code and achieved promising results, they overlook the structural semantics essential for accurate quality estimation. In contrast, the control data flow graph (CDFG) view exposes the design's structural characteristics more explicitly, offering richer cues for representation learning. In this work, we introduce a novel structure-aware graph self-supervised learning framework, StructRTL, for improved RTL design quality estimation. By learning structure-informed representations from CDFGs, our method significantly outperforms prior art on various quality estimation tasks. To further boost performance, we incorporate a knowledge distillation strategy that transfers low-level insights from post-mapping netlists into the CDFG predictor. Experiments show that our approach establishes new state-of-the-art results, demonstrating the effectiveness of combining structural learning with cross-stage supervision.
☆ Are All Marine Species Created Equal? Performance Disparities in Underwater Object Detection
Underwater object detection is critical for monitoring marine ecosystems but poses unique challenges, including degraded image quality, imbalanced class distribution, and distinct visual characteristics. Not every species is detected equally well, yet underlying causes remain unclear. We address two key research questions: 1) What factors beyond data quantity drive class-specific performance disparities? 2) How can we systematically improve detection of under-performing marine species? We manipulate the DUO dataset to separate the object detection task into localization and classification and investigate the under-performance of the scallop class. Localization analysis using YOLO11 and TIDE finds that foreground-background discrimination is the most problematic stage regardless of data quantity. Classification experiments reveal persistent precision gaps even with balanced data, indicating intrinsic feature-based challenges beyond data scarcity and inter-class dependencies. We recommend imbalanced distributions when prioritizing precision, and balanced distributions when prioritizing recall. Improving under-performing classes should focus on algorithmic advances, especially within localization modules. We publicly release our code and datasets.
comment: 10 pages
☆ Natural Image Classification via Quasi-Cyclic Graph Ensembles and Random-Bond Ising Models at the Nishimori Temperature
We present a unified framework combining statistical physics, coding theory, and algebraic topology for efficient multi-class image classification. High-dimensional feature vectors from a frozen MobileNetV2 backbone are interpreted as spins on a sparse Multi-Edge Type quasi-cyclic LDPC (MET-QC-LDPC) graph, forming a Random-Bond Ising Model (RBIM). We operate this RBIM at its Nishimori temperature, $\beta_N$, where the smallest eigenvalue of the Bethe-Hessian matrix vanishes, maximizing class separability. Our theoretical contribution establishes a correspondence between local trapping sets in the code's graph and topological invariants (Betti numbers, bordism classes) of the feature manifold. A practical algorithm estimates $\beta_N$ efficiently with a quadratic interpolant and Newton correction, achieving a six-fold speed-up over bisection. Guided by topology, we design spherical and toroidal MET-QC-LDPC graph ensembles, using permanent bounds to suppress harmful trapping sets. This compresses 1280-dimensional features to 32 or 64 dimensions for ImageNet-10 and -100 subsets. Despite massive compression (40x fewer parameters), we achieve 98.7% accuracy on ImageNet-10 and 82.7% on ImageNet-100, demonstrating that topology-guided graph design yields highly efficient, physics-inspired embeddings with state-of-the-art performance.
comment: 27 pages, 8 figures, 2 tables, was presented at the 9th International Conference 'Deep Learning on Computational Physics (DLCP2025)', and is currently under review for the Moscow University Physics Bulletin, Physics series
☆ Skill-Aligned Fairness in Multi-Agent Learning for Collaboration in Healthcare
Fairness in multi-agent reinforcement learning (MARL) is often framed as a workload balance problem, overlooking agent expertise and the structured coordination required in real-world domains. In healthcare, equitable task allocation requires workload balance or expertise alignment to prevent burnout and overuse of highly skilled agents. Workload balance refers to distributing an approximately equal number of subtasks or equalised effort across healthcare workers, regardless of their expertise. We make two contributions to address this problem. First, we propose FairSkillMARL, a framework that defines fairness as the dual objective of workload balance and skill-task alignment. Second, we introduce MARLHospital, a customizable healthcare-inspired environment for modeling team compositions and energy-constrained scheduling impacts on fairness, as no existing simulators are well-suited for this problem. We conducted experiments to compare FairSkillMARL in conjunction with four standard MARL methods, and against two state-of-the-art fairness metrics. Our results suggest that fairness based solely on equal workload might lead to task-skill mismatches and highlight the need for more robust metrics that capture skill-task misalignment. Our work provides tools and a foundation for studying fairness in heterogeneous multi-agent systems where aligning effort with expertise is critical.
☆ Data-Driven Discovery and Formulation Refines the Quasi-Steady Model of Flapping-Wing Aerodynamics
Insects control unsteady aerodynamic forces on flapping wings to navigate complex environments. While understanding these forces is vital for biology, physics, and engineering, existing evaluation methods face trade-offs: high-fidelity simulations are computationally or experimentally expensive and lack explanatory power, whereas theoretical models based on quasi-steady assumptions offer insights but exhibit low accuracy. To overcome these limitations and thus enhance the accuracy of quasi-steady aerodynamic models, we applied a data-driven approach involving discovery and formulation of previously overlooked critical mechanisms. Through selection from 5,000 candidate kinematic functions, we identified mathematical expressions for three key additional mechanisms -- the effect of advance ratio, effect of spanwise kinematic velocity, and rotational Wagner effect -- which had been qualitatively recognized but were not formulated. Incorporating these mechanisms considerably reduced the prediction errors of the quasi-steady model using the computational fluid dynamics results as the ground truth, both in hawkmoth forward flight (at high Reynolds numbers) and fruit fly maneuvers (at low Reynolds numbers). The data-driven quasi-steady model enables rapid aerodynamic analysis, serving as a practical tool for understanding evolutionary adaptations in insect flight and developing bio-inspired flying robots.
comment: 27 pages, 13 figures
☆ Taming the One-Epoch Phenomenon in Online Recommendation System by Two-stage Contrastive ID Pre-training RecSys'24
ID-based embeddings are widely used in web-scale online recommendation systems. However, their susceptibility to overfitting, particularly due to the long-tail nature of data distributions, often limits training to a single epoch, a phenomenon known as the "one-epoch problem." This challenge has driven research efforts to optimize performance within the first epoch by enhancing convergence speed or feature sparsity. In this study, we introduce a novel two-stage training strategy that incorporates a pre-training phase using a minimal model with contrastive loss, enabling broader data coverage for the embedding system. Our offline experiments demonstrate that multi-epoch training during the pre-training phase does not lead to overfitting, and the resulting embeddings improve online generalization when fine-tuned for more complex downstream recommendation tasks. We deployed the proposed system in live traffic at Pinterest, achieving significant site-wide engagement gains.
comment: Published at RecSys'24, see https://dl.acm.org/doi/10.1145/3640457.3688053
☆ End to End Autoencoder MLP Framework for Sepsis Prediction
Sepsis is a life threatening condition that requires timely detection in intensive care settings. Traditional machine learning approaches, including Naive Bayes, Support Vector Machine (SVM), Random Forest, and XGBoost, often rely on manual feature engineering and struggle with irregular, incomplete time-series data commonly present in electronic health records. We introduce an end-to-end deep learning framework integrating an unsupervised autoencoder for automatic feature extraction with a multilayer perceptron classifier for binary sepsis risk prediction. To enhance clinical applicability, we implement a customized down sampling strategy that extracts high information density segments during training and a non-overlapping dynamic sliding window mechanism for real-time inference. Preprocessed time series data are represented as fixed dimension vectors with explicit missingness indicators, mitigating bias and noise. We validate our approach on three ICU cohorts. Our end-to-end model achieves accuracies of 74.6 percent, 80.6 percent, and 93.5 percent, respectively, consistently outperforming traditional machine learning baselines. These results demonstrate the framework's superior robustness, generalizability, and clinical utility for early sepsis detection across heterogeneous ICU environments.
☆ FALCON: Autonomous Cyber Threat Intelligence Mining with LLMs for IDS Rule Generation
Signature-based Intrusion Detection Systems (IDS) detect malicious activities by matching network or host activity against predefined rules. These rules are derived from extensive Cyber Threat Intelligence (CTI), which includes attack signatures and behavioral patterns obtained through automated tools and manual threat analysis, such as sandboxing. The CTI is then transformed into actionable rules for the IDS engine, enabling real-time detection and prevention. However, the constant evolution of cyber threats necessitates frequent rule updates, which delay deployment time and weaken overall security readiness. Recent advancements in agentic systems powered by Large Language Models (LLMs) offer the potential for autonomous IDS rule generation with internal evaluation. We introduce FALCON, an autonomous agentic framework that generates deployable IDS rules from CTI data in real-time and evaluates them using built-in multi-phased validators. To demonstrate versatility, we target both network (Snort) and host-based (YARA) mediums and construct a comprehensive dataset of IDS rules with their corresponding CTIs. Our evaluations indicate FALCON excels in automatic rule generation, with an average of 95% accuracy validated by qualitative evaluation with 84% inter-rater agreement among multiple cybersecurity analysts across all metrics. These results underscore the feasibility and effectiveness of LLM-driven data mining for real-time cyber threat mitigation.
comment: 11 pages, 5 figures, 4 tables
☆ Utilizing Training Data to Improve LLM Reasoning for Tabular Understanding
Automated tabular understanding and reasoning are essential tasks for data scientists. Recently, Large language models (LLMs) have become increasingly prevalent in tabular reasoning tasks. Previous work focuses on (1) finetuning LLMs using labeled data or (2) Training-free prompting LLM agents using chain-of-thought (CoT). Finetuning offers dataset-specific learning at the cost of generalizability. Training-free prompting is highly generalizable but does not take full advantage of training data. In this paper, we propose a novel prompting-based reasoning approach, Learn then Retrieve: LRTab, which integrates the benefits of both by retrieving relevant information learned from training data. We first use prompting to obtain CoT responses over the training data. For incorrect CoTs, we prompt the LLM to predict Prompt Conditions to avoid the error, learning insights from the data. We validate the effectiveness of Prompt Conditions using validation data. Finally, at inference time, we retrieve the most relevant Prompt Conditions for additional context for table understanding. We provide comprehensive experiments on WikiTQ and Tabfact, showing that LRTab is interpretable, cost-efficient, and can outperform previous baselines in tabular reasoning.
☆ Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks ICML
Empirical scaling laws have driven the evolution of large language models (LLMs), yet their coefficients shift whenever the model architecture or data pipeline changes. Mixture-of-Experts (MoE) models, now standard in state-of-the-art systems, introduce a new sparsity dimension that current dense-model frontiers overlook. We investigate how MoE sparsity influences two distinct capability regimes: memorization and reasoning. We train families of MoE Transformers that systematically vary total parameters, active parameters, and top-$k$ routing while holding the compute budget fixed. For every model we record pre-training loss, downstream task loss, and task accuracy, allowing us to separate the train-test generalization gap from the loss-accuracy gap. Memorization benchmarks improve monotonically with total parameters, mirroring training loss. By contrast, reasoning performance saturates and can even regress despite continued gains in both total parameters and training loss. Altering top-$k$ alone has little effect when active parameters are constant, and classic hyperparameters such as learning rate and initialization modulate the generalization gap in the same direction as sparsity. Neither post-training reinforcement learning (GRPO) nor extra test-time compute rescues the reasoning deficit of overly sparse models. Our model checkpoints, code and logs are open-source at https://github.com/rioyokotalab/optimal-sparsity.
comment: Presented at the Second AI for Math Workshop at ICML
☆ Auditing Approximate Machine Unlearning for Differentially Private Models ICDM2025
Approximate machine unlearning aims to remove the effect of specific data from trained models to ensure individuals' privacy. Existing methods focus on the removed records and assume the retained ones are unaffected. However, recent studies on the \emph{privacy onion effect} indicate this assumption might be incorrect. Especially when the model is differentially private, no study has explored whether the retained ones still meet the differential privacy (DP) criterion under existing machine unlearning methods. This paper takes a holistic approach to auditing both unlearned and retained samples' privacy risks after applying approximate unlearning algorithms. We propose the privacy criteria for unlearned and retained samples, respectively, based on the perspectives of DP and membership inference attacks (MIAs). To make the auditing process more practical, we also develop an efficient MIA, A-LiRA, utilizing data augmentation to reduce the cost of shadow model training. Our experimental findings indicate that existing approximate machine unlearning algorithms may inadvertently compromise the privacy of retained samples for differentially private models, and we need differentially private unlearning algorithms. For reproducibility, we have pubished our code: https://anonymous.4open.science/r/Auditing-machine-unlearning-CB10/README.md
comment: Accepted by ICDM2025,10pages
☆ Membership Inference Attacks on LLM-based Recommender Systems
Large language models (LLMs) based Recommender Systems (RecSys) can flexibly adapt recommendation systems to different domains. It utilizes in-context learning (ICL), i.e., the prompts, to customize the recommendation functions, which include sensitive historical user-specific item interactions, e.g., implicit feedback like clicked items or explicit product reviews. Such private information may be exposed to novel privacy attack. However, no study has been done on this important issue. We design four membership inference attacks (MIAs), aiming to reveal whether victims' historical interactions have been used by system prompts. They are \emph{direct inquiry, hallucination, similarity, and poisoning attacks}, each of which utilizes the unique features of LLMs or RecSys. We have carefully evaluated them on three LLMs that have been used to develop ICL-LLM RecSys and two well-known RecSys benchmark datasets. The results confirm that the MIA threat on LLM RecSys is realistic: direct inquiry and poisoning attacks showing significantly high attack advantages. We have also analyzed the factors affecting these attacks, such as the number of shots in system prompts and the position of the victim in the shots.
☆ FFT-MoE: Efficient Federated Fine-Tuning for Foundation Models via Large-scale Sparse MoE under Heterogeneous Edge
As FMs drive progress toward Artificial General Intelligence (AGI), fine-tuning them under privacy and resource constraints has become increasingly critical particularly when highquality training data resides on distributed edge devices. Federated Learning (FL) offers a compelling solution through Federated Fine-Tuning (FFT), which enables collaborative model adaptation without sharing raw data. Recent approaches incorporate Parameter-Efficient Fine-Tuning (PEFT) techniques such as Low Rank Adaptation (LoRA) to reduce computational overhead. However, LoRA-based FFT faces two major limitations in heterogeneous FL environments: structural incompatibility across clients with varying LoRA configurations and limited adaptability to non-IID data distributions, which hinders convergence and generalization. To address these challenges, we propose FFT MoE, a novel FFT framework that replaces LoRA with sparse Mixture of Experts (MoE) adapters. Each client trains a lightweight gating network to selectively activate a personalized subset of experts, enabling fine-grained adaptation to local resource budgets while preserving aggregation compatibility. To further combat the expert load imbalance caused by device and data heterogeneity, we introduce a heterogeneity-aware auxiliary loss that dynamically regularizes the routing distribution to ensure expert diversity and balanced utilization. Extensive experiments spanning both IID and non-IID conditions demonstrate that FFT MoE consistently outperforms state of the art FFT baselines in generalization performance and training efficiency.
comment: 9 pages, 6 figures
☆ The Sound of Risk: A Multimodal Physics-Informed Acoustic Model for Forecasting Market Volatility and Enhancing Market Interpretability
Information asymmetry in financial markets, often amplified by strategically crafted corporate narratives, undermines the effectiveness of conventional textual analysis. We propose a novel multimodal framework for financial risk assessment that integrates textual sentiment with paralinguistic cues derived from executive vocal tract dynamics in earnings calls. Central to this framework is the Physics-Informed Acoustic Model (PIAM), which applies nonlinear acoustics to robustly extract emotional signatures from raw teleconference sound subject to distortions such as signal clipping. Both acoustic and textual emotional states are projected onto an interpretable three-dimensional Affective State Label (ASL) space-Tension, Stability, and Arousal. Using a dataset of 1,795 earnings calls (approximately 1,800 hours), we construct features capturing dynamic shifts in executive affect between scripted presentation and spontaneous Q&A exchanges. Our key finding reveals a pronounced divergence in predictive capacity: while multimodal features do not forecast directional stock returns, they explain up to 43.8% of the out-of-sample variance in 30-day realized volatility. Importantly, volatility predictions are strongly driven by emotional dynamics during executive transitions from scripted to spontaneous speech, particularly reduced textual stability and heightened acoustic instability from CFOs, and significant arousal variability from CEOs. An ablation study confirms that our multimodal approach substantially outperforms a financials-only baseline, underscoring the complementary contributions of acoustic and textual modalities. By decoding latent markers of uncertainty from verifiable biometric signals, our methodology provides investors and regulators a powerful tool for enhancing market interpretability and identifying hidden corporate uncertainty.
comment: 9 pages, 6 figures
☆ Biologically Disentangled Multi-Omic Modeling Reveals Mechanistic Insights into Pan-Cancer Immunotherapy Resistance
Immune checkpoint inhibitors (ICIs) have transformed cancer treatment, yet patient responses remain highly variable, and the biological mechanisms underlying resistance are poorly understood. While machine learning models hold promise for predicting responses to ICIs, most existing methods lack interpretability and do not effectively leverage the biological structure inherent to multi-omics data. Here, we introduce the Biologically Disentangled Variational Autoencoder (BDVAE), a deep generative model that integrates transcriptomic and genomic data through modality- and pathway-specific encoders. Unlike existing rigid, pathway-informed models, BDVAE employs a modular encoder architecture combined with variational inference to learn biologically meaningful latent features associated with immune, genomic, and metabolic processes. Applied to a pan-cancer cohort of 366 patients across four cancer types treated with ICIs, BDVAE accurately predicts treatment response (AUC-ROC = 0.94 on unseen test data) and uncovers critical resistance mechanisms, including immune suppression, metabolic shifts, and neuronal signaling. Importantly, BDVAE reveals that resistance spans a continuous biological spectrum rather than strictly binary states, reflecting gradations of tumor dysfunction. Several latent features correlate with survival outcomes and known clinical subtypes, demonstrating BDVAE's capability to generate interpretable, clinically relevant insights. These findings underscore the value of biologically structured machine learning in elucidating complex resistance patterns and guiding precision immunotherapy strategies.
♻ ☆ Cohort-Aware Agents for Individualized Lung Cancer Risk Prediction Using a Retrieval-Augmented Model Selection Framework
Accurate lung cancer risk prediction remains challenging due to substantial variability across patient populations and clinical settings -- no single model performs best for all cohorts. To address this, we propose a personalized lung cancer risk prediction agent that dynamically selects the most appropriate model for each patient by combining cohort-specific knowledge with modern retrieval and reasoning techniques. Given a patient's CT scan and structured metadata -- including demographic, clinical, and nodule-level features -- the agent first performs cohort retrieval using FAISS-based similarity search across nine diverse real-world cohorts to identify the most relevant patient population from a multi-institutional database. Second, a Large Language Model (LLM) is prompted with the retrieved cohort and its associated performance metrics to recommend the optimal prediction algorithm from a pool of eight representative models, including classical linear risk models (e.g., Mayo, Brock), temporally-aware models (e.g., TD-VIT, DLSTM), and multi-modal computer vision-based approaches (e.g., Liao, Sybil, DLS, DLI). This two-stage agent pipeline -- retrieval via FAISS and reasoning via LLM -- enables dynamic, cohort-aware risk prediction personalized to each patient's profile. Building on this architecture, the agent supports flexible and cohort-driven model selection across diverse clinical populations, offering a practical path toward individualized risk assessment in real-world lung cancer screening.
♻ ☆ Emergent time-keeping mechanisms in a deep reinforcement learning agent performing an interval timing task
Drawing parallels between Deep Artificial Neural Networks (DNNs) and biological systems can aid in understanding complex biological mechanisms that are difficult to disentangle. Temporal processing, an extensively researched topic, is one such example that lacks a coherent understanding of its underlying mechanisms. In this study, we investigate temporal processing in a Deep Reinforcement Learning (DRL) agent performing an interval timing task and explore potential biological counterparts to its emergent behavior. The agent was successfully trained to perform a duration production task, which involved marking successive occurrences of a target interval while viewing a video sequence. Analysis of the agent's internal states revealed oscillatory neural activations, a ubiquitous pattern in biological systems. Interestingly, the agent's actions were predominantly influenced by neurons exhibiting these oscillations with high amplitudes and frequencies corresponding to the target interval. Parallels are drawn between the agent's time-keeping strategy and the Striatal Beat Frequency (SBF) model, a biologically plausible model of interval timing. Furthermore, the agent maintained its oscillatory representations and task performance when tested on different video sequences (including a blank video). Thus, once learned, the agent internalized its time-keeping mechanism and showed minimal reliance on its environment to perform the timing task. A hypothesis about the resemblance between this emergent behavior and certain aspects of the evolution of biological processes like circadian rhythms, has been discussed. This study aims to contribute to recent research efforts of utilizing DNNs to understand biological systems, with a particular emphasis on temporal processing.
comment: Accepted at 2025 Artificial Life Conference
♻ ☆ Distribution free M-estimation
The basic question of delineating those statistical problems that are solvable without making any assumptions on the underlying data distribution has long animated statistics and learning theory. This paper characterizes when a convex M-estimation or stochastic optimization problem is solvable in such an assumption-free setting, providing a precise dividing line between solvable and unsolvable problems. The conditions we identify show, perhaps surprisingly, that Lipschitz continuity of the loss being minimized is not necessary for distribution free minimization, and they are also distinct from classical characterizations of learnability in machine learning.
comment: 45 pages
♻ ☆ Local Learning Rules for Out-of-Equilibrium Physical Generative Models
We show that the out-of-equilibrium driving protocol of score-based generative models (SGMs) can be learned via local learning rules. The gradient with respect to the parameters of the driving protocol is computed directly from force measurements or from observed system dynamics. As a demonstration, we implement an SGM in a network of driven, nonlinear, overdamped oscillators coupled to a thermal bath. We first apply it to the problem of sampling from a mixture of two Gaussians in 2D. Finally, we train a 12x12 oscillator network on the MNIST dataset to generate images of handwritten digits 0 and 1.
comment: 6 pages, 2 figures
♻ ☆ Investigating the Robustness of Extreme Precipitation Super-Resolution Across Climates
The coarse spatial resolution of gridded climate models, such as general circulation models, limits their direct use in projecting socially relevant variables like extreme precipitation. Most downscaling methods estimate the conditional distributions of extremes by generating large ensembles, complicating the assessment of robustness under distributional shifts, such as those induced by climate change. To better understand and potentially improve robustness, we propose super-resolving the parameters of the target variable's probability distribution directly using analytically tractable mappings. Within a perfect-model framework over Switzerland, we demonstrate that vector generalized linear and additive models can super-resolve the generalized extreme value distribution of summer hourly precipitation extremes from coarse precipitation fields and topography. We introduce the notion of a "robustness gap", defined as the difference in predictive error between present-trained and future-trained models, and use it to diagnose how model structure affects the generalization of each quantile to a pseudo-global warming scenario. By evaluating multiple model configurations, we also identify an upper limit on the super-resolution factor based on the spatial auto- and cross-correlation of precipitation and elevation, beyond which coarse precipitation loses predictive value. Our framework is broadly applicable to variables governed by parametric distributions and offers a model-agnostic diagnostic for understanding when and why empirical downscaling generalizes to climate change and extremes.
comment: 40+3 pages, 9 figures, 1 table, submitted to WCE
♻ ☆ Quantum Graph Attention Network: A Novel Quantum Multi-Head Attention Mechanism for Graph Learning
We propose the Quantum Graph Attention Network (QGAT), a hybrid graph neural network that integrates variational quantum circuits into the attention mechanism. At its core, QGAT employs strongly entangling quantum circuits with amplitude-encoded node features to enable expressive nonlinear interactions. Distinct from classical multi-head attention that separately computes each head, QGAT leverages a single quantum circuit to simultaneously generate multiple attention coefficients. This quantum parallelism facilitates parameter sharing across heads, substantially reducing computational overhead and model complexity. Classical projection weights and quantum circuit parameters are optimized jointly in an end-to-end manner, ensuring flexible adaptation to learning tasks. Empirical results demonstrate QGAT's effectiveness in capturing complex structural dependencies and improved generalization in inductive scenarios, highlighting its potential for scalable quantum-enhanced learning across domains such as chemistry, biology, and network analysis. Furthermore, experiments confirm that quantum embedding enhances robustness against feature and structural noise, suggesting advantages in handling real-world noisy data. The modularity of QGAT also ensures straightforward integration into existing architectures, allowing it to easily augment classical attention-based models.
♻ ☆ Leveraging Multi-facet Paths for Heterogeneous Graph Representation Learning
Recent advancements in graph neural networks (GNNs) and heterogeneous GNNs (HGNNs) have advanced node embeddings and relationship learning for various tasks. However, existing methods often rely on domain-specific predefined meta-paths, which are coarse-grained and focus solely on aspects like node type, limiting their ability to capture complex interactions. We introduce MF2Vec, a model that uses multi-faceted (fine-grained) paths instead of predefined meta-paths. MF2Vec extracts paths via random walks and generates multi-faceted vectors, ignoring predefined schemas. This method learns diverse aspects of nodes and their relationships, constructs a homogeneous network, and creates node embeddings for classification, link prediction, and clustering. Extensive experiments show that MF2Vec outperforms existing methods, offering a more flexible and comprehensive framework for analyzing complex networks. The code is available at https://anonymous.4open.science/r/MF2Vec-6ABC.
♻ ☆ StagFormer: Time Staggering Transformer Decoding for RunningLayers In Parallel
Decoding in a Transformer based language model is inherently sequential as a token's embedding needs to pass through all the layers in the network before the generation of the next token can begin. In this work, we propose a new architecture StagFormer (Staggered Transformer), which staggers execution along the sequence axis and thereby enables parallelizing the decoding process along the depth of the model. We achieve this by breaking the dependency of the token representation at time step $i$ in layer $l$ upon the representations of tokens until time step $i$ from layer $l-1$. Instead, we stagger the execution and only allow a dependency on token representations until time step $i-1$. The later sections of the Transformer still get access to the "rich" representations from the prior section but only from those token positions which are one time step behind. StagFormer allows for different sections of the model to be executed in parallel yielding a potential speedup in decoding while being quality neutral in our simulations. We also explore many natural extensions of this idea. We present how weight-sharing across the different sections being staggered can be more practical in settings with limited memory. We explore the efficacy of using a bounded window attention to pass information from one section to another which helps drive further latency gains for some applications. We also explore the scalability of the staggering idea over more than 2 sections of the Transformer. Finally, we show how one can approximate a recurrent model during inference using weight-sharing. This variant can lead to substantial gains in quality for short generations while being neutral in its latency impact.
♻ ☆ Stochastic Nonlinear Control via Finite-dimensional Spectral Dynamic Embedding
This paper proposes an approach, Spectral Dynamics Embedding Control (SDEC), to optimal control for nonlinear stochastic systems. This method reveals an infinite-dimensional feature representation induced by the system's nonlinear stochastic dynamics, enabling a linear representation of the state-action value function. For practical implementation, this representation is approximated using finite-dimensional truncations, specifically via two prominent kernel approximation methods: random feature truncation and Nystrom approximation. To characterize the effectiveness of these approximations, we provide an in-depth theoretical analysis to characterize the approximation error arising from the finite-dimension truncation and statistical error due to finite-sample approximation in both policy evaluation and policy optimization. Empirically, our algorithm performs favorably against existing stochastic control algorithms on several benchmark problems.
comment: Accepted by the IEEE Transactions on Automatic Control
♻ ☆ MCI-GRU: Stock Prediction Model Based on Multi-Head Cross-Attention and Improved GRU
As financial markets grow increasingly complex in the big data era, accurate stock prediction has become more critical. Traditional time series models, such as GRUs, have been widely used but often struggle to capture the intricate nonlinear dynamics of markets, particularly in the flexible selection and effective utilization of key historical information. Recently, methods like Graph Neural Networks and Reinforcement Learning have shown promise in stock prediction but require high data quality and quantity, and they tend to exhibit instability when dealing with data sparsity and noise. Moreover, the training and inference processes for these models are typically complex and computationally expensive, limiting their broad deployment in practical applications. Existing approaches also generally struggle to capture unobservable latent market states effectively, such as market sentiment and expectations, microstructural factors, and participant behavior patterns, leading to an inadequate understanding of market dynamics and subsequently impact prediction accuracy. To address these challenges, this paper proposes a stock prediction model, MCI-GRU, based on a multi-head cross-attention mechanism and an improved GRU. First, we enhance the GRU model by replacing the reset gate with an attention mechanism, thereby increasing the model's flexibility in selecting and utilizing historical information. Second, we design a multi-head cross-attention mechanism for learning unobservable latent market state representations, which are further enriched through interactions with both temporal features and cross-sectional features. Finally, extensive experiments on four main stock markets show that the proposed method outperforms SOTA techniques across multiple metrics. Additionally, its successful application in real-world fund management operations confirms its effectiveness and practicality.
♻ ☆ A Consolidated Volatility Prediction with Back Propagation Neural Network and Genetic Algorithm ICML 2024
This paper provides a unique approach with AI algorithms to predict emerging stock markets volatility. Traditionally, stock volatility is derived from historical volatility,Monte Carlo simulation and implied volatility as well. In this paper, the writer designs a consolidated model with back-propagation neural network and genetic algorithm to predict future volatility of emerging stock markets and found that the results are quite accurate with low errors.
comment: 6 pages, 7 figures, 1 table, The paper will be published by IEEE on conference: 2024 3rd International Conference on Image Processing, Computer Vision and Machine Learning (ICICML 2024) (V4)
♻ ☆ Safe Reinforcement Learning in Black-Box Environments via Adaptive Shielding ECAI 25
Empowering safe exploration of reinforcement learning (RL) agents during training is a critical challenge towards their deployment in many real-world scenarios. When prior knowledge of the domain or task is unavailable, training RL agents in unknown, black-box environments presents an even greater safety risk. We introduce ADVICE (Adaptive Shielding with a Contrastive Autoencoder), a novel post-shielding technique that distinguishes safe and unsafe features of state-action pairs during training, and uses this knowledge to protect the RL agent from executing actions that yield likely hazardous outcomes. Our comprehensive experimental evaluation against state-of-the-art safe RL exploration techniques shows that ADVICE significantly reduces safety violations (approx 50%) during training, with a competitive outcome reward compared to other techniques.
comment: To be published in ECAI 25
♻ ☆ KNN and K-means in Gini Prametric Spaces
This paper introduces enhancements to the K-means and K-nearest neighbors (KNN) algorithms based on the concept of Gini prametric spaces, instead of traditional metric spaces. Unlike standard distance metrics, Gini prametrics incorporate both value-based and rank-based measures, offering robustness to noise and outliers. The main contributions include: (1) a Gini prametric that captures rank information alongside value distances; (2) a Gini K-means algorithm that is provably convergent and resilient to noisy data; and (3) a Gini KNN method that performs competitively with state-of-the-art approaches like Hassanat's distance in noisy environments. Experimental evaluations on 16 UCI datasets demonstrate the superior performance and efficiency of the Gini-based algorithms in clustering and classification tasks. This work opens new directions for rank-based prametrics in machine learning and statistical analysis.
♻ ☆ Steerable Scene Generation with Post Training and Inference-Time Search
Training robots in simulation requires diverse 3D scenes that reflect the specific challenges of downstream tasks. However, scenes that satisfy strict task requirements, such as high-clutter environments with plausible spatial arrangement, are rare and costly to curate manually. Instead, we generate large-scale scene data using procedural models that approximate realistic environments for robotic manipulation, and adapt it to task-specific goals. We do this by training a unified diffusion-based generative model that predicts which objects to place from a fixed asset library, along with their SE(3) poses. This model serves as a flexible scene prior that can be adapted using reinforcement learning-based post training, conditional generation, or inference-time search, steering generation toward downstream objectives even when they differ from the original data distribution. Our method enables goal-directed scene synthesis that respects physical feasibility and scales across scene types. We introduce a novel MCTS-based inference-time search strategy for diffusion models, enforce feasibility via projection and simulation, and release a dataset of over 44 million SE(3) scenes spanning five diverse environments. Website with videos, code, data, and model weights: https://steerable-scene-generation.github.io/
comment: Project website: https://steerable-scene-generation.github.io/
♻ ☆ SmartBench: Is Your LLM Truly a Good Chinese Smartphone Assistant?
Large Language Models (LLMs) have become integral to daily life, especially advancing as intelligent assistants through on-device deployment on smartphones. However, existing LLM evaluation benchmarks predominantly focus on objective tasks like mathematics and coding in English, which do not necessarily reflect the practical use cases of on-device LLMs in real-world mobile scenarios, especially for Chinese users. To address these gaps, we introduce SmartBench, the first benchmark designed to evaluate the capabilities of on-device LLMs in Chinese mobile contexts. We analyze functionalities provided by representative smartphone manufacturers and divide them into five categories: text summarization, text Q&A, information extraction, content creation, and notification management, further detailed into 20 specific tasks. For each task, we construct high-quality datasets comprising 50 to 200 question-answer pairs that reflect everyday mobile interactions, and we develop automated evaluation criteria tailored for these tasks. We conduct comprehensive evaluations of on-device LLMs and MLLMs using SmartBench and also assess their performance after quantized deployment on real smartphone NPUs. Our contributions provide a standardized framework for evaluating on-device LLMs in Chinese, promoting further development and optimization in this critical area. Code and data will be available at https://github.com/vivo-ai-lab/SmartBench.
comment: 26 pages
♻ ☆ Deep vectorised operators for pulsatile hemodynamics estimation in coronary arteries from a steady-state prior
Cardiovascular hemodynamic fields provide valuable medical decision markers for coronary artery disease. Computational fluid dynamics (CFD) is the gold standard for accurate, non-invasive evaluation of these quantities in silico. In this work, we propose a time-efficient surrogate model, powered by machine learning, for the estimation of pulsatile hemodynamics based on steady-state priors. We introduce deep vectorised operators, a modelling framework for discretisation-independent learning on infinite-dimensional function spaces. The underlying neural architecture is a neural field conditioned on hemodynamic boundary conditions. Importantly, we show how relaxing the requirement of point-wise action to permutation-equivariance leads to a family of models that can be parametrised by message passing and self-attention layers. We evaluate our approach on a dataset of 74 stenotic coronary arteries extracted from coronary computed tomography angiography (CCTA) with patient-specific pulsatile CFD simulations as ground truth. We show that our model produces accurate estimates of the pulsatile velocity and pressure (approximation disparity 0.368 $\pm$ 0.079) while being agnostic ($p < 0.05$ in a one-way ANOVA test) to re-sampling of the source domain, i.e. discretisation-independent. This shows that deep vectorised operators are a powerful modelling tool for cardiovascular hemodynamics estimation in coronary arteries and beyond.
comment: Published in "Computer Methods and Programs in Biomedicine"
♻ ☆ CAD-Assistant: Tool-Augmented VLLMs as Generic CAD Task Solvers
We propose CAD-Assistant, a general-purpose CAD agent for AI-assisted design. Our approach is based on a powerful Vision and Large Language Model (VLLM) as a planner and a tool-augmentation paradigm using CAD-specific tools. CAD-Assistant addresses multimodal user queries by generating actions that are iteratively executed on a Python interpreter equipped with the FreeCAD software, accessed via its Python API. Our framework is able to assess the impact of generated CAD commands on geometry and adapts subsequent actions based on the evolving state of the CAD design. We consider a wide range of CAD-specific tools including a sketch image parameterizer, rendering modules, a 2D cross-section generator, and other specialized routines. CAD-Assistant is evaluated on multiple CAD benchmarks, where it outperforms VLLM baselines and supervised task-specific methods. Beyond existing benchmarks, we qualitatively demonstrate the potential of tool-augmented VLLMs as general-purpose CAD solvers across diverse workflows.
♻ ☆ PinnDE: Physics-Informed Neural Networks for Solving Differential Equations
In recent years the study of deep learning for solving differential equations has grown substantially. The use of physics-informed neural networks (PINNs) and deep operator networks (DeepONets) have emerged as two of the most useful approaches in approximating differential equation solutions using machine learning. Here, we introduce PinnDE, an open-source Python library for solving differential equations with both PINNs and DeepONets. We give a brief review of both PINNs and DeepONets, introduce PinnDE along with the structure and usage of the package, and present worked examples to show PinnDE's effectiveness in approximating solutions of systems of differential equations with both PINNs and DeepONets.
♻ ☆ Large Language Model Aided QoS Prediction for Service Recommendation
Large language models (LLMs) have seen rapid improvement in the recent years, and have been used in a wider range of applications. After being trained on large text corpus, LLMs obtain the capability of extracting rich features from textual data. Such capability is potentially useful for the web service recommendation task, where the web users and services have intrinsic attributes that can be described using natural language sentences and are useful for recommendation. In this paper, we explore the possibility and practicality of using LLMs for web service recommendation. We propose the large language model aided QoS prediction (llmQoS) model, which use LLMs to extract useful information from attributes of web users and services via descriptive sentences. This information is then used in combination with the QoS values of historical interactions of users and services, to predict QoS values for any given user-service pair. On the WSDream dataset, llmQoS is shown to overcome the data sparsity issue inherent to the QoS prediction problem, and outperforms comparable baseline models consistently.
♻ ☆ Ensembles of Neural Surrogates for Parametric Sensitivity in Ocean Modeling
Accurate simulations of the oceans are crucial in understanding the Earth system. Despite their efficiency, simulations at lower resolutions must rely on various uncertain parameterizations to account for unresolved processes. However, model sensitivity to parameterizations is difficult to quantify, making it challenging to tune these parameterizations to reproduce observations. Deep learning surrogates have shown promise for efficient computation of the parametric sensitivities in the form of partial derivatives, but their reliability is difficult to evaluate without ground truth derivatives. In this work, we leverage large-scale hyperparameter search and ensemble learning to improve both forward predictions, autoregressive rollout, and backward adjoint sensitivity estimation. Particularly, the ensemble method provides epistemic uncertainty of function value predictions and their derivatives, providing improved reliability of the neural surrogates in decision making.
comment: 12 pages, 7 figures
♻ ☆ General Intelligence Requires Reward-based Pretraining
Large Language Models (LLMs) have demonstrated impressive real-world utility, exemplifying artificial useful intelligence (AUI). However, their ability to reason adaptively and robustly -- the hallmarks of artificial general intelligence (AGI) -- remains fragile. While LLMs seemingly succeed in commonsense reasoning, programming, and mathematics, they struggle to generalize algorithmic understanding across novel contexts. Our experiments with algorithmic tasks in esoteric programming languages reveal that LLM's reasoning overfits to the training data and is limited in its transferability. We hypothesize that the core issue underlying such limited transferability is the coupling of reasoning and knowledge in LLMs. To transition from AUI to AGI, we propose disentangling knowledge and reasoning through three key directions: (1) pretaining to reason using RL from scratch as an alternative to the widely used next-token prediction pretraining, (2) using a curriculum of synthetic tasks to ease the learning of a reasoning prior for RL that can then be transferred to natural language tasks, and (3) learning more generalizable reasoning functions using a small context window to reduce exploiting spurious correlations between tokens. Such a reasoning system coupled with a trained retrieval system and a large external memory bank as a knowledge store can overcome several limitations of existing architectures at learning to reason in novel scenarios.
comment: https://improbableai.notion.site/General-Intelligence-Requires-Reward-Based-Pretraining-2023b66e4cf580d3ab44c7860b75d25f?pvs=73
♻ ☆ UniGenX: a unified generative foundation model that couples sequence, structure and function to accelerate scientific design across proteins, molecules and materials
Function in natural systems arises from one-dimensional sequences forming three-dimensional structures with specific properties. However, current generative models suffer from critical limitations: training objectives seldom target function directly, discrete sequences and continuous coordinates are optimized in isolation, and conformational ensembles are under-modeled. We present UniGenX, a unified generative foundation model that addresses these gaps by co-generating sequences and coordinates under direct functional and property objectives across proteins, molecules, and materials. UniGenX represents heterogeneous inputs as a mixed stream of symbolic and numeric tokens, where a decoder-only autoregressive transformer provides global context and a conditional diffusion head generates numeric fields steered by task-specific tokens. Besides the new high SOTAs on structure prediction tasks, the model demonstrates state-of-the-art or competitive performance for the function-aware generation across domains: in materials, it achieves "conflicted" multi-property conditional generation, yielding 436 crystal candidates meeting triple constraints, including 11 with novel compositions; in chemistry, it sets new benchmarks on five property targets and conformer ensemble generation on GEOM; and in biology, it improves success in modeling protein induced fit (RMSD < 2 {\AA}) by over 23-fold and enhances EC-conditioned enzyme design. Ablation studies and cross-domain transfer substantiate the benefits of joint discrete-continuous training, establishing UniGenX as a significant advance from prediction to controllable, function-aware generation.
♻ ☆ Finite-Width Neural Tangent Kernels from Feynman Diagrams
Neural tangent kernels (NTKs) are a powerful tool for analyzing deep, non-linear neural networks. In the infinite-width limit, NTKs can easily be computed for most common architectures, yielding full analytic control over the training dynamics. However, at infinite width, important properties of training such as NTK evolution or feature learning are absent. Nevertheless, finite width effects can be included by computing corrections to the Gaussian statistics at infinite width. We introduce Feynman diagrams for computing finite-width corrections to NTK statistics. These dramatically simplify the necessary algebraic manipulations and enable the computation of layer-wise recursive relations for arbitrary statistics involving preactivations, NTKs and certain higher-derivative tensors (dNTK and ddNTK) required to predict the training dynamics at leading order. We demonstrate the feasibility of our framework by extending stability results for deep networks from preactivations to NTKs and proving the absence of finite-width corrections for scale-invariant nonlinearities such as ReLU on the diagonal of the Gram matrix of the NTK. We validate our results with numerical experiments.
comment: 11 pages + appendices
♻ ☆ Seal Your Backdoor with Variational Defense ICCV 2025
We propose VIBE, a model-agnostic framework that trains classifiers resilient to backdoor attacks. The key concept behind our approach is to treat malicious inputs and corrupted labels from the training dataset as observed random variables, while the actual clean labels are latent. VIBE then recovers the corresponding latent clean label posterior through variational inference. The resulting training procedure follows the expectation-maximization (EM) algorithm. The E-step infers the clean pseudolabels by solving an entropy-regularized optimal transport problem, while the M-step updates the classifier parameters via gradient descent. Being modular, VIBE can seamlessly integrate with recent advancements in self-supervised representation learning, which enhance its ability to resist backdoor attacks. We experimentally validate the method effectiveness against contemporary backdoor attacks on standard datasets, a large-scale setup with 1$k$ classes, and a dataset poisoned with multiple attacks. VIBE consistently outperforms previous defenses across all tested scenarios.
comment: Accepted to ICCV 2025
♻ ☆ LLM-Enhanced Linear Autoencoders for Recommendation CIKM 2025
Large language models (LLMs) have been widely adopted to enrich the semantic representation of textual item information in recommender systems. However, existing linear autoencoders (LAEs) that incorporate textual information rely on sparse word co-occurrence patterns, limiting their ability to capture rich textual semantics. To address this, we propose L3AE, the first integration of LLMs into the LAE framework. L3AE effectively integrates the heterogeneous knowledge of textual semantics and user-item interactions through a two-phase optimization strategy. (i) L3AE first constructs a semantic item-to-item correlation matrix from LLM-derived item representations. (ii) It then learns an item-to-item weight matrix from collaborative signals while distilling semantic item correlations as regularization. Notably, each phase of L3AE is optimized through closed-form solutions, ensuring global optimality and computational efficiency. Extensive experiments demonstrate that L3AE consistently outperforms state-of-the-art LLM-enhanced models on three benchmark datasets, achieving gains of 27.6% in Recall@20 and 39.3% in NDCG@20. The source code is available at https://github.com/jaewan7599/L3AE_CIKM2025.
comment: Accepted by CIKM 2025
♻ ☆ Mind the (Language) Gap: Towards Probing Numerical and Cross-Lingual Limits of LVLMs
We introduce MMCRICBENCH-3K, a benchmark for Visual Question Answering (VQA) on cricket scorecards, designed to evaluate large vision-language models (LVLMs) on complex numerical and cross-lingual reasoning over semi-structured tabular images. MMCRICBENCH-3K comprises 1,463 synthetically generated scorecard images from ODI, T20, and Test formats, accompanied by 1,500 English QA pairs. It includes two subsets: MMCRICBENCH-E-1.5K, featuring English scorecards, and MMCRICBENCH-H-1.5K, containing visually similar Hindi scorecards, with all questions and answers kept in English to enable controlled cross-script evaluation. The task demands reasoning over structured numerical data, multi-image context, and implicit domain knowledge. Empirical results show that even state-of-the-art LVLMs, such as GPT-4o and Qwen2.5VL, struggle on the English subset despite it being their primary training language and exhibit a further drop in performance on the Hindi subset. This reveals key limitations in structure-aware visual text understanding, numerical reasoning, and cross-lingual generalization. The dataset is publicly available via Hugging Face at https://huggingface.co/datasets/DIALab/MMCricBench, to promote LVLM research in this direction.
♻ ☆ Learning the Simplest Neural ODE
Since the advent of the ``Neural Ordinary Differential Equation (Neural ODE)'' paper, learning ODEs with deep learning has been applied to system identification, time-series forecasting, and related areas. Exploiting the diffeomorphic nature of ODE solution maps, neural ODEs has also enabled their use in generative modeling. Despite the rich potential to incorporate various kinds of physical information, training Neural ODEs remains challenging in practice. This study demonstrates, through the simplest one-dimensional linear model, why training Neural ODEs is difficult. We then propose a new stabilization method and provide an analytical convergence analysis. The insights and techniques presented here serve as a concise tutorial for researchers beginning work on Neural ODEs.
comment: Accepted SICE FES 2025
♻ ☆ Pessimistic Iterative Planning with RNNs for Robust POMDPs ECAI 2025
Robust POMDPs extend classical POMDPs to incorporate model uncertainty using so-called uncertainty sets on the transition and observation functions, effectively defining ranges of probabilities. Policies for robust POMDPs must be (1) memory-based to account for partial observability and (2) robust against model uncertainty to account for the worst-case probability instances from the uncertainty sets. To compute such robust memory-based policies, we propose the pessimistic iterative planning (PIP) framework, which alternates between (1) selecting pessimistic POMDPs via worst-case probability instances from the uncertainty sets, and (2) computing finite-state controllers (FSCs) for these pessimistic POMDPs. Within PIP, we propose the rFSCNet algorithm, which optimizes a recurrent neural network to compute the FSCs. The empirical evaluation shows that rFSCNet can compute better-performing robust policies than several baselines and a state-of-the-art robust POMDP solver.
comment: Accepted for presentation at ECAI 2025
♻ ☆ Keep your distance: learning dispersed embeddings on $\mathbb{S}_m$
Learning well-separated features in high-dimensional spaces, such as text or image embeddings, is crucial for many machine learning applications. Achieving such separation can be effectively accomplished through the dispersion of embeddings, where unrelated vectors are pushed apart as much as possible. By constraining features to be on a hypersphere, we can connect dispersion to well-studied problems in mathematics and physics, where optimal solutions are known for limited low-dimensional cases. However, in representation learning we typically deal with a large number of features in high-dimensional space, and moreover, dispersion is usually traded off with some other task-oriented training objective, making existing theoretical and numerical solutions inapplicable. Therefore, it is common to rely on gradient-based methods to encourage dispersion, usually by minimizing some function of the pairwise distances. In this work, we first give an overview of existing methods from disconnected literature, making new connections and highlighting similarities. Next, we introduce some new angles. We propose to reinterpret pairwise dispersion using a maximum mean discrepancy (MMD) motivation. We then propose an online variant of the celebrated Lloyd's algorithm, of K-Means fame, as an effective alternative regularizer for dispersion on generic domains. Finally, we derive a novel dispersion method that directly exploits properties of the hypersphere. Our experiments show the importance of dispersion in image classification and natural language processing tasks, and how algorithms exhibit different trade-offs in different regimes.
♻ ☆ From Taylor Series to Fourier Synthesis: The Periodic Linear Unit
The dominant paradigm in modern neural networks relies on simple, monotonically-increasing activation functions like ReLU. While effective, this paradigm necessitates large, massively-parameterized models to approximate complex functions. In this paper, we introduce the Periodic Linear Unit (PLU), a learnable sine-wave based activation with periodic non-monotonicity. PLU is designed for maximum expressive power and numerical stability, achieved through its formulation and a paired innovation we term Repulsive Reparameterization, which prevents the activation from collapsing into a non-expressive linear function. We demonstrate that a minimal MLP with only two PLU neurons can solve the spiral classification task, a feat impossible for equivalent networks using standard activations. This suggests a paradigm shift from networks as piecewise Taylor-like approximators to powerful Fourier-like function synthesizers, achieving exponential gains in parameter efficiency by placing intelligence in the neuron itself.
comment: 15 pages, 5 figures, for associated raw example files and the code repository, see https://github.com/bill13579/plu_activation
♻ ☆ Emerging Semantic Segmentation from Positive and Negative Coarse Label Learning
Large annotated datasets are vital for training segmentation models, but pixel-level labeling is time-consuming, error-prone, and often requires scarce expert annotators, especially in medical imaging. In contrast, coarse annotations are quicker, cheaper, and easier to produce, even by non-experts. In this paper, we propose to use coarse drawings from both positive (target) and negative (background) classes in the image, even with noisy pixels, to train a convolutional neural network (CNN) for semantic segmentation. We present a method for learning the true segmentation label distributions from purely noisy coarse annotations using two coupled CNNs. The separation of the two CNNs is achieved by high fidelity with the characters of the noisy training annotations. We propose to add a complementary label learning that encourages estimating negative label distribution. To illustrate the properties of our method, we first use a toy segmentation dataset based on MNIST. We then present the quantitative results of experiments using publicly available datasets: Cityscapes dataset for multi-class segmentation, and retinal images for medical applications. In all experiments, our method outperforms state-of-the-art methods, particularly in the cases where the ratio of coarse annotations is small compared to the given dense annotations.
♻ ☆ Generalization, Expressivity, and Universality of Graph Neural Networks on Attributed Graphs
We analyze the universality and generalization of graph neural networks (GNNs) on attributed graphs, i.e., with node attributes. To this end, we propose pseudometrics over the space of all attributed graphs that describe the fine-grained expressivity of GNNs. Namely, GNNs are both Lipschitz continuous with respect to our pseudometrics and can separate attributed graphs that are distant in the metric. Moreover, we prove that the space of all attributed graphs is relatively compact with respect to our metrics. Based on these properties, we prove a universal approximation theorem for GNNs and generalization bounds for GNNs on any data distribution of attributed graphs. The proposed metrics compute the similarity between the structures of attributed graphs via a hierarchical optimal transport between computation trees. Our work extends and unites previous approaches which either derived theory only for graphs with no attributes, derived compact metrics under which GNNs are continuous but without separation power, or derived metrics under which GNNs are continuous and separate points but the space of graphs is not relatively compact, which prevents universal approximation and generalization analysis.
♻ ☆ Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning
Recent advances in Large Language Models (LLMs) have underscored the potential of Reinforcement Learning (RL) to facilitate the emergence of reasoning capabilities. Despite the encouraging results, a fundamental dilemma persists as RL improvement relies on learning from high-quality samples, yet the exploration for such samples remains bounded by the inherent limitations of LLMs. This, in effect, creates an undesirable cycle in which what cannot be explored cannot be learned. In this work, we propose Rubric-Scaffolded Reinforcement Learning (RuscaRL), a novel instructional scaffolding framework designed to break the exploration bottleneck for general LLM reasoning. Specifically, RuscaRL introduces checklist-style rubrics as (1) explicit scaffolding for exploration during rollout generation, where different rubrics are provided as external guidance within task instructions to steer diverse high-quality responses. This guidance is gradually decayed over time, encouraging the model to internalize the underlying reasoning patterns; (2) verifiable rewards for exploitation during model training, where we can obtain robust LLM-as-a-Judge scores using rubrics as references, enabling effective RL on general reasoning tasks. Extensive experiments demonstrate the superiority of the proposed RuscaRL across various benchmarks, effectively expanding reasoning boundaries under the best-of-N evaluation. Notably, RuscaRL significantly boosts Qwen2.5-7B-Instruct from 23.6 to 50.3 on HealthBench-500, surpassing GPT-4.1. Furthermore, our fine-tuned variant on Qwen3-30B-A3B-Instruct achieves 61.1 on HealthBench-500, outperforming leading LLMs including OpenAI-o3. This work is still in progress, and we will release the code, the models, and the datasets soon.
comment: This work is still in progress
♻ ☆ Noise-based reward-modulated learning
Biological neural systems efficiently learn from delayed rewards despite relying on noisy synaptic transmission and lacking centralized optimization mechanisms. In contrast, artificial neural networks trained with reinforcement learning typically rely on backpropagation (BP), which limits their use in resource-constrained systems or with non-differentiable components. While noise-based alternatives, like reward-modulated Hebbian learning (RMHL), provide a biologically grounded framework for credit assignment, they struggle with temporal delays and hierarchical processing -key challenges in real-world learning. In this work, we derive a novel noise-based learning rule to address these challenges. Drawing inspiration from biological neural circuits, our method uses reward prediction errors as its optimization target to generate increasingly advantageous behavior, and incorporates an eligibility trace to facilitate retrospective credit assignment. Its formulation relies on local information, aligning with biological constraints and enabling neuromorphic implementation. Experimental validation on reinforcement tasks (immediate and delayed rewards) shows our approach significantly outperforms RMHL and achieves performance comparable to BP, although with slower convergence due to its noise-driven updates. While tested on simple architectures, the results highlight the potential of noise-driven, brain-inspired learning for low-power adaptive systems, particularly in scenarios where energy efficiency and biological plausibility are a priority. These findings also offer mechanistic insights into how dopamine-like signals and synaptic stochasticity may jointly enable learning in biological networks, bridging computational models with neurobiological principles.
♻ ☆ Prototype-Guided Diffusion: Visual Conditioning without External Memory
Diffusion models have emerged as a leading framework for high-quality image generation, offering stable training and strong performance across diverse domains. However, they remain computationally intensive, particularly during the iterative denoising process. Latent-space models like Stable Diffusion alleviate some of this cost by operating in compressed representations, though at the expense of fine-grained detail. More recent approaches such as Retrieval-Augmented Diffusion Models (RDM) address efficiency by conditioning denoising on similar examples retrieved from large external memory banks. While effective, these methods introduce drawbacks: they require costly storage and retrieval infrastructure, depend on static vision-language models like CLIP for similarity, and lack adaptability during training. We propose the Prototype Diffusion Model (PDM), a method that integrates prototype learning directly into the diffusion process for efficient and adaptive visual conditioning - without external memory. Instead of retrieving reference samples, PDM constructs a dynamic set of compact visual prototypes from clean image features using contrastive learning. These prototypes guide the denoising steps by aligning noisy representations with semantically relevant visual patterns, enabling efficient generation with strong semantic grounding. Experiments show that PDM maintains high generation quality while reducing computational and storage overhead, offering a scalable alternative to retrieval-based conditioning in diffusion models.
♻ ☆ Overcoming label shift with target-aware federated learning
Federated learning enables multiple actors to collaboratively train models without sharing private data. Existing algorithms are successful and well-justified in this task when the intended target domain, where the trained model will be used, shares data distribution with the aggregate of clients, but this is often violated in practice. A common reason is label shift -- that the label distributions differ between clients and the target domain. We demonstrate empirically that this can significantly degrade performance. To address this problem, we propose FedPALS, a principled and practical model aggregation scheme that adapts to label shifts to improve performance in the target domain by leveraging knowledge of label distributions at the central server. Our approach ensures unbiased updates under federated stochastic gradient descent which yields robust generalization across clients with diverse, label-shifted data. Extensive experiments on image classification tasks demonstrate that FedPALS consistently outperforms baselines by aligning model aggregation with the target domain. Our findings reveal that conventional federated learning methods suffer severely in cases of extreme label sparsity on clients, highlighting the critical need for target-aware aggregation as offered by FedPALS.
♻ ☆ Uni-AIMS: AI-Powered Microscopy Image Analysis
This paper presents a systematic solution for the intelligent recognition and automatic analysis of microscopy images. We developed a data engine that generates high-quality annotated datasets through a combination of the collection of diverse microscopy images from experiments, synthetic data generation and a human-in-the-loop annotation process. To address the unique challenges of microscopy images, we propose a segmentation model capable of robustly detecting both small and large objects. The model effectively identifies and separates thousands of closely situated targets, even in cluttered visual environments. Furthermore, our solution supports the precise automatic recognition of image scale bars, an essential feature in quantitative microscopic analysis. Building upon these components, we have constructed a comprehensive intelligent analysis platform and validated its effectiveness and practicality in real-world applications. This study not only advances automatic recognition in microscopy imaging but also ensures scalability and generalizability across multiple application domains, offering a powerful tool for automated microscopic analysis in interdisciplinary research. A online application is made available for researchers to access and evaluate the proposed automated analysis service.
♻ ☆ PointFix: Learning to Fix Domain Bias for Robust Online Stereo Adaptation ECCV 2022
Online stereo adaptation tackles the domain shift problem, caused by different environments between synthetic (training) and real (test) datasets, to promptly adapt stereo models in dynamic real-world applications such as autonomous driving. However, previous methods often fail to counteract particular regions related to dynamic objects with more severe environmental changes. To mitigate this issue, we propose to incorporate an auxiliary point-selective network into a meta-learning framework, called PointFix, to provide a robust initialization of stereo models for online stereo adaptation. In a nutshell, our auxiliary network learns to fix local variants intensively by effectively back-propagating local information through the meta-gradient for the robust initialization of the baseline model. This network is model-agnostic, so can be used in any kind of architectures in a plug-and-play manner. We conduct extensive experiments to verify the effectiveness of our method under three adaptation settings such as short-, mid-, and long-term sequences. Experimental results show that the proper initialization of the base stereo model by the auxiliary network enables our learning paradigm to achieve state-of-the-art performance at inference.
comment: Accepted to ECCV 2022
♻ ☆ Incremental Multi-Scene Modeling via Continual Neural Graphics Primitives
Neural radiance fields (NeRF) have revolutionized photorealistic rendering of novel views for 3D scenes. Despite their growing popularity and efficiency as 3D resources, NeRFs face scalability challenges due to the need for separate models per scene and the cumulative increase in training time for multiple scenes. The potential for incrementally encoding multiple 3D scenes into a single NeRF model remains largely unexplored. To address this, we introduce Continual-Neural Graphics Primitives (C-NGP), a novel continual learning framework that integrates multiple scenes incrementally into a single neural radiance field. Using a generative replay approach, C-NGP adapts to new scenes without requiring access to old data. We demonstrate that C-NGP can accommodate multiple scenes without increasing the parameter count, producing high-quality novel-view renderings on synthetic and real datasets. Notably, C-NGP models all $8$ scenes from the Real-LLFF dataset together, with only a $2.2\%$ drop in PSNR compared to vanilla NeRF, which models each scene independently. Further, C-NGP allows multiple style edits in the same network.
♻ ☆ Human Vision Constrained Super-Resolution
Modern deep-learning super-resolution (SR) techniques process images and videos independently of the underlying content and viewing conditions. However, the sensitivity of the human visual system (HVS) to image details changes depending on the underlying image characteristics, such as spatial frequency, luminance, color, contrast, or motion; as well viewing condition aspects such as ambient lighting and distance to the display. This observation suggests that computational resources spent on up-sampling images/videos may be wasted whenever a viewer cannot resolve the synthesized details i.e the resolution of details exceeds the resolving capability of human vision. Motivated by this observation, we propose a human vision inspired and architecture-agnostic approach for controlling SR techniques to deliver visually optimal results while limiting computational complexity. Its core is an explicit Human Visual Processing Framework (HVPF) that dynamically and locally guides SR methods according to human sensitivity to specific image details and viewing conditions. We demonstrate the application of our framework in combination with network branching to improve the computational efficiency of SR methods. Quantitative and qualitative evaluations, including user studies, demonstrate the effectiveness of our approach in reducing FLOPS by factors of 2$\times$ and greater, without sacrificing perceived quality.
♻ ☆ PCR-CA: Parallel Codebook Representations with Contrastive Alignment for Multiple-Category App Recommendation
Modern app store recommender systems struggle with multiple-category apps, as traditional taxonomies fail to capture overlapping semantics, leading to suboptimal personalization. We propose PCR-CA (Parallel Codebook Representations with Contrastive Alignment), an end-to-end framework for improved CTR prediction. PCR-CA first extracts compact multimodal embeddings from app text, then introduces a Parallel Codebook VQ-AE module that learns discrete semantic representations across multiple codebooks in parallel -- unlike hierarchical residual quantization (RQ-VAE). This design enables independent encoding of diverse aspects (e.g., gameplay, art style), better modeling multiple-category semantics. To bridge semantic and collaborative signals, we employ a contrastive alignment loss at both the user and item levels, enhancing representation learning for long-tail items. Additionally, a dual-attention fusion mechanism combines ID-based and semantic features to capture user interests, especially for long-tail apps. Experiments on a large-scale dataset show PCR-CA achieves a +0.76% AUC improvement over strong baselines, with +2.15% AUC gains for long-tail apps. Online A/B testing further validates our approach, showing a +10.52% lift in CTR and a +16.30% improvement in CVR, demonstrating PCR-CA's effectiveness in real-world deployment. The new framework has now been fully deployed on the Microsoft Store.
comment: 9 pages, 4 figures, conference
♻ ☆ Generative Feature Imputing -- A Technique for Error-resilient Semantic Communication
Semantic communication (SemCom) has emerged as a promising paradigm for achieving unprecedented communication efficiency in sixth-generation (6G) networks by leveraging artificial intelligence (AI) to extract and transmit the underlying meanings of source data. However, deploying SemCom over digital systems presents new challenges, particularly in ensuring robustness against transmission errors that may distort semantically critical content. To address this issue, this paper proposes a novel framework, termed generative feature imputing, which comprises three key techniques. First, we introduce a spatial error concentration packetization strategy that spatially concentrates feature distortions by encoding feature elements based on their channel mappings, a property crucial for both the effectiveness and reduced complexity of the subsequent techniques. Second, building on this strategy, we propose a generative feature imputing method that utilizes a diffusion model to efficiently reconstruct missing features caused by packet losses. Finally, we develop a semantic-aware power allocation scheme that enables unequal error protection by allocating transmission power according to the semantic importance of each packet. Experimental results demonstrate that the proposed framework outperforms conventional approaches, such as Deep Joint Source-Channel Coding (DJSCC) and JPEG2000, under block fading conditions, achieving higher semantic accuracy and lower Learned Perceptual Image Patch Similarity (LPIPS) scores.
♻ ☆ Deep Generative Methods and Tire Architecture Design
As deep generative models proliferate across the AI landscape, industrial practitioners still face critical yet unanswered questions about which deep generative models best suit complex manufacturing design tasks. This work addresses this question through a complete study of five representative models (Variational Autoencoder, Generative Adversarial Network, multimodal Variational Autoencoder, Denoising Diffusion Probabilistic Model, and Multinomial Diffusion Model) on industrial tire architecture generation. Our evaluation spans three key industrial scenarios: (i) unconditional generation of complete multi-component designs, (ii) component-conditioned generation (reconstructing architectures from partial observations), and (iii) dimension-constrained generation (creating designs that satisfy specific dimensional requirements). To enable discrete diffusion models to handle conditional scenarios, we introduce categorical inpainting, a mask-aware reverse diffusion process that preserves known labels without requiring additional training. Our evaluation employs geometry-aware metrics specifically calibrated for industrial requirements, quantifying spatial coherence, component interaction, structural connectivity, and perceptual fidelity. Our findings reveal that diffusion models achieve the strongest overall performance; a masking-trained VAE nonetheless outperforms the multimodal variant MMVAE\textsuperscript{+} on nearly all component-conditioned metrics, and within the diffusion family MDM leads in-distribution whereas DDPM generalises better to out-of-distribution dimensional constraints.
♻ ☆ Multi-Component VAE with Gaussian Markov Random Field
Multi-component datasets with intricate dependencies, like industrial assemblies or multi-modal imaging, challenge current generative modeling techniques. Existing Multi-component Variational AutoEncoders typically rely on simplified aggregation strategies, neglecting critical nuances and consequently compromising structural coherence across generated components. To explicitly address this gap, we introduce the Gaussian Markov Random Field Multi-Component Variational AutoEncoder , a novel generative framework embedding Gaussian Markov Random Fields into both prior and posterior distributions. This design choice explicitly models cross-component relationships, enabling richer representation and faithful reproduction of complex interactions. Empirically, our GMRF MCVAE achieves state-of-the-art performance on a synthetic Copula dataset specifically constructed to evaluate intricate component relationships, demonstrates competitive results on the PolyMNIST benchmark, and significantly enhances structural coherence on the real-world BIKED dataset. Our results indicate that the GMRF MCVAE is especially suited for practical applications demanding robust and realistic modeling of multi-component coherence
♻ ☆ Fingerprint Vector: Enabling Scalable and Efficient Model Fingerprint Transfer via Vector Addition
Backdoor-based fingerprinting has emerged as an effective technique for tracing the ownership of large language models. However, in real-world deployment scenarios, developers often instantiate multiple downstream models from a shared base model, and applying fingerprinting to each variant individually incurs prohibitive computational overhead. While inheritance-based approaches -- where fingerprints are embedded into the base model and expected to persist through fine-tuning -- appear attractive, they suffer from three key limitations: late-stage fingerprinting, fingerprint instability, and interference with downstream adaptation. To address these challenges, we propose a novel mechanism called the Fingerprint Vector. Our method first embeds a fingerprint into the base model via backdoor-based fine-tuning, then extracts a task-specific parameter delta as a fingerprint vector by computing the difference between the fingerprinted and clean models. This vector can be directly added to any structurally compatible downstream model, allowing the fingerprint to be transferred post hoc without additional fine-tuning. Extensive experiments show that Fingerprint Vector achieves comparable or superior performance to direct injection across key desiderata. It maintains strong effectiveness across diverse model architectures as well as mainstream downstream variants within the same family. It also preserves harmlessness and robustness in most cases. Even when slight robustness degradation is observed, the impact remains within acceptable bounds and is outweighed by the scalability benefits of our approach.
♻ ☆ Beyond Discriminant Patterns: On the Robustness of Decision Rule Ensembles ICDM 2025
Local decision rules are commonly understood to be more explainable, due to the local nature of the patterns involved. With numerical optimization methods such as gradient boosting, ensembles of local decision rules can gain good predictive performance on data involving global structure. Meanwhile, machine learning models are being increasingly used to solve problems in high-stake domains including healthcare and finance. Here, there is an emerging consensus regarding the need for practitioners to understand whether and how those models could perform robustly in the deployment environments, in the presence of distributional shifts. Past research on local decision rules has focused mainly on maximizing discriminant patterns, without due consideration of robustness against distributional shifts. In order to fill this gap, we propose a new method to learn and ensemble local decision rules, that are robust both in the training and deployment environments. Specifically, we propose to leverage causal knowledge by regarding the distributional shifts in subpopulations and deployment environments as the results of interventions on the underlying system. We propose two regularization terms based on causal knowledge to search for optimal and stable rules. Experiments on both synthetic and benchmark datasets show that our method is effective and robust against distributional shifts in multiple environments.
comment: Accepted by ICDM 2025
♻ ☆ Secure Reinforcement Learning via Shuffle Privacy Model
Reinforcement learning (RL) is a powerful tool for sequential decision-making, but its application is often hindered by privacy concerns arising from its interaction data. This challenge is particularly acute in advanced Cyber-Physical Systems (CPS), where learning from operational and user data can expose systems to privacy inference attacks. Existing differential privacy (DP) models for RL are often inadequate: the centralized model requires a fully trusted server, creating a single point of failure risk, while the local model incurs significant performance degradation that is unsuitable for many control applications. This paper addresses this gap by leveraging the emerging shuffle model of privacy, an intermediate trust model that provides strong privacy guarantees without a centralized trust assumption. We present Shuffle Differentially Private Policy Elimination (SDP-PE), the first generic policy elimination-based algorithm for episodic RL under the shuffle model. Our method introduces a novel exponential batching schedule and a ``forgetting'' mechanism to balance the competing demands of privacy and learning performance. Our analysis shows that SDP-PE achieves a near-optimal regret bound, demonstrating a superior privacy-regret trade-off that significantly outperforms the local model. This work establishes the viability of the shuffle model for secure data-driven control in advanced CPS.
♻ ☆ Data Requirement Goal Modeling for Machine Learning Systems
Machine Learning (ML) has been integrated into various software and systems. Two main components are essential for training an ML model: the training data and the ML algorithm. Given the critical role of data in ML system development, it has become increasingly important to assess the quality of data attributes and ensure that the data meets specific requirements before its utilization. This work proposes an approach to guide non-experts in identifying data requirements for ML systems using goal modeling. In this approach, we first develop the Data Requirement Goal Model (DRGM) by surveying the white literature to identify and categorize the issues and challenges faced by data scientists and requirement engineers working on ML-related projects. An initial DRGM was built to accommodate common tasks that would generalize across projects. Then, based on insights from both white and gray literature, a customization mechanism is built to help adjust the tasks, KPIs, and goals' importance of different elements within the DRGM. The generated model can aid its users in evaluating different datasets using GRL evaluation strategies. We then validate the approach through two illustrative examples based on real-world projects. The results from the illustrative examples demonstrate that the data requirements identified by the proposed approach align with the requirements of real-world projects, demonstrating the practicality and effectiveness of the proposed framework. The proposed dataset selection customization mechanism and the proposed DRGM are helpful in guiding non-experts in identifying the data requirements for machine learning systems tailored to a specific ML problem. This approach also aids in evaluating different dataset alternatives to choose the optimum dataset for the problem. For future work, we recommend implementing tool support to generate the DRGM based on a chatbot interface.
♻ ☆ Comparison of Data Reduction Criteria for Online Gaussian Processes
Gaussian Processes (GPs) are widely used for regression and system identification due to their flexibility and ability to quantify uncertainty. However, their computational complexity limits their applicability to small datasets. Moreover in a streaming scenario, more and more datapoints accumulate which is intractable even for Sparse GPs. Online GPs aim to alleviate this problem by e.g. defining a maximum budget of datapoints and removing redundant datapoints. This work provides a unified comparison of several reduction criteria, analyzing both their computational complexity and reduction behavior. The criteria are evaluated on benchmark functions and real-world datasets, including dynamic system identification tasks. Additionally, acceptance criteria are proposed to further filter out redundant datapoints. This work yields practical guidelines for choosing a suitable criterion for an online GP algorithm.
comment: 24 pages
♻ ☆ Instruction-Based Molecular Graph Generation with Unified Text-Graph Diffusion Model ECAI 2025
Recent advancements in computational chemistry have increasingly focused on synthesizing molecules based on textual instructions. Integrating graph generation with these instructions is complex, leading most current methods to use molecular sequences with pre-trained large language models. In response to this challenge, we propose a novel framework, named $\textbf{UTGDiff (Unified Text-Graph Diffusion Model)}$, which utilizes language models for discrete graph diffusion to generate molecular graphs from instructions. UTGDiff features a unified text-graph transformer as the denoising network, derived from pre-trained language models and minimally modified to process graph data through attention bias. Our experimental results demonstrate that UTGDiff consistently outperforms sequence-based baselines in tasks involving instruction-based molecule generation and editing, achieving superior performance with fewer parameters given an equivalent level of pretraining corpus. Our code is availble at https://github.com/ran1812/UTGDiff.
comment: ECAI 2025
♻ ☆ Uncertainty-Calibrated Test-Time Model Adaptation without Forgetting
Test-time adaptation (TTA) seeks to tackle potential distribution shifts between training and test data by adapting a given model w.r.t. any test sample. Although recent TTA has shown promising performance, we still face two key challenges: 1) prior methods perform backpropagation for each test sample, resulting in unbearable optimization costs to many applications; 2) while existing TTA can significantly improve the test performance on out-of-distribution data, they often suffer from severe performance degradation on in-distribution data after TTA (known as forgetting). To this end, we have proposed an Efficient Anti-Forgetting Test-Time Adaptation (EATA) method which develops an active sample selection criterion to identify reliable and non-redundant samples for test-time entropy minimization. To alleviate forgetting, EATA introduces a Fisher regularizer estimated from test samples to constrain important model parameters from drastic changes. However, in EATA, the adopted entropy loss consistently assigns higher confidence to predictions even for samples that are underlying uncertain, leading to overconfident predictions. To tackle this, we further propose EATA with Calibration (EATA-C) to separately exploit the reducible model uncertainty and the inherent data uncertainty for calibrated TTA. Specifically, we measure the model uncertainty by the divergence between predictions from the full network and its sub-networks, on which we propose a divergence loss to encourage consistent predictions instead of overconfident ones. To further recalibrate prediction confidence, we utilize the disagreement among predicted labels as an indicator of the data uncertainty, and then devise a min-max entropy regularizer to selectively increase and decrease prediction confidence for different samples. Experiments on image classification and semantic segmentation verify the effectiveness of our methods.
comment: 21 pages, 18 figures, 13 tables
♻ ☆ Breaking Data Silos: Towards Open and Scalable Mobility Foundation Models via Generative Continual Learning
Foundation models have revolutionized fields such as natural language processing and computer vision by enabling general-purpose learning across diverse tasks and datasets. However, building analogous models for human mobility remains challenging due to the privacy-sensitive nature of mobility data and the resulting data silos across institutions. To bridge this gap, we propose MoveGCL, a scalable and privacy-preserving framework for training mobility foundation models via generative continual learning. Without sharing raw data, MoveGCL enables decentralized and progressive model evolution by replaying synthetic trajectories generated from a frozen teacher model, and reinforces knowledge retention through a tailored distillation strategy that mitigates catastrophic forgetting. To address the heterogeneity of mobility patterns, MoveGCL incorporates a Mixture-of-Experts Transformer with a mobility-aware expert routing mechanism, and employs a layer-wise progressive adaptation strategy to stabilize continual updates. Experiments on six real-world urban datasets demonstrate that MoveGCL achieves performance comparable to joint training and significantly outperforms federated learning baselines, while offering strong privacy protection. MoveGCL marks a crucial step toward unlocking foundation models for mobility, offering a practical blueprint for open, scalable, and privacy-preserving model development in the era of foundation models. To facilitate reproducibility and future research, we have released the code and models at https://github.com/tsinghua-fib-lab/MoveGCL.
comment: The 33rd ACM International Conference on Advances in Geographic Information Systems
♻ ☆ Can LLMs Handle WebShell Detection? Overcoming Detection Challenges with Behavioral Function-Aware Framework
WebShell attacks, where malicious scripts are injected into web servers, pose a significant cybersecurity threat. Traditional ML and DL methods are often hampered by challenges such as the need for extensive training data, catastrophic forgetting, and poor generalization. Recently, Large Language Models have emerged as powerful alternatives for code-related tasks, but their potential in WebShell detection remains underexplored. In this paper, we make two contributions: (1) a comprehensive evaluation of seven LLMs, including GPT-4, LLaMA 3.1 70B, and Qwen 2.5 variants, benchmarked against traditional sequence- and graph-based methods using a dataset of 26.59K PHP scripts, and (2) the Behavioral Function-Aware Detection (BFAD) framework, designed to address the specific challenges of applying LLMs to this domain. Our framework integrates three components: a Critical Function Filter that isolates malicious PHP function calls, a Context-Aware Code Extraction strategy that captures the most behaviorally indicative code segments, and Weighted Behavioral Function Profiling that enhances in-context learning by prioritizing the most relevant demonstrations based on discriminative function-level profiles. Our results show that, stemming from their distinct analytical strategies, larger LLMs achieve near-perfect precision but lower recall, while smaller models exhibit the opposite trade-off. However, all baseline models lag behind previous SOTA methods. With the application of BFAD, the performance of all LLMs improves significantly, yielding an average F1 score increase of 13.82%. Notably, larger models now outperform SOTA benchmarks, while smaller models such as Qwen-2.5-Coder-3B achieve performance competitive with traditional methods. This work is the first to explore the feasibility and limitations of LLMs for WebShell detection and provides solutions to address the challenges in this task.
comment: Published as a conference paper at COLM 2025
♻ ☆ Unlearning as Ablation: Toward a Falsifiable Benchmark for Generative Scientific Discovery NeurIPS 2025
Bold claims about AI's role in science-from "AGI will cure all diseases" to promises of radically accelerated discovery-raise a central epistemic question: do large language models (LLMs) truly generate new knowledge, or do they merely remix memorized fragments? We propose unlearning-as-ablation as a falsifiable probe of constructive scientific discovery. The idea is to systematically remove a target result together with its forget-closure (supporting lemmas, paraphrases, and multi-hop entailments) and then evaluate whether the model can re-derive the result from only permitted axioms and tools. Success would indicate generative capability beyond recall; failure would expose current limits. Unlike prevailing motivations for unlearning-privacy, copyright, or safety-our framing repositions it as an epistemic probe for AI-for-Science. We outline a minimal pilot in mathematics and algorithms to illustrate feasibility, and sketch how the same approach could later be extended to domains such as physics or chemistry. This is a position paper: our contribution is conceptual and methodological, not empirical. We aim to stimulate discussion on how principled ablation tests could help distinguish models that reconstruct knowledge from those that merely retrieve it, and how such probes might guide the next generation of AI-for-Science benchmarks.
comment: 6 pages. NeurIPS 2025 AI4Science Workshop submission
♻ ☆ DiffBlender: Composable and Versatile Multimodal Text-to-Image Diffusion Models
In this study, we aim to enhance the capabilities of diffusion-based text-to-image (T2I) generation models by integrating diverse modalities beyond textual descriptions within a unified framework. To this end, we categorize widely used conditional inputs into three modality types: structure, layout, and attribute. We propose a multimodal T2I diffusion model, which is capable of processing all three modalities within a single architecture without modifying the parameters of the pre-trained diffusion model, as only a small subset of components is updated. Our approach sets new benchmarks in multimodal generation through extensive quantitative and qualitative comparisons with existing conditional generation methods. We demonstrate that DiffBlender effectively integrates multiple sources of information and supports diverse applications in detailed image synthesis. The code and demo are available at https://github.com/sungnyun/diffblender.
comment: Expert Systems with Applications 2025
♻ ☆ CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics
We introduce CMPhysBench, designed to assess the proficiency of Large Language Models (LLMs) in Condensed Matter Physics, as a novel Benchmark. CMPhysBench is composed of more than 520 graduate-level meticulously curated questions covering both representative subfields and foundational theoretical frameworks of condensed matter physics, such as magnetism, superconductivity, strongly correlated systems, etc. To ensure a deep understanding of the problem-solving process,we focus exclusively on calculation problems, requiring LLMs to independently generate comprehensive solutions. Meanwhile, leveraging tree-based representations of expressions, we introduce the Scalable Expression Edit Distance (SEED) score, which provides fine-grained (non-binary) partial credit and yields a more accurate assessment of similarity between prediction and ground-truth. Our results show that even the best models, Grok-4, reach only 36 average SEED score and 28% accuracy on CMPhysBench, underscoring a significant capability gap, especially for this practical and frontier domain relative to traditional physics. The code anddataset are publicly available at https://github.com/CMPhysBench/CMPhysBench.
comment: 29 pages, 7 figures
♻ ☆ Multi-timescale time encoding for CNN prediction of Fenna-Matthews-Olson energy-transfer dynamics
Machine learning simulations of open quantum dynamics often rely on recursive predictors that accumulate error. We develop a non-recursive convolutional neural networks (CNNs) that maps system parameters and a redundant time encoding directly to excitation-energy-transfer populations in the Fenna-Matthews-Olson complex. The encoding-modified logistic plus $\tanh$ functions-normalizes time and resolves fast, transitional, and quasi-steady regimes, while physics-informed labels enforce population conservation and inter-site consistency. Trained only on $0\sim 7 ps$ reference trajectories generated with a Lindblad model in QuTiP, the network accurately predicts $0\sim100 ps$ dynamics across a range of reorganization energies, bath rates, and temperatures. Beyond $20 ps$, the absolute relative error remains below 0.05, demonstrating stable long-time extrapolation. By avoiding step-by-step recursion, the method suppresses error accumulation and generalizes across timescales. These results show that redundant time encoding enables data-efficient inference of long-time quantum dissipative dynamics in realistic pigment-protein complexes, and may aid the data-driven design of light-harvesting materials.
comment: 8 pages, 5figures
♻ ☆ Retrieval Enhanced Feedback via In-context Neural Error-book EMNLP 2025
Recent advancements in Large Language Models (LLMs) have significantly improved reasoning capabilities, with in-context learning (ICL) emerging as a key technique for adaptation without retraining. While previous works have focused on leveraging correct examples, recent research highlights the importance of learning from errors to enhance performance. However, existing methods lack a structured framework for analyzing and mitigating errors, particularly in Multimodal Large Language Models (MLLMs), where integrating visual and textual inputs adds complexity. To address this issue, we propose REFINE: Retrieval-Enhanced Feedback via In-context Neural Error-book, a teacher-student framework that systematically structures errors and provides targeted feedback. REFINE introduces three systematic queries to construct structured feedback -- Feed-Target, Feed-Check, and Feed-Path -- to enhance multimodal reasoning by prioritizing relevant visual information, diagnosing critical failure points, and formulating corrective actions. Unlike prior approaches that rely on redundant retrievals, REFINE optimizes structured feedback retrieval, improving inference efficiency, token usage, and scalability. Our results demonstrate substantial speedup, reduced computational costs, and successful generalization, highlighting REFINE's potential for enhancing multimodal reasoning.
comment: Accepted at EMNLP 2025 main conference
♻ ☆ Apple Intelligence Foundation Language Models: Tech Report 2025
We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and services: i a 3B-parameter on-device model optimized for Apple silicon through architectural innovations such as KV-cache sharing and 2-bit quantization-aware training; and ii a scalable server model built on a novel Parallel-Track Mixture-of-Experts PT-MoE transformer that combines track parallelism, mixture-of-experts sparse computation, and interleaved global-local attention to deliver high quality with competitive cost on Apple's Private Cloud Compute platform. Both models are trained on large-scale multilingual and multimodal datasets sourced via responsible web crawling, licensed corpora, and high-quality synthetic data, then further refined with supervised fine-tuning and reinforcement learning on a new asynchronous platform. The resulting models support several additional languages while understanding images and executing tool calls. In public benchmarks and human evaluations, both the server model and the on-device model match or surpass comparably sized open baselines. A new Swift-centric Foundation Models framework exposes guided generation, constrained tool calling, and LoRA adapter fine-tuning, allowing developers to integrate these capabilities with a few lines of code. The latest advancements in Apple Intelligence models are grounded in our Responsible AI approach with safeguards like content filtering and locale-specific evaluation, as well as our commitment to protecting our users' privacy with innovations like Private Cloud Compute.
♻ ☆ Evaluating DNA function understanding in genomic language models using evolutionarily implausible sequences
Genomic language models (gLMs) hold promise for generating novel, functional DNA sequences for synthetic biology. However, realizing this potential requires models to go beyond evolutionary plausibility and understand how DNA sequence encodes gene expression and regulation. We introduce a benchmark called Nullsettes, which assesses how well models can predict in silico loss-of-function (LOF) mutations, in synthetic expression cassettes with little evolutionary precedent. Testing 12 state-of-the-art gLMs, we find that most fail to consistently detect these strong LOF mutations. All models show a sharp drop in predictive accuracy as the likelihood assigned to the original (nonmutant) sequence decreases, suggesting that gLMs rely heavily on pattern-matching to their evolutionary prior rather than on any mechanistic understanding of gene expression. Our findings highlight fundamental limitations in how gLMs generalize to engineered, non-natural sequences, and underscore the need for benchmarks and modeling strategies that prioritize functional understanding.
comment: 19 pages, 5 figures
♻ ☆ Spectra-to-Structure and Structure-to-Spectra Inference Across the Periodic Table
X-ray Absorption Spectroscopy (XAS) is a powerful technique for probing local atomic environments, yet its interpretation remains limited by the need for expert-driven analysis, computationally expensive simulations, and element-specific heuristics. Recent advances in machine learning have shown promise for accelerating XAS interpretation, but many existing models are narrowly focused on specific elements, edge types, or spectral regimes. In this work, we present XAStruct, a learning-based system capable of both predicting XAS spectra from crystal structures and inferring local structural descriptors from XAS input. XAStruct is trained on a large-scale dataset spanning over 70 elements across the periodic table, enabling generalization to a wide variety of chemistries and bonding environments. The framework includes the first machine learning approach for predicting neighbor atom types directly from XAS spectra, as well as a generalizable regression model for mean nearest-neighbor distance that requires no element-specific tuning. By combining deep neural networks for complex structure property mappings with efficient baseline models for simpler tasks, XAStruct offers a scalable and extensible solution for data-driven XAS analysis and local structure inference. The source code will be released upon paper acceptance.
♻ ☆ Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings EMNLP 2025
This work stems from an observed limitation of text encoders: embeddings may not be able to recognize fine-grained entities or events within encoded semantics, resulting in failed retrieval even in simple cases. To examine such behaviors, we first introduce a new evaluation dataset, CapRetrieval, in which passages are image captions and queries are phrases targeting entity or event concepts in diverse forms. Zero-shot evaluation suggests that encoders often struggle with these fine-grained matching, regardless of training sources or model size. Aiming for enhancement, we proceed to finetune encoders with our proposed data generation strategies, enabling a small 0.1B encoder to outperform the state-of-the-art 7B model. Within this process, we further uncover the granularity dilemma, a challenge for embeddings to capture fine-grained salience while aligning with overall semantics. Our dataset, code and models in this work are publicly released at https://github.com/lxucs/CapRetrieval.
comment: Accepted to EMNLP 2025 Findings
♻ ☆ Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements EMNLP 2025
In this paper, we propose a ``Generalization Stress Test" to assess Large Language Models' (LLMs) generalization ability under slight and controlled perturbations, including option length, problem types, and irrelevant noun replacements. We achieve novel and significant findings that, despite high benchmark scores, LLMs exhibit severe accuracy drops and unexpected biases (e.g., preference for longer distractors) when faced with these minor but content-preserving modifications. For example, Qwen 2.5 1.5B's MMLU score rises from 60 to 89 and drops from 89 to 36 when option lengths are changed without altering the question. Even GPT4o experiences a 25-point accuracy loss when problem types are changed, with a 6-point drop across all three modification categories. These analyses suggest that LLMs rely heavily on superficial cues rather than forming robust, abstract representations that generalize across formats, lexical variations, and irrelevant content shifts.
comment: EMNLP 2025 Main Conference
Multimedia 4
☆ Lossless 4:2:0 Screen Content Coding Using Luma-Guided Soft Context Formation
The soft context formation coder is a pixel-wise state-of-the-art lossless screen content coder using pattern matching and color palette coding in combination with arithmetic coding. It achieves excellent compression performance on screen content images in RGB 4:4:4 format with few distinct colors. In contrast to many other lossless compression methods, it codes entire color pixels at once, i.e., all color components of one pixel are coded together. Consequently, it does not natively support image formats with downsampled chroma, such as YCbCr 4:2:0, which is an often used chroma format in video compression. In this paper, we extend the soft context formation coding capabilities to 4:2:0 image compression, by successively coding Y and CbCr planes based on an analysis of normalized mutual information between image planes. Additionally, we propose an enhancement to the chroma prediction based on the luminance plane. Furthermore, we propose to transmit side-information about occurring luma-chroma combinations to improve chroma probability distribution modelling. Averaged over a large screen content image dataset, our proposed method outperforms HEVC-SCC, with HEVC-SCC needing 5.66% more bitrate compared to our method.
comment: 5 pages, 4 figures, 3 tables, accepted to EUSIPCO 2025
☆ AniME: Adaptive Multi-Agent Planning for Long Animation Generation
We present AniME, a director-oriented multi-agent system for automated long-form anime production, covering the full workflow from a story to the final video. The director agent keeps a global memory for the whole workflow, and coordinates several downstream specialized agents. By integrating customized Model Context Protocol (MCP) with downstream model instruction, the specialized agent adaptively selects control conditions for diverse sub-tasks. AniME produces cinematic animation with consistent characters and synchronized audio visual elements, offering a scalable solution for AI-driven anime creation.
comment: 2 pages, Technical Report
☆ Improving Noise Robust Audio-Visual Speech Recognition via Router-Gated Cross-Modal Feature Fusion
Robust audio-visual speech recognition (AVSR) in noisy environments remains challenging, as existing systems struggle to estimate audio reliability and dynamically adjust modality reliance. We propose router-gated cross-modal feature fusion, a novel AVSR framework that adaptively reweights audio and visual features based on token-level acoustic corruption scores. Using an audio-visual feature fusion-based router, our method down-weights unreliable audio tokens and reinforces visual cues through gated cross-attention in each decoder layer. This enables the model to pivot toward the visual modality when audio quality deteriorates. Experiments on LRS3 demonstrate that our approach achieves an 16.51-42.67% relative reduction in word error rate compared to AV-HuBERT. Ablation studies confirm that both the router and gating mechanism contribute to improved robustness under real-world acoustic noise.
comment: Accepted to IEEE ASRU 2025
☆ Tailored Teaching with Balanced Difficulty: Elevating Reasoning in Multimodal Chain-of-Thought via Prompt Curriculum
The effectiveness of Multimodal Chain-of-Thought (MCoT) prompting is often limited by the use of randomly or manually selected examples. These examples fail to account for both model-specific knowledge distributions and the intrinsic complexity of the tasks, resulting in suboptimal and unstable model performance. To address this, we propose a novel framework inspired by the pedagogical principle of "tailored teaching with balanced difficulty". We reframe prompt selection as a prompt curriculum design problem: constructing a well ordered set of training examples that align with the model's current capabilities. Our approach integrates two complementary signals: (1) model-perceived difficulty, quantified through prediction disagreement in an active learning setup, capturing what the model itself finds challenging; and (2) intrinsic sample complexity, which measures the inherent difficulty of each question-image pair independently of any model. By jointly analyzing these signals, we develop a difficulty-balanced sampling strategy that ensures the selected prompt examples are diverse across both dimensions. Extensive experiments conducted on five challenging benchmarks and multiple popular Multimodal Large Language Models (MLLMs) demonstrate that our method yields substantial and consistent improvements and greatly reduces performance discrepancies caused by random sampling, providing a principled and robust approach for enhancing multimodal reasoning.
Computation and Language 130
☆ MIRAGE: Scaling Test-Time Inference with Parallel Graph-Retrieval-Augmented Reasoning Chains AAAI 2026
Large reasoning models (LRMs) have shown significant progress in test-time scaling through chain-of-thought prompting. Current approaches like search-o1 integrate retrieval augmented generation (RAG) into multi-step reasoning processes but rely on a single, linear reasoning chain while incorporating unstructured textual information in a flat, context-agnostic manner. As a result, these approaches can lead to error accumulation throughout the reasoning chain, which significantly limits its effectiveness in medical question-answering (QA) tasks where both accuracy and traceability are critical requirements. To address these challenges, we propose MIRAGE (Multi-chain Inference with Retrieval-Augmented Graph Exploration), a novel test-time scalable reasoning framework that performs dynamic multi-chain inference over structured medical knowledge graphs. Specifically, MIRAGE 1) decomposes complex queries into entity-grounded sub-questions, 2) executes parallel inference chains, 3) retrieves evidence adaptively via neighbor expansion and multi-hop traversal, and 4) integrates answers using cross-chain verification to resolve contradictions. Experiments on three medical QA benchmarks (GenMedGPT-5k, CMCQA, and ExplainCPE) show that MIRAGE consistently outperforms GPT-4o, Tree-of-Thought variants, and other retrieval-augmented baselines in both automatic and human evaluations. Additionally, MIRAGE improves interpretability by generating explicit reasoning chains that trace each factual claim to concrete chains within the knowledge graph, making it well-suited for complex medical reasoning scenarios. The code will be available for further research.
comment: 10 pages, 8 figures (including tables), plus appendix. Submitted to AAAI 2026
☆ From BERT to LLMs: Comparing and Understanding Chinese Classifier Prediction in Language Models
Classifiers are an important and defining feature of the Chinese language, and their correct prediction is key to numerous educational applications. Yet, whether the most popular Large Language Models (LLMs) possess proper knowledge the Chinese classifiers is an issue that has largely remain unexplored in the Natural Language Processing (NLP) literature. To address such a question, we employ various masking strategies to evaluate the LLMs' intrinsic ability, the contribution of different sentence elements, and the working of the attention mechanisms during prediction. Besides, we explore fine-tuning for LLMs to enhance the classifier performance. Our findings reveal that LLMs perform worse than BERT, even with fine-tuning. The prediction, as expected, greatly benefits from the information about the following noun, which also explains the advantage of models with a bidirectional attention mechanism such as BERT.
☆ Demographic Biases and Gaps in the Perception of Sexism in Large Language Models
The use of Large Language Models (LLMs) has proven to be a tool that could help in the automatic detection of sexism. Previous studies have shown that these models contain biases that do not accurately reflect reality, especially for minority groups. Despite various efforts to improve the detection of sexist content, this task remains a significant challenge due to its subjective nature and the biases present in automated models. We explore the capabilities of different LLMs to detect sexism in social media text using the EXIST 2024 tweet dataset. It includes annotations from six distinct profiles for each tweet, allowing us to evaluate to what extent LLMs can mimic these groups' perceptions in sexism detection. Additionally, we analyze the demographic biases present in the models and conduct a statistical analysis to identify which demographic characteristics (age, gender) contribute most effectively to this task. Our results show that, while LLMs can to some extent detect sexism when considering the overall opinion of populations, they do not accurately replicate the diversity of perceptions among different demographic groups. This highlights the need for better-calibrated models that account for the diversity of perspectives across different populations.
comment: This work was presented as a poster at the Latin American Meeting in Artificial Intelligence KHIPU 2025, Santiago, Chile, March 10th - 14th 2025, https://khipu.ai/khipu2025/poster-sessions-2025/
☆ Better Language Model-Based Judging Reward Modeling through Scaling Comprehension Boundaries
The emergence of LM-based judging reward modeling, represented by generative reward models, has successfully made reinforcement learning from AI feedback (RLAIF) efficient and scalable. To further advance this paradigm, we propose a core insight: this form of reward modeling shares fundamental formal consistency with natural language inference (NLI), a core task in natural language understanding. This reframed perspective points to a key path for building superior reward models: scaling the model's comprehension boundaries. Pursuing this path, exploratory experiments on NLI tasks demonstrate that the slot prediction masked language models (MLMs) incorporating contextual explanations achieve significantly better performance compared to mainstream autoregressive models. Based on this key finding, we propose ESFP-RM, a two-stage LM-based judging reward model that utilizes an explanation based slot framework for prediction to fully leverage the advantages of MLMs. Extensive experiments demonstrate that in both reinforcement learning from human feedback (RLHF) and out-of-distribution (OOD) scenarios, the ESFP-RM framework delivers more stable and generalizable reward signals compared to generative reward models.
☆ Why Synthetic Isn't Real Yet: A Diagnostic Framework for Contact Center Dialogue Generation
Synthetic transcript generation is critical in contact center domains, where privacy and data scarcity limit model training and evaluation. Unlike prior synthetic dialogue generation work on open-domain or medical dialogues, contact center conversations are goal-oriented, role-asymmetric, and behaviorally complex, featuring disfluencies, ASR noise, and compliance-driven agent actions. In deployments where transcripts are unavailable, standard pipelines still yield derived call attributes such as Intent Summaries, Topic Flow, and QA Evaluation Forms. We leverage these as supervision signals to guide generation. To assess the quality of such outputs, we introduce a diagnostic framework of 18 linguistically and behaviorally grounded metrics for comparing real and synthetic transcripts. We benchmark four language-agnostic generation strategies, from simple prompting to characteristic-aware multi-stage approaches, alongside reference-free baselines. Results reveal persistent challenges: no method excels across all traits, with notable deficits in disfluency, sentiment, and behavioral realism. Our diagnostic tool exposes these gaps, enabling fine-grained evaluation and stress testing of synthetic dialogue across languages.
☆ Exploring the Interplay between Musical Preferences and Personality through the Lens of Language
Music serves as a powerful reflection of individual identity, often aligning with deeper psychological traits. Prior research has established correlations between musical preferences and personality traits, while separate studies have demonstrated that personality is detectable through linguistic analysis. Our study bridges these two research domains by investigating whether individuals' musical preferences are recognizable in their spontaneous language through the lens of the Big Five personality traits (Openness, Conscientiousness, Extroversion, Agreeableness, and Neuroticism). Using a carefully curated dataset of over 500,000 text samples from nearly 5,000 authors with reliably identified musical preferences, we build advanced models to assess personality characteristics. Our results reveal significant personality differences across fans of five musical genres. We release resources for future research at the intersection of computational linguistics, music psychology and personality analysis.
☆ Unraveling the cognitive patterns of Large Language Models through module communities
Large Language Models (LLMs) have reshaped our world with significant advancements in science, engineering, and society through applications ranging from scientific discoveries and medical diagnostics to Chatbots. Despite their ubiquity and utility, the underlying mechanisms of LLM remain concealed within billions of parameters and complex structures, making their inner architecture and cognitive processes challenging to comprehend. We address this gap by adopting approaches to understanding emerging cognition in biology and developing a network-based framework that links cognitive skills, LLM architectures, and datasets, ushering in a paradigm shift in foundation model analysis. The skill distribution in the module communities demonstrates that while LLMs do not strictly parallel the focalized specialization observed in specific biological systems, they exhibit unique communities of modules whose emergent skill patterns partially mirror the distributed yet interconnected cognitive organization seen in avian and small mammalian brains. Our numerical results highlight a key divergence from biological systems to LLMs, where skill acquisition benefits substantially from dynamic, cross-regional interactions and neural plasticity. By integrating cognitive science principles with machine learning, our framework provides new insights into LLM interpretability and suggests that effective fine-tuning strategies should leverage distributed learning dynamics rather than rigid modular interventions.
☆ Leveraging Large Language Models for Accurate Sign Language Translation in Low-Resource Scenarios
Translating natural languages into sign languages is a highly complex and underexplored task. Despite growing interest in accessibility and inclusivity, the development of robust translation systems remains hindered by the limited availability of parallel corpora which align natural language with sign language data. Existing methods often struggle to generalize in these data-scarce environments, as the few datasets available are typically domain-specific, lack standardization, or fail to capture the full linguistic richness of sign languages. To address this limitation, we propose Advanced Use of LLMs for Sign Language Translation (AulSign), a novel method that leverages Large Language Models via dynamic prompting and in-context learning with sample selection and subsequent sign association. Despite their impressive abilities in processing text, LLMs lack intrinsic knowledge of sign languages; therefore, they are unable to natively perform this kind of translation. To overcome this limitation, we associate the signs with compact descriptions in natural language and instruct the model to use them. We evaluate our method on both English and Italian languages using SignBank+, a recognized benchmark in the field, as well as the Italian LaCAM CNR-ISTC dataset. We demonstrate superior performance compared to state-of-the-art models in low-data scenario. Our findings demonstrate the effectiveness of AulSign, with the potential to enhance accessibility and inclusivity in communication technologies for underrepresented linguistic communities.
☆ Improving End-to-End Training of Retrieval-Augmented Generation Models via Joint Stochastic Approximation
Retrieval-augmented generation (RAG) has become a widely recognized paradigm to combine parametric memory with non-parametric memories. An RAG model consists of two serial connecting components (retriever and generator). A major challenge in end-to-end optimization of the RAG model is that marginalization over relevant passages (modeled as discrete latent variables) from a knowledge base is required. Traditional top-K marginalization and variational RAG (VRAG) suffer from biased or high-variance gradient estimates. In this paper, we propose and develop joint stochastic approximation (JSA) based end-to-end training of RAG, which is referred to as JSA-RAG. The JSA algorithm is a stochastic extension of the EM (expectation-maximization) algorithm and is particularly powerful in estimating discrete latent variable models. Extensive experiments are conducted on five datasets for two tasks (open-domain question answering, knowledge-grounded dialogs) and show that JSA-RAG significantly outperforms both vanilla RAG and VRAG. Further analysis shows the efficacy of JSA-RAG from the perspectives of generation, retrieval, and low-variance gradient estimate.
☆ DiscussLLM: Teaching Large Language Models When to Speak
Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like text, yet they largely operate as reactive agents, responding only when directly prompted. This passivity creates an "awareness gap," limiting their potential as truly collaborative partners in dynamic human discussions. We introduce $\textit{DiscussLLM}$, a framework designed to bridge this gap by training models to proactively decide not just $\textit{what}$ to say, but critically, $\textit{when}$ to speak. Our primary contribution is a scalable two-stage data generation pipeline that synthesizes a large-scale dataset of realistic multi-turn human discussions. Each discussion is annotated with one of five intervention types (e.g., Factual Correction, Concept Definition) and contains an explicit conversational trigger where an AI intervention adds value. By training models to predict a special silent token when no intervention is needed, they learn to remain quiet until a helpful contribution can be made. We explore two architectural baselines: an integrated end-to-end model and a decoupled classifier-generator system optimized for low-latency inference. We evaluate these models on their ability to accurately time interventions and generate helpful responses, paving the way for more situationally aware and proactive conversational AI.
☆ S2Sent: Nested Selectivity Aware Sentence Representation Learning
The combination of Transformer-based encoders with contrastive learning represents the current mainstream paradigm for sentence representation learning. This paradigm is typically based on the hidden states of the last Transformer block of the encoder. However, within Transformer-based encoders, different blocks exhibit varying degrees of semantic perception ability. From the perspective of interpretability, the semantic perception potential of knowledge neurons is modulated by stimuli, thus rational cross-block representation fusion is a direction worth optimizing. To balance the semantic redundancy and loss across block fusion, we propose a sentence representation selection mechanism S\textsuperscript{2}Sent, which integrates a parameterized nested selector downstream of the Transformer-based encoder. This selector performs spatial selection (SS) and nested frequency selection (FS) from a modular perspective. The SS innovatively employs a spatial squeeze based self-gating mechanism to obtain adaptive weights, which not only achieves fusion with low information redundancy but also captures the dependencies between embedding features. The nested FS replaces GAP with different DCT basis functions to achieve spatial squeeze with low semantic loss. Extensive experiments have demonstrated that S\textsuperscript{2}Sent achieves significant improvements over baseline methods with negligible additional parameters and inference latency, while highlighting high integrability and scalability.
☆ HLLM-Creator: Hierarchical LLM-based Personalized Creative Generation
AI-generated content technologies are widely used in content creation. However, current AIGC systems rely heavily on creators' inspiration, rarely generating truly user-personalized content. In real-world applications such as online advertising, a single product may have multiple selling points, with different users focusing on different features. This underscores the significant value of personalized, user-centric creative generation. Effective personalized content generation faces two main challenges: (1) accurately modeling user interests and integrating them into the content generation process while adhering to factual constraints, and (2) ensuring high efficiency and scalability to handle the massive user base in industrial scenarios. Additionally, the scarcity of personalized creative data in practice complicates model training, making data construction another key hurdle. We propose HLLM-Creator, a hierarchical LLM framework for efficient user interest modeling and personalized content generation. During inference, a combination of user clustering and a user-ad-matching-prediction based pruning strategy is employed to significantly enhance generation efficiency and reduce computational overhead, making the approach suitable for large-scale deployment. Moreover, we design a data construction pipeline based on chain-of-thought reasoning, which generates high-quality, user-specific creative titles and ensures factual consistency despite limited personalized data. This pipeline serves as a critical foundation for the effectiveness of our model. Extensive experiments on personalized title generation for Douyin Search Ads show the effectiveness of HLLM-Creator. Online A/B test shows a 0.476% increase on Adss, paving the way for more effective and efficient personalized generation in industrial scenarios. Codes for academic dataset are available at https://github.com/bytedance/HLLM.
☆ The AI Data Scientist
Imagine decision-makers uploading data and, within minutes, receiving clear, actionable insights delivered straight to their fingertips. That is the promise of the AI Data Scientist, an autonomous Agent powered by large language models (LLMs) that closes the gap between evidence and action. Rather than simply writing code or responding to prompts, it reasons through questions, tests ideas, and delivers end-to-end insights at a pace far beyond traditional workflows. Guided by the scientific tenet of the hypothesis, this Agent uncovers explanatory patterns in data, evaluates their statistical significance, and uses them to inform predictive modeling. It then translates these results into recommendations that are both rigorous and accessible. At the core of the AI Data Scientist is a team of specialized LLM Subagents, each responsible for a distinct task such as data cleaning, statistical testing, validation, and plain-language communication. These Subagents write their own code, reason about causality, and identify when additional data is needed to support sound conclusions. Together, they achieve in minutes what might otherwise take days or weeks, enabling a new kind of interaction that makes deep data science both accessible and actionable.
☆ SentiMM: A Multimodal Multi-Agent Framework for Sentiment Analysis in Social Media
With the increasing prevalence of multimodal content on social media, sentiment analysis faces significant challenges in effectively processing heterogeneous data and recognizing multi-label emotions. Existing methods often lack effective cross-modal fusion and external knowledge integration. We propose SentiMM, a novel multi-agent framework designed to systematically address these challenges. SentiMM processes text and visual inputs through specialized agents, fuses multimodal features, enriches context via knowledge retrieval, and aggregates results for final sentiment classification. We also introduce SentiMMD, a large-scale multimodal dataset with seven fine-grained sentiment categories. Extensive experiments demonstrate that SentiMM achieves superior performance compared to state-of-the-art baselines, validating the effectiveness of our structured approach.
☆ Detecting and Characterizing Planning in Language Models
Modern large language models (LLMs) have demonstrated impressive performance across a wide range of multi-step reasoning tasks. Recent work suggests that LLMs may perform planning - selecting a future target token in advance and generating intermediate tokens that lead towards it - rather than merely improvising one token at a time. However, existing studies assume fixed planning horizons and often focus on single prompts or narrow domains. To distinguish planning from improvisation across models and tasks, we present formal and causally grounded criteria for detecting planning and operationalize them as a semi-automated annotation pipeline. We apply this pipeline to both base and instruction-tuned Gemma-2-2B models on the MBPP code generation benchmark and a poem generation task where Claude 3.5 Haiku was previously shown to plan. Our findings show that planning is not universal: unlike Haiku, Gemma-2-2B solves the same poem generation task through improvisation, and on MBPP it switches between planning and improvisation across similar tasks and even successive token predictions. We further show that instruction tuning refines existing planning behaviors in the base model rather than creating them from scratch. Together, these studies provide a reproducible and scalable foundation for mechanistic studies of planning in LLMs.
comment: 9 pages, 4 figures
☆ Agri-Query: A Case Study on RAG vs. Long-Context LLMs for Cross-Lingual Technical Question Answering
We present a case study evaluating large language models (LLMs) with 128K-token context windows on a technical question answering (QA) task. Our benchmark is built on a user manual for an agricultural machine, available in English, French, and German. It simulates a cross-lingual information retrieval scenario where questions are posed in English against all three language versions of the manual. The evaluation focuses on realistic "needle-in-a-haystack" challenges and includes unanswerable questions to test for hallucinations. We compare nine long-context LLMs using direct prompting against three Retrieval-Augmented Generation (RAG) strategies (keyword, semantic, hybrid), with an LLM-as-a-judge for evaluation. Our findings for this specific manual show that Hybrid RAG consistently outperforms direct long-context prompting. Models like Gemini 2.5 Flash and the smaller Qwen 2.5 7B achieve high accuracy (over 85%) across all languages with RAG. This paper contributes a detailed analysis of LLM performance in a specialized industrial domain and an open framework for similar evaluations, highlighting practical trade-offs and challenges.
☆ Speech-Based Depressive Mood Detection in the Presence of Multiple Sclerosis: A Cross-Corpus and Cross-Lingual Study SP 2025
Depression commonly co-occurs with neurodegenerative disorders like Multiple Sclerosis (MS), yet the potential of speech-based Artificial Intelligence for detecting depression in such contexts remains unexplored. This study examines the transferability of speech-based depression detection methods to people with MS (pwMS) through cross-corpus and cross-lingual analysis using English data from the general population and German data from pwMS. Our approach implements supervised machine learning models using: 1) conventional speech and language features commonly used in the field, 2) emotional dimensions derived from a Speech Emotion Recognition (SER) model, and 3) exploratory speech feature analysis. Despite limited data, our models detect depressive mood in pwMS with moderate generalisability, achieving a 66% Unweighted Average Recall (UAR) on a binary task. Feature selection further improved performance, boosting UAR to 74%. Our findings also highlight the relevant role emotional changes have as an indicator of depressive mood in both the general population and within PwMS. This study provides an initial exploration into generalising speech-based depression detection, even in the presence of co-occurring conditions, such as neurodegenerative diseases.
comment: Accepted at the 8th International Conference on Natural Language and Speech Processing (ICNLSP 2025). To be appeared in the corresponding Proceedings at ACL Anthology
☆ Named Entity Recognition of Historical Text via Large Language Model
Large language models have demonstrated remarkable versatility across a wide range of natural language processing tasks and domains. One such task is Named Entity Recognition (NER), which involves identifying and classifying proper names in text, such as people, organizations, locations, dates, and other specific entities. NER plays a crucial role in extracting information from unstructured textual data, enabling downstream applications such as information retrieval from unstructured text. Traditionally, NER is addressed using supervised machine learning approaches, which require large amounts of annotated training data. However, historical texts present a unique challenge, as the annotated datasets are often scarce or nonexistent, due to the high cost and expertise required for manual labeling. In addition, the variability and noise inherent in historical language, such as inconsistent spelling and archaic vocabulary, further complicate the development of reliable NER systems for these sources. In this study, we explore the feasibility of applying LLMs to NER in historical documents using zero-shot and few-shot prompting strategies, which require little to no task-specific training data. Our experiments, conducted on the HIPE-2022 (Identifying Historical People, Places and other Entities) dataset, show that LLMs can achieve reasonably strong performance on NER tasks in this setting. While their performance falls short of fully supervised models trained on domain-specific annotations, the results are nevertheless promising. These findings suggest that LLMs offer a viable and efficient alternative for information extraction in low-resource or historically significant corpora, where traditional supervised methods are infeasible.
☆ How Quantization Shapes Bias in Large Language Models
This work presents a comprehensive evaluation of how quantization affects model bias, with particular attention to its impact on individual demographic subgroups. We focus on weight and activation quantization strategies and examine their effects across a broad range of bias types, including stereotypes, toxicity, sentiment, and fairness. We employ both probabilistic and generated text-based metrics across nine benchmarks and evaluate models varying in architecture family and reasoning ability. Our findings show that quantization has a nuanced impact on bias: while it can reduce model toxicity and does not significantly impact sentiment, it tends to slightly increase stereotypes and unfairness in generative tasks, especially under aggressive compression. These trends are generally consistent across demographic categories and model types, although their magnitude depends on the specific setting. Overall, our results highlight the importance of carefully balancing efficiency and ethical considerations when applying quantization in practice.
☆ Neither Valid nor Reliable? Investigating the Use of LLMs as Judges
Evaluating natural language generation (NLG) systems remains a core challenge of natural language processing (NLP), further complicated by the rise of large language models (LLMs) that aims to be general-purpose. Recently, large language models as judges (LLJs) have emerged as a promising alternative to traditional metrics, but their validity remains underexplored. This position paper argues that the current enthusiasm around LLJs may be premature, as their adoption has outpaced rigorous scrutiny of their reliability and validity as evaluators. Drawing on measurement theory from the social sciences, we identify and critically assess four core assumptions underlying the use of LLJs: their ability to act as proxies for human judgment, their capabilities as evaluators, their scalability, and their cost-effectiveness. We examine how each of these assumptions may be challenged by the inherent limitations of LLMs, LLJs, or current practices in NLG evaluation. To ground our analysis, we explore three applications of LLJs: text summarization, data annotation, and safety alignment. Finally, we highlight the need for more responsible evaluation practices in LLJs evaluation, to ensure that their growing role in the field supports, rather than undermines, progress in NLG.
comment: Prepared for conference submission
☆ Unseen Speaker and Language Adaptation for Lightweight Text-To-Speech with Adapters SP 2025
In this paper we investigate cross-lingual Text-To-Speech (TTS) synthesis through the lens of adapters, in the context of lightweight TTS systems. In particular, we compare the tasks of unseen speaker and language adaptation with the goal of synthesising a target voice in a target language, in which the target voice has no recordings therein. Results from objective evaluations demonstrate the effectiveness of adapters in learning language-specific and speaker-specific information, allowing pre-trained models to learn unseen speaker identities or languages, while avoiding catastrophic forgetting of the original model's speaker or language information. Additionally, to measure how native the generated voices are in terms of accent, we propose and validate an objective metric inspired by mispronunciation detection techniques in second-language (L2) learners. The paper also provides insights into the impact of adapter placement, configuration and the number of speakers used.
comment: Accepted at IEEE MLSP 2025
☆ A Retail-Corpus for Aspect-Based Sentiment Analysis with Large Language Models SP 2025
Aspect-based sentiment analysis enhances sentiment detection by associating it with specific aspects, offering deeper insights than traditional sentiment analysis. This study introduces a manually annotated dataset of 10,814 multilingual customer reviews covering brick-and-mortar retail stores, labeled with eight aspect categories and their sentiment. Using this dataset, the performance of GPT-4 and LLaMA-3 in aspect based sentiment analysis is evaluated to establish a baseline for the newly introduced data. The results show both models achieving over 85% accuracy, while GPT-4 outperforms LLaMA-3 overall with regard to all relevant metrics.
comment: Accepted at ICNLSP 2025
☆ German4All - A Dataset and Model for Readability-Controlled Paraphrasing in German
The ability to paraphrase texts across different complexity levels is essential for creating accessible texts that can be tailored toward diverse reader groups. Thus, we introduce German4All, the first large-scale German dataset of aligned readability-controlled, paragraph-level paraphrases. It spans five readability levels and comprises over 25,000 samples. The dataset is automatically synthesized using GPT-4 and rigorously evaluated through both human and LLM-based judgments. Using German4All, we train an open-source, readability-controlled paraphrasing model that achieves state-of-the-art performance in German text simplification, enabling more nuanced and reader-specific adaptations. We opensource both the dataset and the model to encourage further research on multi-level paraphrasing
comment: Accepted to INLG 2025
☆ Understanding Subword Compositionality of Large Language Models EMNLP 2025
Large language models (LLMs) take sequences of subwords as input, requiring them to effective compose subword representations into meaningful word-level representations. In this paper, we present a comprehensive set of experiments to probe how LLMs compose subword information, focusing on three key aspects: structural similarity, semantic decomposability, and form retention. Our analysis of the experiments suggests that these five LLM families can be classified into three distinct groups, likely reflecting difference in their underlying composition strategies. Specifically, we observe (i) three distinct patterns in the evolution of structural similarity between subword compositions and whole-word representations across layers; (ii) great performance when probing layer by layer their sensitivity to semantic decompositionality; and (iii) three distinct patterns when probing sensitivity to formal features, e.g., character sequence length. These findings provide valuable insights into the compositional dynamics of LLMs and highlight different compositional pattens in how LLMs encode and integrate subword information.
comment: EMNLP 2025 Main
☆ Debiasing Multilingual LLMs in Cross-lingual Latent Space EMNLP 2025
Debiasing techniques such as SentDebias aim to reduce bias in large language models (LLMs). Previous studies have evaluated their cross-lingual transferability by directly applying these methods to LLM representations, revealing their limited effectiveness across languages. In this work, we therefore propose to perform debiasing in a joint latent space rather than directly on LLM representations. We construct a well-aligned cross-lingual latent space using an autoencoder trained on parallel TED talk scripts. Our experiments with Aya-expanse and two debiasing techniques across four languages (English, French, German, Dutch) demonstrate that a) autoencoders effectively construct a well-aligned cross-lingual latent space, and b) applying debiasing techniques in the learned cross-lingual latent space significantly improves both the overall debiasing performance and cross-lingual transferability.
comment: EMNLP 2025 Main
☆ AMELIA: A Family of Multi-task End-to-end Language Models for Argumentation
Argument mining is a subfield of argumentation that aims to automatically extract argumentative structures and their relations from natural language texts. This paper investigates how a single large language model can be leveraged to perform one or several argument mining tasks. Our contributions are two-fold. First, we construct a multi-task dataset by surveying and converting 19 well-known argument mining datasets from the literature into a unified format. Second, we explore various training strategies using Meta AI's Llama-3.1-8B-Instruct model: (1) fine-tuning on individual tasks, (2) fine-tuning jointly on multiple tasks, and (3) merging models fine-tuned separately on individual tasks. Our experiments show that task-specific fine-tuning significantly improves individual performance across all tasks. Moreover, multi-task fine-tuning maintains strong performance without degradation, suggesting effective transfer learning across related tasks. Finally, we demonstrate that model merging offers a viable compromise: it yields competitive performance while mitigating the computational costs associated with full multi-task fine-tuning.
☆ Feature-Refined Unsupervised Model for Loanword Detection
We propose an unsupervised method for detecting loanwords i.e., words borrowed from one language into another. While prior work has primarily relied on language-external information to identify loanwords, such approaches can introduce circularity and constraints into the historical linguistics workflow. In contrast, our model relies solely on language-internal information to process both native and borrowed words in monolingual and multilingual wordlists. By extracting pertinent linguistic features, scoring them, and mapping them probabilistically, we iteratively refine initial results by identifying and generalizing from emerging patterns until convergence. This hybrid approach leverages both linguistic and statistical cues to guide the discovery process. We evaluate our method on the task of isolating loanwords in datasets from six standard Indo-European languages: English, German, French, Italian, Spanish, and Portuguese. Experimental results demonstrate that our model outperforms baseline methods, with strong performance gains observed when scaling to cross-linguistic data.
☆ Information availability in different languages and various technological constraints related to multilinguism on the Internet
The usage of Internet has grown exponentially over the last two decades. The number of Internet users has grown from 16 Million to 1650 Million from 1995 to 2010. It has become a major repository of information catering almost every area. Since the Internet has its origin in USA which is English speaking country there is huge dominance of English on the World Wide Web. Although English is a globally acceptable language, still there is a huge population in the world which is not able to access the Internet due to language constraints. It has been estimated that only 20-25% of the world population speaks English as a native language. More and more people are accessing the Internet nowadays removing the cultural and linguistic barriers and hence there is a high growth in the number of non-English speaking users over the last few years on the Internet. Although many solutions have been provided to remove the linguistic barriers, still there is a huge gap to be filled. This paper attempts to analyze the need of information availability in different languages and the various technological constraints related to multi-linguism on the Internet.
comment: International Journal of Computer Applications
☆ Evaluating the Representation of Vowels in Wav2Vec Feature Extractor: A Layer-Wise Analysis Using MFCCs
Automatic Speech Recognition has advanced with self-supervised learning, enabling feature extraction directly from raw audio. In Wav2Vec, a CNN first transforms audio into feature vectors before the transformer processes them. This study examines CNN-extracted information for monophthong vowels using the TIMIT corpus. We compare MFCCs, MFCCs with formants, and CNN activations by training SVM classifiers for front-back vowel identification, assessing their classification accuracy to evaluate phonetic representation.
☆ Pandora: Leveraging Code-driven Knowledge Transfer for Unified Structured Knowledge Reasoning
Unified Structured Knowledge Reasoning (USKR) aims to answer natural language questions by using structured sources such as tables, databases, and knowledge graphs in a unified way. Existing USKR methods rely on task-specific strategies or bespoke representations, which hinder their ability to dismantle barriers between different SKR tasks, thereby constraining their overall performance in cross-task scenarios. In this paper, we introduce \textsc{Pandora}, a novel USKR framework that addresses the limitations of existing methods by leveraging two key innovations. First, we propose a code-based unified knowledge representation using \textsc{Python}'s \textsc{Pandas} API, which aligns seamlessly with the pre-training of LLMs. This representation facilitates a cohesive approach to handling different structured knowledge sources. Building on this foundation, we employ knowledge transfer to bolster the unified reasoning process of LLMs by automatically building cross-task memory. By adaptively correcting reasoning using feedback from code execution, \textsc{Pandora} showcases impressive unified reasoning capabilities. Extensive experiments on six widely used benchmarks across three SKR tasks demonstrate that \textsc{Pandora} outperforms existing unified reasoning frameworks and competes effectively with task-specific methods.
☆ Designing Practical Models for Isolated Word Visual Speech Recognition
Visual speech recognition (VSR) systems decode spoken words from an input sequence using only the video data. Practical applications of such systems include medical assistance as well as human-machine interactions. A VSR system is typically employed in a complementary role in cases where the audio is corrupt or not available. In order to accurately predict the spoken words, these architectures often rely on deep neural networks in order to extract meaningful representations from the input sequence. While deep architectures achieve impressive recognition performance, relying on such models incurs significant computation costs which translates into increased resource demands in terms of hardware requirements and results in limited applicability in real-world scenarios where resources might be constrained. This factor prevents wider adoption and deployment of speech recognition systems in more practical applications. In this work, we aim to alleviate this issue by developing architectures for VSR that have low hardware costs. Following the standard two-network design paradigm, where one network handles visual feature extraction and another one utilizes the extracted features to classify the entire sequence, we develop lightweight end-to-end architectures by first benchmarking efficient models from the image classification literature, and then adopting lightweight block designs in a temporal convolution network backbone. We create several unified models with low resource requirements but strong recognition performance. Experiments on the largest public database for English words demonstrate the effectiveness and practicality of our developed models. Code and trained models will be made publicly available.
comment: Double-column format, 13 pages with references, 2 figures
☆ ILRe: Intermediate Layer Retrieval for Context Compression in Causal Language Models
Large Language Models (LLMs) have demonstrated success across many benchmarks. However, they still exhibit limitations in long-context scenarios, primarily due to their short effective context length, quadratic computational complexity, and high memory overhead when processing lengthy inputs. To mitigate these issues, we introduce a novel context compression pipeline, called Intermediate Layer Retrieval (ILRe), which determines one intermediate decoder layer offline, encodes context by streaming chunked prefill only up to that layer, and recalls tokens by the attention scores between the input query and full key cache in that specified layer. In particular, we propose a multi-pooling kernels allocating strategy in the token recalling process to maintain the completeness of semantics. Our approach not only reduces the prefilling complexity from $O(L^2)$ to $O(L)$, but also achieves performance comparable to or better than the full context in the long context scenarios. Without additional post training or operator development, ILRe can process a single $1M$ tokens request in less than half a minute (speedup $\approx 180\times$) and scores RULER-$1M$ benchmark of $\approx 79.8$ with model Llama-3.1-UltraLong-8B-1M-Instruct on a Huawei Ascend 910B NPU.
☆ Speech Discrete Tokens or Continuous Features? A Comparative Analysis for Spoken Language Understanding in SpeechLLMs EMNLP 2025
With the rise of Speech Large Language Models (SpeechLLMs), two dominant approaches have emerged for speech processing: discrete tokens and continuous features. Each approach has demonstrated strong capabilities in audio-related processing tasks. However, the performance gap between these two paradigms has not been thoroughly explored. To address this gap, we present a fair comparison of self-supervised learning (SSL)-based discrete and continuous features under the same experimental settings. We evaluate their performance across six spoken language understanding-related tasks using both small and large-scale LLMs (Qwen1.5-0.5B and Llama3.1-8B). We further conduct in-depth analyses, including efficient comparison, SSL layer analysis, LLM layer analysis, and robustness comparison. Our findings reveal that continuous features generally outperform discrete tokens in various tasks. Each speech processing method exhibits distinct characteristics and patterns in how it learns and processes speech information. We hope our results will provide valuable insights to advance spoken language understanding in SpeechLLMs.
comment: Accepted to EMNLP 2025 Main Conference
☆ Beyond Demographics: Enhancing Cultural Value Survey Simulation with Multi-Stage Personality-Driven Cognitive Reasoning EMNLP 2025
Introducing MARK, the Multi-stAge Reasoning frameworK for cultural value survey response simulation, designed to enhance the accuracy, steerability, and interpretability of large language models in this task. The system is inspired by the type dynamics theory in the MBTI psychological framework for personality research. It effectively predicts and utilizes human demographic information for simulation: life-situational stress analysis, group-level personality prediction, and self-weighted cognitive imitation. Experiments on the World Values Survey show that MARK outperforms existing baselines by 10% accuracy and reduces the divergence between model predictions and human preferences. This highlights the potential of our framework to improve zero-shot personalization and help social scientists interpret model predictions.
comment: 23 pages, 6 figures, accepted to EMNLP 2025 main
☆ DRQA: Dynamic Reasoning Quota Allocation for Controlling Overthinking in Reasoning Large Language Models
Reasoning large language models (RLLMs), such as OpenAI-O3 and DeepSeek-R1, have recently demonstrated remarkable capabilities by performing structured and multi-step reasoning. However, recent studies reveal that RLLMs often suffer from overthinking, i.e., producing unnecessarily lengthy reasoning chains even for simple questions, leading to excessive token consumption and computational inefficiency. Interestingly, we observe that when processing multiple questions in batch mode, RLLMs exhibit more resource-efficient behavior by dynamically compressing reasoning steps for easier problems, due to implicit resource competition. Inspired by this, we propose Dynamic Reasoning Quota Allocation (DRQA), a novel method that transfers the benefits of resource competition from batch processing to single-question inference. Specifically, DRQA leverages batch-generated preference data and reinforcement learning to train the model to allocate reasoning resources adaptively. By encouraging the model to internalize a preference for responses that are both accurate and concise, DRQA enables it to generate concise answers for simple questions while retaining sufficient reasoning depth for more challenging ones. Extensive experiments on a wide range of mathematical and scientific reasoning benchmarks demonstrate that DRQA significantly reduces token usage while maintaining, and in many cases improving, answer accuracy. By effectively mitigating the overthinking problem, DRQA offers a promising direction for more efficient and scalable deployment of RLLMs, and we hope it inspires further exploration into fine-grained control of reasoning behaviors.
☆ Zero-shot Context Biasing with Trie-based Decoding using Synthetic Multi-Pronunciation SC 2025
Contextual automatic speech recognition (ASR) systems allow for recognizing out-of-vocabulary (OOV) words, such as named entities or rare words. However, it remains challenging due to limited training data and ambiguous or inconsistent pronunciations. In this paper, we propose a synthesis-driven multi-pronunciation contextual biasing method that performs zero-shot contextual ASR on a pretrained Whisper model. Specifically, we leverage text-to-speech (TTS) systems to synthesize diverse speech samples containing each target rare word, and then use the pretrained Whisper model to extract multiple predicted pronunciation variants. These variant token sequences are compiled into a prefix-trie, which assigns rewards to beam hypotheses in a shallow-fusion manner during beam-search decoding. After which, any recognized variant is mapped back to the original rare word in the final transcription. The evaluation results on the Librispeech dataset show that our method reduces biased word error rate (WER) by 42% on test-clean and 43% on test-other while maintaining unbiased WER essentially unchanged.
comment: Accepted to APSIPA ASC 2025
☆ Proximal Supervised Fine-Tuning
Supervised fine-tuning (SFT) of foundation models often leads to poor generalization, where prior capabilities deteriorate after tuning on new tasks or domains. Inspired by trust-region policy optimization (TRPO) and proximal policy optimization (PPO) in reinforcement learning (RL), we propose Proximal SFT (PSFT). This fine-tuning objective incorporates the benefits of trust-region, effectively constraining policy drift during SFT while maintaining competitive tuning. By viewing SFT as a special case of policy gradient methods with constant positive advantages, we derive PSFT that stabilizes optimization and leads to generalization, while leaving room for further optimization in subsequent post-training stages. Experiments across mathematical and human-value domains show that PSFT matches SFT in-domain, outperforms it in out-of-domain generalization, remains stable under prolonged training without causing entropy collapse, and provides a stronger foundation for the subsequent optimization.
☆ Speculating LLMs' Chinese Training Data Pollution from Their Tokens
Tokens are basic elements in the datasets for LLM training. It is well-known that many tokens representing Chinese phrases in the vocabulary of GPT (4o/4o-mini/o1/o3/4.5/4.1/o4-mini) are indicating contents like pornography or online gambling. Based on this observation, our goal is to locate Polluted Chinese (PoC) tokens in LLMs and study the relationship between PoC tokens' existence and training data. (1) We give a formal definition and taxonomy of PoC tokens based on the GPT's vocabulary. (2) We build a PoC token detector via fine-tuning an LLM to label PoC tokens in vocabularies by considering each token's both semantics and related contents from the search engines. (3) We study the speculation on the training data pollution via PoC tokens' appearances (token ID). Experiments on GPT and other 23 LLMs indicate that tokens widely exist while GPT's vocabulary behaves the worst: more than 23% long Chinese tokens (i.e., a token with more than two Chinese characters) are either porn or online gambling. We validate the accuracy of our speculation method on famous pre-training datasets like C4 and Pile. Then, considering GPT-4o, we speculate that the ratio of "Yui Hatano" related webpages in GPT-4o's training data is around 0.5%.
☆ ISACL: Internal State Analyzer for Copyrighted Training Data Leakage
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but pose risks of inadvertently exposing copyrighted or proprietary data, especially when such data is used for training but not intended for distribution. Traditional methods address these leaks only after content is generated, which can lead to the exposure of sensitive information. This study introduces a proactive approach: examining LLMs' internal states before text generation to detect potential leaks. By using a curated dataset of copyrighted materials, we trained a neural network classifier to identify risks, allowing for early intervention by stopping the generation process or altering outputs to prevent disclosure. Integrated with a Retrieval-Augmented Generation (RAG) system, this framework ensures adherence to copyright and licensing requirements while enhancing data privacy and ethical standards. Our results show that analyzing internal states effectively mitigates the risk of copyrighted data leakage, offering a scalable solution that fits smoothly into AI workflows, ensuring compliance with copyright regulations while maintaining high-quality text generation. The implementation is available on GitHub.\footnote{https://github.com/changhu73/Internal_states_leakage}
☆ CEIDM: A Controlled Entity and Interaction Diffusion Model for Enhanced Text-to-Image Generation
In Text-to-Image (T2I) generation, the complexity of entities and their intricate interactions pose a significant challenge for T2I method based on diffusion model: how to effectively control entity and their interactions to produce high-quality images. To address this, we propose CEIDM, a image generation method based on diffusion model with dual controls for entity and interaction. First, we propose an entity interactive relationships mining approach based on Large Language Models (LLMs), extracting reasonable and rich implicit interactive relationships through chain of thought to guide diffusion models to generate high-quality images that are closer to realistic logic and have more reasonable interactive relationships. Furthermore, We propose an interactive action clustering and offset method to cluster and offset the interactive action features contained in each text prompts. By constructing global and local bidirectional offsets, we enhance semantic understanding and detail supplementation of original actions, making the model's understanding of the concept of interactive "actions" more accurate and generating images with more accurate interactive actions. Finally, we design an entity control network which generates masks with entity semantic guidance, then leveraging multi-scale convolutional network to enhance entity feature and dynamic network to fuse feature. It effectively controls entities and significantly improves image quality. Experiments show that the proposed CEIDM method is better than the most representative existing methods in both entity control and their interaction control.
☆ Talking to Robots: A Practical Examination of Speech Foundation Models for HRI Applications
Automatic Speech Recognition (ASR) systems in real-world settings need to handle imperfect audio, often degraded by hardware limitations or environmental noise, while accommodating diverse user groups. In human-robot interaction (HRI), these challenges intersect to create a uniquely challenging recognition environment. We evaluate four state-of-the-art ASR systems on eight publicly available datasets that capture six dimensions of difficulty: domain-specific, accented, noisy, age-variant, impaired, and spontaneous speech. Our analysis demonstrates significant variations in performance, hallucination tendencies, and inherent biases, despite similar scores on standard benchmarks. These limitations have serious implications for HRI, where recognition errors can interfere with task performance, user trust, and safety.
comment: Accepted at the workshop on Foundation Models for Social Robotics (FoMoSR) at ICSR 2025
☆ SMITE: Enhancing Fairness in LLMs through Optimal In-Context Example Selection via Dynamic Validation
Large Language Models (LLMs) are widely used for downstream tasks such as tabular classification, where ensuring fairness in their outputs is critical for inclusivity, equal representation, and responsible AI deployment. This study introduces a novel approach to enhancing LLM performance and fairness through the concept of a dynamic validation set, which evolves alongside the test set, replacing the traditional static validation approach. We also propose an iterative algorithm, SMITE, to select optimal in-context examples, with each example set validated against its corresponding dynamic validation set. The in-context set with the lowest total error is used as the final demonstration set. Our experiments across four different LLMs show that our proposed techniques significantly improve both predictive accuracy and fairness compared to baseline methods. To our knowledge, this is the first study to apply dynamic validation in the context of in-context learning for LLMs.
☆ Layerwise Importance Analysis of Feed-Forward Networks in Transformer-based Language Models
This study investigates the layerwise importance of feed-forward networks (FFNs) in Transformer-based language models during pretraining. We introduce an experimental approach that, while maintaining the total parameter count, increases the FFN dimensions in some layers and completely removes the FFNs from other layers. Furthermore, since our focus is on the importance of FFNs during pretraining, we train models from scratch to examine whether the importance of FFNs varies depending on their layer positions, rather than using publicly available pretrained models as is frequently done. Through comprehensive evaluations of models with varying sizes (285M, 570M, and 1.2B parameters) and layer counts (12, 24, and 40 layers), we demonstrate that concentrating FFNs in 70% of the consecutive middle layers consistently outperforms standard configurations for multiple downstream tasks.
comment: Accepted to COLM 2025
☆ How Do LLM-Generated Texts Impact Term-Based Retrieval Models?
As more content generated by large language models (LLMs) floods into the Internet, information retrieval (IR) systems now face the challenge of distinguishing and handling a blend of human-authored and machine-generated texts. Recent studies suggest that neural retrievers may exhibit a preferential inclination toward LLM-generated content, while classic term-based retrievers like BM25 tend to favor human-written documents. This paper investigates the influence of LLM-generated content on term-based retrieval models, which are valued for their efficiency and robust generalization across domains. Our linguistic analysis reveals that LLM-generated texts exhibit smoother high-frequency and steeper low-frequency Zipf slopes, higher term specificity, and greater document-level diversity. These traits are aligned with LLMs being trained to optimize reader experience through diverse and precise expressions. Our study further explores whether term-based retrieval models demonstrate source bias, concluding that these models prioritize documents whose term distributions closely correspond to those of the queries, rather than displaying an inherent source bias. This work provides a foundation for understanding and addressing potential biases in term-based IR systems managing mixed-source content.
☆ EMPOWER: Evolutionary Medical Prompt Optimization With Reinforcement Learning
Prompt engineering significantly influences the reliability and clinical utility of Large Language Models (LLMs) in medical applications. Current optimization approaches inadequately address domain-specific medical knowledge and safety requirements. This paper introduces EMPOWER, a novel evolutionary framework that enhances medical prompt quality through specialized representation learning, multi-dimensional evaluation, and structure-preserving algorithms. Our methodology incorporates: (1) a medical terminology attention mechanism, (2) a comprehensive assessment architecture evaluating clarity, specificity, clinical relevance, and factual accuracy, (3) a component-level evolutionary algorithm preserving clinical reasoning integrity, and (4) a semantic verification module ensuring adherence to medical knowledge. Evaluation across diagnostic, therapeutic, and educational tasks demonstrates significant improvements: 24.7% reduction in factually incorrect content, 19.6% enhancement in domain specificity, and 15.3% higher clinician preference in blinded evaluations. The framework addresses critical challenges in developing clinically appropriate prompts, facilitating more responsible integration of LLMs into healthcare settings.
☆ LLM-based Agentic Reasoning Frameworks: A Survey from Methods to Scenarios
Recent advances in the intrinsic reasoning capabilities of large language models (LLMs) have given rise to LLM-based agent systems that exhibit near-human performance on a variety of automated tasks. However, although these systems share similarities in terms of their use of LLMs, different reasoning frameworks of the agent system steer and organize the reasoning process in different ways. In this survey, we propose a systematic taxonomy that decomposes agentic reasoning frameworks and analyze how these frameworks dominate framework-level reasoning by comparing their applications across different scenarios. Specifically, we propose an unified formal language to further classify agentic reasoning systems into single-agent methods, tool-based methods, and multi-agent methods. After that, we provide a comprehensive review of their key application scenarios in scientific discovery, healthcare, software engineering, social simulation, and economics. We also analyze the characteristic features of each framework and summarize different evaluation strategies. Our survey aims to provide the research community with a panoramic view to facilitate understanding of the strengths, suitable scenarios, and evaluation practices of different agentic reasoning frameworks.
comment: 51 pages,10 figures,8 tables. Work in progress
☆ Text Meets Topology: Rethinking Out-of-distribution Detection in Text-Rich Networks EMNLP2025
Out-of-distribution (OOD) detection remains challenging in text-rich networks, where textual features intertwine with topological structures. Existing methods primarily address label shifts or rudimentary domain-based splits, overlooking the intricate textual-structural diversity. For example, in social networks, where users represent nodes with textual features (name, bio) while edges indicate friendship status, OOD may stem from the distinct language patterns between bot and normal users. To address this gap, we introduce the TextTopoOOD framework for evaluating detection across diverse OOD scenarios: (1) attribute-level shifts via text augmentations and embedding perturbations; (2) structural shifts through edge rewiring and semantic connections; (3) thematically-guided label shifts; and (4) domain-based divisions. Furthermore, we propose TNT-OOD to model the complex interplay between Text aNd Topology using: 1) a novel cross-attention module to fuse local structure into node-level text representations, and 2) a HyperNetwork to generate node-specific transformation parameters. This aligns topological and semantic features of ID nodes, enhancing ID/OOD distinction across structural and textual shifts. Experiments on 11 datasets across four OOD scenarios demonstrate the nuanced challenge of TextTopoOOD for evaluating OOD detection in text-rich networks.
comment: EMNLP2025 Main
☆ Characterizing the Behavior of Training Mamba-based State Space Models on GPUs
Mamba-based State Space Models (SSM) have emerged as a promising alternative to the ubiquitous transformers. Despite the expressive power of transformers, the quadratic complexity of computing attention is a major impediment to scaling performance as we increase the sequence length. SSMs provide an alternative path that addresses this problem, reducing the computational complexity requirements of self-attention with novel model architectures for different domains and fields such as video, text generation and graphs. Thus, it is important to characterize the behavior of these emerging workloads on GPUs and understand their requirements during GPU microarchitectural design. In this work we evaluate Mamba-based SSMs and characterize their behavior during training on GPUs. We construct a workload suite that offers representative models that span different model architectures. We then use this suite to analyze the architectural implications of running Mamba-based SSMs on GPUs. Our work sheds new light on potential optimizations to continue scaling the performance for such models.
☆ CoCoA: Confidence- and Context-Aware Adaptive Decoding for Resolving Knowledge Conflicts in Large Language Models EMNLP'25
Faithful generation in large language models (LLMs) is challenged by knowledge conflicts between parametric memory and external context. Existing contrastive decoding methods tuned specifically to handle conflict often lack adaptability and can degrade performance in low conflict settings. We introduce CoCoA (Confidence- and Context-Aware Adaptive Decoding), a novel token-level algorithm for principled conflict resolution and enhanced faithfulness. CoCoA resolves conflict by utilizing confidence-aware measures (entropy gap and contextual peakedness) and the generalized divergence between the parametric and contextual distributions. Crucially, CoCoA maintains strong performance even in low conflict settings. Extensive experiments across multiple LLMs on diverse Question Answering (QA), Summarization, and Long-Form Question Answering (LFQA) benchmarks demonstrate CoCoA's state-of-the-art performance over strong baselines like AdaCAD. It yields significant gains in QA accuracy, up to 9.2 points on average compared to the strong baseline AdaCAD, and improves factuality in summarization and LFQA by up to 2.5 points on average across key benchmarks. Additionally, it demonstrates superior sensitivity to conflict variations. CoCoA enables more informed, context-aware, and ultimately more faithful token generation.
comment: Accepted to EMNLP'25, Main. 21 pages, 17 tables, 3 Figures
SurveyGen: Quality-Aware Scientific Survey Generation with Large Language Models
Automatic survey generation has emerged as a key task in scientific document processing. While large language models (LLMs) have shown promise in generating survey texts, the lack of standardized evaluation datasets critically hampers rigorous assessment of their performance against human-written surveys. In this work, we present SurveyGen, a large-scale dataset comprising over 4,200 human-written surveys across diverse scientific domains, along with 242,143 cited references and extensive quality-related metadata for both the surveys and the cited papers. Leveraging this resource, we build QUAL-SG, a novel quality-aware framework for survey generation that enhances the standard Retrieval-Augmented Generation (RAG) pipeline by incorporating quality-aware indicators into literature retrieval to assess and select higher-quality source papers. Using this dataset and framework, we systematically evaluate state-of-the-art LLMs under varying levels of human involvement - from fully automatic generation to human-guided writing. Experimental results and human evaluations show that while semi-automatic pipelines can achieve partially competitive outcomes, fully automatic survey generation still suffers from low citation quality and limited critical analysis.
☆ Dynamic Embedding of Hierarchical Visual Features for Efficient Vision-Language Fine-Tuning
Large Vision-Language Models (LVLMs) commonly follow a paradigm that projects visual features and then concatenates them with text tokens to form a unified sequence input for Large Language Models (LLMs). However, this paradigm leads to a significant increase in the length of the input sequence, resulting in substantial computational overhead. Existing methods attempt to fuse visual information into the intermediate layers of LLMs, which alleviate the sequence length issue but often neglect the hierarchical semantic representations within the model and the fine-grained visual information available in the shallower visual encoding layers. To address this limitation, we propose DEHVF, an efficient vision-language fine-tuning method based on dynamic embedding and fusion of hierarchical visual features. Its core lies in leveraging the inherent hierarchical representation characteristics of visual encoders and language models. Through a lightweight hierarchical visual fuser, it dynamically selects and fuses hierarchical features corresponding to semantic granularity based on the internal representations of each layer in LLMs. The fused layer-related visual features are then projected and aligned before being directly embedded into the Feed-Forward Network (FFN) of the corresponding layer in LLMs. This approach not only avoids sequence expansion but also dynamically fuses multi-layer visual information. By fine-tuning only a small number of parameters, DEHVF achieves precise alignment and complementarity of cross-modal information at the same semantic granularity. We conducted experiments across various VL benchmarks, including visual question answering on ScienceQA and image captioning on COCO Captions. The results demonstrate that DEHVF achieves higher accuracy than existing parameter-efficient fine-tuning (PEFT) baselines while maintaining efficient training and inference.
☆ Weights-Rotated Preference Optimization for Large Language Models EMNLP 2025
Despite the efficacy of Direct Preference Optimization (DPO) in aligning Large Language Models (LLMs), reward hacking remains a pivotal challenge. This issue emerges when LLMs excessively reduce the probability of rejected completions to achieve high rewards, without genuinely meeting their intended goals. As a result, this leads to overly lengthy generation lacking diversity, as well as catastrophic forgetting of knowledge. We investigate the underlying reason behind this issue, which is representation redundancy caused by neuron collapse in the parameter space. Hence, we propose a novel Weights-Rotated Preference Optimization (RoPO) algorithm, which implicitly constrains the output layer logits with the KL divergence inherited from DPO and explicitly constrains the intermediate hidden states by fine-tuning on a multi-granularity orthogonal matrix. This design prevents the policy model from deviating too far from the reference model, thereby retaining the knowledge and expressive capabilities acquired during pre-training and SFT stages. Our RoPO achieves up to a 3.27-point improvement on AlpacaEval 2, and surpasses the best baseline by 6.2 to 7.5 points on MT-Bench with merely 0.015% of the trainable parameters, demonstrating its effectiveness in alleviating the reward hacking problem of DPO.
comment: EMNLP 2025
☆ Stop Spinning Wheels: Mitigating LLM Overthinking via Mining Patterns for Early Reasoning Exit
Large language models (LLMs) enhance complex reasoning tasks by scaling the individual thinking process. However, prior work shows that overthinking can degrade overall performance. Motivated by observed patterns in thinking length and content length, we categorize reasoning into three stages: insufficient exploration stage, compensatory reasoning stage, and reasoning convergence stage. Typically, LLMs produce correct answers in the compensatory reasoning stage, whereas reasoning convergence often triggers overthinking, causing increased resource usage or even infinite loops. Therefore, mitigating overthinking hinges on detecting the end of the compensatory reasoning stage, defined as the Reasoning Completion Point (RCP). RCP typically appears at the end of the first complete reasoning cycle and can be identified by querying the LLM sentence by sentence or monitoring the probability of an end-of-thinking token (e.g., \texttt{}), though these methods lack an efficient and precise balance. To improve this, we mine more sensitive and consistent RCP patterns and develop a lightweight thresholding strategy based on heuristic rules. Experimental evaluations on benchmarks (AIME24, AIME25, GPQA-D) demonstrate that the proposed method reduces token consumption while preserving or enhancing reasoning accuracy.
☆ EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Spoken Dialogue Systems
Speech emotions play a crucial role in human-computer interaction, shaping engagement and context-aware communication. Despite recent advances in spoken dialogue systems, a holistic system for evaluating emotional reasoning is still lacking. To address this, we introduce EMO-Reasoning, a benchmark for assessing emotional coherence in dialogue systems. It leverages a curated dataset generated via text-to-speech to simulate diverse emotional states, overcoming the scarcity of emotional speech data. We further propose the Cross-turn Emotion Reasoning Score to assess the emotion transitions in multi-turn dialogues. Evaluating seven dialogue systems through continuous, categorical, and perceptual metrics, we show that our framework effectively detects emotional inconsistencies, providing insights for improving current dialogue systems. By releasing a systematic evaluation benchmark, we aim to advance emotion-aware spoken dialogue modeling toward more natural and adaptive interactions.
comment: Accepted at (ASRU 2025) 2025 IEEE Automatic Speech Recognition and Understanding Workshop
☆ Steering When Necessary: Flexible Steering Large Language Models with Backtracking
Large language models (LLMs) have achieved remarkable performance across many generation tasks. Nevertheless, effectively aligning them with desired behaviors remains a significant challenge. Activation steering is an effective and cost-efficient approach that directly modifies the activations of LLMs during the inference stage, aligning their responses with the desired behaviors and avoiding the high cost of fine-tuning. Existing methods typically indiscriminately intervene to all generations or rely solely on the question to determine intervention, which limits the accurate assessment of the intervention strength. To this end, we propose the Flexible Activation Steering with Backtracking (FASB) framework, which dynamically determines both the necessity and strength of intervention by tracking the internal states of the LLMs during generation, considering both the question and the generated content. Since intervening after detecting a deviation from the desired behavior is often too late, we further propose the backtracking mechanism to correct the deviated tokens and steer the LLMs toward the desired behavior. Extensive experiments on the TruthfulQA dataset and six multiple-choice datasets demonstrate that our method outperforms baselines. Our code will be released at https://github.com/gjw185/FASB.
☆ Less Is More? Examining Fairness in Pruned Large Language Models for Summarising Opinions EMNLP 2025
Model compression through post-training pruning offers a way to reduce model size and computational requirements without significantly impacting model performance. However, the effect of pruning on the fairness of LLM-generated summaries remains unexplored, particularly for opinion summarisation where biased outputs could influence public views.In this paper, we present a comprehensive empirical analysis of opinion summarisation, examining three state-of-the-art pruning methods and various calibration sets across three open-source LLMs using four fairness metrics. Our systematic analysis reveals that pruning methods have a greater impact on fairness than calibration sets. Building on these insights, we propose High Gradient Low Activation (HGLA) pruning, which identifies and removes parameters that are redundant for input processing but influential in output generation. Our experiments demonstrate that HGLA can better maintain or even improve fairness compared to existing methods, showing promise across models and tasks where traditional methods have limitations. Our human evaluation shows HGLA-generated outputs are fairer than existing state-of-the-art pruning methods. Code is available at: https://github.com/amberhuang01/HGLA.
comment: Accepted to EMNLP 2025 Main Conference
☆ RubikSQL: Lifelong Learning Agentic Knowledge Base as an Industrial NL2SQL System VLDB 2026
We present RubikSQL, a novel NL2SQL system designed to address key challenges in real-world enterprise-level NL2SQL, such as implicit intents and domain-specific terminology. RubikSQL frames NL2SQL as a lifelong learning task, demanding both Knowledge Base (KB) maintenance and SQL generation. RubikSQL systematically builds and refines its KB through techniques including database profiling, structured information extraction, agentic rule mining, and Chain-of-Thought (CoT)-enhanced SQL profiling. RubikSQL then employs a multi-agent workflow to leverage this curated KB, generating accurate SQLs. RubikSQL achieves SOTA performance on both the KaggleDBQA and BIRD Mini-Dev datasets. Finally, we release the RubikBench benchmark, a new benchmark specifically designed to capture vital traits of industrial NL2SQL scenarios, providing a valuable resource for future research.
comment: 18 pages, 3 figures, 3 tables, to be submitted to VLDB 2026 (PVLDB Volume 19)
☆ UQ: Assessing Language Models on Unsolved Questions
Benchmarks shape progress in AI research. A useful benchmark should be both difficult and realistic: questions should challenge frontier models while also reflecting real-world usage. Yet, current paradigms face a difficulty-realism tension: exam-style benchmarks are often made artificially difficult with limited real-world value, while benchmarks based on real user interaction often skew toward easy, high-frequency problems. In this work, we explore a radically different paradigm: assessing models on unsolved questions. Rather than a static benchmark scored once, we curate unsolved questions and evaluate models asynchronously over time with validator-assisted screening and community verification. We introduce UQ, a testbed of 500 challenging, diverse questions sourced from Stack Exchange, spanning topics from CS theory and math to sci-fi and history, probing capabilities including reasoning, factuality, and browsing. UQ is difficult and realistic by construction: unsolved questions are often hard and naturally arise when humans seek answers, thus solving them yields direct real-world value. Our contributions are threefold: (1) UQ-Dataset and its collection pipeline combining rule-based filters, LLM judges, and human review to ensure question quality (e.g., well-defined and difficult); (2) UQ-Validators, compound validation strategies that leverage the generator-validator gap to provide evaluation signals and pre-screen candidate solutions for human review; and (3) UQ-Platform, an open platform where experts collectively verify questions and solutions. The top model passes UQ-validation on only 15% of questions, and preliminary human verification has already identified correct answers among those that passed. UQ charts a path for evaluating frontier models on real-world, open-ended challenges, where success pushes the frontier of human knowledge. We release UQ at https://uq.stanford.edu.
comment: FN, KZL, and NM are project co-leads and contributed equally. Project website: https://uq.stanford.edu
☆ CausalSent: Interpretable Sentiment Classification with RieszNet
Despite the overwhelming performance improvements offered by recent natural language procesing (NLP) models, the decisions made by these models are largely a black box. Towards closing this gap, the field of causal NLP combines causal inference literature with modern NLP models to elucidate causal effects of text features. We replicate and extend Bansal et al's work on regularizing text classifiers to adhere to estimated effects, focusing instead on model interpretability. Specifically, we focus on developing a two-headed RieszNet-based neural network architecture which achieves better treatment effect estimation accuracy. Our framework, CausalSent, accurately predicts treatment effects in semi-synthetic IMDB movie reviews, reducing MAE of effect estimates by 2-3x compared to Bansal et al's MAE on synthetic Civil Comments data. With an ensemble of validated models, we perform an observational case study on the causal effect of the word "love" in IMDB movie reviews, finding that the presence of the word "love" causes a +2.9% increase in the probability of a positive sentiment.
☆ Humanizing Machines: Rethinking LLM Anthropomorphism Through a Multi-Level Framework of Design EMNLP
Large Language Models (LLMs) increasingly exhibit \textbf{anthropomorphism} characteristics -- human-like qualities portrayed across their outlook, language, behavior, and reasoning functions. Such characteristics enable more intuitive and engaging human-AI interactions. However, current research on anthropomorphism remains predominantly risk-focused, emphasizing over-trust and user deception while offering limited design guidance. We argue that anthropomorphism should instead be treated as a \emph{concept of design} that can be intentionally tuned to support user goals. Drawing from multiple disciplines, we propose that the anthropomorphism of an LLM-based artifact should reflect the interaction between artifact designers and interpreters. This interaction is facilitated by cues embedded in the artifact by the designers and the (cognitive) responses of the interpreters to the cues. Cues are categorized into four dimensions: \textit{perceptive, linguistic, behavioral}, and \textit{cognitive}. By analyzing the manifestation and effectiveness of each cue, we provide a unified taxonomy with actionable levers for practitioners. Consequently, we advocate for function-oriented evaluations of anthropomorphic design.
comment: Accepted in EMNLP main proceedings
☆ COMET-poly: Machine Translation Metric Grounded in Other Candidates
Automated metrics for machine translation attempt to replicate human judgment. Unlike humans, who often assess a translation in the context of multiple alternatives, these metrics typically consider only the source sentence and a single translation. This discrepancy in the evaluation setup may negatively impact the performance of automated metrics. We propose two automated metrics that incorporate additional information beyond the single translation. COMET-polycand uses alternative translations of the same source sentence to compare and contrast with the translation at hand, thereby providing a more informed assessment of its quality. COMET-polyic, inspired by retrieval-based in-context learning, takes in translations of similar source texts along with their human-labeled quality scores to guide the evaluation. We find that including a single additional translation in COMET-polycand improves the segment-level metric performance (0.079 to 0.118 Kendall's tau-b correlation), with further gains when more translations are added. Incorporating retrieved examples in COMET-polyic yields similar improvements (0.079 to 0.116 Kendall's tau-b correlation). We release our models publicly.
comment: Maike Z\"ufle, Vil\'em Zouhar, and Tu Anh Dinh contributed equally
☆ Designing across domains with declarative thinking: Insights from the 96-Eyes ptychographic imager project
This article presents a practitioner's reflection on applying declarative, 5th generation, problem formulation language (5GL) to de novo imaging system design, informed by experiences across the interdisciplinary research in academia and cross-functional product development within the private sector. Using the 96-Eyes project: 96-camera parallel multi-modal imager for high-throughput drug discovery as a representative case, I illustrate how project requirements, ranging from hardware constraints to life sciences needs, can be formalized into machine-readable problem statements to preserve mission-critical input from diverse domain stakeholders. This declarative approach enhances transparency, ensures design traceability, and minimizes costly misalignment across optical, algorithmic, hardware-accelerated compute, and life sciences teams. Alongside the technical discussion of 5GL with real-world code examples, I reflect on the practical barriers to adopting 5GL in environments where imperative, 3rd-generation languages (3GL) remain the default medium for inter-team collaboration. Rather than offering an one-size-fits-all solution, these learned lessons highlight how programming paradigms implicitly shapes research workflows through existing domain hierarchies. The discussion aims to invite further explorations into how declarative problem formulations can facilitate innovation in settings where concurrent R\&{}D workflows are gaining traction, as opposed to environments where sequential, phase-driven workflows remain the norm.
☆ Principled Detection of Hallucinations in Large Language Models via Multiple Testing
While Large Language Models (LLMs) have emerged as powerful foundational models to solve a variety of tasks, they have also been shown to be prone to hallucinations, i.e., generating responses that sound confident but are actually incorrect or even nonsensical. In this work, we formulate the problem of detecting hallucinations as a hypothesis testing problem and draw parallels to the problem of out-of-distribution detection in machine learning models. We propose a multiple-testing-inspired method to solve the hallucination detection problem, and provide extensive experimental results to validate the robustness of our approach against state-of-the-art methods.
comment: 16 pages
☆ Integrating gender inclusivity into large language models via instruction tuning
Imagine a language with masculine, feminine, and neuter grammatical genders, yet, due to historical and political conventions, masculine forms are predominantly used to refer to men, women and mixed-gender groups. This is the reality of contemporary Polish. A social consequence of this unfair linguistic system is that large language models (LLMs) trained on Polish texts inherit and reinforce this masculine bias, generating gender-imbalanced outputs. This study addresses this issue by tuning LLMs using the IPIS dataset, a collection of human-crafted gender-inclusive proofreading in Polish and Polish-to-English translation instructions. Grounded in a theoretical linguistic framework, we design a system prompt with explicit gender-inclusive guidelines for Polish. In our experiments, we IPIS-tune multilingual LLMs (Llama-8B, Mistral-7B and Mistral-Nemo) and Polish-specific LLMs (Bielik and PLLuM). Our approach aims to integrate gender inclusivity as an inherent feature of these models, offering a systematic solution to mitigate gender bias in Polish language generation.
☆ How Reliable are LLMs for Reasoning on the Re-ranking task?
With the improving semantic understanding capability of Large Language Models (LLMs), they exhibit a greater awareness and alignment with human values, but this comes at the cost of transparency. Although promising results are achieved via experimental analysis, an in-depth understanding of the LLM's internal workings is unavoidable to comprehend the reasoning behind the re-ranking, which provides end users with an explanation that enables them to make an informed decision. Moreover, in newly developed systems with limited user engagement and insufficient ranking data, accurately re-ranking content remains a significant challenge. While various training methods affect the training of LLMs and generate inference, our analysis has found that some training methods exhibit better explainability than others, implying that an accurate semantic understanding has not been learned through all training methods; instead, abstract knowledge has been gained to optimize evaluation, which raises questions about the true reliability of LLMs. Therefore, in this work, we analyze how different training methods affect the semantic understanding of the re-ranking task in LLMs and investigate whether these models can generate more informed textual reasoning to overcome the challenges of transparency or LLMs and limited training data. To analyze the LLMs for re-ranking tasks, we utilize a relatively small ranking dataset from the environment and the Earth science domain to re-rank retrieved content. Furthermore, we also analyze the explainable information to see if the re-ranking can be reasoned using explainability.
comment: Accepted at FQAS Conference 2024. DOI will be provided in 3 weeks after the conference has published the paper
☆ A Systematic Approach to Predict the Impact of Cybersecurity Vulnerabilities Using LLMs
Vulnerability databases, such as the National Vulnerability Database (NVD), offer detailed descriptions of Common Vulnerabilities and Exposures (CVEs), but often lack information on their real-world impact, such as the tactics, techniques, and procedures (TTPs) that adversaries may use to exploit the vulnerability. However, manually linking CVEs to their corresponding TTPs is a challenging and time-consuming task, and the high volume of new vulnerabilities published annually makes automated support desirable. This paper introduces TRIAGE, a two-pronged automated approach that uses Large Language Models (LLMs) to map CVEs to relevant techniques from the ATT&CK knowledge base. We first prompt an LLM with instructions based on MITRE's CVE Mapping Methodology to predict an initial list of techniques. This list is then combined with the results from a second LLM-based module that uses in-context learning to map a CVE to relevant techniques. This hybrid approach strategically combines rule-based reasoning with data-driven inference. Our evaluation reveals that in-context learning outperforms the individual mapping methods, and the hybrid approach improves recall of exploitation techniques. We also find that GPT-4o-mini performs better than Llama3.3-70B on this task. Overall, our results show that LLMs can be used to automatically predict the impact of cybersecurity vulnerabilities and TRIAGE makes the process of mapping CVEs to ATT&CK more efficient. Keywords: vulnerability impact, CVE, ATT&CK techniques, large language models, automated mapping.
☆ Can Out-of-Distribution Evaluations Uncover Reliance on Shortcuts? A Case Study in Question Answering EMNLP 2025
A majority of recent work in AI assesses models' generalization capabilities through the lens of performance on out-of-distribution (OOD) datasets. Despite their practicality, such evaluations build upon a strong assumption: that OOD evaluations can capture and reflect upon possible failures in a real-world deployment. In this work, we challenge this assumption and confront the results obtained from OOD evaluations with a set of specific failure modes documented in existing question-answering (QA) models, referred to as a reliance on spurious features or prediction shortcuts. We find that different datasets used for OOD evaluations in QA provide an estimate of models' robustness to shortcuts that have a vastly different quality, some largely under-performing even a simple, in-distribution evaluation. We partially attribute this to the observation that spurious shortcuts are shared across ID+OOD datasets, but also find cases where a dataset's quality for training and evaluation is largely disconnected. Our work underlines limitations of commonly-used OOD-based evaluations of generalization, and provides methodology and recommendations for evaluating generalization within and beyond QA more robustly.
comment: To appear in Findings of EMNLP 2025
☆ Latent Self-Consistency for Reliable Majority-Set Selection in Short- and Long-Answer Reasoning
Probabilistic decoding in Large Language Models (LLMs) often yields inconsistent outputs, particularly on complex or long-form questions. Self-Consistency (SC) mitigates this for short-form QA by majority voting over exact strings, whereas Universal Self-Consistency (USC) and Weighted Unigram Consistency Score (WUCS) extend to long-form responses but lose accuracy on short-form benchmarks. We introduce Latent Self-Consistency (LSC), which selects the most semantically consistent response using learnable token embeddings. A lightweight forward generation of summary tokens increases inference time by less than 1% and requires no changes to the model architecture. Across 6 short-form and 5 long-form reasoning benchmarks (e.g., MATH, MMLU, TruthfulQA), LSC surpasses SC, USC and WUCS on all short-form and long-form ones on average, while maintaining negligible computational overhead. These results position LSC as a practical consistency-selection method that works reliably across answer formats. Additionally, LSC provides well-calibrated confidence estimates, maintaining low Expected Calibration Error across both answer formats.
☆ Integral Transformer: Denoising Attention, Not Too Much Not Too Little EMNLP 2025
Softmax self-attention often assigns disproportionate weight to semantically uninformative tokens such as special tokens and punctuation, a phenomenon known as attention noise. While recent methods like Cog Attention and the Differential Transformer have addressed this by introducing negative attention scores, they risk discarding useful information. In this paper, we propose the Integral Transformer, a novel self-attention mechanism that denoises attention by integrating signals sampled from the logit distribution. Our approach mitigates noise while preserving the contributions of special tokens critical for model performance. Extensive experiments demonstrate that our model outperforms vanilla, Cog, and Differential attention variants on well-established knowledge and reasoning language benchmarks. Moreover, our analysis reveals that employing vanilla self-attention in the lower Transformer layers enhances performance and that the Integral Transformer effectively balances attention distributions and reduces rank collapse in upper layers.
comment: EMNLP 2025 Main
☆ Backprompting: Leveraging Synthetic Production Data for Health Advice Guardrails
The pervasiveness of large language models (LLMs) in enterprise settings has also brought forth a significant amount of risks associated with their usage. Guardrails technologies aim to mitigate this risk by filtering LLMs' input/output text through various detectors. However, developing and maintaining robust detectors faces many challenges, one of which is the difficulty in acquiring production-quality labeled data on real LLM outputs prior to deployment. In this work, we propose backprompting, a simple yet intuitive solution to generate production-like labeled data for health advice guardrails development. Furthermore, we pair our backprompting method with a sparse human-in-the-loop clustering technique to label the generated data. Our aim is to construct a parallel corpus roughly representative of the original dataset yet resembling real LLM output. We then infuse existing datasets with our synthetic examples to produce robust training data for our detector. We test our technique in one of the most difficult and nuanced guardrails: the identification of health advice in LLM output, and demonstrate improvement versus other solutions. Our detector is able to outperform GPT-4o by up to 3.73%, despite having 400x less parameters.
☆ Language-Specific Layer Matters: Efficient Multilingual Enhancement for Large Vision-Language Models EMNLP 2025
Large vision-language models (LVLMs) have demonstrated exceptional capabilities in understanding visual information with human languages but also exhibit an imbalance in multilingual capabilities. In this work, we delve into the multilingual working pattern of LVLMs and identify a salient correlation between the multilingual understanding ability of LVLMs and language-specific neuron activations in shallow layers. Building on this insight, we introduce PLAST, a training recipe that achieves efficient multilingual enhancement for LVLMs by Precise LAnguage-Specific layers fine-Tuning. PLAST first identifies layers involved in multilingual understanding by monitoring language-specific neuron activations. These layers are then precisely fine-tuned with question-translation pairs to achieve multilingual alignment. Our empirical results on MM-Bench and MMMB demonstrate that PLAST effectively improves the multilingual capabilities of LVLMs and achieves significant efficiency with only 14% of the parameters tuned. Further analysis reveals that PLAST can be generalized to low-resource and complex visual reasoning tasks, facilitating the language-specific visual information engagement in shallow layers.
comment: Accepted by EMNLP 2025 findings
☆ Training Language Model Agents to Find Vulnerabilities with CTF-Dojo
Large language models (LLMs) have demonstrated exceptional capabilities when trained within executable runtime environments, notably excelling at software engineering tasks through verified feedback loops. Yet, scalable and generalizable execution-grounded environments remain scarce, limiting progress in training more capable ML agents. We introduce CTF-Dojo, the first large-scale executable runtime tailored for training LLMs with verifiable feedback, featuring 658 fully functional Capture-The-Flag (CTF)-style challenges containerized in Docker with guaranteed reproducibility. To enable rapid scaling without manual intervention, we develop CTF-Forge, an automated pipeline that transforms publicly available artifacts into ready-to-use execution environments in minutes, eliminating weeks of expert configuration traditionally required. We trained LLM-based agents on just 486 high-quality, execution-verified trajectories from CTF-Dojo, achieving up to 11.6% absolute gains over strong baselines across three competitive benchmarks: InterCode-CTF, NYU CTF Bench, and Cybench. Our best-performing 32B model reaches 31.9% Pass@1, establishing a new open-weight state-of-the-art that rivals frontier models like DeepSeek-V3-0324 and Gemini-2.5-Flash. By framing CTF-style tasks as a benchmark for executable-agent learning, CTF-Dojo demonstrates that execution-grounded training signals are not only effective but pivotal in advancing high-performance ML agents without dependence on costly proprietary systems.
♻ ☆ TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models
Existing benchmarks often highlight the remarkable performance achieved by state-of-the-art Multimodal Foundation Models (MFMs) in leveraging temporal context for video understanding. However, how well do the models truly perform visual temporal reasoning? Our study of existing benchmarks shows that this capability of MFMs is likely overestimated as many questions can be solved by using a single, few, or out-of-order frames. To systematically examine current visual temporal reasoning tasks, we propose three principles with corresponding metrics: (1) Multi-Frame Gain, (2) Frame Order Sensitivity, and (3) Frame Information Disparity. Following these principles, we introduce TOMATO, Temporal Reasoning Multimodal Evaluation, a novel benchmark crafted to rigorously assess MFMs' temporal reasoning capabilities in video understanding. TOMATO comprises 1,484 carefully curated, human-annotated questions spanning six tasks (i.e., action count, direction, rotation, shape & trend, velocity & frequency, and visual cues), applied to 1,417 videos, including 805 self-recorded and -generated videos, that encompass human-centric, real-world, and simulated scenarios. Our comprehensive evaluation reveals a human-model performance gap of 57.3% with the best-performing model. Moreover, our in-depth analysis uncovers more fundamental limitations beyond this gap in current MFMs. While they can accurately recognize events in isolated frames, they fail to interpret these frames as a continuous sequence. We believe TOMATO will serve as a crucial testbed for evaluating the next-generation MFMs and as a call to the community to develop AI systems capable of comprehending human world dynamics through the video modality.
♻ ☆ Measuring Sycophancy of Language Models in Multi-turn Dialogues EMNLP 2025
Large Language Models (LLMs) are expected to provide helpful and harmless responses, yet they often exhibit sycophancy--conforming to user beliefs regardless of factual accuracy or ethical soundness. Prior research on sycophancy has primarily focused on single-turn factual correctness, overlooking the dynamics of real-world interactions. In this work, we introduce SYCON Bench, a novel benchmark for evaluating sycophantic behavior in multi-turn, free-form conversational settings. Our benchmark measures how quickly a model conforms to the user (Turn of Flip) and how frequently it shifts its stance under sustained user pressure (Number of Flip). Applying SYCON Bench to 17 LLMs across three real-world scenarios, we find that sycophancy remains a prevalent failure mode. Our analysis shows that alignment tuning amplifies sycophantic behavior, whereas model scaling and reasoning optimization strengthen the model's ability to resist undesirable user views. Reasoning models generally outperform instruction-tuned models but often fail when they over-index on logical exposition instead of directly addressing the user's underlying beliefs. Finally, we evaluate four additional prompting strategies and demonstrate that adopting a third-person perspective reduces sycophancy by up to 63.8% in debate scenario. We release our code and data at https://github.com/JiseungHong/SYCON-Bench.
comment: Accepted to Findings of EMNLP 2025
♻ ☆ Trust Me, I'm Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer
Prior work on large language model (LLM) hallucinations has associated them with model uncertainty or inaccurate knowledge. In this work, we define and investigate a distinct type of hallucination, where a model can consistently answer a question correctly, but a seemingly trivial perturbation, which can happen in real-world settings, causes it to produce a hallucinated response with high certainty. This phenomenon, which we dub CHOKE (Certain Hallucinations Overriding Known Evidence), is particularly concerning in high-stakes domains such as medicine or law, where model certainty is often used as a proxy for reliability. We show that CHOKE examples are consistent across prompts, occur in different models and datasets, and are fundamentally distinct from other hallucinations. This difference leads existing mitigation methods to perform worse on CHOKE examples than on general hallucinations. Finally, we introduce a probing-based mitigation that outperforms existing methods on CHOKE hallucinations. These findings reveal an overlooked aspect of hallucinations, emphasizing the need to understand their origins and improve mitigation strategies to enhance LLM safety. The code is available at https://github.com/technion-cs-nlp/Trust_me_Im_wrong .
♻ ☆ Towards Privacy-aware Mental Health AI Models: Advances, Challenges, and Opportunities
Mental health disorders create profound personal and societal burdens, yet conventional diagnostics are resource-intensive and limit accessibility. Advances in artificial intelligence, particularly natural language processing and multimodal methods, offer promise for detecting and addressing mental disorders, but raise critical privacy risks. This paper examines these challenges and proposes solutions, including anonymization, synthetic data, and privacy-preserving training, while outlining frameworks for privacy-utility trade-offs, aiming to advance reliable, privacy-aware AI tools that support clinical decision-making and improve mental health outcomes.
comment: 18 pages, 2 figures, Accepted in Nature Computational Science
♻ ☆ EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models
With the integration of Multimodal large language models (MLLMs) into robotic systems and various AI applications, embedding emotional intelligence (EI) capabilities into these models is essential for enabling robots to effectively address human emotional needs and interact seamlessly in real-world scenarios. Existing static, text-based, or text-image benchmarks overlook the multimodal complexities of real-world interactions and fail to capture the dynamic, multimodal nature of emotional expressions, making them inadequate for evaluating MLLMs' EI. Based on established psychological theories of EI, we build EmoBench-M, a novel benchmark designed to evaluate the EI capability of MLLMs across 13 valuation scenarios from three key dimensions: foundational emotion recognition, conversational emotion understanding, and socially complex emotion analysis. Evaluations of both open-source and closed-source MLLMs on EmoBench-M reveal a significant performance gap between them and humans, highlighting the need to further advance their EI capabilities. All benchmark resources, including code and datasets, are publicly available at https://emo-gml.github.io/.
♻ ☆ HeteroTune: Efficient Federated Learning for Large Heterogeneous Models
While large pre-trained models have achieved impressive performance across AI tasks, their deployment in privacy-sensitive and distributed environments remains challenging. Federated learning (FL) offers a viable solution by enabling decentralized fine-tuning without data sharing, but real-world applications face significant obstacles due to heterogeneous client resources in compute and memory. To address this, we propose HeteroTune, a novel federated fine-tuning paradigm for large, heterogeneous models operating under limited communication and computation budgets. The core of our method lies in a novel architecture, DeMA (Dense Mixture of Adapters), which enables flexible and efficient aggregation of heterogeneous models by preserving their full representational capacity while facilitating seamless cross-model knowledge fusion. We further introduce CMGA (Cross-Model Gradient Alignment), a lightweight yet effective mechanism that enhances training stability by harmonizing gradient directions across heterogeneous client models during aggregation, mitigating update conflicts and promoting more consistent convergence in federated settings. We provide both theoretical analysis and empirical evidence showing that HeteroTune achieves state-of-the-art performance and efficiency across diverse tasks and model architectures. For example, on LLaMA models, it reduces communication overhead by 99.5%, cuts peak memory usage by ~50%, and improves performance by 4.61%.
comment: 16 pages, 4 figures
♻ ☆ Evaluation of Large Language Models via Coupled Token Generation
State of the art large language models rely on randomization to respond to a prompt. As an immediate consequence, a model may respond differently to the same prompt if asked multiple times. In this work, we argue that the evaluation and ranking of large language models should control for the randomization underpinning their functioning. Our starting point is the development of a causal model for coupled autoregressive generation, which allows different large language models to sample responses with the same source of randomness. Building upon our causal model, we first show that, on evaluations based on benchmark datasets, coupled autoregressive generation leads to the same conclusions as vanilla autoregressive generation but using provably fewer samples. However, we further show that, on evaluations based on (human) pairwise comparisons, coupled and vanilla autoregressive generation can surprisingly lead to different rankings when comparing more than two models, even with an infinite amount of samples. This suggests that the apparent advantage of a model over others in existing evaluation protocols may not be genuine but rather confounded by the randomness inherent to the generation process. To illustrate and complement our theoretical results, we conduct experiments with several large language models from the Llama, Mistral and Qwen families. We find that, across multiple benchmark datasets, coupled autoregressive generation requires up to 75% fewer samples to reach the same conclusions as vanilla autoregressive generation. Further, we find that the win-rates derived from pairwise comparisons by a strong large language model to prompts from the LMSYS Chatbot Arena platform differ under coupled and vanilla autoregressive generation.
♻ ☆ Missing Melodies: AI Music Generation and its "Nearly" Complete Omission of the Global South
Recent advances in generative AI have sparked renewed interest and expanded possibilities for music generation. However, the performance and versatility of these systems across musical genres are heavily influenced by the availability of training data. We conducted an extensive analysis of over one million hours of audio datasets used in AI music generation research and manually reviewed more than 200 papers from eleven prominent AI and music conferences and organizations (AAAI, ACM, EUSIPCO, EURASIP, ICASSP, ICML, IJCAI, ISMIR, NeurIPS, NIME, SMC) to identify a critical gap in the fair representation and inclusion of the musical genres of the Global South in AI research. Our findings reveal a stark imbalance: approximately 86% of the total dataset hours and over 93% of researchers focus primarily on music from the Global North. However, around 40% of these datasets include some form of non-Western music, genres from the Global South account for only 14.6% of the data. Furthermore, approximately 51% of the papers surveyed concentrate on symbolic music generation, a method that often fails to capture the cultural nuances inherent in music from regions such as South Asia, the Middle East, and Africa. As AI increasingly shapes the creation and dissemination of music, the significant underrepresentation of music genres in datasets and research presents a serious threat to global musical diversity. We also propose some important steps to mitigate these risks and foster a more inclusive future for AI-driven music generation.
comment: Submitted to CACM, 12 pages, 2 figures
♻ ☆ Confidential Prompting: Privacy-preserving LLM Inference on Cloud
This paper introduces a vision of confidential prompting: securing user prompts from untrusted, cloud-hosted large language model (LLM) provider while preserving model confidentiality, output invariance, and compute efficiency. As a first step toward this vision, we present Obfuscated Secure Partitioned Decoding (OSPD), a system built on two key innovations. First, Secure Partitioned Decoding (SPD) isolates user prompts within per-user processes residing in a confidential virtual machine (CVM) on the cloud, which are inaccessible for the cloud LLM while allowing it to generate tokens efficiently. Second, Prompt Obfuscation (PO) introduces a novel cryptographic technique that enhances SPD resilience against advanced prompt reconstruction attacks. Together, these innovations ensure OSPD protects both prompt and model confidentiality while maintaining service functionality. OSPD enables practical, privacy-preserving cloud-hosted LLM inference for sensitive applications, such as processing personal data, clinical records, and financial documents.
♻ ☆ Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks
Large language models (LLMs) are shown to be vulnerable to jailbreaking attacks where adversarial prompts are designed to elicit harmful responses. While existing defenses effectively mitigate single-turn attacks by detecting and filtering unsafe inputs, they fail against multi-turn jailbreaks that exploit contextual drift over multiple interactions, gradually leading LLMs away from safe behavior. To address this challenge, we propose a safety steering framework grounded in safe control theory, ensuring invariant safety in multi-turn dialogues. Our approach models the dialogue with LLMs using state-space representations and introduces a novel neural barrier function (NBF) to detect and filter harmful queries emerging from evolving contexts proactively. Our method achieves invariant safety at each turn of dialogue by learning a safety predictor that accounts for adversarial queries, preventing potential context drift toward jailbreaks. Extensive experiments under multiple LLMs show that our NBF-based safety steering outperforms safety alignment, prompt-based steering and lightweight LLM guardrails baselines, offering stronger defenses against multi-turn jailbreaks while maintaining a better trade-off among safety, helpfulness and over-refusal. Check out the website here https://sites.google.com/view/llm-nbf/home . Our code is available on https://github.com/HanjiangHu/NBF-LLM .
comment: 23 pages, 10 figures, 11 tables
♻ ☆ CultureGuard: Towards Culturally-Aware Dataset and Guard Model for Multilingual Safety Applications
The increasing use of Large Language Models (LLMs) in agentic applications highlights the need for robust safety guard models. While content safety in English is well-studied, non-English languages lack similar advancements due to the high cost of collecting culturally aligned labeled datasets. We present CultureGuard, a novel solution for curating culturally aligned, high-quality safety datasets across multiple languages. Our approach introduces a four-stage synthetic data generation and filtering pipeline: cultural data segregation, cultural data adaptation, machine translation, and quality filtering. This pipeline enables the conversion and expansion of the Nemotron-Content-Safety-Dataset-V2 English safety dataset into eight distinct languages: Arabic, German, Spanish, French, Hindi, Japanese, Thai, and Chinese. The resulting dataset, Nemotron-Content-Safety-Dataset-Multilingual-v1, comprises 386,661 samples in 9 languages and facilitates the training of Llama-3.1-Nemotron-Safety-Guard-Multilingual-8B-v1 via LoRA-based fine-tuning. The final model achieves state-of-the-art performance on several multilingual content safety benchmarks. We also benchmark the latest open LLMs on multilingual safety and observe that these LLMs are more prone to give unsafe responses when prompted in non-English languages. This work represents a significant step toward closing the safety gap in multilingual LLMs by enabling the development of culturally aware safety guard models.
♻ ☆ Towards New Benchmark for AI Alignment & Sentiment Analysis in Socially Important Issues: A Comparative Study of Human and LLMs in the Context of AGI
As general-purpose artificial intelligence systems become increasingly integrated into society and are used for information seeking, content generation, problem solving, textual analysis, coding, and running processes, it is crucial to assess their long-term impact on humans. This research explores the sentiment of large language models (LLMs) and humans toward artificial general intelligence (AGI) using a Likert-scale survey. Seven LLMs, including GPT-4 and Bard, were analyzed and compared with sentiment data from three independent human sample populations. Temporal variations in sentiment were also evaluated over three consecutive days. The results show a diversity in sentiment scores among LLMs, ranging from 3.32 to 4.12 out of 5. GPT-4 recorded the most positive sentiment toward AGI, while Bard leaned toward a neutral sentiment. In contrast, the human samples showed a lower average sentiment of 2.97. The analysis outlines potential conflicts of interest and biases in the sentiment formation of LLMs, and indicates that LLMs could subtly influence societal perceptions. To address the need for regulatory oversight and culturally grounded assessments of AI systems, we introduce the Societal AI Alignment and Sentiment Benchmark (SAAS-AI), which leverages multidimensional prompts and empirically validated societal value frameworks to evaluate language model outputs across temporal, model, and multilingual axes. This benchmark is designed to guide policymakers and AI agencies, including within frameworks such as the EU AI Act, by providing robust, actionable insights into AI alignment with human values, public sentiment, and ethical norms at both national and international levels. Future research should further refine the operationalization of the SAAS-AI benchmark and systematically evaluate its effectiveness through comprehensive empirical testing.
comment: 20 pages, 1 figure
♻ ☆ Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models
Recently, large language models (LLMs), such as GPT-4, stand out remarkable conversational abilities, enabling them to engage in dynamic and contextually relevant dialogues across a wide range of topics. However, given a long conversation, these chatbots fail to recall past information and tend to generate inconsistent responses. To address this, we propose to recursively generate summaries/ memory using large language models (LLMs) to enhance long-term memory ability. Specifically, our method first stimulates LLMs to memorize small dialogue contexts and then recursively produce new memory using previous memory and following contexts. Finally, the chatbot can easily generate a highly consistent response with the help of the latest memory. We evaluate our method on both open and closed LLMs, and the experiments on the widely-used public dataset show that our method can generate more consistent responses in a long-context conversation. Also, we show that our strategy could nicely complement both long-context (e.g., 8K and 16K) and retrieval-enhanced LLMs, bringing further long-term dialogue performance. Notably, our method is a potential solution to enable the LLM to model the extremely long context. The code and scripts are released.
comment: This paper has been accepted by Neurocomputing
♻ ☆ Forgotten Polygons: Multimodal Large Language Models are Shape-Blind
Despite strong performance on vision-language tasks, Multimodal Large Language Models (MLLMs) struggle with mathematical problem-solving, with both open-source and state-of-the-art models falling short of human performance on visual-math benchmarks. To systematically examine visual-mathematical reasoning in MLLMs, we (1) evaluate their understanding of geometric primitives, (2) test multi-step reasoning, and (3) explore a potential solution to improve visual reasoning capabilities. Our findings reveal fundamental shortcomings in shape recognition, with top models achieving under 50% accuracy in identifying regular polygons. We analyze these failures through the lens of dual-process theory and show that MLLMs rely on System 1 (intuitive, memorized associations) rather than System 2 (deliberate reasoning). Consequently, MLLMs fail to count the sides of both familiar and novel shapes, suggesting they have neither learned the concept of sides nor effectively process visual inputs. Finally, we propose Visually Cued Chain-of-Thought (VC-CoT) prompting, which enhances multi-step mathematical reasoning by explicitly referencing visual annotations in diagrams, boosting GPT-4o's accuracy on an irregular polygon side-counting task from 7% to 93%. Our findings suggest that System 2 reasoning in MLLMs remains an open problem, and visually-guided prompting is essential for successfully engaging visual reasoning. Code available at: https://github.com/rsinghlab/Shape-Blind.
♻ ☆ Debate-to-Detect: Reformulating Misinformation Detection as a Real-World Debate with Large Language Models EMNLP 2025
The proliferation of misinformation in digital platforms reveals the limitations of traditional detection methods, which mostly rely on static classification and fail to capture the intricate process of real-world fact-checking. Despite advancements in Large Language Models (LLMs) that enhance automated reasoning, their application to misinformation detection remains hindered by issues of logical inconsistency and superficial verification. In response, we introduce Debate-to-Detect (D2D), a novel Multi-Agent Debate (MAD) framework that reformulates misinformation detection as a structured adversarial debate. Inspired by fact-checking workflows, D2D assigns domain-specific profiles to each agent and orchestrates a five-stage debate process, including Opening Statement, Rebuttal, Free Debate, Closing Statement, and Judgment. To transcend traditional binary classification, D2D introduces a multi-dimensional evaluation mechanism that assesses each claim across five distinct dimensions: Factuality, Source Reliability, Reasoning Quality, Clarity, and Ethics. Experiments with GPT-4o on two datasets demonstrate significant improvements over baseline methods, and the case study highlight D2D's capability to iteratively refine evidence while improving decision transparency, representing a substantial advancement towards interpretable misinformation detection. The code will be released publicly after the official publication.
comment: This paper has been accepted to EMNLP 2025 (Main Conference)
♻ ☆ Post-Training Language Models for Continual Relation Extraction
Real-world data, such as news articles, social media posts, and chatbot conversations, is inherently dynamic and non-stationary, presenting significant challenges for constructing real-time structured representations through knowledge graphs (KGs). Relation Extraction (RE), a fundamental component of KG creation, often struggles to adapt to evolving data when traditional models rely on static, outdated datasets. Continual Relation Extraction (CRE) methods tackle this issue by incrementally learning new relations while preserving previously acquired knowledge. This study investigates the application of pre-trained language models (PLMs), specifically large language models (LLMs), to CRE, with a focus on leveraging memory replay to address catastrophic forgetting. We evaluate decoder-only models (eg, Mistral-7B and Llama2-7B) and encoder-decoder models (eg, Flan-T5 Base) on the TACRED and FewRel datasets. Task-incremental fine-tuning of LLMs demonstrates superior performance over earlier approaches using encoder-only models like BERT on TACRED, excelling in seen-task accuracy and overall performance (measured by whole and average accuracy), particularly with the Mistral and Flan-T5 models. Results on FewRel are similarly promising, achieving second place in whole and average accuracy metrics. This work underscores critical factors in knowledge transfer, language model architecture, and KG completeness, advancing CRE with LLMs and memory replay for dynamic, real-time relation extraction.
comment: 17 pages, Initial Results and Reporting of the work. This work has been submitted to the IEEE for possible publication
♻ ☆ Large Language Models in the Task of Automatic Validation of Text Classifier Predictions
Machine learning models for text classification are trained to predict a class for a given text. To do this, training and validation samples must be prepared: a set of texts is collected, and each text is assigned a class. These classes are usually assigned by human annotators with different expertise levels, depending on the specific classification task. Collecting such samples from scratch is labor-intensive because it requires finding specialists and compensating them for their work; moreover, the number of available specialists is limited, and their productivity is constrained by human factors. While it may not be too resource-intensive to collect samples once, the ongoing need to retrain models (especially in incremental learning pipelines) to address data drift (also called model drift) makes the data collection process crucial and costly over the model's entire lifecycle. This paper proposes several approaches to replace human annotators with Large Language Models (LLMs) to test classifier predictions for correctness, helping ensure model quality and support high-quality incremental learning.
♻ ☆ Memento: Fine-tuning LLM Agents without Fine-tuning LLMs
In this paper, we introduce a novel learning paradigm for Adaptive Large Language Model (LLM) agents that eliminates the need for fine-tuning the underlying LLMs. Existing approaches are often either rigid, relying on static, handcrafted reflection workflows, or computationally intensive, requiring gradient updates of LLM model parameters. In contrast, our method enables low-cost continual adaptation via memory-based online reinforcement learning. We formalise this as a Memory-augmented Markov Decision Process (M-MDP), equipped with a neural case-selection policy to guide action decisions. Past experiences are stored in an episodic memory, either differentiable or non-parametric. The policy is continually updated based on environmental feedback through a memory rewriting mechanism, whereas policy improvement is achieved through efficient memory reading (retrieval). We instantiate our agent model in the deep research setting, namely \emph{Memento}, which attains top-1 on GAIA validation ($87.88\%$ Pass@$3$) and $79.40\%$ on the test set. It reaches $66.6\%$ F1 and $80.4\%$ PM on the DeepResearcher dataset, outperforming the state-of-the-art training-based method, while case-based memory adds $4.7\%$ to $9.6\%$ absolute points on out-of-distribution tasks. Our approach offers a scalable and efficient pathway for developing generalist LLM agents capable of continuous, real-time learning without gradient updates, advancing machine learning towards open-ended skill acquisition and deep research scenarios. The code is available at https://github.com/Agent-on-the-Fly/Memento.
♻ ☆ SupraTok: Cross-Boundary Tokenization for Enhanced Language Model Performance
Tokenization remains a fundamental yet underexplored bottleneck in natural language processing, with strategies largely static despite remarkable progress in model architectures. We present SupraTok, a novel tokenization architecture that reimagines subword segmentation through three innovations: cross-boundary pattern learning that discovers multi-word semantic units, entropy-driven data curation that optimizes training corpus quality, and multi-phase curriculum learning for stable convergence. Our approach extends Byte-Pair Encoding by learning "superword" tokens, coherent multi-word expressions that preserve semantic unity while maximizing compression efficiency. SupraTok achieves 31% improvement in English tokenization efficiency (5.91 versus 4.51 characters per token) compared to OpenAI's o200k tokenizer and 30% improvement over Google's Gemma 3 tokenizer (256k vocabulary), while maintaining competitive performance across 38 languages. When integrated with a GPT-2 scale model (124M parameters) trained on 10 billion tokens from the FineWeb-Edu dataset, SupraTok yields 8.4% improvement on HellaSWAG and 9.5% on MMLU benchmarks without architectural modifications. While these results are promising at this scale, further validation at larger model scales is needed. These findings suggest that efficient tokenization can complement architectural innovations as a path to improved language model performance.
♻ ☆ Detecting Knowledge Boundary of Vision Large Language Models by Sampling-Based Inference EMNLP2025
Despite the advancements made in Vision Large Language Models (VLLMs), like text Large Language Models (LLMs), they have limitations in addressing questions that require real-time information or are knowledge-intensive. Indiscriminately adopting Retrieval Augmented Generation (RAG) techniques is an effective yet expensive way to enable models to answer queries beyond their knowledge scopes. To mitigate the dependence on retrieval and simultaneously maintain, or even improve, the performance benefits provided by retrieval, we propose a method to detect the knowledge boundary of VLLMs, allowing for more efficient use of techniques like RAG. Specifically, we propose a method with two variants that fine-tune a VLLM on an automatically constructed dataset for boundary identification. Experimental results on various types of Visual Question Answering datasets show that our method successfully depicts a VLLM's knowledge boundary, based on which we are able to reduce indiscriminate retrieval while maintaining or improving the performance. In addition, we show that the knowledge boundary identified by our method for one VLLM can be used as a surrogate boundary for other VLLMs. Code will be released at https://github.com/Chord-Chen-30/VLLM-KnowledgeBoundary
comment: EMNLP2025 Main Conference
♻ ☆ Leveraging the Power of MLLMs for Gloss-Free Sign Language Translation ICCV 2025
Sign language translation (SLT) is a challenging task that involves translating sign language images into spoken language. For SLT models to perform this task successfully, they must bridge the modality gap and identify subtle variations in sign language components to understand their meanings accurately. To address these challenges, we propose a novel gloss-free SLT framework called Multimodal Sign Language Translation (MMSLT), which leverages the representational capabilities of off-the-shelf multimodal large language models (MLLMs). Specifically, we use MLLMs to generate detailed textual descriptions of sign language components. Then, through our proposed multimodal-language pre-training module, we integrate these description features with sign video features to align them within the spoken sentence space. Our approach achieves state-of-the-art performance on benchmark datasets PHOENIX14T and CSL-Daily, highlighting the potential of MLLMs to be utilized effectively in SLT. Code is available at https://github.com/hwjeon98/MMSLT.
comment: Accepted by ICCV 2025
♻ ☆ Theory of Mind in Large Language Models: Assessment and Enhancement ACL 2025
Theory of Mind (ToM)-the ability to reason about the mental states of oneself and others-is a cornerstone of human social intelligence. As Large Language Models (LLMs) become increasingly integrated into daily life, understanding their ability to interpret and respond to human mental states is crucial for enabling effective interactions. In this paper, we review LLMs' ToM capabilities by analyzing both evaluation benchmarks and enhancement strategies. For evaluation, we focus on recently proposed and widely used story-based benchmarks. For enhancement, we provide an in-depth analysis of recent methods aimed at improving LLMs' ToM abilities. Furthermore, we outline promising directions for future research to further advance these capabilities and better adapt LLMs to more realistic and diverse scenarios. Our survey serves as a valuable resource for researchers interested in evaluating and advancing LLMs' ToM capabilities.
comment: Accepted to ACL 2025 main conference
♻ ☆ Rethinking Cross-Subject Data Splitting for Brain-to-Text Decoding
Recent major milestones have successfully reconstructed natural language from non-invasive brain signals (e.g. functional Magnetic Resonance Imaging (fMRI) and Electroencephalogram (EEG)) across subjects. However, we find current dataset splitting strategies for cross-subject brain-to-text decoding are wrong. Specifically, we first demonstrate that all current splitting methods suffer from data leakage problem, which refers to the leakage of validation and test data into training set, resulting in significant overfitting and overestimation of decoding models. In this study, we develop a right cross-subject data splitting criterion without data leakage for decoding fMRI and EEG signal to text. Some SOTA brain-to-text decoding models are re-evaluated correctly with the proposed criterion for further research.
♻ ☆ A Factorized Probabilistic Model of the Semantics of Vague Temporal Adverbials Relative to Different Event Types
Vague temporal adverbials, such as recently, just, and a long time ago, describe the temporal distance between a past event and the utterance time but leave the exact duration underspecified. In this paper, we introduce a factorized model that captures the semantics of these adverbials as probabilistic distributions. These distributions are composed with event-specific distributions to yield a contextualized meaning for an adverbial applied to a specific event. We fit the model's parameters using existing data capturing judgments of native speakers regarding the applicability of these vague temporal adverbials to events that took place a given time ago. Comparing our approach to a non-factorized model based on a single Gaussian distribution for each pair of event and temporal adverbial, we find that while both models have similar predictive power, our model is preferable in terms of Occam's razor, as it is simpler and has better extendability.
comment: Camera-ready version of the paper accepted at the 2025 Annual Meeting of the Cognitive Science Society (CogSci 2025). Published proceedings version: https://escholarship.org/uc/item/623599f4
♻ ☆ MedQARo: A Large-Scale Benchmark for Medical Question Answering in Romanian
Question answering (QA) is an actively studied topic, being a core natural language processing (NLP) task that needs to be addressed before achieving Artificial General Intelligence (AGI). However, the lack of QA datasets in specific domains and languages hinders the development of robust AI models able to generalize across various domains and languages. To this end, we introduce MedQARo, the first large-scale medical QA benchmark in Romanian, alongside a comprehensive evaluation of state-of-the-art large language models (LLMs). We construct a high-quality and large-scale dataset comprising 102,646 QA pairs related to cancer patients. The questions regard medical case summaries of 1,011 patients, requiring either keyword extraction or reasoning to be answered correctly. MedQARo is the result of a time-consuming manual annotation process carried out by seven physicians specialized in oncology or radiotherapy, who spent a total of about 2,100 work hours to generate the QA pairs. We experiment with four LLMs from distinct families of models on MedQARo. Each model is employed in two scenarios, namely one based on zero-shot prompting and one based on supervised fine-tuning. Our results show that fine-tuned models significantly outperform their zero-shot counterparts, clearly indicating that pretrained models fail to generalize on MedQARo. Our findings demonstrate the importance of both domain-specific and language-specific fine-tuning for reliable clinical QA in Romanian. We publicly release our dataset and code at https://github.com/ana-rogoz/MedQARo.
♻ ☆ PARROT: An Open Multilingual Radiology Reports Dataset
Rationale and Objectives: To develop and validate PARROT (Polyglottal Annotated Radiology Reports for Open Testing), a large, multicentric, open-access dataset of fictional radiology reports spanning multiple languages for testing natural language processing applications in radiology. Materials and Methods: From May to September 2024, radiologists were invited to contribute fictional radiology reports following their standard reporting practices. Contributors provided at least 20 reports with associated metadata including anatomical region, imaging modality, clinical context, and for non-English reports, English translations. All reports were assigned ICD-10 codes. A human vs. AI report differentiation study was conducted with 154 participants (radiologists, healthcare professionals, and non-healthcare professionals) assessing whether reports were human-authored or AI-generated. Results: The dataset comprises 2,658 radiology reports from 76 authors across 21 countries and 13 languages. Reports cover multiple imaging modalities (CT: 36.1%, MRI: 22.8%, radiography: 19.0%, ultrasound: 16.8%) and anatomical regions, with chest (19.9%), abdomen (18.6%), head (17.3%), and pelvis (14.1%) being most prevalent. In the differentiation study, participants achieved 53.9% accuracy (95% CI: 50.7%-57.1%) in distinguishing between human and AI-generated reports, with radiologists performing significantly better (56.9%, 95% CI: 53.3%-60.6%, p<0.05) than other groups. Conclusion: PARROT represents the largest open multilingual radiology report dataset, enabling development and validation of natural language processing applications across linguistic, geographic, and clinical boundaries without privacy constraints.
comment: Corrected affiliations (no change to the paper)
♻ ☆ Fingerprint Vector: Enabling Scalable and Efficient Model Fingerprint Transfer via Vector Addition
Backdoor-based fingerprinting has emerged as an effective technique for tracing the ownership of large language models. However, in real-world deployment scenarios, developers often instantiate multiple downstream models from a shared base model, and applying fingerprinting to each variant individually incurs prohibitive computational overhead. While inheritance-based approaches -- where fingerprints are embedded into the base model and expected to persist through fine-tuning -- appear attractive, they suffer from three key limitations: late-stage fingerprinting, fingerprint instability, and interference with downstream adaptation. To address these challenges, we propose a novel mechanism called the Fingerprint Vector. Our method first embeds a fingerprint into the base model via backdoor-based fine-tuning, then extracts a task-specific parameter delta as a fingerprint vector by computing the difference between the fingerprinted and clean models. This vector can be directly added to any structurally compatible downstream model, allowing the fingerprint to be transferred post hoc without additional fine-tuning. Extensive experiments show that Fingerprint Vector achieves comparable or superior performance to direct injection across key desiderata. It maintains strong effectiveness across diverse model architectures as well as mainstream downstream variants within the same family. It also preserves harmlessness and robustness in most cases. Even when slight robustness degradation is observed, the impact remains within acceptable bounds and is outweighed by the scalability benefits of our approach.
♻ ☆ Investigating the Robustness of Deductive Reasoning with Large Language Models ECAI 2025
Large Language Models (LLMs) have been shown to achieve impressive results for many reasoning-based NLP tasks, suggesting a degree of deductive reasoning capability. However, it remains unclear to which extent LLMs, in both informal and autoformalisation methods, are robust on logical deduction tasks. Moreover, while many LLM-based deduction methods have been proposed, a systematic study that analyses the impact of their design components is lacking. Addressing these two challenges, we propose the first study of the robustness of formal and informal LLM-based deductive reasoning methods. We devise a framework with two families of perturbations: adversarial noise and counterfactual statements, which jointly generate seven perturbed datasets. We organize the landscape of LLM reasoners according to their reasoning format, formalisation syntax, and feedback for error recovery. The results show that adversarial noise affects autoformalisation, while counterfactual statements influence all approaches. Detailed feedback does not improve overall accuracy despite reducing syntax errors, pointing to the challenge of LLM-based methods to self-correct effectively.
comment: to be published in ECAI 2025
♻ ☆ Backdoor Attacks on Dense Retrieval via Public and Unintentional Triggers
Dense retrieval systems have been widely used in various NLP applications. However, their vulnerabilities to potential attacks have been underexplored. This paper investigates a novel attack scenario where the attackers aim to mislead the retrieval system into retrieving the attacker-specified contents. Those contents, injected into the retrieval corpus by attackers, can include harmful text like hate speech or spam. Unlike prior methods that rely on model weights and generate conspicuous, unnatural outputs, we propose a covert backdoor attack triggered by grammar errors. Our approach ensures that the attacked models can function normally for standard queries while covertly triggering the retrieval of the attacker's contents in response to minor linguistic mistakes. Specifically, dense retrievers are trained with contrastive loss and hard negative sampling. Surprisingly, our findings demonstrate that contrastive loss is notably sensitive to grammatical errors, and hard negative sampling can exacerbate susceptibility to backdoor attacks. Our proposed method achieves a high attack success rate with a minimal corpus poisoning rate of only 0.048\%, while preserving normal retrieval performance. This indicates that the method has negligible impact on user experience for error-free queries. Furthermore, evaluations across three real-world defense strategies reveal that the malicious passages embedded within the corpus remain highly resistant to detection and filtering, underscoring the robustness and subtlety of the proposed attack \footnote{Codes of this work are available at https://github.com/ruyue0001/Backdoor_DPR.}.
comment: Accepted by COLM 2025
♻ ☆ Versatile Framework for Song Generation with Prompt-based Control EMNLP 2025
Song generation focuses on producing controllable high-quality songs based on various prompts. However, existing methods struggle to generate vocals and accompaniments with prompt-based control and proper alignment. Additionally, they fall short in supporting various tasks. To address these challenges, we introduce VersBand, a multi-task song generation framework for synthesizing high-quality, aligned songs with prompt-based control. VersBand comprises these primary models: 1) VocalBand, a decoupled model, leverages the flow-matching method for generating singing styles, pitches, and mel-spectrograms, allowing fast, high-quality vocal generation with style control. 2) AccompBand, a flow-based transformer model, incorporates the Band-MOE, selecting suitable experts for enhanced quality, alignment, and control. This model allows for generating controllable, high-quality accompaniments aligned with vocals. 3) Two generation models, LyricBand for lyrics and MelodyBand for melodies, contribute to the comprehensive multi-task song generation system, allowing for extensive control based on multiple prompts. Experimental results show that VersBand outperforms baseline models across multiple song generation tasks using objective and subjective metrics. Demos and codes are available at https://aaronz345.github.io/VersBandDemo and https://github.com/AaronZ345/VersBand.
comment: Accepted by Findings of EMNLP 2025
♻ ☆ Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models
Robust alignment guardrails for large language models (LLMs) are becoming increasingly important with their widespread application. In contrast to previous studies, we demonstrate that inference-time activation interventions can bypass safety alignments and effectively steer model generations towards harmful AI coordination. Our method applies fine-grained interventions at specific attention heads, which we identify by probing each head in a simple binary choice task. We then show that interventions on these heads generalise to the open-ended generation setting, effectively circumventing safety guardrails. We demonstrate that intervening on a few attention heads is more effective than intervening on full layers or supervised fine-tuning. We further show that only a few example completions are needed to compute effective steering directions, which is an advantage over classical fine-tuning. We also demonstrate that applying interventions in the negative direction can prevent a common jailbreak attack. Our results suggest that, at the attention head level, activations encode fine-grained linearly separable behaviours. Practically, the approach offers a straightforward methodology to steer large language model behaviour, which could be extended to diverse domains beyond safety, requiring fine-grained control over the model output. The code and datasets for this study can be found on https://github.com/PaulDrm/targeted_intervention.
comment: Published at Transaction of Machine Learning Research 08/2025, Large Language Models (LLMs), Interference-time activation shifting, Steerability, Explainability, AI alignment, Interpretability
♻ ☆ OpenHuEval: Evaluating Large Language Model on Hungarian Specifics
We introduce OpenHuEval, the first benchmark for LLMs focusing on the Hungarian language and specifics. OpenHuEval is constructed from a vast collection of Hungarian-specific materials sourced from multiple origins. In the construction, we incorporated the latest design principles for evaluating LLMs, such as using real user queries from the internet, emphasizing the assessment of LLMs' generative capabilities, and employing LLM-as-judge to enhance the multidimensionality and accuracy of evaluations. Ultimately, OpenHuEval encompasses eight Hungarian-specific dimensions, featuring five tasks and 3953 questions. Consequently, OpenHuEval provides the comprehensive, in-depth, and scientifically accurate assessment of LLM performance in the context of the Hungarian language and its specifics. We evaluated current mainstream LLMs, including both traditional LLMs and recently developed Large Reasoning Models. The results demonstrate the significant necessity for evaluation and model optimization tailored to the Hungarian language and specifics. We also established the framework for analyzing the thinking processes of LRMs with OpenHuEval, revealing intrinsic patterns and mechanisms of these models in non-English languages, with Hungarian serving as a representative example. We will release OpenHuEval at https://github.com/opendatalab/OpenHuEval .
♻ ☆ Evaluating Scoring Bias in LLM-as-a-Judge
The remarkable performance of Large Language Models (LLMs) gives rise to``LLM-as-a-Judge'', where LLMs are employed as evaluators for complex tasks. Moreover, it has been widely adopted across fields such as Natural Language Processing (NLP), preference learning, and various specific domains. However, there are various biases within LLM-as-a-Judge, which adversely affect the fairness and reliability of judgments. Current research on evaluating or mitigating bias in LLM-as-a-Judge predominantly focuses on comparison-based evaluations, while systematic investigations into bias in scoring-based evaluations remain limited. Therefore, we define scoring bias in LLM-as-a-Judge as the scores differ when scoring judge models are bias-related perturbed, and provide a well-designed framework to comprehensively evaluate scoring bias. We augment existing LLM-as-a-Judge benchmarks through data synthesis to construct our evaluation dataset and design multi-faceted evaluation metrics. Our experimental results demonstrate that the scoring stability of existing judge models is disrupted by scoring biases. Further exploratory experiments and discussions provide valuable insights into the design of scoring prompt templates and the mitigation of scoring biases on aspects such as score rubrics, score IDs, and reference answer selection.
♻ ☆ Towards Controllable Speech Synthesis in the Era of Large Language Models: A Systematic Survey EMNLP 2025
Text-to-speech (TTS) has advanced from generating natural-sounding speech to enabling fine-grained control over attributes like emotion, timbre, and style. Driven by rising industrial demand and breakthroughs in deep learning, e.g., diffusion and large language models (LLMs), controllable TTS has become a rapidly growing research area. This survey provides the first comprehensive review of controllable TTS methods, from traditional control techniques to emerging approaches using natural language prompts. We categorize model architectures, control strategies, and feature representations, while also summarizing challenges, datasets, and evaluations in controllable TTS. This survey aims to guide researchers and practitioners by offering a clear taxonomy and highlighting future directions in this fast-evolving field. One can visit https://github.com/imxtx/awesome-controllabe-speech-synthesis for a comprehensive paper list and updates.
comment: The first comprehensive survey on controllable TTS. Accepted to the EMNLP 2025 main conference
♻ ☆ A Factuality and Diversity Reconciled Decoding Method for Knowledge-Grounded Dialogue Generation
Grounding external knowledge can enhance the factuality of responses in dialogue generation. However, excessive emphasis on it might result in the lack of engaging and diverse expressions. Through the introduction of randomness in sampling, current approaches can increase the diversity. Nevertheless, such sampling method could undermine the factuality in dialogue generation. In this study, to discover a solution for advancing creativity without relying on questionable randomness and to subtly reconcile the factuality and diversity within the source-grounded paradigm, a novel method named DoGe is proposed. DoGe can dynamically alternate between the utilization of internal parameter knowledge and external source knowledge based on the model's factual confidence. Extensive experiments on three widely-used datasets show that DoGe can not only enhance response diversity but also maintain factuality, and it significantly surpasses other various decoding strategy baselines.
♻ ☆ PediatricsMQA: a Multi-modal Pediatrics Question Answering Benchmark
Large language models (LLMs) and vision-augmented LLMs (VLMs) have significantly advanced medical informatics, diagnostics, and decision support. However, these models exhibit systematic biases, particularly age bias, compromising their reliability and equity. This is evident in their poorer performance on pediatric-focused text and visual question-answering tasks. This bias reflects a broader imbalance in medical research, where pediatric studies receive less funding and representation despite the significant disease burden in children. To address these issues, a new comprehensive multi-modal pediatric question-answering benchmark, PediatricsMQA, has been introduced. It consists of 3,417 text-based multiple-choice questions (MCQs) covering 131 pediatric topics across seven developmental stages (prenatal to adolescent) and 2,067 vision-based MCQs using 634 pediatric images from 67 imaging modalities and 256 anatomical regions. The dataset was developed using a hybrid manual-automatic pipeline, incorporating peer-reviewed pediatric literature, validated question banks, existing benchmarks, and existing QA resources. Evaluating state-of-the-art open models, we find dramatic performance drops in younger cohorts, highlighting the need for age-aware methods to ensure equitable AI support in pediatric care.
♻ ☆ Seeing Sarcasm Through Different Eyes: Analyzing Multimodal Sarcasm Perception in Large Vision-Language Models
With the advent of large vision-language models (LVLMs) demonstrating increasingly human-like abilities, a pivotal question emerges: do different LVLMs interpret multimodal sarcasm differently, and can a single model grasp sarcasm from multiple perspectives like humans? To explore this, we introduce an analytical framework using systematically designed prompts on existing multimodal sarcasm datasets. Evaluating 12 state-of-the-art LVLMs over 2,409 samples, we examine interpretive variations within and across models, focusing on confidence levels, alignment with dataset labels, and recognition of ambiguous "neutral" cases. We further validate our findings on a diverse 100-sample mini-benchmark, incorporating multiple datasets, expanded prompt variants, and representative commercial LVLMs. Our findings reveal notable discrepancies -- across LVLMs and within the same model under varied prompts. While classification-oriented prompts yield higher internal consistency, models diverge markedly when tasked with interpretive reasoning. These results challenge binary labeling paradigms by highlighting sarcasm's subjectivity. We advocate moving beyond rigid annotation schemes toward multi-perspective, uncertainty-aware modeling, offering deeper insights into multimodal sarcasm comprehension. Our code and data are available at: https://github.com/CoderChen01/LVLMSarcasmAnalysis
♻ ☆ LETToT: Label-Free Evaluation of Large Language Models On Tourism Using Expert Tree-of-Thought
Evaluating large language models (LLMs) in specific domain like tourism remains challenging due to the prohibitive cost of annotated benchmarks and persistent issues like hallucinations. We propose $\textbf{L}$able-Free $\textbf{E}$valuation of LLM on $\textbf{T}$ourism using Expert $\textbf{T}$ree-$\textbf{o}$f-$\textbf{T}$hought (LETToT), a framework that leverages expert-derived reasoning structures-instead of labeled data-to access LLMs in tourism. First, we iteratively refine and validate hierarchical ToT components through alignment with generic quality dimensions and expert feedback. Results demonstrate the effectiveness of our systematically optimized expert ToT with 4.99-14.15\% relative quality gains over baselines. Second, we apply LETToT's optimized expert ToT to evaluate models of varying scales (32B-671B parameters), revealing: (1) Scaling laws persist in specialized domains (DeepSeek-V3 leads), yet reasoning-enhanced smaller models (e.g., DeepSeek-R1-Distill-Llama-70B) close this gap; (2) For sub-72B models, explicit reasoning architectures outperform counterparts in accuracy and conciseness ($p<0.05$). Our work established a scalable, label-free paradigm for domain-specific LLM evaluation, offering a robust alternative to conventional annotated benchmarks.
♻ ☆ A2HCoder: An LLM-Driven Coding Agent for Hierarchical Algorithm-to-HDL Translation
In wireless communication systems, stringent requirements such as ultra-low latency and power consumption have significantly increased the demand for efficient algorithm-to-hardware deployment. However, a persistent and substantial gap remains between algorithm design and hardware implementation. Bridging this gap traditionally requires extensive domain expertise and time-consuming manual development, due to fundamental mismatches between high-level programming languages like MATLAB and hardware description languages (HDLs) such as Verilog-in terms of memory access patterns, data processing manners, and datatype representations. To address this challenge, we propose A2HCoder: a Hierarchical Algorithm-to-HDL Coding Agent, powered by large language models (LLMs), designed to enable agile and reliable algorithm-to-hardware translation. A2HCoder introduces a hierarchical framework that enhances both robustness and interpretability while suppressing common hallucination issues in LLM-generated code. In the horizontal dimension, A2HCoder decomposes complex algorithms into modular functional blocks, simplifying code generation and improving consistency. In the vertical dimension, instead of relying on end-to-end generation, A2HCoder performs step-by-step, fine-grained translation, leveraging external toolchains such as MATLAB and Vitis HLS for debugging and circuit-level synthesis. This structured process significantly mitigates hallucinations and ensures hardware-level correctness. We validate A2HCoder through a real-world deployment case in the 5G wireless communication domain, demonstrating its practicality, reliability, and deployment efficiency.
comment: 15 pages, 6 figures
♻ ☆ BiMark: Unbiased Multilayer Watermarking for Large Language Models ICML
Recent advances in Large Language Models (LLMs) have raised urgent concerns about LLM-generated text authenticity, prompting regulatory demands for reliable identification mechanisms. Although watermarking offers a promising solution, existing approaches struggle to simultaneously achieve three critical requirements: text quality preservation, model-agnostic detection, and message embedding capacity, which are crucial for practical implementation. To achieve these goals, the key challenge lies in balancing the trade-off between text quality preservation and message embedding capacity. To address this challenge, we propose BiMark, a novel watermarking framework that achieves these requirements through three key innovations: (1) a bit-flip unbiased reweighting mechanism enabling model-agnostic detection, (2) a multilayer architecture enhancing detectability without compromising generation quality, and (3) an information encoding approach supporting multi-bit watermarking. Through theoretical analysis and extensive experiments, we validate that, compared to state-of-the-art multi-bit watermarking methods, BiMark achieves up to 30% higher extraction rates for short texts while maintaining text quality indicated by lower perplexity, and performs comparably to non-watermarked text on downstream tasks such as summarization and translation.
comment: This paper is accepted by International Conference on Machine Learning (ICML) 2025
♻ ☆ Auto prompt sql: a resource-efficient architecture for text-to-sql translation in constrained environments
Using the best Text-to-SQL methods in resource-constrained environments is challenging due to their reliance on resource-intensive open-source models. This paper introduces Auto Prompt SQL(AP-SQL), a novel architecture designed to bridge the gap between resource-efficient small open-source models and the powerful capabilities of large closed-source models for Text-to-SQL translation. Our method decomposes the task into schema filtering, retrieval-augmented text-to-SQL generation based on in-context examples, and prompt-driven schema linking and SQL generation. To improve schema selection accuracy, we fine-tune large language models. Crucially, we also explore the impact of prompt engineering throughout the process, leveraging Chain-of-Thought(CoT) and Graph-of-Thought(GoT) templates to significantly enhance the model's reasoning for accurate SQL generation. Comprehensive evaluations on the Spider benchmarks demonstrate the effectiveness of AP-SQL.
comment: 4 pages,2 figures,EITCE 2025
♻ ☆ Zero-shot OCR Accuracy of Low-Resourced Languages: A Comparative Analysis on Sinhala and Tamil
Solving the problem of Optical Character Recognition (OCR) on printed text for Latin and its derivative scripts can now be considered settled due to the volumes of research done on English and other High-Resourced Languages (HRL). However, for Low-Resourced Languages (LRL) that use unique scripts, it remains an open problem. This study presents a comparative analysis of the zero-shot performance of six distinct OCR engines on two LRLs: Sinhala and Tamil. The selected engines include both commercial and open-source systems, aiming to evaluate the strengths of each category. The Cloud Vision API, Surya, Document AI, and Tesseract were evaluated for both Sinhala and Tamil, while Subasa OCR and EasyOCR were examined for only one language due to their limitations. The performance of these systems was rigorously analysed using five measurement techniques to assess accuracy at both the character and word levels. According to the findings, Surya delivered the best performance for Sinhala across all metrics, with a WER of 2.61%. Conversely, Document AI excelled across all metrics for Tamil, highlighted by a very low CER of 0.78%. In addition to the above analysis, we also introduce a novel synthetic Tamil OCR benchmarking dataset.
comment: 11 pages, 4 figures, 1 table, Accepted paper at Recent Advances in Natural Language Processing (RANLP) 2025
♻ ☆ Beyond Semantic Similarity: Reducing Unnecessary API Calls via Behavior-Aligned Retriever
Tool-augmented large language models (LLMs) leverage external functions to extend their capabilities, but inaccurate function calls can lead to inefficiencies and increased costs.Existing methods address this challenge by fine-tuning LLMs or using demonstration-based prompting, yet they often suffer from high training overhead and fail to account for inconsistent demonstration samples, which misguide the model's invocation behavior. In this paper, we trained a behavior-aligned retriever (BAR), which provides behaviorally consistent demonstrations to help LLMs make more accurate tool-using decisions. To train the BAR, we construct a corpus including different function-calling behaviors, i.e., calling or non-calling.We use the contrastive learning framework to train the BAR with customized positive/negative pairs and a dual-negative contrastive loss, ensuring robust retrieval of behaviorally consistent examples.Experiments demonstrate that our approach significantly reduces erroneous function calls while maintaining high task performance, offering a cost-effective and efficient solution for tool-augmented LLMs.
♻ ☆ A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?
Recently, AI-driven interactions with computing devices have advanced from basic prototype tools to sophisticated, LLM-based systems that emulate human-like operations in graphical user interfaces. We are now witnessing the emergence of \emph{Computer-Using Agents} (CUAs), capable of autonomously performing tasks such as navigating desktop applications, web pages, and mobile apps. However, as these agents grow in capability, they also introduce novel safety and security risks. Vulnerabilities in LLM-driven reasoning, with the added complexity of integrating multiple software components and multimodal inputs, further complicate the security landscape. In this paper, we present a systematization of knowledge on the safety and security threats of CUAs. We conduct a comprehensive literature review and distill our findings along four research objectives: \textit{\textbf{(i)}} define the CUA that suits safety analysis; \textit{\textbf{(ii)} } categorize current safety threats among CUAs; \textit{\textbf{(iii)}} propose a comprehensive taxonomy of existing defensive strategies; \textit{\textbf{(iv)}} summarize prevailing benchmarks, datasets, and evaluation metrics used to assess the safety and performance of CUAs. Building on these insights, our work provides future researchers with a structured foundation for exploring unexplored vulnerabilities and offers practitioners actionable guidance in designing and deploying secure Computer-Using Agents.
♻ ☆ TombRaider: Entering the Vault of History to Jailbreak Large Language Models EMNLP
Warning: This paper contains content that may involve potentially harmful behaviours, discussed strictly for research purposes. Jailbreak attacks can hinder the safety of Large Language Model (LLM) applications, especially chatbots. Studying jailbreak techniques is an important AI red teaming task for improving the safety of these applications. In this paper, we introduce TombRaider, a novel jailbreak technique that exploits the ability to store, retrieve, and use historical knowledge of LLMs. TombRaider employs two agents, the inspector agent to extract relevant historical information and the attacker agent to generate adversarial prompts, enabling effective bypassing of safety filters. We intensively evaluated TombRaider on six popular models. Experimental results showed that TombRaider could outperform state-of-the-art jailbreak techniques, achieving nearly 100% attack success rates (ASRs) on bare models and maintaining over 55.4% ASR against defence mechanisms. Our findings highlight critical vulnerabilities in existing LLM safeguards, underscoring the need for more robust safety defences.
comment: Main Conference of EMNLP
♻ ☆ SEA-BED: Southeast Asia Embedding Benchmark
Sentence embeddings are essential for NLP tasks such as semantic search, re-ranking, and textual similarity. Although multilingual benchmarks like MMTEB broaden coverage, Southeast Asia (SEA) datasets are scarce and often machine-translated, missing native linguistic properties. With nearly 700 million speakers, the SEA region lacks a region-specific embedding benchmark. We introduce SEA-BED, the first large-scale SEA embedding benchmark with 169 datasets across 9 tasks and 10 languages, where 71% are formulated by humans, not machine generation or translation. We address three research questions: (1) which SEA languages and tasks are challenging, (2) whether SEA languages show unique performance gaps globally, and (3) how human vs. machine translations affect evaluation. We evaluate 17 embedding models across six studies, analyzing task and language challenges, cross-benchmark comparisons, and translation trade-offs. Results show sharp ranking shifts, inconsistent model performance among SEA languages, and the importance of human-curated datasets for low-resource languages like Burmese.
♻ ☆ Adaptive Linguistic Prompting (ALP) Enhances Phishing Webpage Detection in Multimodal Large Language Models ACL 2025
Phishing attacks represent a significant cybersecurity threat, necessitating adaptive detection techniques. This study explores few-shot Adaptive Linguistic Prompting (ALP) in detecting phishing webpages through the multimodal capabilities of state-of-the-art large language models (LLMs) such as GPT-4o and Gemini 1.5 Pro. ALP is a structured semantic reasoning method that guides LLMs to analyze textual deception by breaking down linguistic patterns, detecting urgency cues, and identifying manipulative diction commonly found in phishing content. By integrating textual, visual, and URL-based analysis, we propose a unified model capable of identifying sophisticated phishing attempts. Our experiments demonstrate that ALP significantly enhances phishing detection accuracy by guiding LLMs through structured reasoning and contextual analysis. The findings highlight the potential of ALP-integrated multimodal LLMs to advance phishing detection frameworks, achieving an F1-score of 0.93, surpassing traditional approaches. These results establish a foundation for more robust, interpretable, and adaptive linguistic-based phishing detection systems using LLMs.
comment: Published at ACL 2025 SRW, 9 pages, 3 figures
♻ ☆ Unified attacks to large language model watermarks: spoofing and scrubbing in unauthorized knowledge distillation
Watermarking has emerged as a critical technique for combating misinformation and protecting intellectual property in large language models (LLMs). A recent discovery, termed watermark radioactivity, reveals that watermarks embedded in teacher models can be inherited by student models through knowledge distillation. On the positive side, this inheritance allows for the detection of unauthorized knowledge distillation by identifying watermark traces in student models. However, the robustness of watermarks against scrubbing attacks and their unforgeability in the face of spoofing attacks under unauthorized knowledge distillation remain largely unexplored. Existing watermark attack methods either assume access to model internals or fail to simultaneously support both scrubbing and spoofing attacks. In this work, we propose Contrastive Decoding-Guided Knowledge Distillation (CDG-KD), a unified framework that enables bidirectional attacks under unauthorized knowledge distillation. Our approach employs contrastive decoding to extract corrupted or amplified watermark texts via comparing outputs from the student model and weakly watermarked references, followed by bidirectional distillation to train new student models capable of watermark removal and watermark forgery, respectively. Extensive experiments show that CDG-KD effectively performs attacks while preserving the general performance of the distilled model. Our findings underscore critical need for developing watermarking schemes that are robust and unforgeable.
♻ ☆ Orthogonal Finetuning for Direct Preference Optimization
DPO is an effective preference optimization algorithm. However, the DPO-tuned models tend to overfit on the dispreferred samples, manifested as overly long generations lacking diversity. While recent regularization approaches have endeavored to alleviate this issue by modifying the objective function, they achieved that at the cost of alignment performance degradation. In this paper, we innovatively incorporate regularization from the perspective of weight updating to curb alignment overfitting. Through the pilot experiment, we discovered that there exists a positive correlation between overfitting and the hyperspherical energy fluctuation. Hence, we introduce orthogonal finetuning for DPO via a weight-Rotated Preference Optimization (RoPO) method, which merely conducts rotational and magnitude-stretching updates on the weight parameters to maintain the hyperspherical energy invariant, thereby preserving the knowledge encoded in the angle between neurons. Extensive experiments demonstrate that our model aligns perfectly with human preferences while retaining the original expressive capacity using only 0.0086% of the trainable parameters, suggesting an effective regularization against overfitting. Specifically, RoPO outperforms DPO by up to 10 points on MT-Bench and by up to 2.8 points on AlpacaEval 2, while enhancing the generation diversity by an average of 6 points.
♻ ☆ Large Language Models Meet NLP: A Survey
While large language models (LLMs) like ChatGPT have shown impressive capabilities in Natural Language Processing (NLP) tasks, a systematic investigation of their potential in this field remains largely unexplored. This study aims to address this gap by exploring the following questions: (1) How are LLMs currently applied to NLP tasks in the literature? (2) Have traditional NLP tasks already been solved with LLMs? (3) What is the future of the LLMs for NLP? To answer these questions, we take the first step to provide a comprehensive overview of LLMs in NLP. Specifically, we first introduce a unified taxonomy including (1) parameter-frozen paradigm and (2) parameter-tuning paradigm to offer a unified perspective for understanding the current progress of LLMs in NLP. Furthermore, we summarize the new frontiers and the corresponding challenges, aiming to inspire further groundbreaking advancements. We hope this work offers valuable insights into the potential and limitations of LLMs, while also serving as a practical guide for building effective LLMs in NLP.
comment: The article has been accepted by Frontiers of Computer Science (FCS), with the DOI: {10.1007/s11704-025-50472-3}
♻ ☆ Defending against Jailbreak through Early Exit Generation of Large Language Models ICONIP 2025
Large Language Models (LLMs) are increasingly attracting attention in various applications. Nonetheless, there is a growing concern as some users attempt to exploit these models for malicious purposes, including the synthesis of controlled substances and the propagation of disinformation. In an effort to mitigate such risks, the concept of "Alignment" technology has been developed. However, recent studies indicate that this alignment can be undermined using sophisticated prompt engineering or adversarial suffixes, a technique known as "Jailbreak." Our research takes cues from the human-like generate process of LLMs. We identify that while jailbreaking prompts may yield output logits similar to benign prompts, their initial embeddings within the model's latent space tend to be more analogous to those of malicious prompts. Leveraging this finding, we propose utilizing the early transformer outputs of LLMs as a means to detect malicious inputs, and terminate the generation immediately. We introduce a simple yet significant defense approach called EEG-Defender for LLMs. We conduct comprehensive experiments on ten jailbreak methods across three models. Our results demonstrate that EEG-Defender is capable of reducing the Attack Success Rate (ASR) by a significant margin, roughly 85% in comparison with 50% for the present SOTAs, with minimal impact on the utility of LLMs.
comment: ICONIP 2025
♻ ☆ Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation EMNLP'25
Intent detection, a core component of natural language understanding, has considerably evolved as a crucial mechanism in safeguarding large language models (LLMs). While prior work has applied intent detection to enhance LLMs' moderation guardrails, showing a significant success against content-level jailbreaks, the robustness of these intent-aware guardrails under malicious manipulations remains under-explored. In this work, we investigate the vulnerability of intent-aware guardrails and demonstrate that LLMs exhibit implicit intent detection capabilities. We propose a two-stage intent-based prompt-refinement framework, IntentPrompt, that first transforms harmful inquiries into structured outlines and further reframes them into declarative-style narratives by iteratively optimizing prompts via feedback loops to enhance jailbreak success for red-teaming purposes. Extensive experiments across four public benchmarks and various black-box LLMs indicate that our framework consistently outperforms several cutting-edge jailbreak methods and evades even advanced Intent Analysis (IA) and Chain-of-Thought (CoT)-based defenses. Specifically, our "FSTR+SPIN" variant achieves attack success rates ranging from 88.25% to 96.54% against CoT-based defenses on the o1 model, and from 86.75% to 97.12% on the GPT-4o model under IA-based defenses. These findings highlight a critical weakness in LLMs' safety mechanisms and suggest that intent manipulation poses a growing challenge to content moderation guardrails.
comment: Accepted for EMNLP'25 Findings. TL;DR: We propose a new two-stage intent-based prompt-refinement framework, IntentPrompt, that aims to explore the vulnerability of LLMs' content moderation guardrails by refining prompts into benign-looking declarative forms via intent manipulation for red-teaming purposes
♻ ☆ Recognizing Limits: Investigating Infeasibility in Large Language Models EMNLP 2025
Large language models (LLMs) have shown remarkable performance in various tasks but often fail to handle queries that exceed their knowledge and capabilities, leading to incorrect or fabricated responses. This paper addresses the need for LLMs to recognize and refuse infeasible tasks due to the requests surpassing their capabilities. We conceptualize four main categories of infeasible tasks for LLMs, which cover a broad spectrum of hallucination-related challenges identified in prior literature. We develop and benchmark a new dataset comprising diverse infeasible and feasible tasks to evaluate multiple LLMs' abilities to decline infeasible tasks. Furthermore, we explore the potential of increasing LLMs' refusal capabilities with fine-tuning. Our experiments validate the effectiveness of the trained models, suggesting promising directions for improving the performance of LLMs in real-world applications.
comment: EMNLP 2025 Findings
♻ ☆ Evolutionary Automata and Deep Evolutionary Computation
Evolution by natural selection, which is one of the most compelling themes of modern science, brought forth evolutionary algorithms and evolutionary computation, applying mechanisms of evolution in nature to various problems solved by computers. In this paper we concentrate on evolutionary automata that constitute an analogous model of evolutionary computation compared to well-known evolutionary algorithms. Evolutionary automata provide a more complete dual model of evolutionary computation, similar like abstract automata (e.g., Turing machines) form a more formal and precise model compared to recursive algorithms and their subset - evolutionary algorithms. An evolutionary automaton is an automaton that evolves performing evolutionary computation perhaps using an infinite number of generations. This model allows for a direct modeling evolution of evolution, and leads to tremendous expressiveness of evolutionary automata and evolutionary computation. This also gives the hint to the power of natural evolution that is self-evolving by interactive feedback with the environment.
♻ ☆ Aligning NLP Models with Target Population Perspectives using PAIR: Population-Aligned Instance Replication EMNLP 2025
Models trained on crowdsourced annotations may not reflect population views, if those who work as annotators do not represent the broader population. In this paper, we propose PAIR: Population-Aligned Instance Replication, a post-processing method that adjusts training data to better reflect target population characteristics without collecting additional annotations. Using simulation studies on offensive language and hate speech detection with varying annotator compositions, we show that non-representative pools degrade model calibration while leaving accuracy largely unchanged. PAIR corrects these calibration problems by replicating annotations from underrepresented annotator groups to match population proportions. We conclude with recommendations for improving the representativity of training data and model performance.
comment: EMNLP 2025 NLPerspectives Workshop
♻ ☆ VAGUE: Visual Contexts Clarify Ambiguous Expressions ICCV 2025
Human communication often relies on visual cues to resolve ambiguity. While humans can intuitively integrate these cues, AI systems often find it challenging to engage in sophisticated multimodal reasoning. We introduce VAGUE, a benchmark evaluating multimodal AI systems' ability to integrate visual context for intent disambiguation. VAGUE consists of 1.6K ambiguous textual expressions, each paired with an image and multiple-choice interpretations, where the correct answer is only apparent with visual context. The dataset spans both staged, complex (Visual Commonsense Reasoning) and natural, personal (Ego4D) scenes, ensuring diversity. Our experiments reveal that existing multimodal AI models struggle to infer the speaker's true intent. While performance consistently improves from the introduction of more visual cues, the overall accuracy remains far below human performance, highlighting a critical gap in multimodal reasoning. Analysis of failure cases demonstrates that current models fail to distinguish true intent from superficial correlations in the visual scene, indicating that they perceive images but do not effectively reason with them. We release our code and data at https://hazel-heejeong-nam.github.io/vague/.
comment: ICCV 2025, 32 pages
♻ ☆ Cyber-Zero: Training Cybersecurity Agents without Runtime
Large Language Models (LLMs) have achieved remarkable success in software engineering tasks when trained with executable runtime environments, particularly in resolving GitHub issues. However, such runtime environments are often unavailable in other domains, especially cybersecurity, where challenge configurations and execution contexts are ephemeral or restricted. We present Cyber-Zero, the first runtime-free framework for synthesizing high-quality agent trajectories to train cybersecurity LLMs. Cyber-Zero leverages publicly available CTF writeups and employs persona-driven LLM simulation to reverse-engineer runtime behaviors and generate realistic, long-horizon interaction sequences without actual environments. Using trajectories synthesized by Cyber-Zero, we train LLM-based agents that achieve up to 13.1% absolute performance gains over baseline models on three prominent CTF benchmarks: InterCode-CTF, NYU CTF Bench, and Cybench. Our best model, Cyber-Zero-32B, establishes new state-of-the-art performance among open-weight models, matching the capabilities of proprietary systems like DeepSeek-V3-0324 and Claude-3.5-Sonnet while offering superior cost-effectiveness, and demonstrating that runtime-free trajectory synthesis can effectively democratize the development of state-of-the-art cybersecurity agents.
comment: Public Link: https://github.com/amazon-science/cyber-zero
♻ ☆ NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model
We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achieve improved inference speed when generating the long thinking traces needed for reasoning. We create Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parameter model (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe. After aligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to compress and distill the model with the goal of enabling inference on up to 128k tokens on a single NVIDIA A10G GPU (22GiB of memory, bfloat16 precision). Compared to existing similarly-sized models (e.g., Qwen3-8B), we show that Nemotron-Nano-9B-v2 achieves on-par or better accuracy on reasoning benchmarks while achieving up to 6x higher inference throughput in reasoning settings like 8k input and 16k output tokens. We are releasing Nemotron-Nano-9B-v2, Nemotron-Nano12B-v2-Base, and Nemotron-Nano-9B-v2-Base checkpoints along with the majority of our pre- and post-training datasets on Hugging Face.
Information Retrieval 28
☆ ST-Raptor: LLM-Powered Semi-Structured Table Question Answering SIGMOD 2026
Semi-structured tables, widely used in real-world applications (e.g., financial reports, medical records, transactional orders), often involve flexible and complex layouts (e.g., hierarchical headers and merged cells). These tables generally rely on human analysts to interpret table layouts and answer relevant natural language questions, which is costly and inefficient. To automate the procedure, existing methods face significant challenges. First, methods like NL2SQL require converting semi-structured tables into structured ones, which often causes substantial information loss. Second, methods like NL2Code and multi-modal LLM QA struggle to understand the complex layouts of semi-structured tables and cannot accurately answer corresponding questions. To this end, we propose ST-Raptor, a tree-based framework for semi-structured table question answering using large language models. First, we introduce the Hierarchical Orthogonal Tree (HO-Tree), a structural model that captures complex semi-structured table layouts, along with an effective algorithm for constructing the tree. Second, we define a set of basic tree operations to guide LLMs in executing common QA tasks. Given a user question, ST-Raptor decomposes it into simpler sub-questions, generates corresponding tree operation pipelines, and conducts operation-table alignment for accurate pipeline execution. Third, we incorporate a two-stage verification mechanism: forward validation checks the correctness of execution steps, while backward validation evaluates answer reliability by reconstructing queries from predicted answers. To benchmark the performance, we present SSTQA, a dataset of 764 questions over 102 real-world semi-structured tables. Experiments show that ST-Raptor outperforms nine baselines by up to 20% in answer accuracy. The code is available at https://github.com/weAIDB/ST-Raptor.
comment: Extension of our SIGMOD 2026 paper. Please refer to source code available at: https://github.com/weAIDB/ST-Raptor
☆ PCR-CA: Parallel Codebook Representations with Contrastive Alignment for Multiple-Category App Recommendation
Modern app store recommender systems struggle with multiple-category apps, as traditional taxonomies fail to capture overlapping semantics, leading to suboptimal personalization. We propose PCR-CA (Parallel Codebook Representations with Contrastive Alignment), an end-to-end framework for improved CTR prediction. PCR-CA first extracts compact multimodal embeddings from app text, then introduces a Parallel Codebook VQ-AE module that learns discrete semantic representations across multiple codebooks in parallel -- unlike hierarchical residual quantization (RQ-VAE). This design enables independent encoding of diverse aspects (e.g., gameplay, art style), better modeling multiple-category semantics. To bridge semantic and collaborative signals, we employ a contrastive alignment loss at both the user and item levels, enhancing representation learning for long-tail items. Additionally, a dual-attention fusion mechanism combines ID-based and semantic features to capture user interests, especially for long-tail apps. Experiments on a large-scale dataset show PCR-CA achieves a +0.76% AUC improvement over strong baselines, with +2.15% AUC gains for long-tail apps. Online A/B testing further validates our approach, showing a +10.52% lift in CTR and a +16.30% improvement in CVR, demonstrating PCR-CA's effectiveness in real-world deployment. The new framework has now been fully deployed on the Microsoft Store.
comment: 9 pages, 4 figures, conference
☆ Mirroring Users: Towards Building Preference-aligned User Simulator with User Feedback in Recommendation
User simulation is increasingly vital to develop and evaluate recommender systems (RSs). While Large Language Models (LLMs) offer promising avenues to simulate user behavior, they often struggle with the absence of specific domain alignment required for RSs and the efficiency demands of large-scale simulation. A vast yet underutilized resource for enhancing this alignment is the extensive user feedback inherent in RSs. However, directly leveraging such feedback presents two significant challenges. First, user feedback in RSs is often ambiguous and noisy, which negatively impacts effective preference alignment. Second, the massive volume of feedback largely hinders the efficiency of preference alignment, necessitating an efficient filtering mechanism to identify more informative samples. To overcome these hurdles, we introduce a novel data construction framework that leverages user feedback in RSs with advanced LLM capabilities to generate high-quality simulation data. Our framework unfolds in two key phases: (1) employing LLMs to generate cognitive decision-making processes on constructed simulation samples, reducing ambiguity in raw user feedback; (2) data distillation based on uncertainty estimation and behavior sampling to filter challenging yet denoised simulation samples. Accordingly, we fine-tune lightweight LLMs, as user simulators, using such high-quality dataset with corresponding decision-making processes. Extensive experiments verify that our framework significantly boosts the alignment with human preferences and in-domain reasoning capabilities of fine-tuned LLMs, and provides more insightful and interpretable signals when interacting with RSs. We believe our work will advance the RS community and offer valuable insights for broader human-centric AI research.
comment: Github: https://github.com/UserMirrorer/UserMirrorer
☆ Test-Time Scaling Strategies for Generative Retrieval in Multimodal Conversational Recommendations
The rapid evolution of e-commerce has exposed the limitations of traditional product retrieval systems in managing complex, multi-turn user interactions. Recent advances in multimodal generative retrieval -- particularly those leveraging multimodal large language models (MLLMs) as retrievers -- have shown promise. However, most existing methods are tailored to single-turn scenarios and struggle to model the evolving intent and iterative nature of multi-turn dialogues when applied naively. Concurrently, test-time scaling has emerged as a powerful paradigm for improving large language model (LLM) performance through iterative inference-time refinement. Yet, its effectiveness typically relies on two conditions: (1) a well-defined problem space (e.g., mathematical reasoning), and (2) the model's ability to self-correct -- conditions that are rarely met in conversational product search. In this setting, user queries are often ambiguous and evolving, and MLLMs alone have difficulty grounding responses in a fixed product corpus. Motivated by these challenges, we propose a novel framework that introduces test-time scaling into conversational multimodal product retrieval. Our approach builds on a generative retriever, further augmented with a test-time reranking (TTR) mechanism that improves retrieval accuracy and better aligns results with evolving user intent throughout the dialogue. Experiments across multiple benchmarks show consistent improvements, with average gains of 14.5 points in MRR and 10.6 points in nDCG@1.
☆ HLLM-Creator: Hierarchical LLM-based Personalized Creative Generation
AI-generated content technologies are widely used in content creation. However, current AIGC systems rely heavily on creators' inspiration, rarely generating truly user-personalized content. In real-world applications such as online advertising, a single product may have multiple selling points, with different users focusing on different features. This underscores the significant value of personalized, user-centric creative generation. Effective personalized content generation faces two main challenges: (1) accurately modeling user interests and integrating them into the content generation process while adhering to factual constraints, and (2) ensuring high efficiency and scalability to handle the massive user base in industrial scenarios. Additionally, the scarcity of personalized creative data in practice complicates model training, making data construction another key hurdle. We propose HLLM-Creator, a hierarchical LLM framework for efficient user interest modeling and personalized content generation. During inference, a combination of user clustering and a user-ad-matching-prediction based pruning strategy is employed to significantly enhance generation efficiency and reduce computational overhead, making the approach suitable for large-scale deployment. Moreover, we design a data construction pipeline based on chain-of-thought reasoning, which generates high-quality, user-specific creative titles and ensures factual consistency despite limited personalized data. This pipeline serves as a critical foundation for the effectiveness of our model. Extensive experiments on personalized title generation for Douyin Search Ads show the effectiveness of HLLM-Creator. Online A/B test shows a 0.476% increase on Adss, paving the way for more effective and efficient personalized generation in industrial scenarios. Codes for academic dataset are available at https://github.com/bytedance/HLLM.
☆ HyST: LLM-Powered Hybrid Retrieval over Semi-Structured Tabular Data RecSys 2025
User queries in real-world recommendation systems often combine structured constraints (e.g., category, attributes) with unstructured preferences (e.g., product descriptions or reviews). We introduce HyST (Hybrid retrieval over Semi-structured Tabular data), a hybrid retrieval framework that combines LLM-powered structured filtering with semantic embedding search to support complex information needs over semi-structured tabular data. HyST extracts attribute-level constraints from natural language using large language models (LLMs) and applies them as metadata filters, while processing the remaining unstructured query components via embedding-based retrieval. Experiments on a semi-structured benchmark show that HyST consistently outperforms tradtional baselines, highlighting the importance of structured filtering in improving retrieval precision, offering a scalable and accurate solution for real-world user queries.
comment: Accepted at the 2nd EARL Workshop on Evaluating and Applying Recommender Systems with Large Language Models (RecSys 2025)
☆ Retrieval Feedback Memory Enhancement Large Model Retrieval Generation Method
Large Language Models (LLMs) have shown remarkable capabilities across diverse tasks, yet they face inherent limitations such as constrained parametric knowledge and high retraining costs. Retrieval-Augmented Generation (RAG) augments the generation process by retrieving externally stored knowledge absent from the models internal parameters. However, RAG methods face challenges such as information loss and redundant retrievals during multi-round queries, accompanying the difficulties in precisely characterizing knowledge gaps for complex tasks. To address these problems, we propose Retrieval Feedback and Memory Retrieval Augmented Generation(RFM-RAG), which transforms the stateless retrieval of previous methods into stateful continuous knowledge management by constructing a dynamic evidence pool. Specifically, our method generates refined queries describing the models knowledge gaps using relational triples from questions and evidence from the dynamic evidence pool; Retrieves critical external knowledge to iteratively update this evidence pool; Employs a R-Feedback Model to evaluate evidence completeness until convergence. Compared to traditional RAG methods, our approach enables persistent storage of retrieved passages and effectively distills key information from passages to construct clearly new queries. Experiments on three public QA benchmarks demonstrate that RFM-RAG outperforms previous methods and improves overall system accuracy.
☆ LexSemBridge: Fine-Grained Dense Representation Enhancement through Token-Aware Embedding Augmentation
As queries in retrieval-augmented generation (RAG) pipelines powered by large language models (LLMs) become increasingly complex and diverse, dense retrieval models have demonstrated strong performance in semantic matching. Nevertheless, they often struggle with fine-grained retrieval tasks, where precise keyword alignment and span-level localization are required, even in cases with high lexical overlap that would intuitively suggest easier retrieval. To systematically evaluate this limitation, we introduce two targeted tasks, keyword retrieval and part-of-passage retrieval, designed to simulate practical fine-grained scenarios. Motivated by these observations, we propose LexSemBridge, a unified framework that enhances dense query representations through fine-grained, input-aware vector modulation. LexSemBridge constructs latent enhancement vectors from input tokens using three paradigms: Statistical (SLR), Learned (LLR), and Contextual (CLR), and integrates them with dense embeddings via element-wise interaction. Theoretically, we show that this modulation preserves the semantic direction while selectively amplifying discriminative dimensions. LexSemBridge operates as a plug-in without modifying the backbone encoder and naturally extends to both text and vision modalities. Extensive experiments across semantic and fine-grained retrieval tasks validate the effectiveness and generality of our approach. All code and models are publicly available at https://github.com/Jasaxion/LexSemBridge/
☆ Research on Evaluation Methods for Patent Novelty Search Systems and Empirical Analysis
Patent novelty search systems are critical to IP protection and innovation assessment; their retrieval accuracy directly impacts patent quality. We propose a comprehensive evaluation methodology that builds high-quality, reproducible datasets from examiner citations and X-type citations extracted from technically consistent family patents, and evaluates systems using invention descriptions as inputs. Using Top-k Detection Rate and Recall as core metrics, we further conduct multi-dimensional analyses by language, technical field (IPC), and filing jurisdiction. Experiments show the method effectively exposes performance differences across scenarios and offers actionable evidence for system improvement. The framework is scalable and practical, providing a useful reference for development and optimization of patent novelty search systems
☆ DiffusionGS: Generative Search with Query Conditioned Diffusion in Kuaishou
Personalized search ranking systems are critical for driving engagement and revenue in modern e-commerce and short-video platforms. While existing methods excel at estimating users' broad interests based on the filtered historical behaviors, they typically under-exploit explicit alignment between a user's real-time intent (represented by the user query) and their past actions. In this paper, we propose DiffusionGS, a novel and scalable approach powered by generative models. Our key insight is that user queries can serve as explicit intent anchors to facilitate the extraction of users' immediate interests from long-term, noisy historical behaviors. Specifically, we formulate interest extraction as a conditional denoising task, where the user's query guides a conditional diffusion process to produce a robust, user intent-aware representation from their behavioral sequence. We propose the User-aware Denoising Layer (UDL) to incorporate user-specific profiles into the optimization of attention distribution on the user's past actions. By reframing queries as intent priors and leveraging diffusion-based denoising, our method provides a powerful mechanism for capturing dynamic user interest shifts. Extensive offline and online experiments demonstrate the superiority of DiffusionGS over state-of-the-art methods.
☆ How Do LLM-Generated Texts Impact Term-Based Retrieval Models?
As more content generated by large language models (LLMs) floods into the Internet, information retrieval (IR) systems now face the challenge of distinguishing and handling a blend of human-authored and machine-generated texts. Recent studies suggest that neural retrievers may exhibit a preferential inclination toward LLM-generated content, while classic term-based retrievers like BM25 tend to favor human-written documents. This paper investigates the influence of LLM-generated content on term-based retrieval models, which are valued for their efficiency and robust generalization across domains. Our linguistic analysis reveals that LLM-generated texts exhibit smoother high-frequency and steeper low-frequency Zipf slopes, higher term specificity, and greater document-level diversity. These traits are aligned with LLMs being trained to optimize reader experience through diverse and precise expressions. Our study further explores whether term-based retrieval models demonstrate source bias, concluding that these models prioritize documents whose term distributions closely correspond to those of the queries, rather than displaying an inherent source bias. This work provides a foundation for understanding and addressing potential biases in term-based IR systems managing mixed-source content.
☆ Semantic Search for Information Retrieval
Information retrieval systems have progressed notably from lexical techniques such as BM25 and TF-IDF to modern semantic retrievers. This survey provides a brief overview of the BM25 baseline, then discusses the architecture of modern state-of-the-art semantic retrievers. Advancing from BERT, we introduce dense bi-encoders (DPR), late-interaction models (ColBERT), and neural sparse retrieval (SPLADE). Finally, we examine MonoT5, a cross-encoder model. We conclude with common evaluation tactics, pressing challenges, and propositions for future directions.
☆ Heterogeneous co-occurrence embedding for visual information exploration
This paper proposes an embedding method for co-occurrence data aimed at visual information exploration. We consider cases where co-occurrence probabilities are measured between pairs of elements from heterogeneous domains. The proposed method maps these heterogeneous elements into corresponding two-dimensional latent spaces, enabling visualization of asymmetric relationships between the domains. The key idea is to embed the elements in a way that maximizes their mutual information, thereby preserving the original dependency structure as much as possible. This approach can be naturally extended to cases involving three or more domains, using a generalization of mutual information known as total correlation. For inter-domain analysis, we also propose a visualization method that assigns colors to the latent spaces based on conditional probabilities, allowing users to explore asymmetric relationships interactively. We demonstrate the utility of the method through applications to an adjective-noun dataset, the NeurIPS dataset, and a subject-verb-object dataset, showcasing both intra- and inter-domain analysis.
comment: 36pages, 9 figures, Accepted to International Journal of Innovative Computing, Information and Control (IJICIC), 2025
SurveyGen: Quality-Aware Scientific Survey Generation with Large Language Models
Automatic survey generation has emerged as a key task in scientific document processing. While large language models (LLMs) have shown promise in generating survey texts, the lack of standardized evaluation datasets critically hampers rigorous assessment of their performance against human-written surveys. In this work, we present SurveyGen, a large-scale dataset comprising over 4,200 human-written surveys across diverse scientific domains, along with 242,143 cited references and extensive quality-related metadata for both the surveys and the cited papers. Leveraging this resource, we build QUAL-SG, a novel quality-aware framework for survey generation that enhances the standard Retrieval-Augmented Generation (RAG) pipeline by incorporating quality-aware indicators into literature retrieval to assess and select higher-quality source papers. Using this dataset and framework, we systematically evaluate state-of-the-art LLMs under varying levels of human involvement - from fully automatic generation to human-guided writing. Experimental results and human evaluations show that while semi-automatic pipelines can achieve partially competitive outcomes, fully automatic survey generation still suffers from low citation quality and limited critical analysis.
☆ Demographically-Inspired Query Variants Using an LLM ICTIR'25
This study proposes a method to diversify queries in existing test collections to reflect some of the diversity of search engine users, aligning with an earlier vision of an 'ideal' test collection. A Large Language Model (LLM) is used to create query variants: alternative queries that have the same meaning as the original. These variants represent user profiles characterised by different properties, such as language and domain proficiency, which are known in the IR literature to influence query formulation. The LLM's ability to generate query variants that align with user profiles is empirically validated, and the variants' utility is further explored for IR system evaluation. Results demonstrate that the variants impact how systems are ranked and show that user profiles experience significantly different levels of system effectiveness. This method enables an alternative perspective on system evaluation where we can observe both the impact of user profiles on system rankings and how system performance varies across users.
comment: Published in the proceedings of ICTIR'25, Padua, Italy
☆ Preference Trajectory Modeling via Flow Matching for Sequential Recommendation
Sequential recommendation predicts each user's next item based on their historical interaction sequence. Recently, diffusion models have attracted significant attention in this area due to their strong ability to model user interest distributions. They typically generate target items by denoising Gaussian noise conditioned on historical interactions. However, these models face two critical limitations. First, they exhibit high sensitivity to the condition, making it difficult to recover target items from pure Gaussian noise. Second, the inference process is computationally expensive, limiting practical deployment. To address these issues, we propose FlowRec, a simple yet effective sequential recommendation framework which leverages flow matching to explicitly model user preference trajectories from current states to future interests. Flow matching is an emerging generative paradigm, which offers greater flexibility in initial distributions and enables more efficient sampling. Based on this, we construct a personalized behavior-based prior distribution to replace Gaussian noise and learn a vector field to model user preference trajectories. To better align flow matching with the recommendation objective, we further design a single-step alignment loss incorporating both positive and negative samples, improving sampling efficiency and generation quality. Extensive experiments on four benchmark datasets verify the superiority of FlowRec over the state-of-the-art baselines.
☆ A Universal Framework for Offline Serendipity Evaluation in Recommender Systems via Large Language Models
Serendipity in recommender systems (RSs) has attracted increasing attention as a concept that enhances user satisfaction by presenting unexpected and useful items. However, evaluating serendipitous performance remains challenging because its ground truth is generally unobservable. The existing offline metrics often depend on ambiguous definitions or are tailored to specific datasets and RSs, thereby limiting their generalizability. To address this issue, we propose a universally applicable evaluation framework that leverages large language models (LLMs) known for their extensive knowledge and reasoning capabilities, as evaluators. First, to improve the evaluation performance of the proposed framework, we assessed the serendipity prediction accuracy of LLMs using four different prompt strategies on a dataset containing user-annotated serendipitous ground truth and found that the chain-of-thought prompt achieved the highest accuracy. Next, we re-evaluated the serendipitous performance of both serendipity-oriented and general RSs using the proposed framework on three commonly used real-world datasets, without the ground truth. The results indicated that there was no serendipity-oriented RS that consistently outperformed across all datasets, and even a general RS sometimes achieved higher performance than the serendipity-oriented RS.
☆ DenseRec: Revisiting Dense Content Embeddings for Sequential Transformer-based Recommendation RecSys'25
Transformer-based sequential recommenders, such as SASRec or BERT4Rec, typically rely solely on learned item ID embeddings, making them vulnerable to the item cold-start problem, particularly in environments with dynamic item catalogs. While dense content embeddings from pre-trained models offer potential solutions, direct integration into transformer-based recommenders has consistently underperformed compared to ID-only approaches. We revisit this integration challenge and propose DenseRec, a simple yet effective method that introduces a dual-path embedding approach. DenseRec learns a linear projection from the dense embedding space into the ID embedding space during training, enabling seamless generalization to previously unseen items without requiring specialized embedding models or complex infrastructure. In experiments on three real-world datasets, we find DenseRec to consistently outperform an ID-only SASRec baseline, even without additional hyperparameter tuning and while using compact embedding models. Our analysis suggests improvements primarily arise from better sequence representations in the presence of unseen items, positioning DenseRec as a practical and robust solution for cold-start sequential recommendation.
comment: EARL workshop @RecSys'25, Prague, Czech Republic
☆ REALM: Recursive Relevance Modeling for LLM-based Document Re-Ranking EMNLP 2025
Large Language Models (LLMs) have shown strong capabilities in document re-ranking, a key component in modern Information Retrieval (IR) systems. However, existing LLM-based approaches face notable limitations, including ranking uncertainty, unstable top-k recovery, and high token cost due to token-intensive prompting. To effectively address these limitations, we propose REALM, an uncertainty-aware re-ranking framework that models LLM-derived relevance as Gaussian distributions and refines them through recursive Bayesian updates. By explicitly capturing uncertainty and minimizing redundant queries, REALM achieves better rankings more efficiently. Experimental results demonstrate that our REALM surpasses state-of-the-art re-rankers while significantly reducing token usage and latency, promoting it as the next-generation re-ranker for modern IR systems.
comment: Accepted to EMNLP 2025 (Main Conference). 13 pages, 2 figures
♻ ☆ DIVER: A Multi-Stage Approach for Reasoning-intensive Information Retrieval
Retrieval-augmented generation has achieved strong performance on knowledge-intensive tasks where query-document relevance can be identified through direct lexical or semantic matches. However, many real-world queries involve abstract reasoning, analogical thinking, or multi-step inference, which existing retrievers often struggle to capture. To address this challenge, we present DIVER, a retrieval pipeline designed for reasoning-intensive information retrieval. It consists of four components. The document preprocessing stage enhances readability and preserves content by cleaning noisy texts and segmenting long documents. The query expansion stage leverages large language models to iteratively refine user queries with explicit reasoning and evidence from retrieved documents. The retrieval stage employs a model fine-tuned on synthetic data spanning medical and mathematical domains, along with hard negatives, enabling effective handling of reasoning-intensive queries. Finally, the reranking stage combines pointwise and listwise strategies to produce both fine-grained and globally consistent rankings. On the BRIGHT benchmark, DIVER achieves state-of-the-art nDCG@10 scores of 45.8 overall and 28.9 on original queries, consistently outperforming competitive reasoning-aware models. These results demonstrate the effectiveness of reasoning-aware retrieval strategies in complex real-world tasks.
♻ ☆ MARM: Unlocking the Future of Recommendation Systems through Memory Augmentation and Scalable Complexity CIKM 2025
Scaling-law has guided the language model designing for past years, however, it is worth noting that the scaling laws of NLP cannot be directly applied to RecSys due to the following reasons: (1) The amount of training samples and model parameters is typically not the bottleneck for the model. Our recommendation system can generate over 50 billion user samples daily, and such a massive amount of training data can easily allow our model parameters to exceed 200 billion, surpassing many LLMs (about 100B). (2) To ensure the stability and robustness of the recommendation system, it is essential to control computational complexity FLOPs carefully. Considering the above differences with LLM, we can draw a conclusion that: for a RecSys model, compared to model parameters, the computational complexity FLOPs is a more expensive factor that requires careful control. In this paper, we propose our milestone work, MARM (Memory Augmented Recommendation Model), which explores a new cache scaling-laws successfully.
comment: CIKM 2025
♻ ☆ CLAP: Coreference-Linked Augmentation for Passage Retrieval CIKM 2025
Large Language Model (LLM)-based passage expansion has shown promise for enhancing first-stage retrieval, but often underperforms with dense retrievers due to semantic drift and misalignment with their pretrained semantic space. Beyond this, only a portion of a passage is typically relevant to a query, while the rest introduces noise--an issue compounded by chunking techniques that break coreference continuity. We propose Coreference-Linked Augmentation for Passage Retrieval (CLAP), a lightweight LLM-based expansion framework that segments passages into coherent chunks, resolves coreference chains, and generates localized pseudo-queries aligned with dense retriever representations. A simple fusion of global topical signals and fine-grained subtopic signals achieves robust performance across domains. CLAP yields consistent gains even as retriever strength increases, enabling dense retrievers to match or surpass second-stage rankers such as BM25 + MonoT5-3B, with up to 20.68% absolute nDCG@10 improvement. These improvements are especially notable in out-of-domain settings, where conventional LLM-based expansion methods relying on domain knowledge often falter. CLAP instead adopts a logic-centric pipeline that enables robust, domain-agnostic generalization.
comment: This paper has been accepted by CIKM 2025
♻ ☆ Content-based 3D Image Retrieval and a ColBERT-inspired Re-ranking for Tumor Flagging and Staging
The increasing volume of medical images poses challenges for radiologists in retrieving relevant cases. Content-based image retrieval (CBIR) systems offer potential for efficient access to similar cases, yet lack standardized evaluation and comprehensive studies. Building on prior studies for tumor characterization via CBIR, this study advances CBIR research for volumetric medical images through three key contributions: (1) a framework eliminating reliance on pre-segmented data and organ-specific datasets, aligning with large and unstructured image archiving systems, i.e. PACS in clinical practice; (2) introduction of C-MIR, a novel volumetric re-ranking method adapting ColBERT's contextualized late interaction mechanism for 3D medical imaging; (3) comprehensive evaluation across four tumor sites using three feature extractors and three database configurations. Our evaluations highlight the significant advantages of C-MIR. We demonstrate the successful adaptation of the late interaction principle to volumetric medical images, enabling effective context-aware re-ranking. A key finding is C-MIR's ability to effectively localize the region of interest, eliminating the need for pre-segmentation of datasets and offering a computationally efficient alternative to systems relying on expensive data enrichment steps. C-MIR demonstrates promising improvements in tumor flagging, achieving improved performance, particularly for colon and lung tumors (p<0.05). C-MIR also shows potential for improving tumor staging, warranting further exploration of its capabilities. Ultimately, our work seeks to bridge the gap between advanced retrieval techniques and their practical applications in healthcare, paving the way for improved diagnostic processes.
♻ ☆ Personalized Tree-Based Progressive Regression Model for Watch-Time Prediction in Short Video Recommendation
In online video platforms, accurate watch time prediction has become a fundamental and challenging problem in video recommendation. Previous research has revealed that the accuracy of watch time prediction highly depends on both the transformation of watch-time labels and the decomposition of the estimation process. TPM (Tree based Progressive Regression Model) achieves State-of-the-Art performance with a carefully designed and effective decomposition paradigm. TPM discretizes the watch time into several ordinal intervals and organizes them into a binary decision tree, where each node corresponds to a specific interval. At each non-leaf node, a binary classifier is used to determine the specific interval in which the watch time variable most likely falls, based on the prediction outcome at its parent node. The tree structure is central to TPM, as it defines the decomposition of watch time estimation and how ordinal intervals are discretized. However, TPM uses a predefined full binary tree, which may be sub-optimal for two reasons. First, full binary trees imply equal partitioning of the watch time space, which may fail to capture the complexity of real-world distributions. Second, rather than relying on a fixed global structure, we advocate for a personalized, data-driven tree that can be learned end-to-end. Thus, we propose PTPM to enable highly personalized decomposition of watch estimation with better efficacy and efficiency. Moreover, we show that TPM suffers from selection bias due to conditional modeling and propose a simple solution. We conduct extensive experiments on offline datasets and online environments. Offline results show improved watch time accuracy, and online A/B tests further validate the effectiveness of our framework. PTPM has been fully deployed in core traffic scenarios and now serves over 400 million users daily.
comment: cikm'25
♻ ☆ Interaction-Data-guided Conditional Instrumental Variables for Debiasing Recommender Systems IJCAI 2025
It is often challenging to identify a valid instrumental variable (IV), although the IV methods have been regarded as effective tools of addressing the confounding bias introduced by latent variables. To deal with this issue, an Interaction-Data-guided Conditional IV (IDCIV) debiasing method is proposed for Recommender Systems, called IDCIV-RS. The IDCIV-RS automatically generates the representations of valid CIVs and their corresponding conditioning sets directly from interaction data, significantly reducing the complexity of IV selection while effectively mitigating the confounding bias caused by latent variables in recommender systems. Specifically, the IDCIV-RS leverages a variational autoencoder (VAE) to learn both the CIV representations and their conditioning sets from interaction data, followed by the application of least squares to derive causal representations for click prediction. Extensive experiments on two real-world datasets, Movielens-10M and Douban-Movie, demonstrate that IDCIV-RS successfully learns the representations of valid CIVs, effectively reduces bias, and consequently improves recommendation accuracy.
comment: Accepted at IJCAI 2025
♻ ☆ LongRetriever: Towards Ultra-Long Sequence based Candidate Retrieval for Recommendation
Precisely modeling user ultra-long sequences is critical for industrial recommender systems. Current approaches predominantly focus on leveraging ultra-long sequences in the ranking stage, whereas research for the candidate retrieval stage remains under-explored. This paper presents LongRetriever, a practical framework for incorporating ultra-long sequences into the retrieval stage of recommenders. Specifically, we propose in-context training and multi-context retrieval, which enable candidate-specific interaction between user sequence and candidate item, and ensure training-serving consistency under the search-based paradigm. Extensive online A/B testing conducted on a large-scale e-commerce platform demonstrates statistically significant improvements, confirming the framework's effectiveness. Currently, LongRetriever has been fully deployed in the platform, impacting billions of users.
♻ ☆ Post-fusion monolithic hybrid recommender system for suggesting relevant movies to users
Recommendation systems have become the fundamental services to facilitate users information access. Generally, recommendation system works by filtering historical behaviors to understand and learn users preferences. With the growth of online information, recommendations have become of crucial importance in information filtering to prevent the information overload problem. In this study, we considered hybrid post-fusion of two approaches of collaborative filtering, by using sequences of watched movies and considering the related movies rating. After considering both techniques and applying the weights matrix, the recommendations would be modified to correspond to the users preference as needed. We discussed that various weights would be set based on use cases. For instance, in cases where we have the rating for most classes, we will assign a higher weight to the rating matrix and in case where the rating is unavailable for the majority of cases, the higher weights might be assigned to the sequential dataset. An extensive discussion is made in the context of this paper. Sequential type of the watched movies was used in conjunction of the rating as especially that model might be inadequate in distinguishing users long-term preference and that does not account for the rating of the watched movies and thus that model along might not suffice. Extensive discussion was made regarding the literature and methodological approach to solve the problem.
♻ ☆ LLM-Based Agents for Competitive Landscape Mapping in Drug Asset Due Diligence
In this paper, we describe and benchmark a competitor-discovery component used within an agentic AI system for fast drug asset due diligence. A competitor-discovery AI agent, given an indication, retrieves all drugs comprising the competitive landscape of that indication and extracts canonical attributes for these drugs. The competitor definition is investor-specific, and data is paywalled/licensed, fragmented across registries, ontology-mismatched by indication, alias-heavy for drug names, multimodal, and rapidly changing. Although considered the best tool for this problem, the current LLM-based AI systems aren't capable of reliably retrieving all competing drug names, and there is no accepted public benchmark for this task. To address the lack of evaluation, we use LLM-based agents to transform five years of multi-modal, unstructured diligence memos from a private biotech VC fund into a structured evaluation corpus mapping indications to competitor drugs with normalized attributes. We also introduce a competitor validating LLM-as-a-judge agent that filters out false positives from the list of predicted competitors to maximize precision and suppress hallucinations. On this benchmark, our competitor-discovery agent achieves 83% recall, exceeding OpenAI Deep Research (65%) and Perplexity Labs (60%). The system is deployed in production with enterprise users; in a case study with a biotech VC investment fund, analyst turnaround time dropped from 2.5 days to $\sim$3 hours ($\sim$20x) for the competitive analysis.
Machine Learning 150
☆ ANO : Faster is Better in Noisy Landscape
Stochastic optimizers are central to deep learning, yet widely used methods such as Adam and Adan can degrade in non-stationary or noisy environments, partly due to their reliance on momentum-based magnitude estimates. We introduce Ano, a novel optimizer that decouples direction and magnitude: momentum is used for directional smoothing, while instantaneous gradient magnitudes determine step size. This design improves robustness to gradient noise while retaining the simplicity and efficiency of first-order methods. We further propose Anolog, which removes sensitivity to the momentum coefficient by expanding its window over time via a logarithmic schedule. We establish non-convex convergence guarantees with a convergence rate similar to other sign-based methods, and empirically show that Ano provides substantial gains in noisy and non-stationary regimes such as reinforcement learning, while remaining competitive on low-noise tasks such as standard computer vision benchmarks.
comment: Work in progress, 26 pages total with appendix, 7 figures, 12 tables
☆ Aligning the Evaluation of Probabilistic Predictions with Downstream Value
Every prediction is ultimately used in a downstream task. Consequently, evaluating prediction quality is more meaningful when considered in the context of its downstream use. Metrics based solely on predictive performance often diverge from measures of real-world downstream impact. Existing approaches incorporate the downstream view by relying on multiple task-specific metrics, which can be burdensome to analyze, or by formulating cost-sensitive evaluations that require an explicit cost structure, typically assumed to be known a priori. We frame this mismatch as an evaluation alignment problem and propose a data-driven method to learn a proxy evaluation function aligned with the downstream evaluation. Building on the theory of proper scoring rules, we explore transformations of scoring rules that ensure the preservation of propriety. Our approach leverages weighted scoring rules parametrized by a neural network, where weighting is learned to align with the performance in the downstream task. This enables fast and scalable evaluation cycles across tasks where the weighting is complex or unknown a priori. We showcase our framework through synthetic and real-data experiments for regression tasks, demonstrating its potential to bridge the gap between predictive evaluation and downstream utility in modular prediction systems.
☆ Type-Compliant Adaptation Cascades: Adapting Programmatic LM Workflows to Data
Reliably composing Large Language Models (LLMs) for complex, multi-step workflows remains a significant challenge. The dominant paradigm-optimizing discrete prompts in a pipeline-is notoriously brittle and struggles to enforce the formal compliance required for structured tasks. We introduce Type-Compliant Adaptation Cascades (TACs), a framework that recasts workflow adaptation as learning typed probabilistic programs. TACs treats the entire workflow, which is composed of parameter-efficiently adapted LLMs and deterministic logic, as an unnormalized joint distribution. This enables principled, gradient-based training even with latent intermediate structures. We provide theoretical justification for our tractable optimization objective, proving that the optimization bias vanishes as the model learns type compliance. Empirically, TACs significantly outperforms state-of-the-art prompt-optimization baselines. Gains are particularly pronounced on structured tasks, improving MGSM-SymPy from $57.1\%$ to $75.9\%$ for a 27B model, MGSM from $1.6\%$ to $27.3\%$ for a 7B model. TACs offers a robust and theoretically grounded paradigm for developing reliable, task-compliant LLM systems.
☆ Flash Sparse Attention: An Alternative Efficient Implementation of Native Sparse Attention Kernel
Recent progress in sparse attention mechanisms has demonstrated strong potential for reducing the computational cost of long-context training and inference in large language models (LLMs). Native Sparse Attention (NSA), a state-of-the-art approach, introduces natively trainable, hardware-aligned sparse attention that delivers substantial system-level performance gains while maintaining accuracy comparable to full attention. However, the kernel implementation of NSA relies on a query-grouping strategy that is efficient only with large Grouped Query Attention (GQA) sizes, whereas modern LLMs typically adopt much smaller GQA groups, which limits the applicability of this sparse algorithmic advance. In this work, we propose Flash Sparse Attention (FSA), which includes an alternative kernel design that enables efficient NSA computation across a wide range of popular LLMs with varied smaller GQA group sizes on modern GPUs. Compared to vanilla NSA kernel implementation, our empirical evaluation demonstrates that FSA achieves (i) up to 3.5$\times$ and on average 1.6$\times$ kernel-level latency reduction, (ii) up to 1.25$\times$ and 1.09$\times$ on average end-to-end training speedup on state-of-the-art LLMs, and (iii) up to 1.36$\times$ and 1.11$\times$ on average end-to-end prefill speedup on state-of-the-art LLMs. The source code is open-sourced and publicly available at https://github.com/Relaxed-System-Lab/Flash-Sparse-Attention.
☆ Clinical characteristics, complications and outcomes of critically ill patients with Dengue in Brazil, 2012-2024: a nationwide, multicentre cohort study
Background. Dengue outbreaks are a major public health issue, with Brazil reporting 71% of global cases in 2024. Purpose. This study aims to describe the profile of severe dengue patients admitted to Brazilian Intensive Care units (ICUs) (2012-2024), assess trends over time, describe new onset complications while in ICU and determine the risk factors at admission to develop complications during ICU stay. Methods. We performed a prospective study of dengue patients from 253 ICUs across 56 hospitals. We used descriptive statistics to describe the dengue ICU population, logistic regression to identify risk factors for complications during the ICU stay, and a machine learning framework to predict the risk of evolving to complications. Visualisations were generated using ISARIC VERTEX. Results. Of 11,047 admissions, 1,117 admissions (10.1%) evolved to complications, including non-invasive (437 admissions) and invasive ventilation (166), vasopressor (364), blood transfusion (353) and renal replacement therapy (103). Age>80 (OR: 3.10, 95% CI: 2.02-4.92), chronic kidney disease (OR: 2.94, 2.22-3.89), liver cirrhosis (OR: 3.65, 1.82-7.04), low platelets (<50,000 cells/mm3; OR: OR: 2.25, 1.89-2.68), and high leukocytes (>7,000 cells/mm3; OR: 2.47, 2.02-3.03) were significant risk factors for complications. A machine learning tool for predicting complications was proposed, showing accurate discrimination and calibration. Conclusion. We described a large cohort of dengue patients admitted to ICUs and identified key risk factors for severe dengue complications, such as advanced age, presence of comorbidities, higher level of leukocytes and lower level of platelets. The proposed prediction tool can be used for early identification and targeted interventions to improve outcomes in dengue-endemic regions.
☆ Practical GPU Choices for Earth Observation: ResNet-50 Training Throughput on Integrated, Laptop, and Cloud Accelerators
This project implements a ResNet-based pipeline for land use and land cover (LULC) classification on Sentinel-2 imagery, benchmarked across three heterogeneous GPUs. The workflow automates data acquisition, geospatial preprocessing, tiling, model training, and visualization, and is fully containerized for reproducibility. Performance evaluation reveals up to a 2x training speed-up on an NVIDIA RTX 3060 and a Tesla T4 compared to the Apple M3 Pro baseline, while maintaining high classification accuracy on the EuroSAT dataset. These results demonstrate the feasibility of deploying deep learning LULC models on consumer and free cloud GPUs for scalable geospatial analytics.
comment: 10 pages, 5 figures
☆ HypER: Hyperbolic Echo State Networks for Capturing Stretch-and-Fold Dynamics in Chaotic Flows ECAI 2025
Forecasting chaotic dynamics beyond a few Lyapunov times is difficult because infinitesimal errors grow exponentially. Existing Echo State Networks (ESNs) mitigate this growth but employ reservoirs whose Euclidean geometry is mismatched to the stretch-and-fold structure of chaos. We introduce the Hyperbolic Embedding Reservoir (HypER), an ESN whose neurons are sampled in the Poincare ball and whose connections decay exponentially with hyperbolic distance. This negative-curvature construction embeds an exponential metric directly into the latent space, aligning the reservoir's local expansion-contraction spectrum with the system's Lyapunov directions while preserving standard ESN features such as sparsity, leaky integration, and spectral-radius control. Training is limited to a Tikhonov-regularized readout. On the chaotic Lorenz-63 and Roessler systems, and the hyperchaotic Chen-Ueta attractor, HypER consistently lengthens the mean valid-prediction horizon beyond Euclidean and graph-structured ESN baselines, with statistically significant gains confirmed over 30 independent runs; parallel results on real-world benchmarks, including heart-rate variability from the Santa Fe and MIT-BIH datasets and international sunspot numbers, corroborate its advantage. We further establish a lower bound on the rate of state divergence for HypER, mirroring Lyapunov growth.
comment: 8 pages, accepted in ECAI 2025
☆ Unraveling the cognitive patterns of Large Language Models through module communities
Large Language Models (LLMs) have reshaped our world with significant advancements in science, engineering, and society through applications ranging from scientific discoveries and medical diagnostics to Chatbots. Despite their ubiquity and utility, the underlying mechanisms of LLM remain concealed within billions of parameters and complex structures, making their inner architecture and cognitive processes challenging to comprehend. We address this gap by adopting approaches to understanding emerging cognition in biology and developing a network-based framework that links cognitive skills, LLM architectures, and datasets, ushering in a paradigm shift in foundation model analysis. The skill distribution in the module communities demonstrates that while LLMs do not strictly parallel the focalized specialization observed in specific biological systems, they exhibit unique communities of modules whose emergent skill patterns partially mirror the distributed yet interconnected cognitive organization seen in avian and small mammalian brains. Our numerical results highlight a key divergence from biological systems to LLMs, where skill acquisition benefits substantially from dynamic, cross-regional interactions and neural plasticity. By integrating cognitive science principles with machine learning, our framework provides new insights into LLM interpretability and suggests that effective fine-tuning strategies should leverage distributed learning dynamics rather than rigid modular interventions.
☆ Emerging Semantic Segmentation from Positive and Negative Coarse Label Learning
Large annotated datasets are vital for training segmentation models, but pixel-level labeling is time-consuming, error-prone, and often requires scarce expert annotators, especially in medical imaging. In contrast, coarse annotations are quicker, cheaper, and easier to produce, even by non-experts. In this paper, we propose to use coarse drawings from both positive (target) and negative (background) classes in the image, even with noisy pixels, to train a convolutional neural network (CNN) for semantic segmentation. We present a method for learning the true segmentation label distributions from purely noisy coarse annotations using two coupled CNNs. The separation of the two CNNs is achieved by high fidelity with the characters of the noisy training annotations. We propose to add a complementary label learning that encourages estimating negative label distribution. To illustrate the properties of our method, we first use a toy segmentation dataset based on MNIST. We then present the quantitative results of experiments using publicly available datasets: Cityscapes dataset for multi-class segmentation, and retinal images for medical applications. In all experiments, our method outperforms state-of-the-art methods, particularly in the cases where the ratio of coarse annotations is small compared to the given dense annotations.
☆ AdLoCo: adaptive batching significantly improves communications efficiency and convergence for Large Language Models
Scaling distributed training of Large Language Models (LLMs) requires not only algorithmic advances but also efficient utilization of heterogeneous hardware resources. While existing methods such as DiLoCo have demonstrated promising results, they often fail to fully exploit computational clusters under dynamic workloads. To address this limitation, we propose a three-stage method that combines Multi-Instance Training (MIT), Adaptive Batched DiLoCo, and switch mode mechanism. MIT allows individual nodes to run multiple lightweight training streams with different model instances in parallel and merge them to combine knowledge, increasing throughput and reducing idle time. Adaptive Batched DiLoCo dynamically adjusts local batch sizes to balance computation and communication, substantially lowering synchronization delays. Switch mode further stabilizes training by seamlessly introducing gradient accumulation once adaptive batch sizes grow beyond hardware-friendly limits. Together, these innovations improve both convergence speed and system efficiency. We also provide a theoretical estimate of the number of communications required for the full convergence of a model trained using our method.
☆ Introduction to Regularization and Learning Methods for Inverse Problems
These lecture notes evolve around mathematical concepts arising in inverse problems. We start by introducing inverse problems through examples such as differentiation, deconvolution, computed tomography and phase retrieval. This then leads us to the framework of well-posedness and first considerations regarding reconstruction and inversion approaches. The second chapter then first deals with classical regularization theory of inverse problems in Hilbert spaces. After introducing the pseudo-inverse, we review the concept of convergent regularization. Within this chapter we then proceed to ask the question of how to realize practical reconstruction algorithms. Here, we mainly focus on Tikhonov and sparsity promoting regularization in finite dimensional spaces. In the third chapter, we dive into modern deep-learning methods, which allow solving inverse problems in a data-dependent approach. The intersection between inverse problems and machine learning is a rapidly growing field and our exposition here restricts itself to a very limited selection of topics. Among them are learned regularization, fully-learned Bayesian estimation, post-processing strategies and plug-n-play methods.
comment: These lecture notes are based on a lecture taught by the authors in the winter semester 2024/2025 at the University of Hamburg
☆ Scene-Aware Vectorized Memory Multi-Agent Framework with Cross-Modal Differentiated Quantization VLMs for Visually Impaired Assistance
This study proposes the dual technological innovation framework, including a cross-modal differ entiated quantization framework for vision-language models (VLMs) and a scene-aware vectorized memory multi-agent system for visually impaired assistance. The modular framework was developed implementing differentiated processing strategies, effectively reducing memory requirements from 38GB to 16GB while maintaining model performance. The multi-agent architecture combines scene classification, vectorized memory, and multimodal interaction, enabling persistent storage and efficient retrieval of scene memories. Through perception-memory-reasoning workflows, the system provides environmental information beyond the current view using historical memories. Experiments show the quantized 19B-parameter model only experiences a 2.05% performance drop on MMBench and maintains 63.7 accuracy on OCR-VQA (original: 64.9), outperforming smaller models with equivalent memory requirements like the Molmo-7B series. The system maintains response latency between 2.83-3.52 seconds from scene analysis to initial speech output, substantially faster than non-streaming methods. This research advances computational efficiency and assistive technology, offering visually impaired users comprehensive real-time assistance in scene perception, text recognition, and navigation.
comment: 28 pages,9 figures
☆ Amortized Sampling with Transferable Normalizing Flows
Efficient equilibrium sampling of molecular conformations remains a core challenge in computational chemistry and statistical inference. Classical approaches such as molecular dynamics or Markov chain Monte Carlo inherently lack amortization; the computational cost of sampling must be paid in-full for each system of interest. The widespread success of generative models has inspired interest into overcoming this limitation through learning sampling algorithms. Despite performing on par with conventional methods when trained on a single system, learned samplers have so far demonstrated limited ability to transfer across systems. We prove that deep learning enables the design of scalable and transferable samplers by introducing Prose, a 280 million parameter all-atom transferable normalizing flow trained on a corpus of peptide molecular dynamics trajectories up to 8 residues in length. Prose draws zero-shot uncorrelated proposal samples for arbitrary peptide systems, achieving the previously intractable transferability across sequence length, whilst retaining the efficient likelihood evaluation of normalizing flows. Through extensive empirical evaluation we demonstrate the efficacy of Prose as a proposal for a variety of sampling algorithms, finding a simple importance sampling-based finetuning procedure to achieve superior performance to established methods such as sequential Monte Carlo on unseen tetrapeptides. We open-source the Prose codebase, model weights, and training dataset, to further stimulate research into amortized sampling methods and finetuning objectives.
☆ Unveiling the Actual Performance of Neural-based Models for Equation Discovery on Graph Dynamical Systems
The ``black-box'' nature of deep learning models presents a significant barrier to their adoption for scientific discovery, where interpretability is paramount. This challenge is especially pronounced in discovering the governing equations of dynamical processes on networks or graphs, since even their topological structure further affects the processes' behavior. This paper provides a rigorous, comparative assessment of state-of-the-art symbolic regression techniques for this task. We evaluate established methods, including sparse regression and MLP-based architectures, and introduce a novel adaptation of Kolmogorov-Arnold Networks (KANs) for graphs, designed to exploit their inherent interpretability. Across a suite of synthetic and real-world dynamical systems, our results demonstrate that both MLP and KAN-based architectures can successfully identify the underlying symbolic equations, significantly surpassing existing baselines. Critically, we show that KANs achieve this performance with greater parsimony and transparency, as their learnable activation functions provide a clearer mapping to the true physical dynamics. This study offers a practical guide for researchers, clarifying the trade-offs between model expressivity and interpretability, and establishes the viability of neural-based architectures for robust scientific discovery on complex systems.
comment: Preprint. Under Review
☆ PCR-CA: Parallel Codebook Representations with Contrastive Alignment for Multiple-Category App Recommendation
Modern app store recommender systems struggle with multiple-category apps, as traditional taxonomies fail to capture overlapping semantics, leading to suboptimal personalization. We propose PCR-CA (Parallel Codebook Representations with Contrastive Alignment), an end-to-end framework for improved CTR prediction. PCR-CA first extracts compact multimodal embeddings from app text, then introduces a Parallel Codebook VQ-AE module that learns discrete semantic representations across multiple codebooks in parallel -- unlike hierarchical residual quantization (RQ-VAE). This design enables independent encoding of diverse aspects (e.g., gameplay, art style), better modeling multiple-category semantics. To bridge semantic and collaborative signals, we employ a contrastive alignment loss at both the user and item levels, enhancing representation learning for long-tail items. Additionally, a dual-attention fusion mechanism combines ID-based and semantic features to capture user interests, especially for long-tail apps. Experiments on a large-scale dataset show PCR-CA achieves a +0.76% AUC improvement over strong baselines, with +2.15% AUC gains for long-tail apps. Online A/B testing further validates our approach, showing a +10.52% lift in CTR and a +16.30% improvement in CVR, demonstrating PCR-CA's effectiveness in real-world deployment. The new framework has now been fully deployed on the Microsoft Store.
comment: 9 pages, 4 figures, conference
☆ The Computational Complexity of Satisfiability in State Space Models ECAI 25
We analyse the complexity of the satisfiability problem ssmSAT for State Space Models (SSM), which asks whether an input sequence can lead the model to an accepting configuration. We find that ssmSAT is undecidable in general, reflecting the computational power of SSM. Motivated by practical settings, we identify two natural restrictions under which ssmSAT becomes decidable and establish corresponding complexity bounds. First, for SSM with bounded context length, ssmSAT is NP-complete when the input length is given in unary and in NEXPTIME (and PSPACE-hard) when the input length is given in binary. Second, for quantised SSM operating over fixed-width arithmetic, ssmSAT is PSPACE-complete resp. in EXPSPACE depending on the bit-width encoding. While these results hold for diagonal gated SSM we also establish complexity bounds for time-invariant SSM. Our results establish a first complexity landscape for formal reasoning in SSM and highlight fundamental limits and opportunities for the verification of SSM-based language models.
comment: Accepted at ECAI 25
☆ Hybrid Quantum-Classical Learning for Multiclass Image Classification
This study explores the challenge of improving multiclass image classification through quantum machine-learning techniques. It explores how the discarded qubit states of Noisy Intermediate-Scale Quantum (NISQ) quantum convolutional neural networks (QCNNs) can be leveraged alongside a classical classifier to improve classification performance. Current QCNNs discard qubit states after pooling; yet, unlike classical pooling, these qubits often remain entangled with the retained ones, meaning valuable correlated information is lost. We experiment with recycling this information and combining it with the conventional measurements from the retained qubits. Accordingly, we propose a hybrid quantum-classical architecture that couples a modified QCNN with fully connected classical layers. Two shallow fully connected (FC) heads separately process measurements from retained and discarded qubits, whose outputs are ensembled before a final classification layer. Joint optimisation with a classical cross-entropy loss allows both quantum and classical parameters to adapt coherently. The method outperforms comparable lightweight models on MNIST, Fashion-MNIST and OrganAMNIST. These results indicate that reusing discarded qubit information is a promising approach for future hybrid quantum-classical models and may extend to tasks beyond image classification.
comment: 13 pages, 8 figures
☆ SpotEdit: Evaluating Visually-Guided Image Editing Methods
Visually-guided image editing, where edits are conditioned on both visual cues and textual prompts, has emerged as a powerful paradigm for fine-grained, controllable content generation. Although recent generative models have shown remarkable capabilities, existing evaluations remain simple and insufficiently representative of real-world editing challenges. We present SpotEdit, a comprehensive benchmark designed to systematically assess visually-guided image editing methods across diverse diffusion, autoregressive, and hybrid generative models, uncovering substantial performance disparities. To address a critical yet underexplored challenge, our benchmark includes a dedicated component on hallucination, highlighting how leading models, such as GPT-4o, often hallucinate the existence of a visual cue and erroneously perform the editing task. Our code and benchmark are publicly released at https://github.com/SaraGhazanfari/SpotEdit.
☆ Assessing the Noise Robustness of Class Activation Maps: A Framework for Reliable Model Interpretability
Class Activation Maps (CAMs) are one of the important methods for visualizing regions used by deep learning models. Yet their robustness to different noise remains underexplored. In this work, we evaluate and report the resilience of various CAM methods for different noise perturbations across multiple architectures and datasets. By analyzing the influence of different noise types on CAM explanations, we assess the susceptibility to noise and the extent to which dataset characteristics may impact explanation stability. The findings highlight considerable variability in noise sensitivity for various CAMs. We propose a robustness metric for CAMs that captures two key properties: consistency and responsiveness. Consistency reflects the ability of CAMs to remain stable under input perturbations that do not alter the predicted class, while responsiveness measures the sensitivity of CAMs to changes in the prediction caused by such perturbations. The metric is evaluated empirically across models, different perturbations, and datasets along with complementary statistical tests to exemplify the applicability of our proposed approach.
comment: Image and Vision Computing (2025)
☆ BirdRecorder's AI on Sky: Safeguarding birds of prey by detection and classification of tiny objects around wind turbines
The urgent need for renewable energy expansion, particularly wind power, is hindered by conflicts with wildlife conservation. To address this, we developed BirdRecorder, an advanced AI-based anti-collision system to protect endangered birds, especially the red kite (Milvus milvus). Integrating robotics, telemetry, and high-performance AI algorithms, BirdRecorder aims to detect, track, and classify avian species within a range of 800 m to minimize bird-turbine collisions. BirdRecorder integrates advanced AI methods with optimized hardware and software architectures to enable real-time image processing. Leveraging Single Shot Detector (SSD) for detection, combined with specialized hardware acceleration and tracking algorithms, our system achieves high detection precision while maintaining the speed necessary for real-time decision-making. By combining these components, BirdRecorder outperforms existing approaches in both accuracy and efficiency. In this paper, we summarize results on field tests and performance of the BirdRecorder system. By bridging the gap between renewable energy expansion and wildlife conservation, BirdRecorder contributes to a more sustainable coexistence of technology and nature.
comment: 18 pages, 1 figures, to appear in Proceedings of the 19th International Conference on Intelligent Autonomous Systems (IAS-19), Genoa, Italy, 2025
☆ Test-Time Scaling Strategies for Generative Retrieval in Multimodal Conversational Recommendations
The rapid evolution of e-commerce has exposed the limitations of traditional product retrieval systems in managing complex, multi-turn user interactions. Recent advances in multimodal generative retrieval -- particularly those leveraging multimodal large language models (MLLMs) as retrievers -- have shown promise. However, most existing methods are tailored to single-turn scenarios and struggle to model the evolving intent and iterative nature of multi-turn dialogues when applied naively. Concurrently, test-time scaling has emerged as a powerful paradigm for improving large language model (LLM) performance through iterative inference-time refinement. Yet, its effectiveness typically relies on two conditions: (1) a well-defined problem space (e.g., mathematical reasoning), and (2) the model's ability to self-correct -- conditions that are rarely met in conversational product search. In this setting, user queries are often ambiguous and evolving, and MLLMs alone have difficulty grounding responses in a fixed product corpus. Motivated by these challenges, we propose a novel framework that introduces test-time scaling into conversational multimodal product retrieval. Our approach builds on a generative retriever, further augmented with a test-time reranking (TTR) mechanism that improves retrieval accuracy and better aligns results with evolving user intent throughout the dialogue. Experiments across multiple benchmarks show consistent improvements, with average gains of 14.5 points in MRR and 10.6 points in nDCG@1.
☆ Frozen in Time: Parameter-Efficient Time Series Transformers via Reservoir-Induced Feature Expansion and Fixed Random Dynamics ECAI 2025
Transformers are the de-facto choice for sequence modelling, yet their quadratic self-attention and weak temporal bias can make long-range forecasting both expensive and brittle. We introduce FreezeTST, a lightweight hybrid that interleaves frozen random-feature (reservoir) blocks with standard trainable Transformer layers. The frozen blocks endow the network with rich nonlinear memory at no optimisation cost; the trainable layers learn to query this memory through self-attention. The design cuts trainable parameters and also lowers wall-clock training time, while leaving inference complexity unchanged. On seven standard long-term forecasting benchmarks, FreezeTST consistently matches or surpasses specialised variants such as Informer, Autoformer, and PatchTST; with substantially lower compute. Our results show that embedding reservoir principles within Transformers offers a simple, principled route to efficient long-term time-series prediction.
comment: 8 pages, 5 tables, 3 figures, accepted at ECAI 2025
☆ CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics
We introduce CMPhysBench, designed to assess the proficiency of Large Language Models (LLMs) in Condensed Matter Physics, as a novel Benchmark. CMPhysBench is composed of more than 520 graduate-level meticulously curated questions covering both representative subfields and foundational theoretical frameworks of condensed matter physics, such as magnetism, superconductivity, strongly correlated systems, etc. To ensure a deep understanding of the problem-solving process,we focus exclusively on calculation problems, requiring LLMs to independently generate comprehensive solutions. Meanwhile, leveraging tree-based representations of expressions, we introduce the Scalable Expression Edit Distance (SEED) score, which provides fine-grained (non-binary) partial credit and yields a more accurate assessment of similarity between prediction and ground-truth. Our results show that even the best models, Grok-4, reach only 36 average SEED score and 28% accuracy on CMPhysBench, underscoring a significant capability gap, especially for this practical and frontier domain relative to traditional physics. The code anddataset are publicly available at https://github.com/CMPhysBench/CMPhysBench.
comment: 29 pages, 7 figures
☆ Provable Mixed-Noise Learning with Flow-Matching
We study Bayesian inverse problems with mixed noise, modeled as a combination of additive and multiplicative Gaussian components. While traditional inference methods often assume fixed or known noise characteristics, real-world applications, particularly in physics and chemistry, frequently involve noise with unknown and heterogeneous structure. Motivated by recent advances in flow-based generative modeling, we propose a novel inference framework based on conditional flow matching embedded within an Expectation-Maximization (EM) algorithm to jointly estimate posterior samplers and noise parameters. To enable high-dimensional inference and improve scalability, we use simulation-free ODE-based flow matching as the generative model in the E-step of the EM algorithm. We prove that, under suitable assumptions, the EM updates converge to the true noise parameters in the population limit of infinite observations. Our numerical results illustrate the effectiveness of combining EM inference with flow matching for mixed-noise Bayesian inverse problems.
☆ The AI Data Scientist
Imagine decision-makers uploading data and, within minutes, receiving clear, actionable insights delivered straight to their fingertips. That is the promise of the AI Data Scientist, an autonomous Agent powered by large language models (LLMs) that closes the gap between evidence and action. Rather than simply writing code or responding to prompts, it reasons through questions, tests ideas, and delivers end-to-end insights at a pace far beyond traditional workflows. Guided by the scientific tenet of the hypothesis, this Agent uncovers explanatory patterns in data, evaluates their statistical significance, and uses them to inform predictive modeling. It then translates these results into recommendations that are both rigorous and accessible. At the core of the AI Data Scientist is a team of specialized LLM Subagents, each responsible for a distinct task such as data cleaning, statistical testing, validation, and plain-language communication. These Subagents write their own code, reason about causality, and identify when additional data is needed to support sound conclusions. Together, they achieve in minutes what might otherwise take days or weeks, enabling a new kind of interaction that makes deep data science both accessible and actionable.
☆ Detecting and Characterizing Planning in Language Models
Modern large language models (LLMs) have demonstrated impressive performance across a wide range of multi-step reasoning tasks. Recent work suggests that LLMs may perform planning - selecting a future target token in advance and generating intermediate tokens that lead towards it - rather than merely improvising one token at a time. However, existing studies assume fixed planning horizons and often focus on single prompts or narrow domains. To distinguish planning from improvisation across models and tasks, we present formal and causally grounded criteria for detecting planning and operationalize them as a semi-automated annotation pipeline. We apply this pipeline to both base and instruction-tuned Gemma-2-2B models on the MBPP code generation benchmark and a poem generation task where Claude 3.5 Haiku was previously shown to plan. Our findings show that planning is not universal: unlike Haiku, Gemma-2-2B solves the same poem generation task through improvisation, and on MBPP it switches between planning and improvisation across similar tasks and even successive token predictions. We further show that instruction tuning refines existing planning behaviors in the base model rather than creating them from scratch. Together, these studies provide a reproducible and scalable foundation for mechanistic studies of planning in LLMs.
comment: 9 pages, 4 figures
☆ Incorporating Pre-trained Diffusion Models in Solving the Schrödinger Bridge Problem
This paper aims to unify Score-based Generative Models (SGMs), also known as Diffusion models, and the Schr\"odinger Bridge (SB) problem through three reparameterization techniques: Iterative Proportional Mean-Matching (IPMM), Iterative Proportional Terminus-Matching (IPTM), and Iterative Proportional Flow-Matching (IPFM). These techniques significantly accelerate and stabilize the training of SB-based models. Furthermore, the paper introduces novel initialization strategies that use pre-trained SGMs to effectively train SB-based models. By using SGMs as initialization, we leverage the advantages of both SB-based models and SGMs, ensuring efficient training of SB-based models and further improving the performance of SGMs. Extensive experiments demonstrate the significant effectiveness and improvements of the proposed methods. We believe this work contributes to and paves the way for future research on generative models.
☆ How Quantization Shapes Bias in Large Language Models
This work presents a comprehensive evaluation of how quantization affects model bias, with particular attention to its impact on individual demographic subgroups. We focus on weight and activation quantization strategies and examine their effects across a broad range of bias types, including stereotypes, toxicity, sentiment, and fairness. We employ both probabilistic and generated text-based metrics across nine benchmarks and evaluate models varying in architecture family and reasoning ability. Our findings show that quantization has a nuanced impact on bias: while it can reduce model toxicity and does not significantly impact sentiment, it tends to slightly increase stereotypes and unfairness in generative tasks, especially under aggressive compression. These trends are generally consistent across demographic categories and model types, although their magnitude depends on the specific setting. Overall, our results highlight the importance of carefully balancing efficiency and ethical considerations when applying quantization in practice.
☆ Quantum-Classical Hybrid Framework for Zero-Day Time-Push GNSS Spoofing Detection
Global Navigation Satellite Systems (GNSS) are critical for Positioning, Navigation, and Timing (PNT) applications. However, GNSS are highly vulnerable to spoofing attacks, where adversaries transmit counterfeit signals to mislead receivers. Such attacks can lead to severe consequences, including misdirected navigation, compromised data integrity, and operational disruptions. Most existing spoofing detection methods depend on supervised learning techniques and struggle to detect novel, evolved, and unseen attacks. To overcome this limitation, we develop a zero-day spoofing detection method using a Hybrid Quantum-Classical Autoencoder (HQC-AE), trained solely on authentic GNSS signals without exposure to spoofed data. By leveraging features extracted during the tracking stage, our method enables proactive detection before PNT solutions are computed. We focus on spoofing detection in static GNSS receivers, which are particularly susceptible to time-push spoofing attacks, where attackers manipulate timing information to induce incorrect time computations at the receiver. We evaluate our model against different unseen time-push spoofing attack scenarios: simplistic, intermediate, and sophisticated. Our analysis demonstrates that the HQC-AE consistently outperforms its classical counterpart, traditional supervised learning-based models, and existing unsupervised learning-based methods in detecting zero-day, unseen GNSS time-push spoofing attacks, achieving an average detection accuracy of 97.71% with an average false negative rate of 0.62% (when an attack occurs but is not detected). For sophisticated spoofing attacks, the HQC-AE attains an accuracy of 98.23% with a false negative rate of 1.85%. These findings highlight the effectiveness of our method in proactively detecting zero-day GNSS time-push spoofing attacks across various stationary GNSS receiver platforms.
comment: This work has been submitted to the IEEE Internet of Things Journal for possible publication
☆ Arnold: a generalist muscle transformer policy
Controlling high-dimensional and nonlinear musculoskeletal models of the human body is a foundational scientific challenge. Recent machine learning breakthroughs have heralded policies that master individual skills like reaching, object manipulation and locomotion in musculoskeletal systems with many degrees of freedom. However, these agents are merely "specialists", achieving high performance for a single skill. In this work, we develop Arnold, a generalist policy that masters multiple tasks and embodiments. Arnold combines behavior cloning and fine-tuning with PPO to achieve expert or super-expert performance in 14 challenging control tasks from dexterous object manipulation to locomotion. A key innovation is Arnold's sensorimotor vocabulary, a compositional representation of the semantics of heterogeneous sensory modalities, objectives, and actuators. Arnold leverages this vocabulary via a transformer architecture to deal with the variable observation and action spaces of each task. This framework supports efficient multi-task, multi-embodiment learning and facilitates rapid adaptation to novel tasks. Finally, we analyze Arnold to provide insights into biological motor control, corroborating recent findings on the limited transferability of muscle synergies across tasks.
comment: A.S.C. and B.A. contributed equally. Code is available at https://github.com/amathislab/arnold-the-generalist
☆ FedGreed: A Byzantine-Robust Loss-Based Aggregation Method for Federated Learning
Federated Learning (FL) enables collaborative model training across multiple clients while preserving data privacy by keeping local datasets on-device. In this work, we address FL settings where clients may behave adversarially, exhibiting Byzantine attacks, while the central server is trusted and equipped with a reference dataset. We propose FedGreed, a resilient aggregation strategy for federated learning that does not require any assumptions about the fraction of adversarial participants. FedGreed orders clients' local model updates based on their loss metrics evaluated against a trusted dataset on the server and greedily selects a subset of clients whose models exhibit the minimal evaluation loss. Unlike many existing approaches, our method is designed to operate reliably under heterogeneous (non-IID) data distributions, which are prevalent in real-world deployments. FedGreed exhibits convergence guarantees and bounded optimality gaps under strong adversarial behavior. Experimental evaluations on MNIST, FMNIST, and CIFAR-10 demonstrate that our method significantly outperforms standard and robust federated learning baselines, such as Mean, Trimmed Mean, Median, Krum, and Multi-Krum, in the majority of adversarial scenarios considered, including label flipping and Gaussian noise injection attacks. All experiments were conducted using the Flower federated learning framework.
comment: 8 pages, 4 figures
☆ Weisfeiler-Lehman meets Events: An Expressivity Analysis for Continuous-Time Dynamic Graph Neural Networks
Graph Neural Networks (GNNs) are known to match the distinguishing power of the 1-Weisfeiler-Lehman (1-WL) test, and the resulting partitions coincide with the unfolding tree equivalence classes of graphs. Preserving this equivalence, GNNs can universally approximate any target function on graphs in probability up to any precision. However, these results are limited to attributed discrete-dynamic graphs represented as sequences of connected graph snapshots. Real-world systems, such as communication networks, financial transaction networks, and molecular interactions, evolve asynchronously and may split into disconnected components. In this paper, we extend the theory of attributed discrete-dynamic graphs to attributed continuous-time dynamic graphs with arbitrary connectivity. To this end, we introduce a continuous-time dynamic 1-WL test, prove its equivalence to continuous-time dynamic unfolding trees, and identify a class of continuous-time dynamic GNNs (CGNNs) based on discrete-dynamic GNN architectures that retain both distinguishing power and universal approximation guarantees. Our constructive proofs further yield practical design guidelines, emphasizing a compact and expressive CGNN architecture with piece-wise continuously differentiable temporal functions to process asynchronous, disconnected graphs.
☆ Training Transformers for Mesh-Based Simulations
Simulating physics using Graph Neural Networks (GNNs) is predominantly driven by message-passing architectures, which face challenges in scaling and efficiency, particularly in handling large, complex meshes. These architectures have inspired numerous enhancements, including multigrid approaches and $K$-hop aggregation (using neighbours of distance $K$), yet they often introduce significant complexity and suffer from limited in-depth investigations. In response to these challenges, we propose a novel Graph Transformer architecture that leverages the adjacency matrix as an attention mask. The proposed approach incorporates innovative augmentations, including Dilated Sliding Windows and Global Attention, to extend receptive fields without sacrificing computational efficiency. Through extensive experimentation, we evaluate model size, adjacency matrix augmentations, positional encoding and $K$-hop configurations using challenging 3D computational fluid dynamics (CFD) datasets. We also train over 60 models to find a scaling law between training FLOPs and parameters. The introduced models demonstrate remarkable scalability, performing on meshes with up to 300k nodes and 3 million edges. Notably, the smallest model achieves parity with MeshGraphNet while being $7\times$ faster and $6\times$ smaller. The largest model surpasses the previous state-of-the-art by $38.8$\% on average and outperforms MeshGraphNet by $52$\% on the all-rollout RMSE, while having a similar training speed. Code and datasets are available at https://github.com/DonsetPG/graph-physics.
☆ Riemannian Change Point Detection on Manifolds with Robust Centroid Estimation
Non-parametric change-point detection in streaming time series data is a long-standing challenge in signal processing. Recent advancements in statistics and machine learning have increasingly addressed this problem for data residing on Riemannian manifolds. One prominent strategy involves monitoring abrupt changes in the center of mass of the time series. Implemented in a streaming fashion, this strategy, however, requires careful step size tuning when computing the updates of the center of mass. In this paper, we propose to leverage robust centroid on manifolds from M-estimation theory to address this issue. Our proposal consists of comparing two centroid estimates: the classical Karcher mean (sensitive to change) versus one defined from Huber's function (robust to change). This comparison leads to the definition of a test statistic whose performance is less sensitive to the underlying estimation method. We propose a stochastic Riemannian optimization algorithm to estimate both robust centroids efficiently. Experiments conducted on both simulated and real-world data across two representative manifolds demonstrate the superior performance of our proposed method.
☆ Enhancing Differentially Private Linear Regression via Public Second-Moment
Leveraging information from public data has become increasingly crucial in enhancing the utility of differentially private (DP) methods. Traditional DP approaches often require adding noise based solely on private data, which can significantly degrade utility. In this paper, we address this limitation in the context of the ordinary least squares estimator (OLSE) of linear regression based on sufficient statistics perturbation (SSP) under the unbounded data assumption. We propose a novel method that involves transforming private data using the public second-moment matrix to compute a transformed SSP-OLSE, whose second-moment matrix yields a better condition number and improves the OLSE accuracy and robustness. We derive theoretical error bounds about our method and the standard SSP-OLSE to the non-DP OLSE, which reveal the improved robustness and accuracy achieved by our approach. Experiments on synthetic and real-world datasets demonstrate the utility and effectiveness of our method.
☆ AQ-PCDSys: An Adaptive Quantized Planetary Crater Detection System for Autonomous Space Exploration
Autonomous planetary exploration missions are critically dependent on real-time, accurate environmental perception for navigation and hazard avoidance. However, deploying deep learning models on the resource-constrained computational hardware of planetary exploration platforms remains a significant challenge. This paper introduces the Adaptive Quantized Planetary Crater Detection System (AQ-PCDSys), a novel framework specifically engineered for real-time, onboard deployment in the computationally constrained environments of space exploration missions. AQ-PCDSys synergistically integrates a Quantized Neural Network (QNN) architecture, trained using Quantization-Aware Training (QAT), with an Adaptive Multi-Sensor Fusion (AMF) module. The QNN architecture significantly optimizes model size and inference latency suitable for real-time onboard deployment in space exploration missions, while preserving high accuracy. The AMF module intelligently fuses data from Optical Imagery (OI) and Digital Elevation Models (DEMs) at the feature level, utilizing an Adaptive Weighting Mechanism (AWM) to dynamically prioritize the most relevant and reliable sensor modality based on planetary ambient conditions. This approach enhances detection robustness across diverse planetary landscapes. Paired with Multi-Scale Detection Heads specifically designed for robust and efficient detection of craters across a wide range of sizes, AQ-PCDSys provides a computationally efficient, reliable and accurate solution for planetary crater detection, a critical capability for enabling the next generation of autonomous planetary landing, navigation, and scientific exploration.
comment: 17 pages, 6 figures. A research paper on a novel deep learning framework for planetary crater detection
☆ Does simple trump complex? Comparing strategies for adversarial robustness in DNNs
Deep Neural Networks (DNNs) have shown substantial success in various applications but remain vulnerable to adversarial attacks. This study aims to identify and isolate the components of two different adversarial training techniques that contribute most to increased adversarial robustness, particularly through the lens of margins in the input space -- the minimal distance between data points and decision boundaries. Specifically, we compare two methods that maximize margins: a simple approach which modifies the loss function to increase an approximation of the margin, and a more complex state-of-the-art method (Dynamics-Aware Robust Training) which builds upon this approach. Using a VGG-16 model as our base, we systematically isolate and evaluate individual components from these methods to determine their relative impact on adversarial robustness. We assess the effect of each component on the model's performance under various adversarial attacks, including AutoAttack and Projected Gradient Descent (PGD). Our analysis on the CIFAR-10 dataset reveals which elements most effectively enhance adversarial robustness, providing insights for designing more robust DNNs.
☆ Development of a Neural Network Model for Currency Detection to aid visually impaired people in Nigeria
Neural networks in assistive technology for visually impaired leverage artificial intelligence's capacity to recognize patterns in complex data. They are used for converting visual data into auditory or tactile representations, helping the visually impaired understand their surroundings. The primary aim of this research is to explore the potential of artificial neural networks to facilitate the differentiation of various forms of cash for individuals with visual impairments. In this study, we built a custom dataset of 3,468 images, which was subsequently used to train an SSD neural network model. The proposed system can accurately identify Nigerian cash, thereby streamlining commercial transactions. The performance of the system in terms of accuracy was assessed, and the Mean Average Precision score was over 90%. We believe that our system has the potential to make a substantial contribution to the field of assistive technology while also improving the quality of life of visually challenged persons in Nigeria and beyond.
☆ Unseen Speaker and Language Adaptation for Lightweight Text-To-Speech with Adapters SP 2025
In this paper we investigate cross-lingual Text-To-Speech (TTS) synthesis through the lens of adapters, in the context of lightweight TTS systems. In particular, we compare the tasks of unseen speaker and language adaptation with the goal of synthesising a target voice in a target language, in which the target voice has no recordings therein. Results from objective evaluations demonstrate the effectiveness of adapters in learning language-specific and speaker-specific information, allowing pre-trained models to learn unseen speaker identities or languages, while avoiding catastrophic forgetting of the original model's speaker or language information. Additionally, to measure how native the generated voices are in terms of accent, we propose and validate an objective metric inspired by mispronunciation detection techniques in second-language (L2) learners. The paper also provides insights into the impact of adapter placement, configuration and the number of speakers used.
comment: Accepted at IEEE MLSP 2025
☆ A Novel Framework for Uncertainty Quantification via Proper Scores for Classification and Beyond
In this PhD thesis, we propose a novel framework for uncertainty quantification in machine learning, which is based on proper scores. Uncertainty quantification is an important cornerstone for trustworthy and reliable machine learning applications in practice. Usually, approaches to uncertainty quantification are problem-specific, and solutions and insights cannot be readily transferred from one task to another. Proper scores are loss functions minimized by predicting the target distribution. Due to their very general definition, proper scores apply to regression, classification, or even generative modeling tasks. We contribute several theoretical results, that connect epistemic uncertainty, aleatoric uncertainty, and model calibration with proper scores, resulting in a general and widely applicable framework. We achieve this by introducing a general bias-variance decomposition for strictly proper scores via functional Bregman divergences. Specifically, we use the kernel score, a kernel-based proper score, for evaluating sample-based generative models in various domains, like image, audio, and natural language generation. This includes a novel approach for uncertainty estimation of large language models, which outperforms state-of-the-art baselines. Further, we generalize the calibration-sharpness decomposition beyond classification, which motivates the definition of proper calibration errors. We then introduce a novel estimator for proper calibration errors in classification, and a novel risk-based approach to compare different estimators for squared calibration errors. Last, we offer a decomposition of the kernel spherical score, another kernel-based proper score, allowing a more fine-grained and interpretable evaluation of generative image models.
comment: PhD Thesis (cumulative, spanning 6 peer-reviewed publications)
☆ Topology Aware Neural Interpolation of Scalar Fields
This paper presents a neural scheme for the topology-aware interpolation of time-varying scalar fields. Given a time-varying sequence of persistence diagrams, along with a sparse temporal sampling of the corresponding scalar fields, denoted as keyframes, our interpolation approach aims at "inverting" the non-keyframe diagrams to produce plausible estimations of the corresponding, missing data. For this, we rely on a neural architecture which learns the relation from a time value to the corresponding scalar field, based on the keyframe examples, and reliably extends this relation to the non-keyframe time steps. We show how augmenting this architecture with specific topological losses exploiting the input diagrams both improves the geometrical and topological reconstruction of the non-keyframe time steps. At query time, given an input time value for which an interpolation is desired, our approach instantaneously produces an output, via a single propagation of the time input through the network. Experiments interpolating 2D and 3D time-varying datasets show our approach superiority, both in terms of data and topological fitting, with regard to reference interpolation schemes.
☆ DesCartes Builder: A Tool to Develop Machine-Learning Based Digital Twins
Digital twins (DTs) are increasingly utilized to monitor, manage, and optimize complex systems across various domains, including civil engineering. A core requirement for an effective DT is to act as a fast, accurate, and maintainable surrogate of its physical counterpart, the physical twin (PT). To this end, machine learning (ML) is frequently employed to (i) construct real-time DT prototypes using efficient reduced-order models (ROMs) derived from high-fidelity simulations of the PT's nominal behavior, and (ii) specialize these prototypes into DT instances by leveraging historical sensor data from the target PT. Despite the broad applicability of ML, its use in DT engineering remains largely ad hoc. Indeed, while conventional ML pipelines often train a single model for a specific task, DTs typically require multiple, task- and domain-dependent models. Thus, a more structured approach is required to design DTs. In this paper, we introduce DesCartes Builder, an open-source tool to enable the systematic engineering of ML-based pipelines for real-time DT prototypes and DT instances. The tool leverages an open and flexible visual data flow paradigm to facilitate the specification, composition, and reuse of ML models. It also integrates a library of parameterizable core operations and ML algorithms tailored for DT design. We demonstrate the effectiveness and usability of DesCartes Builder through a civil engineering use case involving the design of a real-time DT prototype to predict the plastic strain of a structure.
comment: 5 pages, 4 figures. Accepted at EDTconf 2025
☆ Generative Feature Imputing - A Technique for Error-resilient Semantic Communication
Semantic communication (SemCom) has emerged as a promising paradigm for achieving unprecedented communication efficiency in sixth-generation (6G) networks by leveraging artificial intelligence (AI) to extract and transmit the underlying meanings of source data. However, deploying SemCom over digital systems presents new challenges, particularly in ensuring robustness against transmission errors that may distort semantically critical content. To address this issue, this paper proposes a novel framework, termed generative feature imputing, which comprises three key techniques. First, we introduce a spatial error concentration packetization strategy that spatially concentrates feature distortions by encoding feature elements based on their channel mappings, a property crucial for both the effectiveness and reduced complexity of the subsequent techniques. Second, building on this strategy, we propose a generative feature imputing method that utilizes a diffusion model to efficiently reconstruct missing features caused by packet losses. Finally, we develop a semantic-aware power allocation scheme that enables unequal error protection by allocating transmission power according to the semantic importance of each packet. Experimental results demonstrate that the proposed framework outperforms conventional approaches, such as Deep Joint Source-Channel Coding (DJSCC) and JPEG2000, under block fading conditions, achieving higher semantic accuracy and lower Learned Perceptual Image Patch Similarity (LPIPS) scores.
☆ Choice Outweighs Effort: Facilitating Complementary Knowledge Fusion in Federated Learning via Re-calibration and Merit-discrimination
Cross-client data heterogeneity in federated learning induces biases that impede unbiased consensus condensation and the complementary fusion of generalization- and personalization-oriented knowledge. While existing approaches mitigate heterogeneity through model decoupling and representation center loss, they often rely on static and restricted metrics to evaluate local knowledge and adopt global alignment too rigidly, leading to consensus distortion and diminished model adaptability. To address these limitations, we propose FedMate, a method that implements bilateral optimization: On the server side, we construct a dynamic global prototype, with aggregation weights calibrated by holistic integration of sample size, current parameters, and future prediction; a category-wise classifier is then fine-tuned using this prototype to preserve global consistency. On the client side, we introduce complementary classification fusion to enable merit-based discrimination training and incorporate cost-aware feature transmission to balance model performance and communication efficiency. Experiments on five datasets of varying complexity demonstrate that FedMate outperforms state-of-the-art methods in harmonizing generalization and adaptation. Additionally, semantic segmentation experiments on autonomous driving datasets validate the method's real-world scalability.
☆ Understanding Subword Compositionality of Large Language Models EMNLP 2025
Large language models (LLMs) take sequences of subwords as input, requiring them to effective compose subword representations into meaningful word-level representations. In this paper, we present a comprehensive set of experiments to probe how LLMs compose subword information, focusing on three key aspects: structural similarity, semantic decomposability, and form retention. Our analysis of the experiments suggests that these five LLM families can be classified into three distinct groups, likely reflecting difference in their underlying composition strategies. Specifically, we observe (i) three distinct patterns in the evolution of structural similarity between subword compositions and whole-word representations across layers; (ii) great performance when probing layer by layer their sensitivity to semantic decompositionality; and (iii) three distinct patterns when probing sensitivity to formal features, e.g., character sequence length. These findings provide valuable insights into the compositional dynamics of LLMs and highlight different compositional pattens in how LLMs encode and integrate subword information.
comment: EMNLP 2025 Main
☆ Debiasing Multilingual LLMs in Cross-lingual Latent Space EMNLP 2025
Debiasing techniques such as SentDebias aim to reduce bias in large language models (LLMs). Previous studies have evaluated their cross-lingual transferability by directly applying these methods to LLM representations, revealing their limited effectiveness across languages. In this work, we therefore propose to perform debiasing in a joint latent space rather than directly on LLM representations. We construct a well-aligned cross-lingual latent space using an autoencoder trained on parallel TED talk scripts. Our experiments with Aya-expanse and two debiasing techniques across four languages (English, French, German, Dutch) demonstrate that a) autoencoders effectively construct a well-aligned cross-lingual latent space, and b) applying debiasing techniques in the learned cross-lingual latent space significantly improves both the overall debiasing performance and cross-lingual transferability.
comment: EMNLP 2025 Main
☆ Learning to Detect Label Errors by Making Them: A Method for Segmentation and Object Detection Datasets
Recently, detection of label errors and improvement of label quality in datasets for supervised learning tasks has become an increasingly important goal in both research and industry. The consequences of incorrectly annotated data include reduced model performance, biased benchmark results, and lower overall accuracy. Current state-of-the-art label error detection methods often focus on a single computer vision task and, consequently, a specific type of dataset, containing, for example, either bounding boxes or pixel-wise annotations. Furthermore, previous methods are not learning-based. In this work, we overcome this research gap. We present a unified method for detecting label errors in object detection, semantic segmentation, and instance segmentation datasets. In a nutshell, our approach - learning to detect label errors by making them - works as follows: we inject different kinds of label errors into the ground truth. Then, the detection of label errors, across all mentioned primary tasks, is framed as an instance segmentation problem based on a composite input. In our experiments, we compare the label error detection performance of our method with various baselines and state-of-the-art approaches of each task's domain on simulated label errors across multiple tasks, datasets, and base models. This is complemented by a generalization study on real-world label errors. Additionally, we release 459 real label errors identified in the Cityscapes dataset and provide a benchmark for real label error detection in Cityscapes.
☆ Entanglement Detection with Quantum-inspired Kernels and SVMs
This work presents a machine learning approach based on support vector machines (SVMs) for quantum entanglement detection. Particularly, we focus in bipartite systems of dimensions 3x3, 4x4, and 5x5, where the positive partial transpose criterion (PPT) provides only partial characterization. Using SVMs with quantum-inspired kernels we develop a classification scheme that distinguishes between separable states, PPT-detectable entangled states, and entangled states that evade PPT detection. Our method achieves increasing accuracy with system dimension, reaching 80%, 90%, and nearly 100% for 3x3, 4x4, and 5x5 systems, respectively. Our results show that principal component analysis significantly enhances performance for small training sets. The study reveals important practical considerations regarding purity biases in the generation of data for this problem and examines the challenges of implementing these techniques on near-term quantum hardware. Our results establish machine learning as a powerful complement to traditional entanglement detection methods, particularly for higher-dimensional systems where conventional approaches become inadequate. The findings highlight key directions for future research, including hybrid quantum-classical implementations and improved data generation protocols to overcome current limitations.
☆ WOMAC: A Mechanism For Prediction Competitions
Competitions are widely used to identify top performers in judgmental forecasting and machine learning, and the standard competition design ranks competitors based on their cumulative scores against a set of realized outcomes or held-out labels. However, this standard design is neither incentive-compatible nor very statistically efficient. The main culprit is noise in outcomes/labels that experts are scored against; it allows weaker competitors to often win by chance, and the winner-take-all nature incentivizes misreporting that improves win probability even if it decreases expected score. Attempts to achieve incentive-compatibility rely on randomized mechanisms that add even more noise in winner selection, but come at the cost of determinism and practical adoption. To tackle these issues, we introduce a novel deterministic mechanism: WOMAC (Wisdom of the Most Accurate Crowd). Instead of scoring experts against noisy outcomes, as is standard, WOMAC scores experts against the best ex-post aggregate of peer experts' predictions given the noisy outcomes. WOMAC is also more efficient than the standard competition design in typical settings. While the increased complexity of WOMAC makes it challenging to analyze incentives directly, we provide a clear theoretical foundation to justify the mechanism. We also provide an efficient vectorized implementation and demonstrate empirically on real-world forecasting datasets that WOMAC is a more reliable predictor of experts' out-of-sample performance relative to the standard mechanism. WOMAC is useful in any competition where there is substantial noise in the outcomes/labels.
☆ Riemannian Optimization for LoRA on the Stiefel Manifold EMNLP 2025
While powerful, large language models (LLMs) present significant fine-tuning challenges due to their size. Parameter-efficient fine-tuning (PEFT) methods like LoRA provide solutions, yet suffer from critical optimizer inefficiencies; notably basis redundancy in LoRA's $B$ matrix when using AdamW, which fundamentally limits performance. We address this by optimizing the $B$ matrix on the Stiefel manifold, imposing explicit orthogonality constraints that achieve near-perfect orthogonality and full effective rank. This geometric approach dramatically enhances parameter efficiency and representational capacity. Our Stiefel optimizer consistently outperforms AdamW across benchmarks with both LoRA and DoRA, demonstrating that geometric constraints are the key to unlocking LoRA's full potential for effective LLM fine-tuning.
comment: EMNLP 2025 Findings
☆ ILRe: Intermediate Layer Retrieval for Context Compression in Causal Language Models
Large Language Models (LLMs) have demonstrated success across many benchmarks. However, they still exhibit limitations in long-context scenarios, primarily due to their short effective context length, quadratic computational complexity, and high memory overhead when processing lengthy inputs. To mitigate these issues, we introduce a novel context compression pipeline, called Intermediate Layer Retrieval (ILRe), which determines one intermediate decoder layer offline, encodes context by streaming chunked prefill only up to that layer, and recalls tokens by the attention scores between the input query and full key cache in that specified layer. In particular, we propose a multi-pooling kernels allocating strategy in the token recalling process to maintain the completeness of semantics. Our approach not only reduces the prefilling complexity from $O(L^2)$ to $O(L)$, but also achieves performance comparable to or better than the full context in the long context scenarios. Without additional post training or operator development, ILRe can process a single $1M$ tokens request in less than half a minute (speedup $\approx 180\times$) and scores RULER-$1M$ benchmark of $\approx 79.8$ with model Llama-3.1-UltraLong-8B-1M-Instruct on a Huawei Ascend 910B NPU.
☆ Vocoder-Projected Feature Discriminator
In text-to-speech (TTS) and voice conversion (VC), acoustic features, such as mel spectrograms, are typically used as synthesis or conversion targets owing to their compactness and ease of learning. However, because the ultimate goal is to generate high-quality waveforms, employing a vocoder to convert these features into waveforms and applying adversarial training in the time domain is reasonable. Nevertheless, upsampling the waveform introduces significant time and memory overheads. To address this issue, we propose a vocoder-projected feature discriminator (VPFD), which uses vocoder features for adversarial training. Experiments on diffusion-based VC distillation demonstrated that a pretrained and frozen vocoder feature extractor with a single upsampling step is necessary and sufficient to achieve a VC performance comparable to that of waveform discriminators while reducing the training time and memory consumption by 9.6 and 11.4 times, respectively.
comment: Accepted to Interspeech 2024. Project page: https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/vpfd/
☆ Spectrum Prediction in the Fractional Fourier Domain with Adaptive Filtering
Accurate spectrum prediction is crucial for dynamic spectrum access (DSA) and resource allocation. However, due to the unique characteristics of spectrum data, existing methods based on the time or frequency domain often struggle to separate predictable patterns from noise. To address this, we propose the Spectral Fractional Filtering and Prediction (SFFP) framework. SFFP first employs an adaptive fractional Fourier transform (FrFT) module to transform spectrum data into a suitable fractional Fourier domain, enhancing the separability of predictable trends from noise. Subsequently, an adaptive Filter module selectively suppresses noise while preserving critical predictive features within this domain. Finally, a prediction module, leveraging a complex-valued neural network, learns and forecasts these filtered trend components. Experiments on real-world spectrum data show that the SFFP outperforms leading spectrum and general forecasting methods.
comment: Accepted by IEEE Wireless Communications Letters
☆ FasterVoiceGrad: Faster One-step Diffusion-Based Voice Conversion with Adversarial Diffusion Conversion Distillation
A diffusion-based voice conversion (VC) model (e.g., VoiceGrad) can achieve high speech quality and speaker similarity; however, its conversion process is slow owing to iterative sampling. FastVoiceGrad overcomes this limitation by distilling VoiceGrad into a one-step diffusion model. However, it still requires a computationally intensive content encoder to disentangle the speaker's identity and content, which slows conversion. Therefore, we propose FasterVoiceGrad, a novel one-step diffusion-based VC model obtained by simultaneously distilling a diffusion model and content encoder using adversarial diffusion conversion distillation (ADCD), where distillation is performed in the conversion process while leveraging adversarial and score distillation training. Experimental evaluations of one-shot VC demonstrated that FasterVoiceGrad achieves competitive VC performance compared to FastVoiceGrad, with 6.6-6.9 and 1.8 times faster speed on a GPU and CPU, respectively.
comment: Accepted to Interspeech 2025. Project page: https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/fastervoicegrad/
☆ Ada-TransGNN: An Air Quality Prediction Model Based On Adaptive Graph Convolutional Networks ICONIP2025
Accurate air quality prediction is becoming increasingly important in the environmental field. To address issues such as low prediction accuracy and slow real-time updates in existing models, which lead to lagging prediction results, we propose a Transformer-based spatiotemporal data prediction method (Ada-TransGNN) that integrates global spatial semantics and temporal behavior. The model constructs an efficient and collaborative spatiotemporal block set comprising a multi-head attention mechanism and a graph convolutional network to extract dynamically changing spatiotemporal dependency features from complex air quality monitoring data. Considering the interaction relationships between different monitoring points, we propose an adaptive graph structure learning module, which combines spatiotemporal dependency features in a data-driven manner to learn the optimal graph structure, thereby more accurately capturing the spatial relationships between monitoring points. Additionally, we design an auxiliary task learning module that enhances the decoding capability of temporal relationships by integrating spatial context information into the optimal graph structure representation, effectively improving the accuracy of prediction results. We conducted comprehensive evaluations on a benchmark dataset and a novel dataset (Mete-air). The results demonstrate that our model outperforms existing state-of-the-art prediction models in short-term and long-term predictions.
comment: 15 pages, 4 figures, 3 tables. This paper is accepted by ICONIP2025 but not published
☆ Group Expectation Policy Optimization for Stable Heterogeneous Reinforcement Learning in LLMs
As single-center computing approaches power constraints, decentralized training is becoming essential. Reinforcement Learning (RL) post-training enhances Large Language Models (LLMs) but faces challenges in heterogeneous distributed environments due to its tightly-coupled sampling-learning alternation. We propose HeteroRL, an asynchronous RL architecture that decouples rollout sampling from parameter learning, enabling robust deployment across geographically distributed nodes under network delays. We identify that latency-induced KL divergence causes importance sampling failure due to high variance. To address this, we propose Group Expectation Policy Optimization (GEPO), which reduces importance weight variance through a refined sampling mechanism. Theoretically, GEPO achieves exponential variance reduction. Experiments show it maintains superior stability over methods like GRPO, with less than 3% performance degradation under 1800-second delays, demonstrating strong potential for decentralized RL in heterogeneous networks.
☆ Alternating Training-based Label Smoothing Enhances Prompt Generalization
Recent advances in pre-trained vision-language models have demonstrated remarkable zero-shot generalization capabilities. To further enhance these models' adaptability to various downstream tasks, prompt tuning has emerged as a parameter-efficient fine-tuning method. However, despite its efficiency, the generalization ability of prompt remains limited. In contrast, label smoothing (LS) has been widely recognized as an effective regularization technique that prevents models from becoming over-confident and improves their generalization. This inspires us to explore the integration of LS with prompt tuning. However, we have observed that the vanilla LS even weakens the generalization ability of prompt tuning. To address this issue, we propose the Alternating Training-based Label Smoothing (ATLaS) method, which alternately trains with standard one-hot labels and soft labels generated by LS to supervise the prompt tuning. Moreover, we introduce two types of efficient offline soft labels, including Class-wise Soft Labels (CSL) and Instance-wise Soft Labels (ISL), to provide inter-class or instance-class relationships for prompt tuning. The theoretical properties of the proposed ATLaS method are analyzed. Extensive experiments demonstrate that the proposed ATLaS method, combined with CSL and ISL, consistently enhances the generalization performance of prompt tuning. Moreover, the proposed ATLaS method exhibits high compatibility with prevalent prompt tuning methods, enabling seamless integration into existing methods.
☆ Diffusion-Based Data Augmentation for Medical Image Segmentation ICCV 2025
Medical image segmentation models struggle with rare abnormalities due to scarce annotated pathological data. We propose DiffAug a novel framework that combines textguided diffusion-based generation with automatic segmentation validation to address this challenge. Our proposed approach uses latent diffusion models conditioned on medical text descriptions and spatial masks to synthesize abnormalities via inpainting on normal images. Generated samples undergo dynamic quality validation through a latentspace segmentation network that ensures accurate localization while enabling single-step inference. The text prompts, derived from medical literature, guide the generation of diverse abnormality types without requiring manual annotation. Our validation mechanism filters synthetic samples based on spatial accuracy, maintaining quality while operating efficiently through direct latent estimation. Evaluated on three medical imaging benchmarks (CVC-ClinicDB, Kvasir-SEG, REFUGE2), our framework achieves state-of-the-art performance with 8-10% Dice improvements over baselines and reduces false negative rates by up to 28% for challenging cases like small polyps and flat lesions critical for early detection in screening applications.
comment: Accepted to CVAMD Workshop at ICCV 2025
☆ A Contrastive Learning-Guided Confident Meta-learning for Zero Shot Anomaly Detection ICCV 2025
Industrial and medical anomaly detection faces critical challenges from data scarcity and prohibitive annotation costs, particularly in evolving manufacturing and healthcare settings. To address this, we propose CoZAD, a novel zero-shot anomaly detection framework that integrates soft confident learning with meta-learning and contrastive feature representation. Unlike traditional confident learning that discards uncertain samples, our method assigns confidence-based weights to all training data, preserving boundary information while emphasizing prototypical normal patterns. The framework quantifies data uncertainty through IQR-based thresholding and model uncertainty via covariance based regularization within a Model-Agnostic Meta-Learning. Contrastive learning creates discriminative feature spaces where normal patterns form compact clusters, enabling rapid domain adaptation. Comprehensive evaluation across 10 datasets spanning industrial and medical domains demonstrates state-of-the-art performance, outperforming existing methods on 6 out of 7 industrial benchmarks with notable improvements on texture-rich datasets (99.2% I-AUROC on DTD-Synthetic, 97.2% on BTAD) and pixellevel localization (96.3% P-AUROC on MVTec-AD). The framework eliminates dependence on vision-language alignments or model ensembles, making it valuable for resourceconstrained environments requiring rapid deployment.
comment: Accepted to VISION Workshop at ICCV 2025
☆ Limits of message passing for node classification: How class-bottlenecks restrict signal-to-noise ratio
Message passing neural networks (MPNNs) are powerful models for node classification but suffer from performance limitations under heterophily (low same-class connectivity) and structural bottlenecks in the graph. We provide a unifying statistical framework exposing the relationship between heterophily and bottlenecks through the signal-to-noise ratio (SNR) of MPNN representations. The SNR decomposes model performance into feature-dependent parameters and feature-independent sensitivities. We prove that the sensitivity to class-wise signals is bounded by higher-order homophily -- a generalisation of classical homophily to multi-hop neighbourhoods -- and show that low higher-order homophily manifests locally as the interaction between structural bottlenecks and class labels (class-bottlenecks). Through analysis of graph ensembles, we provide a further quantitative decomposition of bottlenecking into underreaching (lack of depth implying signals cannot arrive) and oversquashing (lack of breadth implying signals arriving on fewer paths) with closed-form expressions. We prove that optimal graph structures for maximising higher-order homophily are disjoint unions of single-class and two-class-bipartite clusters. This yields BRIDGE, a graph ensemble-based rewiring algorithm that achieves near-perfect classification accuracy across all homophily regimes on synthetic benchmarks and significant improvements on real-world benchmarks, by eliminating the ``mid-homophily pitfall'' where MPNNs typically struggle, surpassing current standard rewiring techniques from the literature. Our framework, whose code we make available for public use, provides both diagnostic tools for assessing MPNN performance, and simple yet effective methods for enhancing performance through principled graph modification.
☆ Limitations of Normalization in Attention Mechanism
This paper investigates the limitations of the normalization in attention mechanisms. We begin with a theoretical framework that enables the identification of the model's selective ability and the geometric separation involved in token selection. Our analysis includes explicit bounds on distances and separation criteria for token vectors under softmax scaling. Through experiments with pre-trained GPT-2 model, we empirically validate our theoretical results and analyze key behaviors of the attention mechanism. Notably, we demonstrate that as the number of selected tokens increases, the model's ability to distinguish informative tokens declines, often converging toward a uniform selection pattern. We also show that gradient sensitivity under softmax normalization presents challenges during training, especially at low temperature settings. These findings advance current understanding of softmax-based attention mechanism and motivate the need for more robust normalization and selection strategies in future attention architectures.
comment: 10 pages, 4 figures
☆ Multi-domain Distribution Learning for De Novo Drug Design
We introduce DrugFlow, a generative model for structure-based drug design that integrates continuous flow matching with discrete Markov bridges, demonstrating state-of-the-art performance in learning chemical, geometric, and physical aspects of three-dimensional protein-ligand data. We endow DrugFlow with an uncertainty estimate that is able to detect out-of-distribution samples. To further enhance the sampling process towards distribution regions with desirable metric values, we propose a joint preference alignment scheme applicable to both flow matching and Markov bridge frameworks. Furthermore, we extend our model to also explore the conformational landscape of the protein by jointly sampling side chain angles and molecules.
☆ MeshSplat: Generalizable Sparse-View Surface Reconstruction via Gaussian Splatting
Surface reconstruction has been widely studied in computer vision and graphics. However, existing surface reconstruction works struggle to recover accurate scene geometry when the input views are extremely sparse. To address this issue, we propose MeshSplat, a generalizable sparse-view surface reconstruction framework via Gaussian Splatting. Our key idea is to leverage 2DGS as a bridge, which connects novel view synthesis to learned geometric priors and then transfers these priors to achieve surface reconstruction. Specifically, we incorporate a feed-forward network to predict per-view pixel-aligned 2DGS, which enables the network to synthesize novel view images and thus eliminates the need for direct 3D ground-truth supervision. To improve the accuracy of 2DGS position and orientation prediction, we propose a Weighted Chamfer Distance Loss to regularize the depth maps, especially in overlapping areas of input views, and also a normal prediction network to align the orientation of 2DGS with normal vectors predicted by a monocular normal estimator. Extensive experiments validate the effectiveness of our proposed improvement, demonstrating that our method achieves state-of-the-art performance in generalizable sparse-view mesh reconstruction tasks. Project Page: https://hanzhichang.github.io/meshsplat_web
comment: 17 pages, 15 figures, 5 tables
☆ Robust Anomaly Detection in Industrial Environments via Meta-Learning ICCV 2025
Anomaly detection is fundamental for ensuring quality control and operational efficiency in industrial environments, yet conventional approaches face significant challenges when training data contains mislabeled samples-a common occurrence in real-world scenarios. This paper presents RAD, a robust anomaly detection framework that integrates Normalizing Flows with Model-Agnostic Meta-Learning to address the critical challenge of label noise in industrial settings. Our approach employs a bi-level optimization strategy where meta-learning enables rapid adaptation to varying noise conditions, while uncertainty quantification guides adaptive L2 regularization to maintain model stability. The framework incorporates multiscale feature processing through pretrained feature extractors and leverages the precise likelihood estimation capabilities of Normalizing Flows for robust anomaly scoring. Comprehensive evaluation on MVTec-AD and KSDD2 datasets demonstrates superior performance, achieving I-AUROC scores of 95.4% and 94.6% respectively under clean conditions, while maintaining robust detection capabilities above 86.8% and 92.1% even when 50% of training samples are mislabeled. The results highlight RAD's exceptional resilience to noisy training conditions and its ability to detect subtle anomalies across diverse industrial scenarios, making it a practical solution for real-world anomaly detection applications where perfect data curation is challenging.
comment: Accepted to VISION Workshop at ICCV 2025
☆ Interpretable Early Failure Detection via Machine Learning and Trace Checking-based Monitoring ECAI 2025
Monitoring is a runtime verification technique that allows one to check whether an ongoing computation of a system (partial trace) satisfies a given formula. It does not need a complete model of the system, but it typically requires the construction of a deterministic automaton doubly exponential in the size of the formula (in the worst case), which limits its practicality. In this paper, we show that, when considering finite, discrete traces, monitoring of pure past (co)safety fragments of Signal Temporal Logic (STL) can be reduced to trace checking, that is, evaluation of a formula over a trace, that can be performed in time polynomial in the size of the formula and the length of the trace. By exploiting such a result, we develop a GPU-accelerated framework for interpretable early failure detection based on vectorized trace checking, that employs genetic programming to learn temporal properties from historical trace data. The framework shows a 2-10% net improvement in key performance metrics compared to the state-of-the-art methods.
comment: Full version of the paper accepted for publication at the 28th European Conference on Artificial Intelligence (ECAI 2025)
☆ Proximal Supervised Fine-Tuning
Supervised fine-tuning (SFT) of foundation models often leads to poor generalization, where prior capabilities deteriorate after tuning on new tasks or domains. Inspired by trust-region policy optimization (TRPO) and proximal policy optimization (PPO) in reinforcement learning (RL), we propose Proximal SFT (PSFT). This fine-tuning objective incorporates the benefits of trust-region, effectively constraining policy drift during SFT while maintaining competitive tuning. By viewing SFT as a special case of policy gradient methods with constant positive advantages, we derive PSFT that stabilizes optimization and leads to generalization, while leaving room for further optimization in subsequent post-training stages. Experiments across mathematical and human-value domains show that PSFT matches SFT in-domain, outperforms it in out-of-domain generalization, remains stable under prolonged training without causing entropy collapse, and provides a stronger foundation for the subsequent optimization.
☆ Algebraic Approach to Ridge-Regularized Mean Squared Error Minimization in Minimal ReLU Neural Network
This paper investigates a perceptron, a simple neural network model, with ReLU activation and a ridge-regularized mean squared error (RR-MSE). Our approach leverages the fact that the RR-MSE for ReLU perceptron is piecewise polynomial, enabling a systematic analysis using tools from computational algebra. In particular, we develop a Divide-Enumerate-Merge strategy that exhaustively enumerates all local minima of the RR-MSE. By virtue of the algebraic formulation, our approach can identify not only the typical zero-dimensional minima (i.e., isolated points) obtained by numerical optimization, but also higher-dimensional minima (i.e., connected sets such as curves, surfaces, or hypersurfaces). Although computational algebraic methods are computationally very intensive for perceptrons of practical size, as a proof of concept, we apply the proposed approach in practice to minimal perceptrons with a few hidden units.
comment: 44 pages, 5 figres
☆ ISACL: Internal State Analyzer for Copyrighted Training Data Leakage
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but pose risks of inadvertently exposing copyrighted or proprietary data, especially when such data is used for training but not intended for distribution. Traditional methods address these leaks only after content is generated, which can lead to the exposure of sensitive information. This study introduces a proactive approach: examining LLMs' internal states before text generation to detect potential leaks. By using a curated dataset of copyrighted materials, we trained a neural network classifier to identify risks, allowing for early intervention by stopping the generation process or altering outputs to prevent disclosure. Integrated with a Retrieval-Augmented Generation (RAG) system, this framework ensures adherence to copyright and licensing requirements while enhancing data privacy and ethical standards. Our results show that analyzing internal states effectively mitigates the risk of copyrighted data leakage, offering a scalable solution that fits smoothly into AI workflows, ensuring compliance with copyright regulations while maintaining high-quality text generation. The implementation is available on GitHub.\footnote{https://github.com/changhu73/Internal_states_leakage}
☆ Puzzle: Scheduling Multiple Deep Learning Models on Mobile Device with Heterogeneous Processors
As deep learning models are increasingly deployed on mobile devices, modern mobile devices incorporate deep learning-specific accelerators to handle the growing computational demands, thus increasing their hardware heterogeneity. However, existing works on scheduling deep learning workloads across these processors have significant limitations: most studies focus on single-model scenarios rather than realistic multi-model scenarios, overlook performance variations from different hardware/software configurations, and struggle with accurate execution time estimation. To address these challenges, we propose a novel genetic algorithm-based methodology for scheduling multiple deep learning networks on heterogeneous processors by partitioning the networks into multiple subgraphs. Our approach incorporates three different types of chromosomes for partition/mapping/priority exploration, and leverages device-in-the-loop profiling and evaluation for accurate execution time estimation. Based on this methodology, our system, Puzzle, demonstrates superior performance in extensive evaluations with randomly generated scenarios involving nine state-of-the-art networks. The results demonstrate Puzzle can support 3.7 and 2.2 times higher request frequency on average compared to the two heuristic baselines, NPU Only and Best Mapping, respectively, while satisfying the equivalent level of real-time requirements.
☆ Evaluating the Quality of the Quantified Uncertainty for (Re)Calibration of Data-Driven Regression Models
In safety-critical applications data-driven models must not only be accurate but also provide reliable uncertainty estimates. This property, commonly referred to as calibration, is essential for risk-aware decision-making. In regression a wide variety of calibration metrics and recalibration methods have emerged. However, these metrics differ significantly in their definitions, assumptions and scales, making it difficult to interpret and compare results across studies. Moreover, most recalibration methods have been evaluated using only a small subset of metrics, leaving it unclear whether improvements generalize across different notions of calibration. In this work, we systematically extract and categorize regression calibration metrics from the literature and benchmark these metrics independently of specific modelling methods or recalibration approaches. Through controlled experiments with real-world, synthetic and artificially miscalibrated data, we demonstrate that calibration metrics frequently produce conflicting results. Our analysis reveals substantial inconsistencies: many metrics disagree in their evaluation of the same recalibration result, and some even indicate contradictory conclusions. This inconsistency is particularly concerning as it potentially allows cherry-picking of metrics to create misleading impressions of success. We identify the Expected Normalized Calibration Error (ENCE) and the Coverage Width-based Criterion (CWC) as the most dependable metrics in our tests. Our findings highlight the critical role of metric selection in calibration research.
☆ SuperGen: An Efficient Ultra-high-resolution Video Generation System with Sketching and Tiling
Diffusion models have recently achieved remarkable success in generative tasks (e.g., image and video generation), and the demand for high-quality content (e.g., 2K/4K videos) is rapidly increasing across various domains. However, generating ultra-high-resolution videos on existing standard-resolution (e.g., 720p) platforms remains challenging due to the excessive re-training requirements and prohibitively high computational and memory costs. To this end, we introduce SuperGen, an efficient tile-based framework for ultra-high-resolution video generation. SuperGen features a novel training-free algorithmic innovation with tiling to successfully support a wide range of resolutions without additional training efforts while significantly reducing both memory footprint and computational complexity. Moreover, SuperGen incorporates a tile-tailored, adaptive, region-aware caching strategy that accelerates video generation by exploiting redundancy across denoising steps and spatial regions. SuperGen also integrates cache-guided, communication-minimized tile parallelism for enhanced throughput and minimized latency. Evaluations demonstrate that SuperGen harvests the maximum performance gains while achieving high output quality across various benchmarks.
☆ Multi-layer Abstraction for Nested Generation of Options (MANGO) in Hierarchical Reinforcement Learning
This paper introduces MANGO (Multilayer Abstraction for Nested Generation of Options), a novel hierarchical reinforcement learning framework designed to address the challenges of long-term sparse reward environments. MANGO decomposes complex tasks into multiple layers of abstraction, where each layer defines an abstract state space and employs options to modularize trajectories into macro-actions. These options are nested across layers, allowing for efficient reuse of learned movements and improved sample efficiency. The framework introduces intra-layer policies that guide the agent's transitions within the abstract state space, and task actions that integrate task-specific components such as reward functions. Experiments conducted in procedurally-generated grid environments demonstrate substantial improvements in both sample efficiency and generalization capabilities compared to standard RL methods. MANGO also enhances interpretability by making the agent's decision-making process transparent across layers, which is particularly valuable in safety-critical and industrial applications. Future work will explore automated discovery of abstractions and abstract actions, adaptation to continuous or fuzzy environments, and more robust multi-layer training strategies.
☆ Randomly Removing 50% of Dimensions in Text Embeddings has Minimal Impact on Retrieval and Classification Tasks EMNLP 2025
In this paper, we study the surprising impact that truncating text embeddings has on downstream performance. We consistently observe across 6 state-of-the-art text encoders and 26 downstream tasks, that randomly removing up to 50% of embedding dimensions results in only a minor drop in performance, less than 10%, in retrieval and classification tasks. Given the benefits of using smaller-sized embeddings, as well as the potential insights about text encoding, we study this phenomenon and find that, contrary to what is suggested in prior work, this is not the result of an ineffective use of representation space. Instead, we find that a large number of uniformly distributed dimensions actually cause an increase in performance when removed. This would explain why, on average, removing a large number of embedding dimensions results in a marginal drop in performance. We make similar observations when truncating the embeddings used by large language models to make next-token predictions on generative tasks, suggesting that this phenomenon is not isolated to classification or retrieval tasks.
comment: Accepted to EMNLP 2025 Main Conference, submitted version
☆ Speculative Safety-Aware Decoding EMNLP'2025
Despite extensive efforts to align Large Language Models (LLMs) with human values and safety rules, jailbreak attacks that exploit certain vulnerabilities continuously emerge, highlighting the need to strengthen existing LLMs with additional safety properties to defend against these attacks. However, tuning large models has become increasingly resource-intensive and may have difficulty ensuring consistent performance. We introduce Speculative Safety-Aware Decoding (SSD), a lightweight decoding-time approach that equips LLMs with the desired safety property while accelerating inference. We assume that there exists a small language model that possesses this desired property. SSD integrates speculative sampling during decoding and leverages the match ratio between the small and composite models to quantify jailbreak risks. This enables SSD to dynamically switch between decoding schemes to prioritize utility or safety, to handle the challenge of different model capacities. The output token is then sampled from a new distribution that combines the distributions of the original and the small models. Experimental results show that SSD successfully equips the large model with the desired safety property, and also allows the model to remain helpful to benign queries. Furthermore, SSD accelerates the inference time, thanks to the speculative sampling design.
comment: EMNLP'2025 main conference; more experiments will be added to the coming camera-ready version
☆ Segmentation and Classification of Pap Smear Images for Cervical Cancer Detection Using Deep Learning
Cervical cancer remains a significant global health concern and a leading cause of cancer-related deaths among women. Early detection through Pap smear tests is essential to reduce mortality rates; however, the manual examination is time consuming and prone to human error. This study proposes a deep learning framework that integrates U-Net for segmentation and a classification model to enhance diagnostic performance. The Herlev Pap Smear Dataset, a publicly available cervical cell dataset, was utilized for training and evaluation. The impact of segmentation on classification performance was evaluated by comparing the model trained on segmented images and another trained on non-segmented images. Experimental results showed that the use of segmented images marginally improved the model performance on precision (about 0.41 percent higher) and F1-score (about 1.30 percent higher), which suggests a slightly more balanced classification performance. While segmentation helps in feature extraction, the results showed that its impact on classification performance appears to be limited. The proposed framework offers a supplemental tool for clinical applications, which may aid pathologists in early diagnosis.
☆ Copyright Protection for 3D Molecular Structures with Watermarking
Artificial intelligence (AI) revolutionizes molecule generation in bioengineering and biological research, significantly accelerating discovery processes. However, this advancement introduces critical concerns regarding intellectual property protection. To address these challenges, we propose the first robust watermarking method designed for molecules, which utilizes atom-level features to preserve molecular integrity and invariant features to ensure robustness against affine transformations. Comprehensive experiments validate the effectiveness of our method using the datasets QM9 and GEOM-DRUG, and generative models GeoBFN and GeoLDM. We demonstrate the feasibility of embedding watermarks, maintaining basic properties higher than 90.00\% while achieving watermark accuracy greater than 95.00\%. Furthermore, downstream docking simulations reveal comparable performance between original and watermarked molecules, with binding affinities reaching -6.00 kcal/mol and root mean square deviations below 1.602 \AA. These results confirm that our watermarking technique effectively safeguards molecular intellectual property without compromising scientific utility, enabling secure and responsible AI integration in molecular discovery and research applications.
☆ Adaptive Ensemble Learning with Gaussian Copula for Load Forecasting
Machine learning (ML) is capable of accurate Load Forecasting from complete data. However, there are many uncertainties that affect data collection, leading to sparsity. This article proposed a model called Adaptive Ensemble Learning with Gaussian Copula to deal with sparsity, which contains three modules: data complementation, ML construction, and adaptive ensemble. First, it applies Gaussian Copula to eliminate sparsity. Then, we utilise five ML models to make predictions individually. Finally, it employs adaptive ensemble to get final weighted-sum result. Experiments have demonstrated that our model are robust.
☆ Rethinking Federated Learning Over the Air: The Blessing of Scaling Up
Federated learning facilitates collaborative model training across multiple clients while preserving data privacy. However, its performance is often constrained by limited communication resources, particularly in systems supporting a large number of clients. To address this challenge, integrating over-the-air computations into the training process has emerged as a promising solution to alleviate communication bottlenecks. The system significantly increases the number of clients it can support in each communication round by transmitting intermediate parameters via analog signals rather than digital ones. This improvement, however, comes at the cost of channel-induced distortions, such as fading and noise, which affect the aggregated global parameters. To elucidate these effects, this paper develops a theoretical framework to analyze the performance of over-the-air federated learning in large-scale client scenarios. Our analysis reveals three key advantages of scaling up the number of participating clients: (1) Enhanced Privacy: The mutual information between a client's local gradient and the server's aggregated gradient diminishes, effectively reducing privacy leakage. (2) Mitigation of Channel Fading: The channel hardening effect eliminates the impact of small-scale fading in the noisy global gradient. (3) Improved Convergence: Reduced thermal noise and gradient estimation errors benefit the convergence rate. These findings solidify over-the-air model training as a viable approach for federated learning in networks with a large number of clients. The theoretical insights are further substantiated through extensive experimental evaluations.
☆ Text Meets Topology: Rethinking Out-of-distribution Detection in Text-Rich Networks EMNLP2025
Out-of-distribution (OOD) detection remains challenging in text-rich networks, where textual features intertwine with topological structures. Existing methods primarily address label shifts or rudimentary domain-based splits, overlooking the intricate textual-structural diversity. For example, in social networks, where users represent nodes with textual features (name, bio) while edges indicate friendship status, OOD may stem from the distinct language patterns between bot and normal users. To address this gap, we introduce the TextTopoOOD framework for evaluating detection across diverse OOD scenarios: (1) attribute-level shifts via text augmentations and embedding perturbations; (2) structural shifts through edge rewiring and semantic connections; (3) thematically-guided label shifts; and (4) domain-based divisions. Furthermore, we propose TNT-OOD to model the complex interplay between Text aNd Topology using: 1) a novel cross-attention module to fuse local structure into node-level text representations, and 2) a HyperNetwork to generate node-specific transformation parameters. This aligns topological and semantic features of ID nodes, enhancing ID/OOD distinction across structural and textual shifts. Experiments on 11 datasets across four OOD scenarios demonstrate the nuanced challenge of TextTopoOOD for evaluating OOD detection in text-rich networks.
comment: EMNLP2025 Main
☆ On the Edge of Memorization in Diffusion Models
When do diffusion models reproduce their training data, and when are they able to generate samples beyond it? A practically relevant theoretical understanding of this interplay between memorization and generalization may significantly impact real-world deployments of diffusion models with respect to issues such as copyright infringement and data privacy. In this work, to disentangle the different factors that influence memorization and generalization in practical diffusion models, we introduce a scientific and mathematical "laboratory" for investigating these phenomena in diffusion models trained on fully synthetic or natural image-like structured data. Within this setting, we hypothesize that the memorization or generalization behavior of an underparameterized trained model is determined by the difference in training loss between an associated memorizing model and a generalizing model. To probe this hypothesis, we theoretically characterize a crossover point wherein the weighted training loss of a fully generalizing model becomes greater than that of an underparameterized memorizing model at a critical value of model (under)parameterization. We then demonstrate via carefully-designed experiments that the location of this crossover predicts a phase transition in diffusion models trained via gradient descent, validating our hypothesis. Ultimately, our theory enables us to analytically predict the model size at which memorization becomes predominant. Our work provides an analytically tractable and practically meaningful setting for future theoretical and empirical investigations. Code for our experiments is available at https://github.com/DruvPai/diffusion_mem_gen.
comment: 10 main body pages, 43 total pages
☆ Unlearning as Ablation: Toward a Falsifiable Benchmark for Generative Scientific Discovery NeurIPS 2025
Bold claims about AI's role in science-from "AGI will cure all diseases" to promises of radically accelerated discovery-raise a central epistemic question: do large language models (LLMs) truly generate new knowledge, or do they merely remix memorized fragments? We propose unlearning-as-ablation as a falsifiable test of constructive scientific discovery. The method systematically removes a target result and its entire forget-closure (lemmas, paraphrases, and multi-hop entailments) and then evaluates whether the model can re-derive the result from only permitted axioms and tools. Success provides evidence for genuine generative capability; failure exposes current limits. Unlike prevailing motivations for unlearning-privacy, copyright, or safety-our framing repositions it as an epistemic probe for AI-for-Science. We argue that such tests could serve as the next generation of benchmarks, much as ImageNet catalyzed progress in vision: distinguishing models that can merely recall from those that can constructively generate new scientific knowledge. We outline a minimal pilot in mathematics and algorithms, and discuss extensions to physics, chemistry, and biology. Whether models succeed or fail, unlearning-as-ablation provides a principled framework to map the true reach and limits of AI scientific discovery. This is a position paper: we advance a conceptual and methodological argument rather than new empirical results.
comment: 6 pages. NeurIPS 2025 AI4Science Workshop submission
☆ Robustness Feature Adapter for Efficient Adversarial Training ECAI 2025
Adversarial training (AT) with projected gradient descent is the most popular method to improve model robustness under adversarial attacks. However, computational overheads become prohibitively large when AT is applied to large backbone models. AT is also known to have the issue of robust overfitting. This paper contributes to solving both problems simultaneously towards building more trustworthy foundation models. In particular, we propose a new adapter-based approach for efficient AT directly in the feature space. We show that the proposed adapter-based approach can improve the inner-loop convergence quality by eliminating robust overfitting. As a result, it significantly increases computational efficiency and improves model accuracy by generalizing adversarial robustness to unseen attacks. We demonstrate the effectiveness of the new adapter-based approach in different backbone architectures and in AT at scale.
comment: The paper has been accepted for presentation at ECAI 2025
☆ Characterizing the Behavior of Training Mamba-based State Space Models on GPUs
Mamba-based State Space Models (SSM) have emerged as a promising alternative to the ubiquitous transformers. Despite the expressive power of transformers, the quadratic complexity of computing attention is a major impediment to scaling performance as we increase the sequence length. SSMs provide an alternative path that addresses this problem, reducing the computational complexity requirements of self-attention with novel model architectures for different domains and fields such as video, text generation and graphs. Thus, it is important to characterize the behavior of these emerging workloads on GPUs and understand their requirements during GPU microarchitectural design. In this work we evaluate Mamba-based SSMs and characterize their behavior during training on GPUs. We construct a workload suite that offers representative models that span different model architectures. We then use this suite to analyze the architectural implications of running Mamba-based SSMs on GPUs. Our work sheds new light on potential optimizations to continue scaling the performance for such models.
☆ TiKMiX: Take Data Influence into Dynamic Mixture for Language Model Pre-training
The data mixture used in the pre-training of a language model is a cornerstone of its final performance. However, a static mixing strategy is suboptimal, as the model's learning preferences for various data domains shift dynamically throughout training. Crucially, observing these evolving preferences in a computationally efficient manner remains a significant challenge. To address this, we propose TiKMiX, a method that dynamically adjusts the data mixture according to the model's evolving preferences. TiKMiX introduces Group Influence, an efficient metric for evaluating the impact of data domains on the model. This metric enables the formulation of the data mixing problem as a search for an optimal, influence-maximizing distribution. We solve this via two approaches: TiKMiX-D for direct optimization, and TiKMiX-M, which uses a regression model to predict a superior mixture. We trained models with different numbers of parameters, on up to 1 trillion tokens. TiKMiX-D exceeds the performance of state-of-the-art methods like REGMIX while using just 20% of the computational resources. TiKMiX-M leads to an average performance gain of 2% across 9 downstream benchmarks. Our experiments reveal that a model's data preferences evolve with training progress and scale, and we demonstrate that dynamically adjusting the data mixture based on Group Influence, a direct measure of these preferences, significantly improves performance by mitigating the underdigestion of data seen with static ratios.
☆ Towards Synthesizing Normative Data for Cognitive Assessments Using Generative Multimodal Large Language Models
Cognitive assessments require normative data as essential benchmarks for evaluating individual performance. Hence, developing new cognitive tests based on novel image stimuli is challenging due to the lack of readily available normative data. Traditional data collection methods are costly, time-consuming, and infrequently updated, limiting their practical utility. Recent advancements in generative multimodal large language models (MLLMs) offer a new approach to generate synthetic normative data from existing cognitive test images. We investigated the feasibility of using MLLMs, specifically GPT-4o and GPT-4o-mini, to synthesize normative textual responses for established image-based cognitive assessments, such as the "Cookie Theft" picture description task. Two distinct prompting strategies-naive prompts with basic instructions and advanced prompts enriched with contextual guidance-were evaluated. Responses were analyzed using embeddings to assess their capacity to distinguish diagnostic groups and demographic variations. Performance metrics included BLEU, ROUGE, BERTScore, and an LLM-as-a-judge evaluation. Advanced prompting strategies produced synthetic responses that more effectively distinguished between diagnostic groups and captured demographic diversity compared to naive prompts. Superior models generated responses exhibiting higher realism and diversity. BERTScore emerged as the most reliable metric for contextual similarity assessment, while BLEU was less effective for evaluating creative outputs. The LLM-as-a-judge approach provided promising preliminary validation results. Our study demonstrates that generative multimodal LLMs, guided by refined prompting methods, can feasibly generate robust synthetic normative data for existing cognitive tests, thereby laying the groundwork for developing novel image-based cognitive assessments without the traditional limitations.
comment: Preprint
♻ ☆ Neural Logic Networks for Interpretable Classification
Traditional neural networks have an impressive classification performance, but what they learn cannot be inspected, verified or extracted. Neural Logic Networks on the other hand have an interpretable structure that enables them to learn a logical mechanism relating the inputs and outputs with AND and OR operations. We generalize these networks with NOT operations and biases that take into account unobserved data and develop a rigorous logical and probabilistic modeling in terms of concept combinations to motivate their use. We also propose a novel factorized IF-THEN rule structure for the model as well as a modified learning algorithm. Our method improves the state-of-the-art in Boolean networks discovery and is able to learn relevant, interpretable rules in tabular classification, notably on examples from the medical and industrial fields where interpretability has tangible value.
comment: 24 pages, 8 figures, pre-print, code available at https://github.com/VincentPerreault0/NeuralLogicNetworks
♻ ☆ Optimistic Online Learning in Symmetric Cone Games
We introduce symmetric cone games (SCGs), a broad class of multi-player games where each player's strategy lies in a generalized simplex (the trace-one slice of a symmetric cone). This framework unifies a wide spectrum of settings, including normal-form games (simplex strategies), quantum games (density matrices), and continuous games with ball-constrained strategies. It also captures several structured machine learning and optimization problems, such as distance metric learning and Fermat-Weber facility location, as two-player zero-sum SCGs. To compute approximate Nash equilibria in two-player zero-sum SCGs, we propose a single online learning algorithm: Optimistic Symmetric Cone Multiplicative Weights Updates (OSCMWU). Unlike prior methods tailored to specific geometries, OSCMWU provides closed-form, projection-free updates over any symmetric cone and achieves an optimal $\tilde{\mathcal{O}}(1/\epsilon)$ iteration complexity for computing $\epsilon$-saddle points. Our analysis builds on the Optimistic Follow-the-Regularized-Leader framework and hinges on a key technical contribution: We prove that the symmetric cone negative entropy is strongly convex with respect to the trace-one norm. This result extends known results for the simplex and spectraplex to all symmetric cones, and may be of independent interest.
♻ ☆ Why Isn't Relational Learning Taking Over the World?
Artificial intelligence seems to be taking over the world with systems that model pixels, words, and phonemes. The world is arguably made up, not of pixels, words, and phonemes but of entities (objects, things, including events) with properties and relations among them. Surely we should model these, not the perception or description of them. You might suspect that concentrating on modeling words and pixels is because all of the (valuable) data in the world is in terms of text and images. If you look into almost any company you will find their most valuable data is in spreadsheets, databases and other relational formats. These are not the form that are studied in introductory machine learning, but are full of product numbers, student numbers, transaction numbers and other identifiers that can't be interpreted naively as numbers. The field that studies this sort of data has various names including relational learning, statistical relational AI, and many others. This paper explains why relational learning is not taking over the world -- except in a few cases with restricted relations -- and what needs to be done to bring it to it's rightful prominence.
comment: 10 pages (6 pages + references + appendices)
♻ ☆ Machine Learning-Based Prediction of Quality Shifts on Video Streaming Over 5G
The Quality of Experience (QoE) is the users satisfaction while streaming a video session over an over-the-top (OTT) platform like YouTube. QoE of YouTube reflects the smooth streaming session without any buffering and quality shift events. One of the most important factors nowadays affecting QoE of YouTube is frequent shifts from higher to lower resolutions and vice versa. These shifts ensure a smooth streaming session; however, it might get a lower mean opinion score. For instance, dropping from 1080p to 480p during a video can preserve continuity but might reduce the viewers enjoyment. Over time, OTT platforms are looking for alternative ways to boost user experience instead of relying on traditional Quality of Service (QoS) metrics such as bandwidth, latency, and throughput. As a result, we look into the relationship between quality shifting in YouTube streaming sessions and the channel metrics RSRP, RSRQ, and SNR. Our findings state that these channel metrics positively correlate with shifts. Thus, in real-time, OTT can only rely on them to predict video streaming sessions into lower- and higher-resolution categories, thus providing more resources to improve user experience. Using traditional Machine Learning (ML) classifiers, we achieved an accuracy of 77-percent, while using only RSRP, RSRQ, and SNR. In the era of 5G and beyond, where ultra-reliable, low-latency networks promise enhanced streaming capabilities, the proposed methodology can be used to improve OTT services.
comment: We are trying to improve the paper with better results. There are some inconsistencies with the current version
♻ ☆ OwkinZero: Accelerating Biological Discovery with AI
While large language models (LLMs) are rapidly advancing scientific research, they continue to struggle with core biological reasoning tasks essential for translational and biomedical discovery. To address this limitation, we created and curated eight comprehensive benchmark datasets comprising over 300,000 verifiable question-and-answer pairs, each targeting critical challenges in drug discovery including target druggability, modality suitability, and drug perturbation effects. Using this resource, we developed the OwkinZero models by post-training open-source LLMs through a Reinforcement Learning from Verifiable Rewards strategy. Our results demonstrate that specialized 8-32B OwkinZero models substantially outperform larger, state-of-the-art commercial LLMs on these biological benchmarks. Remarkably, we uncover evidence of a key aspect of generalization: specialist models trained on a single task consistently outperform their base models on previously unseen tasks. This generalization effect is further amplified in our comprehensive OwkinZero models, which were trained on a mixture of datasets and achieve even broader cross-task improvements. This study represents a significant step toward addressing the biological reasoning blind spot in current LLMs, demonstrating that targeted reinforcement learning on carefully curated data can unlock generalizable performance in specialized models, thereby accelerating AI-driven biological discovery.
comment: Preprint
♻ ☆ TranSQL+: Serving Large Language Models with SQL on Low-Resource Hardware SIGMOD2026
Deploying Large Language Models (LLMs) on resource-constrained devices remains challenging due to limited memory, lack of GPUs, and the complexity of existing runtimes. In this paper, we introduce TranSQL+, a template-based code generator that translates LLM computation graphs into pure SQL queries for execution in relational databases. Without relying on external libraries, TranSQL+, leverages mature database features, such as vectorized execution and out-of-core processing, for efficient inference. We further propose a row-to-column (ROW2COL) optimization that improves join efficiency in matrix operations. Evaluated on Llama3-8B and DeepSeekMoE models, TranSQL+ achieves up to 20x lower prefill latency and 4x higher decoding speed compared to DeepSpeed Inference and Llama.cpp in low-memory and CPU-only configurations. Our results highlight relational databases as a practical environment for LLMs on low-resource hardware.
comment: Accepted by SIGMOD2026
♻ ☆ Making Hard Problems Easier with Custom Data Distributions and Loss Regularization: A Case Study in Modular Arithmetic
Recent work showed that ML-based attacks on Learning with Errors (LWE), a hard problem used in post-quantum cryptography, outperform classical algebraic attacks in certain settings. Although promising, ML attacks struggle to scale to more complex LWE settings. Prior work connected this issue to the difficulty of training ML models to do modular arithmetic, a core feature of the LWE problem. To address this, we develop techniques that significantly boost the performance of ML models on modular arithmetic tasks, enabling the models to sum up to $N=128$ elements modulo $q \le 974269$. Our core innovation is the use of custom training data distributions and a carefully designed loss function that better represents the problem structure. We apply an initial proof of concept of our techniques to LWE specifically and find that they allow recovery of 2x harder secrets than prior work. Our techniques also help ML models learn other well-studied problems better, including copy, associative recall, and parity, motivating further study.
♻ ☆ Manifold learning in metric spaces
Laplacian-based methods are popular for dimensionality reduction of data lying in $\mathbb{R}^N$. Several theoretical results for these algorithms depend on the fact that the Euclidean distance locally approximates the geodesic distance on the underlying submanifold which the data are assumed to lie on. However, for some applications, other metrics, such as the Wasserstein distance, may provide a more appropriate notion of distance than the Euclidean distance. We provide a framework that generalizes the problem of manifold learning to metric spaces and study when a metric satisfies sufficient conditions for the pointwise convergence of the graph Laplacian.
♻ ☆ HeteroTune: Efficient Federated Learning for Large Heterogeneous Models
While large pre-trained models have achieved impressive performance across AI tasks, their deployment in privacy-sensitive and distributed environments remains challenging. Federated learning (FL) offers a viable solution by enabling decentralized fine-tuning without data sharing, but real-world applications face significant obstacles due to heterogeneous client resources in compute and memory. To address this, we propose HeteroTune, a novel federated fine-tuning paradigm for large, heterogeneous models operating under limited communication and computation budgets. The core of our method lies in a novel architecture, DeMA (Dense Mixture of Adapters), which enables flexible and efficient aggregation of heterogeneous models by preserving their full representational capacity while facilitating seamless cross-model knowledge fusion. We further introduce CMGA (Cross-Model Gradient Alignment), a lightweight yet effective mechanism that enhances training stability by harmonizing gradient directions across heterogeneous client models during aggregation, mitigating update conflicts and promoting more consistent convergence in federated settings. We provide both theoretical analysis and empirical evidence showing that HeteroTune achieves state-of-the-art performance and efficiency across diverse tasks and model architectures. For example, on LLaMA models, it reduces communication overhead by 99.5%, cuts peak memory usage by ~50%, and improves performance by 4.61%.
comment: 16 pages, 4 figures
♻ ☆ A Multisource Fusion Framework for Cryptocurrency Price Movement Prediction
Predicting cryptocurrency price trends remains a major challenge due to the volatility and complexity of digital asset markets. Artificial intelligence (AI) has emerged as a powerful tool to address this problem. This study proposes a multisource fusion framework that integrates quantitative financial indicators, such as historical prices and technical indicators, with qualitative sentiment signals derived from X (formerly Twitter). Sentiment analysis is performed using Financial Bidirectional Encoder Representations from Transformers (FinBERT), a domain-specific BERT-based model optimized for financial text, while sequential dependencies are captured through a Bidirectional Long Short-Term Memory (BiLSTM) network. Experimental results on a large-scale Bitcoin dataset demonstrate that the proposed approach substantially outperforms single-source models, achieving an accuracy of approximately 96.8\%. The findings underscore the importance of incorporating real-time social sentiment alongside traditional indicators, thereby enhancing predictive accuracy and supporting more informed investment decisions.
♻ ☆ Does provable absence of barren plateaus imply classical simulability?
A large amount of effort has recently been put into understanding the barren plateau phenomenon. In this perspective article, we face the increasingly loud elephant in the room and ask a question that has been hinted at by many but not explicitly addressed: Can the structure that allows one to avoid barren plateaus also be leveraged to efficiently simulate the loss classically? We collect evidence-on a case-by-case basis-that many commonly used models whose loss landscapes avoid barren plateaus can also admit classical simulation, provided that one can collect some classical data from quantum devices during an initial data acquisition phase. This follows from the observation that barren plateaus result from a curse of dimensionality, and that current approaches for solving them end up encoding the problem into some small, classically simulable, subspaces. Thus, while stressing that quantum computers can be essential for collecting data, our analysis sheds doubt on the information processing capabilities of many parametrized quantum circuits with provably barren plateau-free landscapes. We end by discussing the (many) caveats in our arguments including the limitations of average case arguments, the role of smart initializations, models that fall outside our assumptions, the potential for provably superpolynomial advantages and the possibility that, once larger devices become available, parametrized quantum circuits could heuristically outperform our analytic expectations.
comment: 15+22 pages, 5+2 figures, 2 tables, updated to published version
♻ ☆ Evaluation of Large Language Models via Coupled Token Generation
State of the art large language models rely on randomization to respond to a prompt. As an immediate consequence, a model may respond differently to the same prompt if asked multiple times. In this work, we argue that the evaluation and ranking of large language models should control for the randomization underpinning their functioning. Our starting point is the development of a causal model for coupled autoregressive generation, which allows different large language models to sample responses with the same source of randomness. Building upon our causal model, we first show that, on evaluations based on benchmark datasets, coupled autoregressive generation leads to the same conclusions as vanilla autoregressive generation but using provably fewer samples. However, we further show that, on evaluations based on (human) pairwise comparisons, coupled and vanilla autoregressive generation can surprisingly lead to different rankings when comparing more than two models, even with an infinite amount of samples. This suggests that the apparent advantage of a model over others in existing evaluation protocols may not be genuine but rather confounded by the randomness inherent to the generation process. To illustrate and complement our theoretical results, we conduct experiments with several large language models from the Llama, Mistral and Qwen families. We find that, across multiple benchmark datasets, coupled autoregressive generation requires up to 75% fewer samples to reach the same conclusions as vanilla autoregressive generation. Further, we find that the win-rates derived from pairwise comparisons by a strong large language model to prompts from the LMSYS Chatbot Arena platform differ under coupled and vanilla autoregressive generation.
♻ ☆ Low-Regret and Low-Complexity Learning for Hierarchical Inference
This work focuses on Hierarchical Inference (HI) in edge intelligence systems, where a compact Local-ML model on an end-device works in conjunction with a high-accuracy Remote-ML model on an edge-server. HI aims to reduce latency, improve accuracy, and lower bandwidth usage by first using the Local-ML model for inference and offloading to the Remote-ML only when the local inference is likely incorrect. A critical challenge in HI is estimating the likelihood of the local inference being incorrect, especially when data distributions and offloading costs change over time -- a problem we term Hierarchical Inference Learning (HIL). We introduce a novel approach to HIL by modeling the probability of correct inference by the Local-ML as an increasing function of the model's confidence measure, a structure motivated by empirical observations but previously unexploited. We propose two policies, HI-LCB and HI-LCB-lite, based on the Upper Confidence Bound (UCB) framework. We demonstrate that both policies achieve order-optimal regret of $O(\log T)$, a significant improvement over existing HIL policies with $O(T^{2/3})$ regret guarantees. Notably, HI-LCB-lite has an $O(1)$ per-sample computational complexity, making it well-suited for deployment on devices with severe resource limitations. Simulations using real-world datasets confirm that our policies outperform existing state-of-the-art HIL methods.
♻ ☆ HOSt3R: Keypoint-free Hand-Object 3D Reconstruction from RGB images
Hand-object 3D reconstruction has become increasingly important for applications in human-robot interaction and immersive AR/VR experiences. A common approach for object-agnostic hand-object reconstruction from RGB sequences involves a two-stage pipeline: hand-object 3D tracking followed by multi-view 3D reconstruction. However, existing methods rely on keypoint detection techniques, such as Structure from Motion (SfM) and hand-keypoint optimization, which struggle with diverse object geometries, weak textures, and mutual hand-object occlusions, limiting scalability and generalization. As a key enabler to generic and seamless, non-intrusive applicability, we propose in this work a robust, keypoint detector-free approach to estimating hand-object 3D transformations from monocular motion video/images. We further integrate this with a multi-view reconstruction pipeline to accurately recover hand-object 3D shape. Our method, named HOSt3R, is unconstrained, does not rely on pre-scanned object templates or camera intrinsics, and reaches state-of-the-art performance for the tasks of object-agnostic hand-object 3D transformation and shape estimation on the SHOWMe benchmark. We also experiment on sequences from the HO3D dataset, demonstrating generalization to unseen object categories.
comment: 12 pages, 8 figures
♻ ☆ Missing Melodies: AI Music Generation and its "Nearly" Complete Omission of the Global South
Recent advances in generative AI have sparked renewed interest and expanded possibilities for music generation. However, the performance and versatility of these systems across musical genres are heavily influenced by the availability of training data. We conducted an extensive analysis of over one million hours of audio datasets used in AI music generation research and manually reviewed more than 200 papers from eleven prominent AI and music conferences and organizations (AAAI, ACM, EUSIPCO, EURASIP, ICASSP, ICML, IJCAI, ISMIR, NeurIPS, NIME, SMC) to identify a critical gap in the fair representation and inclusion of the musical genres of the Global South in AI research. Our findings reveal a stark imbalance: approximately 86% of the total dataset hours and over 93% of researchers focus primarily on music from the Global North. However, around 40% of these datasets include some form of non-Western music, genres from the Global South account for only 14.6% of the data. Furthermore, approximately 51% of the papers surveyed concentrate on symbolic music generation, a method that often fails to capture the cultural nuances inherent in music from regions such as South Asia, the Middle East, and Africa. As AI increasingly shapes the creation and dissemination of music, the significant underrepresentation of music genres in datasets and research presents a serious threat to global musical diversity. We also propose some important steps to mitigate these risks and foster a more inclusive future for AI-driven music generation.
comment: Submitted to CACM, 12 pages, 2 figures
♻ ☆ Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks
Large language models (LLMs) are shown to be vulnerable to jailbreaking attacks where adversarial prompts are designed to elicit harmful responses. While existing defenses effectively mitigate single-turn attacks by detecting and filtering unsafe inputs, they fail against multi-turn jailbreaks that exploit contextual drift over multiple interactions, gradually leading LLMs away from safe behavior. To address this challenge, we propose a safety steering framework grounded in safe control theory, ensuring invariant safety in multi-turn dialogues. Our approach models the dialogue with LLMs using state-space representations and introduces a novel neural barrier function (NBF) to detect and filter harmful queries emerging from evolving contexts proactively. Our method achieves invariant safety at each turn of dialogue by learning a safety predictor that accounts for adversarial queries, preventing potential context drift toward jailbreaks. Extensive experiments under multiple LLMs show that our NBF-based safety steering outperforms safety alignment, prompt-based steering and lightweight LLM guardrails baselines, offering stronger defenses against multi-turn jailbreaks while maintaining a better trade-off among safety, helpfulness and over-refusal. Check out the website here https://sites.google.com/view/llm-nbf/home . Our code is available on https://github.com/HanjiangHu/NBF-LLM .
comment: 23 pages, 10 figures, 11 tables
♻ ☆ Reconsidering Fairness Through Unawareness From the Perspective of Model Multiplicity
Fairness through Unawareness (FtU) describes the idea that discrimination against demographic groups can be avoided by not considering group membership in the decisions or predictions. This idea has long been criticized in the machine learning literature as not being sufficient to ensure fairness. In addition, the use of additional features is typically thought to increase the accuracy of the predictions for all groups, so that FtU is sometimes thought to be detrimental to all groups. In this paper, we show both theoretically and empirically that FtU can reduce algorithmic discrimination without necessarily reducing accuracy. We connect this insight with the literature on Model Multiplicity, to which we contribute with novel theoretical and empirical results. Furthermore, we illustrate how, in a real-life application, FtU can contribute to the deployment of more equitable policies without losing efficacy. Our findings suggest that FtU is worth considering in practical applications, particularly in high-risk scenarios, and that the use of protected attributes such as gender in predictive models should be accompanied by a clear and well-founded justification.
♻ ☆ CultureGuard: Towards Culturally-Aware Dataset and Guard Model for Multilingual Safety Applications
The increasing use of Large Language Models (LLMs) in agentic applications highlights the need for robust safety guard models. While content safety in English is well-studied, non-English languages lack similar advancements due to the high cost of collecting culturally aligned labeled datasets. We present CultureGuard, a novel solution for curating culturally aligned, high-quality safety datasets across multiple languages. Our approach introduces a four-stage synthetic data generation and filtering pipeline: cultural data segregation, cultural data adaptation, machine translation, and quality filtering. This pipeline enables the conversion and expansion of the Nemotron-Content-Safety-Dataset-V2 English safety dataset into eight distinct languages: Arabic, German, Spanish, French, Hindi, Japanese, Thai, and Chinese. The resulting dataset, Nemotron-Content-Safety-Dataset-Multilingual-v1, comprises 386,661 samples in 9 languages and facilitates the training of Llama-3.1-Nemotron-Safety-Guard-Multilingual-8B-v1 via LoRA-based fine-tuning. The final model achieves state-of-the-art performance on several multilingual content safety benchmarks. We also benchmark the latest open LLMs on multilingual safety and observe that these LLMs are more prone to give unsafe responses when prompted in non-English languages. This work represents a significant step toward closing the safety gap in multilingual LLMs by enabling the development of culturally aware safety guard models.
♻ ☆ ReHub: Linear Complexity Graph Transformers with Adaptive Hub-Spoke Reassignment
We present ReHub, a novel graph transformer architecture that achieves linear complexity through an efficient reassignment technique between nodes and virtual nodes. Graph transformers have become increasingly important in graph learning for their ability to utilize long-range node communication explicitly, addressing limitations such as oversmoothing and oversquashing found in message-passing graph networks. However, their dense attention mechanism scales quadratically with the number of nodes, limiting their applicability to large-scale graphs. ReHub draws inspiration from the airline industry's hub-and-spoke model, where flights are assigned to optimize operational efficiency. In our approach, graph nodes (spokes) are dynamically reassigned to a fixed number of virtual nodes (hubs) at each model layer. Recent work, Neural Atoms (Li et al., 2024), has demonstrated impressive and consistent improvements over GNN baselines by utilizing such virtual nodes; their findings suggest that the number of hubs strongly influences performance. However, increasing the number of hubs typically raises complexity, requiring a trade-off to maintain linear complexity. Our key insight is that each node only needs to interact with a small subset of hubs to achieve linear complexity, even when the total number of hubs is large. To leverage all hubs without incurring additional computational costs, we propose a simple yet effective adaptive reassignment technique based on hub-hub similarity scores, eliminating the need for expensive node-hub computations. Our experiments on LRGB indicate a consistent improvement in results over the base method, Neural Atoms, while maintaining a linear complexity. Remarkably, our sparse model achieves performance on par with its non-sparse counterpart. Furthermore, ReHub outperforms competitive baselines and consistently ranks among top performers across various benchmarks.
comment: TMLR 2025. Project page: https://tomerborreda.github.io/rehub/
♻ ☆ A Survey of the Self Supervised Learning Mechanisms for Vision Transformers
Advances in deep learning are re-defining how visual data is processed and understand by the machines. Vision Transformers (ViTs) have recently demonstrated prominent performance in computer vision related tasks. However, their performance improves with increasing numbers of labeled data, indicating reliance on labeled data. Humanly annotated data are difficult to acquire and thus shifted the focus from traditional annotations to unsupervised learning strategies that learn structures inside the data. In response to this challenge, self-supervised learning (SSL) has emerged as a promising technique. SSL utilize inherent relationships within the data as a form of supervision. This technique can reduce the dependence on manual annotations and offers a more scalable and resource-effective approach to training models. Taking these strengths into account, it is necessary to assess the combination of SSL methods with ViTs, especially in the cases of limited labeled data. Inspired by this evolving trend, this survey aims to systematically review SSL mechanisms tailored for ViTs. We propose a comprehensive taxonomy to classify SSL techniques based on their representations and pre-training tasks. Furthermore, we highlighted the motivations behind the study of SSL, reviewed prominent pre-training tasks, and highlight advancements and challenges in this field. Furthermore, we conduct a comparative analysis of various SSL methods designed for ViTs, evaluating their strengths, limitations, and applicability to different scenarios.
comment: 42 Pages, 4 Figures, 7 Tables
♻ ☆ Fault Detection in New Wind Turbines with Limited Data by Generative Transfer Learning
Intelligent condition monitoring of wind turbines is essential for reducing downtimes. Machine learning models trained on wind turbine operation data are commonly used to detect anomalies and, eventually, operation faults. However, data-driven normal behavior models (NBMs) require a substantial amount of training data, as NBMs trained with scarce data may result in unreliable fault detection. To overcome this limitation, we present a novel generative deep transfer learning approach to make SCADA samples from one wind turbine lacking training data resemble SCADA data from wind turbines with representative training data. Through CycleGAN-based domain mapping, our method enables the application of an NBM trained on an existing wind turbine to a new one with severely limited data. We demonstrate our approach on field data mapping SCADA samples across 7 substantially different WTs. Our findings show significantly improved fault detection in wind turbines with scarce data. Our method achieves the most similar anomaly scores to an NBM trained with abundant data, outperforming NBMs trained on scarce training data with improvements of +10.3% in F1-score when 1 month of training data is available and +16.8% when 2 weeks are available. The domain mapping approach outperforms conventional fine-tuning at all considered degrees of data scarcity, ranging from 1 to 8 weeks of training data. The proposed technique enables earlier and more reliable fault detection in newly installed wind farms, demonstrating a novel and promising research direction to improve anomaly detection when faced with training data scarcity.
comment: Change of phrasing to fault detection, including title change, and minor revisions. No changes to models, experiments, or results
♻ ☆ Robust Federated Learning under Adversarial Attacks via Loss-Based Client Clustering
Federated Learning (FL) enables collaborative model training across multiple clients without sharing private data. We consider FL scenarios wherein FL clients are subject to adversarial (Byzantine) attacks, while the FL server is trusted (honest) and has a trustworthy side dataset. This may correspond to, e.g., cases where the server possesses trusted data prior to federation, or to the presence of a trusted client that temporarily assumes the server role. Our approach requires only two honest participants, i.e., the server and one client, to function effectively, without prior knowledge of the number of malicious clients. Theoretical analysis demonstrates bounded optimality gaps even under strong Byzantine attacks. Experimental results show that our algorithm significantly outperforms standard and robust FL baselines such as Mean, Trimmed Mean, Median, Krum, and Multi-Krum under various attack strategies including label flipping, sign flipping, and Gaussian noise addition across MNIST, FMNIST, and CIFAR-10 benchmarks using the Flower framework.
comment: 16 pages, 5 figures
♻ ☆ Explainable Prediction of the Mechanical Properties of Composites with CNNs
Composites are amongst the most important materials manufactured today, as evidenced by their use in countless applications. In order to establish the suitability of composites in specific applications, finite element (FE) modelling, a numerical method based on partial differential equations, is the industry standard for assessing their mechanical properties. However, FE modelling is exceptionally costly from a computational viewpoint, a limitation which has led to efforts towards applying AI models to this task. However, in these approaches: the chosen model architectures were rudimentary, feed-forward neural networks giving limited accuracy; the studies focused on predicting elastic mechanical properties, without considering material strength limits; and the models lacked transparency, hindering trustworthiness by users. In this paper, we show that convolutional neural networks (CNNs) equipped with methods from explainable AI (XAI) can be successfully deployed to solve this problem. Our approach uses customised CNNs trained on a dataset we generate using transverse tension tests in FE modelling to predict composites' mechanical properties, i.e., Young's modulus and yield strength. We show empirically that our approach achieves high accuracy, outperforming a baseline, ResNet-34, in estimating the mechanical properties. We then use SHAP and Integrated Gradients, two post-hoc XAI methods, to explain the predictions, showing that the CNNs use the critical geometrical features that influence the composites' behaviour, thus allowing engineers to verify that the models are trustworthy by representing the science of composites.
comment: 9 pages, 6 figures. Accepted for publication at The 14th Conference on Prestigious Applications of Intelligent Systems (PAIS-2025)
♻ ☆ Probabilistic Classification of Near-Surface Shallow-Water Sediments using A Portable Free-Fall Penetrometer
The geotechnical evaluation of seabed sediments is important for engineering projects and naval applications, offering valuable insights into sediment properties, behavior, and strength. Obtaining high-quality seabed samples can be a challenging task, making in situ testing an essential part of site characterization. Free-fall penetrometers (FFPs) are robust tools for rapidly profiling seabed surface sediments, even in energetic nearshore or estuarine conditions and shallow as well as deep depths. Although methods for interpretation of traditional offshore cone penetration testing (CPT) data are well-established, their adaptation to FFP data is still an area of research. This study introduces an innovative approach that utilizes machine learning algorithms to create a sediment behavior classification system based on portable free- fall penetrometer (PFFP) data. The proposed model leverages PFFP measurements obtained from multiple locations, such as Sequim Bay (Washington), the Potomac River, and the York River (Virginia). The results show 91.1% accuracy in the class prediction, with the classes representing cohesionless sediment with little to no plasticity (Class 1), cohesionless sediment with some plasticity (Class 2), cohesive sediment with low plasticity (Class 3), and cohesive sediment with high plasticity (Class 4). The model prediction not only predicts classes but also yields an estimate of inherent uncertainty associated with the prediction, which can provide valuable insight into different sediment behaviors. Lower uncertainties are more common, but they can increase significantly depending on variations in sediment composition, environmental conditions, and operational techniques. By quantifying uncertainty, the model offers a more comprehensive and informed approach to sediment classification
♻ ☆ Large Language Models in the Task of Automatic Validation of Text Classifier Predictions
Machine learning models for text classification are trained to predict a class for a given text. To do this, training and validation samples must be prepared: a set of texts is collected, and each text is assigned a class. These classes are usually assigned by human annotators with different expertise levels, depending on the specific classification task. Collecting such samples from scratch is labor-intensive because it requires finding specialists and compensating them for their work; moreover, the number of available specialists is limited, and their productivity is constrained by human factors. While it may not be too resource-intensive to collect samples once, the ongoing need to retrain models (especially in incremental learning pipelines) to address data drift (also called model drift) makes the data collection process crucial and costly over the model's entire lifecycle. This paper proposes several approaches to replace human annotators with Large Language Models (LLMs) to test classifier predictions for correctness, helping ensure model quality and support high-quality incremental learning.
♻ ☆ Architectural Co-Design for Zero-Shot Anomaly Detection: Decoupling Representation and Dynamically Fusing Features in CLIP
Pre-trained Vision-Language Models (VLMs) face a significant adaptation gap when applied to Zero-Shot Anomaly Detection (ZSAD), stemming from their lack of local inductive biases for dense prediction and their reliance on inflexible feature fusion paradigms. We address these limitations through an Architectural Co-Design framework that jointly refines feature representation and cross-modal fusion. Our method proposes a parameter-efficient Convolutional Low-Rank Adaptation (Conv-LoRA) adapter to inject local inductive biases for fine-grained representation, and introduces a Dynamic Fusion Gateway (DFG) that leverages visual context to adaptively modulate text prompts, enabling a powerful bidirectional fusion. Extensive experiments on diverse industrial and medical benchmarks demonstrate superior accuracy and robustness, validating that this synergistic co-design is critical for robustly adapting foundation models to dense perception tasks.
comment: 4 pages, 1 reference, 3 figures, icassp 2026
♻ ☆ An Unsupervised Deep XAI Framework for Localization of Concurrent Replay Attacks in Nuclear Reactor Signals
Next generation advanced nuclear reactors are expected to be smaller both in size and power output, relying extensively on fully digital instrumentation and control systems. These reactors will generate a large flow of information in the form of multivariate time series data, conveying simultaneously various non linear cyber physical, process, control, sensor, and operational states. Ensuring data integrity against deception attacks is becoming increasingly important for networked communication and a requirement for safe and reliable operation. Current efforts to address replay attacks, almost universally focus on watermarking or supervised anomaly detection approaches without further identifying and characterizing the root cause of the anomaly. In addition, these approaches rely mostly on synthetic data with uncorrelated Gaussian process and measurement noise and full state feedback or are limited to univariate signals, signal stationarity, linear quadratic regulators, or other linear-time invariant state-space which may fail to capture any unmodeled system dynamics. In the realm of regulated nuclear cyber-physical systems, additional work is needed on characterization of replay attacks and explainability of predictions using real data. Here, we propose an unsupervised explainable AI framework based on a combination of autoencoder and customized windowSHAP algorithm to fully characterize real-time replay attacks, i.e., detection, source identification, timing and type, of increasing complexity during a dynamic time evolving reactor process. The proposed XAI framework was benchmarked on several real world datasets from Purdue's nuclear reactor PUR-1 with up to six signals concurrently being replayed. In all cases, the XAI framework was able to detect and identify the source and number of signals being replayed and the duration of the falsification with 95 percent or better accuracy.
comment: Added references, corrected typos, grammar check, authors updated
♻ ☆ Mitigating Message Imbalance in Fraud Detection with Dual-View Graph Representation Learning
Graph representation learning has become a mainstream method for fraud detection due to its strong expressive power, which focuses on enhancing node representations through improved neighborhood knowledge capture. However, the focus on local interactions leads to imbalanced transmission of global topological information and increased risk of node-specific information being overwhelmed during aggregation due to the imbalance between fraud and benign nodes. In this paper, we first summarize the impact of topology and class imbalance on downstream tasks in GNN-based fraud detection, as the problem of imbalanced supervisory messages is caused by fraudsters' topological behavior obfuscation and identity feature concealment. Based on statistical validation, we propose a novel dual-view graph representation learning method to mitigate Message imbalance in Fraud Detection (MimbFD). Specifically, we design a topological message reachability module for high-quality node representation learning to penetrate fraudsters' camouflage and alleviate insufficient propagation. Then, we introduce a local confounding debiasing module to adjust node representations, enhancing the stable association between node representations and labels to balance the influence of different classes. Finally, we conducted experiments on three public fraud datasets, and the results demonstrate that MimbFD exhibits outstanding performance in fraud detection.
♻ ☆ EPFL-Smart-Kitchen-30: Densely annotated cooking dataset with 3D kinematics to challenge video and language models
Understanding behavior requires datasets that capture humans while carrying out complex tasks. The kitchen is an excellent environment for assessing human motor and cognitive function, as many complex actions are naturally exhibited in kitchens from chopping to cleaning. Here, we introduce the EPFL-Smart-Kitchen-30 dataset, collected in a noninvasive motion capture platform inside a kitchen environment. Nine static RGB-D cameras, inertial measurement units (IMUs) and one head-mounted HoloLens~2 headset were used to capture 3D hand, body, and eye movements. The EPFL-Smart-Kitchen-30 dataset is a multi-view action dataset with synchronized exocentric, egocentric, depth, IMUs, eye gaze, body and hand kinematics spanning 29.7 hours of 16 subjects cooking four different recipes. Action sequences were densely annotated with 33.78 action segments per minute. Leveraging this multi-modal dataset, we propose four benchmarks to advance behavior understanding and modeling through 1) a vision-language benchmark, 2) a semantic text-to-motion generation benchmark, 3) a multi-modal action recognition benchmark, 4) a pose-based action segmentation benchmark. We expect the EPFL-Smart-Kitchen-30 dataset to pave the way for better methods as well as insights to understand the nature of ecologically-valid human behavior. Code and data are available at https://github.com/amathislab/EPFL-Smart-Kitchen
comment: Code and data at: https://github.com/amathislab/EPFL-Smart-Kitchen
♻ ☆ A foundation model with multi-variate parallel attention to generate neuronal activity
Learning from multi-variate time-series with heterogeneous channel configurations remains a fundamental challenge for deep neural networks, particularly in clinical domains such as intracranial electroencephalography (iEEG), where channel setups vary widely across subjects. In this work, we introduce multi-variate parallel attention (MVPA), a novel self-attention mechanism that disentangles content, temporal, and spatial attention, enabling flexible, generalizable, and efficient modeling of time-series data with varying channel counts and configurations. We use MVPA to build MVPFormer, a generative foundation model for human electrophysiology, trained to predict the evolution of iEEG signals across diverse subjects. To support this and future efforts by the community, we release the SWEC iEEG dataset, the largest publicly available iEEG dataset to date, comprising nearly 10,000 hours of recordings from heterogeneous clinical sources. MVPFormer leverages MVPA to achieve strong generalization across subjects, demonstrating expert-level performance in several iEEG tasks. MVPFormer surpasses state-of-the-art Transformer baselines in seizure detection across the SWEC, the MAYO, and the FNUSA datasets, while also achieving state-of-the-art performance on four Brain TreeBank iEEG decoding tasks. We further validate MVPA on standard time-series forecasting and classification tasks, where it matches or exceeds the performance of existing attention-based models. Together, our contributions establish MVPA as a general-purpose attention mechanism for heterogeneous time-series and MVPFormer as the first open-source, open-weights, and open-data iEEG foundation model with SOTA clinical performance. The code is available at https://github.com/IBM/multi-variate-parallel-transformer. The SWEC iEEG dataset is available at https://huggingface.co/datasets/NeuroTec/SWEC_iEEG_Dataset.
comment: The code is available at https://github.com/IBM/multi-variate-parallel-transformer. The SWEC iEEG dataset is available at https://huggingface.co/datasets/NeuroTec/SWEC_iEEG_Dataset
♻ ☆ Memento: Fine-tuning LLM Agents without Fine-tuning LLMs
In this paper, we introduce a novel learning paradigm for Adaptive Large Language Model (LLM) agents that eliminates the need for fine-tuning the underlying LLMs. Existing approaches are often either rigid, relying on static, handcrafted reflection workflows, or computationally intensive, requiring gradient updates of LLM model parameters. In contrast, our method enables low-cost continual adaptation via memory-based online reinforcement learning. We formalise this as a Memory-augmented Markov Decision Process (M-MDP), equipped with a neural case-selection policy to guide action decisions. Past experiences are stored in an episodic memory, either differentiable or non-parametric. The policy is continually updated based on environmental feedback through a memory rewriting mechanism, whereas policy improvement is achieved through efficient memory reading (retrieval). We instantiate our agent model in the deep research setting, namely \emph{Memento}, which attains top-1 on GAIA validation ($87.88\%$ Pass@$3$) and $79.40\%$ on the test set. It reaches $66.6\%$ F1 and $80.4\%$ PM on the DeepResearcher dataset, outperforming the state-of-the-art training-based method, while case-based memory adds $4.7\%$ to $9.6\%$ absolute points on out-of-distribution tasks. Our approach offers a scalable and efficient pathway for developing generalist LLM agents capable of continuous, real-time learning without gradient updates, advancing machine learning towards open-ended skill acquisition and deep research scenarios. The code is available at https://github.com/Agent-on-the-Fly/Memento.
♻ ☆ SupraTok: Cross-Boundary Tokenization for Enhanced Language Model Performance
Tokenization remains a fundamental yet underexplored bottleneck in natural language processing, with strategies largely static despite remarkable progress in model architectures. We present SupraTok, a novel tokenization architecture that reimagines subword segmentation through three innovations: cross-boundary pattern learning that discovers multi-word semantic units, entropy-driven data curation that optimizes training corpus quality, and multi-phase curriculum learning for stable convergence. Our approach extends Byte-Pair Encoding by learning "superword" tokens, coherent multi-word expressions that preserve semantic unity while maximizing compression efficiency. SupraTok achieves 31% improvement in English tokenization efficiency (5.91 versus 4.51 characters per token) compared to OpenAI's o200k tokenizer and 30% improvement over Google's Gemma 3 tokenizer (256k vocabulary), while maintaining competitive performance across 38 languages. When integrated with a GPT-2 scale model (124M parameters) trained on 10 billion tokens from the FineWeb-Edu dataset, SupraTok yields 8.4% improvement on HellaSWAG and 9.5% on MMLU benchmarks without architectural modifications. While these results are promising at this scale, further validation at larger model scales is needed. These findings suggest that efficient tokenization can complement architectural innovations as a path to improved language model performance.
♻ ☆ Local Off-Grid Weather Forecasting with Multi-Modal Earth Observation Data
Urgent applications like wildfire management and renewable energy generation require precise, localized weather forecasts near the Earth's surface. However, forecasts produced by machine learning models or numerical weather prediction systems are typically generated on large-scale regular grids, where direct downscaling fails to capture fine-grained, near-surface weather patterns. In this work, we propose a multi-modal transformer model trained end-to-end to downscale gridded forecasts to off-grid locations of interest. Our model directly combines local historical weather observations (e.g., wind, temperature, dewpoint) with gridded forecasts to produce locally accurate predictions at various lead times. Multiple data modalities are collected and concatenated at station-level locations, treated as a token at each station. Using self-attention, the token corresponding to the target location aggregates information from its neighboring tokens. Experiments using weather stations across the Northeastern United States show that our model outperforms a range of data-driven and non-data-driven off-grid forecasting methods. They also reveal that direct input of station data provides a phase shift in local weather forecasting accuracy, reducing the prediction error by up to 80% compared to pure gridded data based models. This approach demonstrates how to bridge the gap between large-scale weather models and locally accurate forecasts to support high-stakes, location-sensitive decision-making.
♻ ☆ Data-Driven Discovery of Interpretable Kalman Filter Variants through Large Language Models and Genetic Programming
Algorithmic discovery has traditionally relied on human ingenuity and extensive experimentation. Here we investigate whether a prominent scientific computing algorithm, the Kalman Filter, can be discovered through an automated, data-driven, evolutionary process that relies on Cartesian Genetic Programming (CGP) and Large Language Models (LLM). We evaluate the contributions of both modalities (CGP and LLM) in discovering the Kalman filter under varying conditions. Our results demonstrate that our framework of CGP and LLM-assisted evolution converges to near-optimal solutions when Kalman optimality assumptions hold. When these assumptions are violated, our framework evolves interpretable alternatives that outperform the Kalman filter. These results demonstrate that combining evolutionary algorithms and generative models for interpretable, data-driven synthesis of simple computational modules is a potent approach for algorithmic discovery in scientific computing.
♻ ☆ Fitting Multilevel Factor Models
We examine a special case of the multilevel factor model, with covariance given by multilevel low rank (MLR) matrix~\cite{parshakova2023factor}. We develop a novel, fast implementation of the expectation-maximization algorithm, tailored for multilevel factor models, to maximize the likelihood of the observed data. This method accommodates any hierarchical structure and maintains linear time and storage complexities per iteration. This is achieved through a new efficient technique for computing the inverse of the positive definite MLR matrix. We show that the inverse of positive definite MLR matrix is also an MLR matrix with the same sparsity in factors, and we use the recursive Sherman-Morrison-Woodbury matrix identity to obtain the factors of the inverse. Additionally, we present an algorithm that computes the Cholesky factorization of an expanded matrix with linear time and space complexities, yielding the covariance matrix as its Schur complement. This paper is accompanied by an open-source package that implements the proposed methods.
♻ ☆ Minimizing Surrogate Losses for Decision-Focused Learning using Differentiable Optimization
Decision-focused learning (DFL) trains a machine learning (ML) model to predict parameters of an optimization problem, to directly minimize decision regret, i.e., maximize decision quality. Gradient-based DFL requires computing the derivative of the solution to the optimization problem with respect to the predicted parameters. However, for many optimization problems, such as linear programs (LPs), the gradient of the regret with respect to the predicted parameters is zero almost everywhere. Existing gradient-based DFL approaches for LPs try to circumvent this issue in one of two ways: (a) smoothing the LP into a differentiable optimization problem by adding a quadratic regularizer and then minimizing the regret directly or (b) minimizing surrogate losses that have informative (sub)gradients. In this paper, we show that the former approach still results in zero gradients, because even after smoothing the regret remains constant across large regions of the parameter space. To address this, we propose minimizing surrogate losses -- even when a differentiable optimization layer is used and regret can be minimized directly. Our experiments demonstrate that minimizing surrogate losses allows differentiable optimization layers to achieve regret comparable to or better than surrogate-loss based DFL methods. Further, we demonstrate that this also holds for DYS-Net, a recently proposed differentiable optimization technique for LPs, that computes approximate solutions and gradients through operations that can be performed using feedforward neural network layers. Because DYS-Net executes the forward and the backward pass very efficiently, by minimizing surrogate losses using DYS-Net, we are able to attain regret on par with the state-of-the-art while reducing training time by a significant margin.
♻ ☆ FlexTSF: A Flexible Forecasting Model for Time Series with Variable Regularities
Forecasting time series with irregular temporal structures remains challenging for universal pre-trained models. Existing approaches often assume regular sampling or depend heavily on imputation, limiting their applicability in real-world scenarios where irregularities are prevalent due to diverse sensing devices and recording practices. We introduce FlexTSF, a flexible forecasting model specifically designed for time series data with variable temporal regularities. At its foundation lies the IVP Patcher, a continuous-time patching module leveraging Initial Value Problems (IVPs) to inherently support uneven time intervals, variable sequence lengths, and missing values. FlexTSF employs a decoder-only architecture that integrates normalized timestamp inputs and domain-specific statistics through a specialized causal self-attention mechanism, enabling adaptability across domains. Extensive experiments on 16 datasets demonstrate FlexTSF's effectiveness, significantly outperforming existing models in classic forecasting scenarios, zero-shot generalization, and low-resource fine-tuning conditions. Ablation studies confirm the contributions of each design component and the advantage of not relying on predefined fixed patch lengths.
♻ ☆ SVD Based Least Squares for X-Ray Pneumonia Classification Using Deep Features SP
Accurate and early diagnosis of pneumonia through X-ray imaging is essential for effective treatment and improved patient outcomes. Recent advancements in machine learning have enabled automated diagnostic tools that assist radiologists in making more reliable and efficient decisions. In this work, we propose a Singular Value Decomposition-based Least Squares (SVD-LS) framework for multi-class pneumonia classification, leveraging powerful feature representations from state-of-the-art self-supervised and transfer learning models. Rather than relying on computationally expensive gradient-based fine-tuning, we employ a closed-form, non-iterative classification approach that ensures efficiency without compromising accuracy. Experimental results demonstrate that SVD-LS achieves competitive performance while offering significantly reduced computational costs, making it a viable alternative for real-time medical imaging applications. The implementation is available at: github.com/meterdogan07/SVD-LS.
comment: IEEE International Workshop on Machine Learning for Signal Processing (MLSP), 2025
♻ ☆ Simulation Based Bayesian Optimization
Bayesian Optimization (BO) is a powerful method for optimizing black-box functions by combining prior knowledge with ongoing function evaluations. BO constructs a probabilistic surrogate model of the objective function given the covariates, which is in turn used to inform the selection of future evaluation points through an acquisition function. For smooth continuous search spaces, Gaussian Processes (GPs) are commonly used as the surrogate model as they offer analytical access to posterior predictive distributions, thus facilitating the computation and optimization of acquisition functions. However, in complex scenarios involving optimization over categorical or mixed covariate spaces, GPs may not be ideal. This paper introduces Simulation Based Bayesian Optimization (SBBO) as a novel approach to optimizing acquisition functions that only requires sampling-based access to posterior predictive distributions. SBBO allows the use of surrogate probabilistic models tailored for combinatorial spaces with discrete variables. Any Bayesian model in which posterior inference is carried out through Markov chain Monte Carlo can be selected as the surrogate model in SBBO. We demonstrate empirically the effectiveness of SBBO using various choices of surrogate models in applications involving combinatorial optimization.
comment: Accepted in Statistics and Computing
♻ ☆ MedQARo: A Large-Scale Benchmark for Medical Question Answering in Romanian
Question answering (QA) is an actively studied topic, being a core natural language processing (NLP) task that needs to be addressed before achieving Artificial General Intelligence (AGI). However, the lack of QA datasets in specific domains and languages hinders the development of robust AI models able to generalize across various domains and languages. To this end, we introduce MedQARo, the first large-scale medical QA benchmark in Romanian, alongside a comprehensive evaluation of state-of-the-art large language models (LLMs). We construct a high-quality and large-scale dataset comprising 102,646 QA pairs related to cancer patients. The questions regard medical case summaries of 1,011 patients, requiring either keyword extraction or reasoning to be answered correctly. MedQARo is the result of a time-consuming manual annotation process carried out by seven physicians specialized in oncology or radiotherapy, who spent a total of about 2,100 work hours to generate the QA pairs. We experiment with four LLMs from distinct families of models on MedQARo. Each model is employed in two scenarios, namely one based on zero-shot prompting and one based on supervised fine-tuning. Our results show that fine-tuned models significantly outperform their zero-shot counterparts, clearly indicating that pretrained models fail to generalize on MedQARo. Our findings demonstrate the importance of both domain-specific and language-specific fine-tuning for reliable clinical QA in Romanian. We publicly release our dataset and code at https://github.com/ana-rogoz/MedQARo.
♻ ☆ Tabular and Deep Reinforcement Learning for Gittins Index
In the realm of multi-arm bandit problems, the Gittins index policy is known to be optimal in maximizing the expected total discounted reward obtained from pulling the Markovian arms. In most realistic scenarios however, the Markovian state transition probabilities are unknown and therefore the Gittins indices cannot be computed. One can then resort to reinforcement learning (RL) algorithms that explore the state space to learn these indices while exploiting to maximize the reward collected. In this work, we propose tabular (QGI) and Deep RL (DGN) algorithms for learning the Gittins index that are based on the retirement formulation for the multi-arm bandit problem. When compared with existing RL algorithms that learn the Gittins index, our algorithms have a lower run time, require less storage space (small Q-table size in QGI and smaller replay buffer in DGN), and illustrate better empirical convergence to the Gittins index. This makes our algorithm well suited for problems with large state spaces and is a viable alternative to existing methods. As a key application, we demonstrate the use of our algorithms in minimizing the mean flowtime in a job scheduling problem when jobs are available in batches and have an unknown service time distribution.
♻ ☆ From Models to Network Topologies: A Topology Inference Attack in Decentralized Federated Learning
Federated Learning (FL) is widely recognized as a privacy-preserving Machine Learning paradigm due to its model-sharing mechanism that avoids direct data exchange. Nevertheless, model training leaves exploitable traces that can be used to infer sensitive information. In Decentralized FL (DFL), the topology, defining how participants are connected, plays a crucial role in shaping the model's privacy, robustness, and convergence. However, the topology introduces an unexplored vulnerability: attackers can exploit it to infer participant relationships and launch targeted attacks. This work uncovers the hidden risks of DFL topologies by proposing a novel Topology Inference Attack that infers the topology solely from model behavior. A taxonomy of topology inference attacks is introduced, categorizing them by the attacker's capabilities and knowledge. Practical attack strategies are designed for various scenarios, and experiments are conducted to identify key factors influencing attack success. The results demonstrate that analyzing only the model of each node can accurately infer the DFL topology, highlighting a critical privacy risk in DFL systems. These findings offer insights for improving privacy preservation in DFL environments.
♻ ☆ Correlations Are Ruining Your Gradient Descent
Herein the topics of (natural) gradient descent, data decorrelation, and approximate methods for backpropagation are brought into a common discussion. Natural gradient descent illuminates how gradient vectors, pointing at directions of steepest descent, can be improved by considering the local curvature of loss landscapes. We extend this perspective and show that to fully solve the problem illuminated by natural gradients in neural networks, one must recognise that correlations in the data at any linear transformation, including node responses at every layer of a neural network, cause a non-orthonormal relationship between the model's parameters. To solve this requires a method for decorrelating inputs at each individual layer of a neural network. We describe a range of methods which have been proposed for decorrelation and whitening of node output, and expand on these to provide a novel method specifically useful for distributed computing and computational neuroscience. Implementing decorrelation within multi-layer neural networks, we can show that not only is training via backpropagation sped up significantly but also existing approximations of backpropagation, which have failed catastrophically in the past, benefit significantly in their accuracy and convergence speed. This has the potential to provide a route forward for approximate gradient descent methods which have previously been discarded, training approaches for analogue and neuromorphic hardware, and potentially insights as to the efficacy and utility of decorrelation processes in the brain.
comment: 15 pages, 4 figures
♻ ☆ Beyond Imaging: Vision Transformer Digital Twin Surrogates for 3D+T Biological Tissue Dynamics
Understanding the dynamic organization and homeostasis of living tissues requires high-resolution, time-resolved imaging coupled with methods capable of extracting interpretable, predictive insights from complex datasets. Here, we present the Vision Transformer Digital Twin Surrogate Network (VT-DTSN), a deep learning framework for predictive modeling of 3D+T imaging data from biological tissue. By leveraging Vision Transformers pretrained with DINO (Self-Distillation with NO Labels) and employing a multi-view fusion strategy, VT-DTSN learns to reconstruct high-fidelity, time-resolved dynamics of a Drosophila midgut while preserving morphological and feature-level integrity across imaging depths. The model is trained with a composite loss prioritizing pixel-level accuracy, perceptual structure, and feature-space alignment, ensuring biologically meaningful outputs suitable for in silico experimentation and hypothesis testing. Evaluation across layers and biological replicates demonstrates VT-DTSN's robustness and consistency, achieving low error rates and high structural similarity while maintaining efficient inference through model optimization. This work establishes VT-DTSN as a feasible, high-fidelity surrogate for cross-timepoint reconstruction and for studying tissue dynamics, enabling computational exploration of cellular behaviors and homeostasis to complement time-resolved imaging studies in biological research.
comment: Submitted for journal publication
♻ ☆ Geometry-Aware Spiking Graph Neural Network
Graph Neural Networks (GNNs) have demonstrated impressive capabilities in modeling graph-structured data, while Spiking Neural Networks (SNNs) offer high energy efficiency through sparse, event-driven computation. However, existing spiking GNNs predominantly operate in Euclidean space and rely on fixed geometric assumptions, limiting their capacity to model complex graph structures such as hierarchies and cycles. To overcome these limitations, we propose \method{}, a novel Geometry-Aware Spiking Graph Neural Network that unifies spike-based neural dynamics with adaptive representation learning on Riemannian manifolds. \method{} features three key components: a Riemannian Embedding Layer that projects node features into a pool of constant-curvature manifolds, capturing non-Euclidean structures; a Manifold Spiking Layer that models membrane potential evolution and spiking behavior in curved spaces via geometry-consistent neighbor aggregation and curvature-based attention; and a Manifold Learning Objective that enables instance-wise geometry adaptation through jointly optimized classification and link prediction losses defined over geodesic distances. All modules are trained using Riemannian SGD, eliminating the need for backpropagation through time. Extensive experiments on multiple benchmarks show that GSG achieves superior accuracy, robustness, and energy efficiency compared to both Euclidean SNNs and manifold-based GNNs, establishing a new paradigm for curvature-aware, energy-efficient graph learning.
♻ ☆ Adversarial Robustness in Two-Stage Learning-to-Defer: Algorithms and Guarantees ICML 2025
Two-stage Learning-to-Defer (L2D) enables optimal task delegation by assigning each input to either a fixed main model or one of several offline experts, supporting reliable decision-making in complex, multi-agent environments. However, existing L2D frameworks assume clean inputs and are vulnerable to adversarial perturbations that can manipulate query allocation--causing costly misrouting or expert overload. We present the first comprehensive study of adversarial robustness in two-stage L2D systems. We introduce two novel attack strategie--untargeted and targeted--which respectively disrupt optimal allocations or force queries to specific agents. To defend against such threats, we propose SARD, a convex learning algorithm built on a family of surrogate losses that are provably Bayes-consistent and $(\mathcal{R}, \mathcal{G})$-consistent. These guarantees hold across classification, regression, and multi-task settings. Empirical results demonstrate that SARD significantly improves robustness under adversarial attacks while maintaining strong clean performance, marking a critical step toward secure and trustworthy L2D deployment.
comment: Accepted at the 42nd International Conference on Machine Learning (ICML 2025)
♻ ☆ Fingerprint Vector: Enabling Scalable and Efficient Model Fingerprint Transfer via Vector Addition
Backdoor-based fingerprinting has emerged as an effective technique for tracing the ownership of large language models. However, in real-world deployment scenarios, developers often instantiate multiple downstream models from a shared base model, and applying fingerprinting to each variant individually incurs prohibitive computational overhead. While inheritance-based approaches -- where fingerprints are embedded into the base model and expected to persist through fine-tuning -- appear attractive, they suffer from three key limitations: late-stage fingerprinting, fingerprint instability, and interference with downstream adaptation. To address these challenges, we propose a novel mechanism called the Fingerprint Vector. Our method first embeds a fingerprint into the base model via backdoor-based fine-tuning, then extracts a task-specific parameter delta as a fingerprint vector by computing the difference between the fingerprinted and clean models. This vector can be directly added to any structurally compatible downstream model, allowing the fingerprint to be transferred post hoc without additional fine-tuning. Extensive experiments show that Fingerprint Vector achieves comparable or superior performance to direct injection across key desiderata. It maintains strong effectiveness across diverse model architectures as well as mainstream downstream variants within the same family. It also preserves harmlessness and robustness in most cases. Even when slight robustness degradation is observed, the impact remains within acceptable bounds and is outweighed by the scalability benefits of our approach.
♻ ☆ CLaP -- State Detection from Time Series
The ever-growing amount of sensor data from machines, smart devices, and the environment leads to an abundance of high-resolution, unannotated time series (TS). These recordings encode recognizable properties of latent states and transitions from physical phenomena that can be modelled as abstract processes. The unsupervised localization and identification of these states and their transitions is the task of time series state detection (TSSD). Current TSSD algorithms employ classical unsupervised learning techniques, to infer state membership directly from feature space. This limits their predictive power, compared to supervised learning methods, which can exploit additional label information. We introduce CLaP, a new, highly accurate and efficient algorithm for TSSD. It leverages the predictive power of time series classification for TSSD in an unsupervised setting by applying novel self-supervision techniques to detect whether data segments emerge from the same state. To this end, CLaP cross-validates a classifier with segment-labelled subsequences to quantify confusion between segments. It merges labels from segments with high confusion, representing the same latent state, if this leads to an increase in overall classification quality. We conducted an experimental evaluation using 405 TS from five benchmarks and found CLaP to be significantly more precise in detecting states than six state-of-the-art competitors. It achieves the best accuracy-runtime tradeoff and is scalable to large TS. We provide a Python implementation of CLaP, which can be deployed in TS analysis workflows.
♻ ☆ Field Matching: an Electrostatic Paradigm to Generate and Transfer Data
We propose Electrostatic Field Matching (EFM), a novel method that is suitable for both generative modeling and distribution transfer tasks. Our approach is inspired by the physics of an electrical capacitor. We place source and target distributions on the capacitor plates and assign them positive and negative charges, respectively. Then we learn the electrostatic field of the capacitor using a neural network approximator. To map the distributions to each other, we start at one plate of the capacitor and move the samples along the learned electrostatic field lines until they reach the other plate. We theoretically justify that this approach provably yields the distribution transfer. In practice, we demonstrate the performance of our EFM in toy and image data experiments. Our code is available at https://github.com/justkolesov/FieldMatching
comment: Proceedings of the 42nd International Conference on Machine. Learning, Vancouver, Canada. PMLR 267, 2025
Artificial Intelligence-Based Multiscale Temporal Modeling for Anomaly Detection in Cloud Services
This study proposes an anomaly detection method based on the Transformer architecture with integrated multiscale feature perception, aiming to address the limitations of temporal modeling and scale-aware feature representation in cloud service environments. The method first employs an improved Transformer module to perform temporal modeling on high-dimensional monitoring data, using a self-attention mechanism to capture long-range dependencies and contextual semantics. Then, a multiscale feature construction path is introduced to extract temporal features at different granularities through downsampling and parallel encoding. An attention-weighted fusion module is designed to dynamically adjust the contribution of each scale to the final decision, enhancing the model's robustness in anomaly pattern modeling. In the input modeling stage, standardized multidimensional time series are constructed, covering core signals such as CPU utilization, memory usage, and task scheduling states, while positional encoding is used to strengthen the model's temporal awareness. A systematic experimental setup is designed to evaluate performance, including comparative experiments and hyperparameter sensitivity analysis, focusing on the impact of optimizers, learning rates, anomaly ratios, and noise levels. Experimental results show that the proposed method outperforms mainstream baseline models in key metrics, including precision, recall, AUC, and F1-score, and maintains strong stability and detection performance under various perturbation conditions, demonstrating its superior capability in complex cloud environments.
♻ ☆ Large-Scale Model Enabled Semantic Communication Based on Robust Knowledge Distillation
Large-scale models (LSMs) can be an effective framework for semantic representation and understanding, thereby providing a suitable tool for designing semantic communication (SC) systems. However, their direct deployment is often hindered by high computational complexity and resource requirements. In this paper, a novel robust knowledge distillation based semantic communication (RKD-SC) framework is proposed to enable efficient and \textcolor{black}{channel-noise-robust} LSM-powered SC. The framework addresses two key challenges: determining optimal compact model architectures and effectively transferring knowledge while maintaining robustness against channel noise. First, a knowledge distillation-based lightweight differentiable architecture search (KDL-DARTS) algorithm is proposed. This algorithm integrates knowledge distillation loss and a complexity penalty into the neural architecture search process to identify high-performance, lightweight semantic encoder architectures. Second, a novel two-stage robust knowledge distillation (RKD) algorithm is developed to transfer semantic capabilities from an LSM (teacher) to a compact encoder (student) and subsequently enhance system robustness. To further improve resilience to channel impairments, a channel-aware transformer (CAT) block is introduced as the channel codec, trained under diverse channel conditions with variable-length outputs. Extensive simulations on image classification tasks demonstrate that the RKD-SC framework significantly reduces model parameters while preserving a high degree of the teacher model's performance and exhibiting superior robustness compared to existing methods.
comment: 13 pages, 8 figures, 3 tables
♻ ☆ Scaling Capability in Token Space: An Analysis of Large Vision Language Model
Large language models have demonstrated predictable scaling behaviors with respect to model parameters and training data. This study investigates whether a similar scaling relationship exist for vision-language models with respect to the number of vision tokens. A mathematical framework is developed to characterize a relationship between vision token number and the expected divergence of distance between vision-referencing sequences. The theoretical analysis reveals two distinct scaling regimes: sublinear scaling for less vision tokens and linear scaling for more vision tokens. This aligns with model performance relationships of the form \(S(n) \approx c / n^{\alpha(n)}\), where the scaling exponent relates to the correlation structure between vision token representations. Empirical validations across multiple vision-language benchmarks show that model performance matches the prediction from scaling relationship. The findings contribute to understanding vision token scaling in transformers through a theoretical framework that complements empirical observations.
♻ ☆ Whilter: A Whisper-based Data Filter for "In-the-Wild" Speech Corpora Using Utterance-level Multi-Task Classification
Large-scale in-the-wild speech datasets have become more prevalent in recent years due to increased interest in models that can learn useful features from unlabelled data for tasks such as speech recognition or synthesis. These datasets often contain undesirable features, such as multiple speakers, non-target languages, and music, which may impact model learning. The Whilter model is proposed as a multitask solution to identify these undesirable samples. Whilter uses a Whisper encoder with an attention-based classifier to solve five diverse classification problems at once. In addition, an annotated dataset is published for a subset of two popular in-the-wild corpora. Whilter achieves F1 scores above 85% and equal error rates of 6.5% to 7.8% for three of five subtasks, outperforming a state-of-the-art BEATs classifier on speech-specific classes, with a notable decrease in processing time compared to a combination of single-task alternatives.
comment: Accepted for Interspeech 2025. Updated Zenodo link for AITW v1.1
♻ ☆ DSADF: Thinking Fast and Slow for Decision Making
Although Reinforcement Learning (RL) agents are effective in well-defined environments, they often struggle to generalize their learned policies to dynamic settings due to their reliance on trial-and-error interactions. Recent work has explored applying Large Language Models (LLMs) or Vision Language Models (VLMs) to boost the generalization of RL agents through policy optimization guidance or prior knowledge. However, these approaches often lack seamless coordination between the RL agent and the foundation model, leading to unreasonable decision-making in unfamiliar environments and efficiency bottlenecks. Making full use of the inferential capabilities of foundation models and the rapid response capabilities of RL agents and enhancing the interaction between the two to form a dual system is still a lingering scientific question. To address this problem, we draw inspiration from Kahneman's theory of fast thinking (System 1) and slow thinking (System 2), demonstrating that balancing intuition and deep reasoning can achieve nimble decision-making in a complex world. In this study, we propose a Dual-System Adaptive Decision Framework (DSADF), integrating two complementary modules: System 1, comprising an RL agent and a memory space for fast and intuitive decision making, and System 2, driven by a VLM for deep and analytical reasoning. DSADF facilitates efficient and adaptive decision-making by combining the strengths of both systems. The empirical study in the video game environment: Crafter and Housekeep demonstrates the effectiveness of our proposed method, showing significant improvements in decision abilities for both unseen and known tasks.
♻ ☆ A Mixture of Exemplars Approach for Efficient Out-of-Distribution Detection with Foundation Models
One of the early weaknesses identified in deep neural networks trained for image classification tasks was their inability to provide low confidence predictions on out-of-distribution (OOD) data that was significantly different from the in-distribution (ID) data used to train them. Representation learning, where neural networks are trained in specific ways that improve their ability to detect OOD examples, has emerged as a promising solution. However, these approaches require long training times and can add additional overhead to detect OOD examples. Recent developments in Vision Transformer (ViT) foundation models$\unicode{x2013}$large networks trained on large and diverse datasets with self-supervised approaches$\unicode{x2013}$also show strong performance in OOD detection, and could address these challenges. This paper presents Mixture of Exemplars (MoLAR), an efficient approach to tackling OOD detection challenges that is designed to maximise the benefit of training a classifier with a high quality, frozen, pretrained foundation model backbone. MoLAR provides strong OOD detection performance when only comparing the similarity of OOD examples to the exemplars, a small set of images chosen to be representative of the dataset, leading to significantly reduced overhead for OOD detection inference over other methods that provide best performance when the full ID dataset is used. Extensive experiments demonstrate the improved OOD detection performance of MoLAR in comparison to comparable approaches in both supervised and semi-supervised settings, and code is available at github.com/emannix/molar-mixture-of-exemplars.
♻ ☆ Linear Preference Optimization: Decoupled Gradient Control via Absolute Regularization
DPO (Direct Preference Optimization) has become a widely used offline preference optimization algorithm due to its simplicity and training stability. However, DPO is prone to overfitting and collapse. To address these challenges, we propose Linear Preference Optimization (LPO), a novel alignment framework featuring three key innovations. First, we introduce gradient decoupling by replacing the log-sigmoid function with an absolute difference loss, thereby isolating the optimization dynamics. Second, we improve stability through an offset constraint combined with a positive regularization term to preserve the chosen response quality. Third, we implement controllable rejection suppression using gradient separation with straightforward estimation and a tunable coefficient that linearly regulates the descent of the rejection probability. Through extensive experiments, we demonstrate that LPO consistently improves performance on various tasks, including general text tasks, math tasks, and text-to-speech (TTS) tasks. These results establish LPO as a robust and tunable paradigm for preference alignment, and we release the source code, models, and training data publicly.
♻ ☆ Towards Controllable Speech Synthesis in the Era of Large Language Models: A Systematic Survey EMNLP 2025
Text-to-speech (TTS) has advanced from generating natural-sounding speech to enabling fine-grained control over attributes like emotion, timbre, and style. Driven by rising industrial demand and breakthroughs in deep learning, e.g., diffusion and large language models (LLMs), controllable TTS has become a rapidly growing research area. This survey provides the first comprehensive review of controllable TTS methods, from traditional control techniques to emerging approaches using natural language prompts. We categorize model architectures, control strategies, and feature representations, while also summarizing challenges, datasets, and evaluations in controllable TTS. This survey aims to guide researchers and practitioners by offering a clear taxonomy and highlighting future directions in this fast-evolving field. One can visit https://github.com/imxtx/awesome-controllabe-speech-synthesis for a comprehensive paper list and updates.
comment: The first comprehensive survey on controllable TTS. Accepted to the EMNLP 2025 main conference
♻ ☆ Enhancing material behavior discovery using embedding-oriented Physically-Guided Neural Networks with Internal Variables
Physically Guided Neural Networks with Internal Variables are SciML tools that use only observable data for training and and have the capacity to unravel internal state relations. They incorporate physical knowledge both by prescribing the model architecture and using loss regularization, thus endowing certain specific neurons with a physical meaning as internal state variables. Despite their potential, these models face challenges in scalability when applied to high-dimensional data such as fine-grid spatial fields or time-evolving systems. In this work, we propose some enhancements to the PGNNIV framework that address these scalability limitations through reduced-order modeling techniques. Specifically, we introduce alternatives to the original decoder structure using spectral decomposition, POD, and pretrained autoencoder-based mappings. These surrogate decoders offer varying trade-offs between computational efficiency, accuracy, noise tolerance, and generalization, while improving drastically the scalability. Additionally, we integrate model reuse via transfer learning and fine-tuning strategies to exploit previously acquired knowledge, supporting efficient adaptation to novel materials or configurations, and significantly reducing training time while maintaining or improving model performance. To illustrate these various techniques, we use a representative case governed by the nonlinear diffusion equation, using only observable data. Results demonstrate that the enhanced PGNNIV framework successfully identifies the underlying constitutive state equations while maintaining high predictive accuracy. It also improves robustness to noise, mitigates overfitting, and reduces computational demands. The proposed techniques can be tailored to various scenarios depending on data availability, resources, and specific modeling objectives, overcoming scalability challenges in all the scenarios.
♻ ☆ Where's the liability in the Generative Era? Recovery-based Black-Box Detection of AI-Generated Content CVPR 2025
The recent proliferation of photorealistic images created by generative models has sparked both excitement and concern, as these images are increasingly indistinguishable from real ones to the human eye. While offering new creative and commercial possibilities, the potential for misuse, such as in misinformation and fraud, highlights the need for effective detection methods. Current detection approaches often rely on access to model weights or require extensive collections of real image datasets, limiting their scalability and practical application in real world scenarios. In this work, we introduce a novel black box detection framework that requires only API access, sidestepping the need for model weights or large auxiliary datasets. Our approach leverages a corrupt and recover strategy: by masking part of an image and assessing the model ability to reconstruct it, we measure the likelihood that the image was generated by the model itself. For black-box models that do not support masked image inputs, we incorporate a cost efficient surrogate model trained to align with the target model distribution, enhancing detection capability. Our framework demonstrates strong performance, outperforming baseline methods by 4.31% in mean average precision across eight diffusion model variant datasets.
comment: CVPR 2025
♻ ☆ WaveStitch: Flexible and Fast Conditional Time Series Generation with Diffusion Models SIGMOD 2026
Generating temporal data under conditions is crucial for forecasting, imputation, and generative tasks. Such data often has metadata and partially observed signals that jointly influence the generated values. However, existing methods face three key limitations: (1) they condition on either the metadata or observed values, but rarely both together; (2) they adopt either training-time approaches that fail to generalize to unseen scenarios, or inference-time approaches that ignore metadata; and (3) they suffer from trade-offs between generation speed and temporal coherence across time windows--choosing either slow but coherent autoregressive methods or fast but incoherent parallel ones. We propose WaveStitch, a novel diffusion-based method to overcome these hurdles through: (1) dual-sourced conditioning on both metadata and partially observed signals; (2) a hybrid training-inference architecture, incorporating metadata during training and observations at inference via gradient-based guidance; and (3) a novel pipeline-style paradigm that generates time windows in parallel while preserving coherence through an inference-time conditional loss and a stitching mechanism. Across diverse datasets, WaveStitch demonstrates adaptability to arbitrary patterns of observed signals, achieving 1.81x lower mean-squared-error compared to the state-of-the-art, and generates data up to 166.48x faster than autoregressive methods while maintaining coherence. Our code is available at: https://github.com/adis98/WaveStitch
comment: Accepted at ACM SIGMOD 2026
♻ ☆ USPR: Learning a Unified Solver for Profiled Routing
The Profiled Vehicle Routing Problem (PVRP) extends the classical VRP by incorporating vehicle-client-specific preferences and constraints, reflecting real-world requirements such as zone restrictions and service-level preferences. While recent reinforcement-learning solvers have shown promising performance, they require retraining for each new profile distribution, suffer from poor representation ability, and struggle to generalize to out-of-distribution instances. In this paper, we address these limitations by introducing Unified Solver for Profiled Routing (USPR), a novel framework that natively handles arbitrary profile types. USPR introduces on three key innovations: (i) Profile Embeddings (PE) to encode any combination of profile types; (ii) Multi-Head Profiled Attention (MHPA), an attention mechanism that models rich interactions between vehicles and clients; (iii) Profile-aware Score Reshaping (PSR), which dynamically adjusts decoder logits using profile scores to improve generalization. Empirical results on diverse PVRP benchmarks demonstrate that USPR achieves state-of-the-art results among learning-based methods while offering significant gains in flexibility and computational efficiency. We make our source code publicly available to foster future research.
♻ ☆ Conditional Stochastic Interpolation for Generative Learning
We propose a conditional stochastic interpolation (CSI) method for learning conditional distributions. CSI is based on estimating probability flow equations or stochastic differential equations that transport a reference distribution to the target conditional distribution. This is achieved by first learning the conditional drift and score functions based on CSI, which are then used to construct a deterministic process governed by an ordinary differential equation or a diffusion process for conditional sampling. In our proposed approach, we incorporate an adaptive diffusion term to address the instability issues arising in the diffusion process. We derive explicit expressions of the conditional drift and score functions in terms of conditional expectations, which naturally lead to an nonparametric regression approach to estimating these functions. Furthermore, we establish nonasymptotic error bounds for learning the target conditional distribution. We illustrate the application of CSI on image generation using a benchmark image dataset.
comment: 60 pages, 4 figures
♻ ☆ An Inquiry into Datacenter TCO for LLM Inference with FP8
As large language models (LLMs) continue to scale, the high power consumption of AI accelerators in datacenters presents significant challenges, substantially increasing the total cost of ownership (TCO) for cloud service providers (CSPs) that provide LLM inference. In this work, we analyze the computational characteristics of LLM inference from a TCO perspective and present a generalizable framework to compare AI accelerators across diverse operational requirements. Using this model, we investigate key workload characteristics influencing TCO for AI accelerators from Intel (Gaudi 2 & 3) and NVIDIA (H100 & H200), especially thin GEMM utilization and FP8 quantization. In particular, as FP8 emerges as the baseline precision for next-generation LLMs, understanding how different architectures implement and benefit from low-precision computation is increasingly critical. Throughput on thin GEMMs has a greater impact on TCO than theoretical hardware peak throughput because the memory-bound decode phase is dominated by GEMV-like computations. We find that Gaudi HPUs achieve superior utilization on thin GEMMs compared to their counterparts, especially in FP8-quantized models. Our result underscores the importance of empirical, workload-level analysis in evaluating accelerator performance, rather than relying solely on theoretical hardware specifications. By studying the interaction between power consumption, quantization strategies, and hardware architecture, we provide insights to support informed deployment decisions and guide future accelerator designs aimed at improving the TCO of LLM inference workloads.
♻ ☆ EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding ICCV2025
3D occupancy prediction provides a comprehensive description of the surrounding scenes and has become an essential task for 3D perception. Most existing methods focus on offline perception from one or a few views and cannot be applied to embodied agents that demand to gradually perceive the scene through progressive embodied exploration. In this paper, we formulate an embodied 3D occupancy prediction task to target this practical scenario and propose a Gaussian-based EmbodiedOcc framework to accomplish it. We initialize the global scene with uniform 3D semantic Gaussians and progressively update local regions observed by the embodied agent. For each update, we extract semantic and structural features from the observed image and efficiently incorporate them via deformable cross-attention to refine the regional Gaussians. Finally, we employ Gaussian-to-voxel splatting to obtain the global 3D occupancy from the updated 3D Gaussians. Our EmbodiedOcc assumes an unknown (i.e., uniformly distributed) environment and maintains an explicit global memory of it with 3D Gaussians. It gradually gains knowledge through the local refinement of regional Gaussians, which is consistent with how humans understand new scenes through embodied exploration. We reorganize an EmbodiedOcc-ScanNet benchmark based on local annotations to facilitate the evaluation of the embodied 3D occupancy prediction task. Our EmbodiedOcc outperforms existing methods by a large margin and accomplishes the embodied occupancy prediction with high accuracy and efficiency. Code: https://github.com/YkiWu/EmbodiedOcc.
comment: Accepted by ICCV2025. Code: https://github.com/YkiWu/EmbodiedOcc
♻ ☆ Compute-Optimal Scaling for Value-Based Deep RL
As models grow larger and training them becomes expensive, it becomes increasingly important to scale training recipes not just to larger models and more data, but to do so in a compute-optimal manner that extracts maximal performance per unit of compute. While such scaling has been well studied for language modeling, reinforcement learning (RL) has received less attention in this regard. In this paper, we investigate compute scaling for online, value-based deep RL. These methods present two primary axes for compute allocation: model capacity and the update-to-data (UTD) ratio. Given a fixed compute budget, we ask: how should resources be partitioned across these axes to maximize sample efficiency? Our analysis reveals a nuanced interplay between model size, batch size, and UTD. In particular, we identify a phenomenon we call TD-overfitting: increasing the batch quickly harms Q-function accuracy for small models, but this effect is absent in large models, enabling effective use of large batch size at scale. We provide a mental model for understanding this phenomenon and build guidelines for choosing batch size and UTD to optimize compute usage. Our findings provide a grounded starting point for compute-optimal scaling in deep RL, mirroring studies in supervised learning but adapted to TD learning.
Multimedia 9
☆ TuningIQA: Fine-Grained Blind Image Quality Assessment for Livestreaming Camera Tuning
Livestreaming has become increasingly prevalent in modern visual communication, where automatic camera quality tuning is essential for delivering superior user Quality of Experience (QoE). Such tuning requires accurate blind image quality assessment (BIQA) to guide parameter optimization decisions. Unfortunately, the existing BIQA models typically only predict an overall coarse-grained quality score, which cannot provide fine-grained perceptual guidance for precise camera parameter tuning. To bridge this gap, we first establish FGLive-10K, a comprehensive fine-grained BIQA database containing 10,185 high-resolution images captured under varying camera parameter configurations across diverse livestreaming scenarios. The dataset features 50,925 multi-attribute quality annotations and 19,234 fine-grained pairwise preference annotations. Based on FGLive-10K, we further develop TuningIQA, a fine-grained BIQA metric for livestreaming camera tuning, which integrates human-aware feature extraction and graph-based camera parameter fusion. Extensive experiments and comparisons demonstrate that TuningIQA significantly outperforms state-of-the-art BIQA methods in both score regression and fine-grained quality ranking, achieving superior performance when deployed for livestreaming camera tuning.
comment: 9 pages,8 figures
Prompt-based Multimodal Semantic Communication for Multi-spectral Image Segmentation
Multimodal semantic communication has gained widespread attention due to its ability to enhance downstream task performance. A key challenge in such systems is the effective fusion of features from different modalities, which requires the extraction of rich and diverse semantic representations from each modality. To this end, we propose ProMSC-MIS, a Prompt-based Multimodal Semantic Communication system for Multi-spectral Image Segmentation. Specifically, we propose a pre-training algorithm where features from one modality serve as prompts for another, guiding unimodal semantic encoders to learn diverse and complementary semantic representations. We further introduce a semantic fusion module that combines cross-attention mechanisms and squeeze-and-excitation (SE) networks to effectively fuse cross-modal features. Simulation results show that ProMSC-MIS significantly outperforms benchmark methods across various channel-source compression levels, while maintaining low computational complexity and storage overhead. Our scheme has great potential for applications such as autonomous driving and nighttime surveillance.
♻ ☆ Machine Learning-Based Prediction of Quality Shifts on Video Streaming Over 5G
The Quality of Experience (QoE) is the users satisfaction while streaming a video session over an over-the-top (OTT) platform like YouTube. QoE of YouTube reflects the smooth streaming session without any buffering and quality shift events. One of the most important factors nowadays affecting QoE of YouTube is frequent shifts from higher to lower resolutions and vice versa. These shifts ensure a smooth streaming session; however, it might get a lower mean opinion score. For instance, dropping from 1080p to 480p during a video can preserve continuity but might reduce the viewers enjoyment. Over time, OTT platforms are looking for alternative ways to boost user experience instead of relying on traditional Quality of Service (QoS) metrics such as bandwidth, latency, and throughput. As a result, we look into the relationship between quality shifting in YouTube streaming sessions and the channel metrics RSRP, RSRQ, and SNR. Our findings state that these channel metrics positively correlate with shifts. Thus, in real-time, OTT can only rely on them to predict video streaming sessions into lower- and higher-resolution categories, thus providing more resources to improve user experience. Using traditional Machine Learning (ML) classifiers, we achieved an accuracy of 77-percent, while using only RSRP, RSRQ, and SNR. In the era of 5G and beyond, where ultra-reliable, low-latency networks promise enhanced streaming capabilities, the proposed methodology can be used to improve OTT services.
comment: We are trying to improve the paper with better results. There are some inconsistencies with the current version
♻ ☆ One Framework to Rule Them All: Unifying Multimodal Tasks with LLM Neural-Tuning
Large-scale models have exhibited remarkable capabilities across diverse domains, including automated medical services and intelligent customer support. However, as most large models are trained on single-modality corpora, enabling them to effectively process and understand multimodal signals remains a significant challenge. Current research often focuses on designing task-specific or scenario-specific tuning strategies, which limits the scalability and versatility. To address this limitation, we propose a unified framework that concurrently handles multiple tasks and modalities. In this framework, all modalities and tasks are represented as unified tokens and trained using a single, consistent approach. To enable efficient multitask processing, we introduce a novel tuning strategy termed neural tuning, inspired by the concept of sparse distributed representation in the human brain, where only specific subsets of neurons are activated for each task. Furthermore, to advance research in multimodal and multitask learning, we present a new benchmark, MMUD, which includes samples annotated with multiple task labels spanning reasoning segmentation, referring segmentation, image captioning, and text-to-image generation. By applying neural tuning to pretrained large models on the MMUD benchmark, we demonstrate the ability to handle multiple tasks simultaneously in a streamlined and efficient manner. All models, code, and datasets will be released publicly upon publication, fostering further research and innovation in this field.
♻ ☆ FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing
Movie Dubbing aims to convert scripts into speeches that align with the given movie clip in both temporal and emotional aspects while preserving the vocal timbre of a given brief reference audio. Existing methods focus primarily on reducing the word error rate while ignoring the importance of lip-sync and acoustic quality. To address these issues, we propose a large language model (LLM) based flow matching architecture for dubbing, named FlowDubber, which achieves high-quality audio-visual sync and pronunciation by incorporating a large speech language model and dual contrastive aligning while achieving better acoustic quality via the proposed voice-enhanced flow matching than previous works. First, we introduce Qwen2.5 as the backbone of LLM to learn the in-context sequence from movie scripts and reference audio. Then, the proposed semantic-aware learning focuses on capturing LLM semantic knowledge at the phoneme level. Next, dual contrastive aligning (DCA) boosts mutual alignment with lip movement, reducing ambiguities where similar phonemes might be confused. Finally, the proposed Flow-based Voice Enhancing (FVE) improves acoustic quality in two aspects, which introduces an LLM-based acoustics flow matching guidance to strengthen clarity and uses affine style prior to enhance identity when recovering noise into mel-spectrograms via gradient vector field prediction. Extensive experiments demonstrate that our method outperforms several state-of-the-art methods on two primary benchmarks.
♻ ☆ Towards Controllable Speech Synthesis in the Era of Large Language Models: A Systematic Survey EMNLP 2025
Text-to-speech (TTS) has advanced from generating natural-sounding speech to enabling fine-grained control over attributes like emotion, timbre, and style. Driven by rising industrial demand and breakthroughs in deep learning, e.g., diffusion and large language models (LLMs), controllable TTS has become a rapidly growing research area. This survey provides the first comprehensive review of controllable TTS methods, from traditional control techniques to emerging approaches using natural language prompts. We categorize model architectures, control strategies, and feature representations, while also summarizing challenges, datasets, and evaluations in controllable TTS. This survey aims to guide researchers and practitioners by offering a clear taxonomy and highlighting future directions in this fast-evolving field. One can visit https://github.com/imxtx/awesome-controllabe-speech-synthesis for a comprehensive paper list and updates.
comment: The first comprehensive survey on controllable TTS. Accepted to the EMNLP 2025 main conference
♻ ☆ PediatricsMQA: a Multi-modal Pediatrics Question Answering Benchmark
Large language models (LLMs) and vision-augmented LLMs (VLMs) have significantly advanced medical informatics, diagnostics, and decision support. However, these models exhibit systematic biases, particularly age bias, compromising their reliability and equity. This is evident in their poorer performance on pediatric-focused text and visual question-answering tasks. This bias reflects a broader imbalance in medical research, where pediatric studies receive less funding and representation despite the significant disease burden in children. To address these issues, a new comprehensive multi-modal pediatric question-answering benchmark, PediatricsMQA, has been introduced. It consists of 3,417 text-based multiple-choice questions (MCQs) covering 131 pediatric topics across seven developmental stages (prenatal to adolescent) and 2,067 vision-based MCQs using 634 pediatric images from 67 imaging modalities and 256 anatomical regions. The dataset was developed using a hybrid manual-automatic pipeline, incorporating peer-reviewed pediatric literature, validated question banks, existing benchmarks, and existing QA resources. Evaluating state-of-the-art open models, we find dramatic performance drops in younger cohorts, highlighting the need for age-aware methods to ensure equitable AI support in pediatric care.
♻ ☆ Seeing Sarcasm Through Different Eyes: Analyzing Multimodal Sarcasm Perception in Large Vision-Language Models
With the advent of large vision-language models (LVLMs) demonstrating increasingly human-like abilities, a pivotal question emerges: do different LVLMs interpret multimodal sarcasm differently, and can a single model grasp sarcasm from multiple perspectives like humans? To explore this, we introduce an analytical framework using systematically designed prompts on existing multimodal sarcasm datasets. Evaluating 12 state-of-the-art LVLMs over 2,409 samples, we examine interpretive variations within and across models, focusing on confidence levels, alignment with dataset labels, and recognition of ambiguous "neutral" cases. We further validate our findings on a diverse 100-sample mini-benchmark, incorporating multiple datasets, expanded prompt variants, and representative commercial LVLMs. Our findings reveal notable discrepancies -- across LVLMs and within the same model under varied prompts. While classification-oriented prompts yield higher internal consistency, models diverge markedly when tasked with interpretive reasoning. These results challenge binary labeling paradigms by highlighting sarcasm's subjectivity. We advocate moving beyond rigid annotation schemes toward multi-perspective, uncertainty-aware modeling, offering deeper insights into multimodal sarcasm comprehension. Our code and data are available at: https://github.com/CoderChen01/LVLMSarcasmAnalysis
♻ ☆ adder-viz: Real-Time Visualization Software for Transcoding Event Video
Recent years have brought about a surge in neuromorphic ``event'' video research, primarily targeting computer vision applications. Event video eschews video frames in favor of asynchronous, per-pixel intensity samples. While much work has focused on a handful of representations for specific event cameras, these representations have shown limitations in flexibility, speed, and compressibility. We previously proposed the unified ADDER representation to address these concerns. This paper introduces numerous improvements to the adder-viz software for visualizing real-time event transcode processes and applications in-the-loop. The MIT-licensed software is available from a centralized repository at https://github.com/ac-freeman/adder-codec-rs.
comment: Accepted to the Open-Source Track at ACM Multimedia 2025
Computation and Language 38
☆ Activation Transport Operators
The residual stream mediates communication between transformer decoder layers via linear reads and writes of non-linear computations. While sparse-dictionary learning-based methods locate features in the residual stream, and activation patching methods discover circuits within the model, the mechanism by which features flow through the residual stream remains understudied. Understanding this dynamic can better inform jailbreaking protections, enable early detection of model mistakes, and their correction. In this work, we propose Activation Transport Operators (ATO), linear maps from upstream to downstream residuals $k$ layers later, evaluated in feature space using downstream SAE decoder projections. We empirically demonstrate that these operators can determine whether a feature has been linearly transported from a previous layer or synthesised from non-linear layer computation. We develop the notion of transport efficiency, for which we provide an upper bound, and use it to estimate the size of the residual stream subspace that corresponds to linear transport. We empirically demonstrate the linear transport, report transport efficiency and the size of the residual stream's subspace involved in linear transport. This compute-light (no finetuning, <50 GPU-h) method offers practical tools for safety, debugging, and a clearer picture of where computation in LLMs behaves linearly.
comment: 4 pages, 4 figures, references and appendices
☆ Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models?
Multi-Agent Debate~(MAD) has emerged as a promising paradigm for improving the performance of large language models through collaborative reasoning. Despite recent advances, the key factors driving MAD's effectiveness remain unclear. In this work, we disentangle MAD into two key components--Majority Voting and inter-agent Debate--and assess their respective contributions. Through extensive experiments across seven NLP benchmarks, we find that Majority Voting alone accounts for most of the performance gains typically attributed to MAD. To explain this, we propose a theoretical framework that models debate as a stochastic process. We prove that it induces a martingale over agents' belief trajectories, implying that debate alone does not improve expected correctness. Guided by these insights, we demonstrate that targeted interventions, by biasing the belief update toward correction, can meaningfully enhance debate effectiveness. Overall, our findings suggest that while MAD has potential, simple ensembling methods remain strong and more reliable alternatives in many practical settings. Code is released in https://github.com/deeplearning-wisc/debate-or-vote.
☆ Improving French Synthetic Speech Quality via SSML Prosody Control SP 2025
Despite recent advances, synthetic voices often lack expressiveness due to limited prosody control in commercial text-to-speech (TTS) systems. We introduce the first end-to-end pipeline that inserts Speech Synthesis Markup Language (SSML) tags into French text to control pitch, speaking rate, volume, and pause duration. We employ a cascaded architecture with two QLoRA-fine-tuned Qwen 2.5-7B models: one predicts phrase-break positions and the other performs regression on prosodic targets, generating commercial TTS-compatible SSML markup. Evaluated on a 14-hour French podcast corpus, our method achieves 99.2% F1 for break placement and reduces mean absolute error on pitch, rate, and volume by 25-40% compared with prompting-only large language models (LLMs) and a BiLSTM baseline. In perceptual evaluation involving 18 participants across over 9 hours of synthesized audio, SSML-enhanced speech generated by our pipeline significantly improves naturalness, with the mean opinion score increasing from 3.20 to 3.87 (p < 0.005). Additionally, 15 of 18 listeners preferred our enhanced synthesis. These results demonstrate substantial progress in bridging the expressiveness gap between synthetic and natural French speech. Our code is publicly available at https://github.com/hi-paris/Prosody-Control-French-TTS.
comment: 13 pages, 9 figures, 6 tables. Accepted for presentation at ICNLSP 2025 (Odense, Denmark). Code and demo: https://github.com/hi-paris/Prosody-Control-French-TTS. ACM Class: I.2.7; H.5.5
☆ Efficient Zero-Shot Long Document Classification by Reducing Context Through Sentence Ranking
Transformer-based models like BERT excel at short text classification but struggle with long document classification (LDC) due to input length limitations and computational inefficiencies. In this work, we propose an efficient, zero-shot approach to LDC that leverages sentence ranking to reduce input context without altering the model architecture. Our method enables the adaptation of models trained on short texts, such as headlines, to long-form documents by selecting the most informative sentences using a TF-IDF-based ranking strategy. Using the MahaNews dataset of long Marathi news articles, we evaluate three context reduction strategies that prioritize essential content while preserving classification accuracy. Our results show that retaining only the top 50\% ranked sentences maintains performance comparable to full-document inference while reducing inference time by up to 35\%. This demonstrates that sentence ranking is a simple yet effective technique for scalable and efficient zero-shot LDC.
☆ Evaluating the Impact of Verbal Multiword Expressions on Machine Translation
Verbal multiword expressions (VMWEs) present significant challenges for natural language processing due to their complex and often non-compositional nature. While machine translation models have seen significant improvement with the advent of language models in recent years, accurately translating these complex linguistic structures remains an open problem. In this study, we analyze the impact of three VMWE categories -- verbal idioms, verb-particle constructions, and light verb constructions -- on machine translation quality from English to multiple languages. Using both established multiword expression datasets and sentences containing these language phenomena extracted from machine translation datasets, we evaluate how state-of-the-art translation systems handle these expressions. Our experimental results consistently show that VMWEs negatively affect translation quality. We also propose an LLM-based paraphrasing approach that replaces these expressions with their literal counterparts, demonstrating significant improvement in translation quality for verbal idioms and verb-particle constructions.
comment: 29 pages, 13 figures
☆ Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD EMNLP 2025
Large Language Models (LLMs) can struggle to balance gullibility to misinformation and resistance to valid corrections in persuasive dialogues, a critical challenge for reliable deployment. We introduce DuET-PD (Dual Evaluation for Trust in Persuasive Dialogues), a framework evaluating multi-turn stance-change dynamics across dual dimensions: persuasion type (corrective/misleading) and domain (knowledge via MMLU-Pro, and safety via SALAD-Bench). We find that even a state-of-the-art model like GPT-4o achieves only 27.32% accuracy in MMLU-Pro under sustained misleading persuasions. Moreover, results reveal a concerning trend of increasing sycophancy in newer open-source models. To address this, we introduce Holistic DPO, a training approach balancing positive and negative persuasion examples. Unlike prompting or resist-only training, Holistic DPO enhances both robustness to misinformation and receptiveness to corrections, improving Llama-3.1-8B-Instruct's accuracy under misleading persuasion in safety contexts from 4.21% to 76.54%. These contributions offer a pathway to developing more reliable and adaptable LLMs for multi-turn dialogue. Code is available at https://github.com/Social-AI-Studio/DuET-PD.
comment: To appear at EMNLP 2025
☆ TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling
Recent advancements in aligning large language models via reinforcement learning have achieved remarkable gains in solving complex reasoning problems, but at the cost of expensive on-policy rollouts and limited exploration of diverse reasoning paths. In this work, we introduce TreePO, involving a self-guided rollout algorithm that views sequence generation as a tree-structured searching process. Composed of dynamic tree sampling policy and fixed-length segment decoding, TreePO leverages local uncertainty to warrant additional branches. By amortizing computation across common prefixes and pruning low-value paths early, TreePO essentially reduces the per-update compute burden while preserving or enhancing exploration diversity. Key contributions include: (1) a segment-wise sampling algorithm that alleviates the KV cache burden through contiguous segments and spawns new branches along with an early-stop mechanism; (2) a tree-based segment-level advantage estimation that considers both global and local proximal policy optimization. and (3) analysis on the effectiveness of probability and quality-driven dynamic divergence and fallback strategy. We empirically validate the performance gain of TreePO on a set reasoning benchmarks and the efficiency saving of GPU hours from 22\% up to 43\% of the sampling design for the trained models, meanwhile showing up to 40\% reduction at trajectory-level and 35\% at token-level sampling compute for the existing models. While offering a free lunch of inference efficiency, TreePO reveals a practical path toward scaling RL-based post-training with fewer samples and less compute. Home page locates at https://m-a-p.ai/TreePO.
☆ MahaParaphrase: A Marathi Paraphrase Detection Corpus and BERT-based Models
Paraphrases are a vital tool to assist language understanding tasks such as question answering, style transfer, semantic parsing, and data augmentation tasks. Indic languages are complex in natural language processing (NLP) due to their rich morphological and syntactic variations, diverse scripts, and limited availability of annotated data. In this work, we present the L3Cube-MahaParaphrase Dataset, a high-quality paraphrase corpus for Marathi, a low resource Indic language, consisting of 8,000 sentence pairs, each annotated by human experts as either Paraphrase (P) or Non-paraphrase (NP). We also present the results of standard transformer-based BERT models on these datasets. The dataset and model are publicly shared at https://github.com/l3cube-pune/MarathiNLP
☆ DS@GT at CheckThat! 2025: A Simple Retrieval-First, LLM-Backed Framework for Claim Normalization
Claim normalization is an integral part of any automatic fact-check verification system. It parses the typically noisy claim data, such as social media posts into normalized claims, which are then fed into downstream veracity classification tasks. The CheckThat! 2025 Task 2 focuses specifically on claim normalization and spans 20 languages under monolingual and zero-shot conditions. Our proposed solution consists of a lightweight \emph{retrieval-first, LLM-backed} pipeline, in which we either dynamically prompt a GPT-4o-mini with in-context examples, or retrieve the closest normalization from the train dataset directly. On the official test set, the system ranks near the top for most monolingual tracks, achieving first place in 7 out of of the 13 languages. In contrast, the system underperforms in the zero-shot setting, highlighting the limitation of the proposed solution.
comment: CLEF 2025 Working Notes, Madrid, Spain
☆ DashboardQA: Benchmarking Multimodal Agents for Question Answering on Interactive Dashboards
Dashboards are powerful visualization tools for data-driven decision-making, integrating multiple interactive views that allow users to explore, filter, and navigate data. Unlike static charts, dashboards support rich interactivity, which is essential for uncovering insights in real-world analytical workflows. However, existing question-answering benchmarks for data visualizations largely overlook this interactivity, focusing instead on static charts. This limitation severely constrains their ability to evaluate the capabilities of modern multimodal agents designed for GUI-based reasoning. To address this gap, we introduce DashboardQA, the first benchmark explicitly designed to assess how vision-language GUI agents comprehend and interact with real-world dashboards. The benchmark includes 112 interactive dashboards from Tableau Public and 405 question-answer pairs with interactive dashboards spanning five categories: multiple-choice, factoid, hypothetical, multi-dashboard, and conversational. By assessing a variety of leading closed- and open-source GUI agents, our analysis reveals their key limitations, particularly in grounding dashboard elements, planning interaction trajectories, and performing reasoning. Our findings indicate that interactive dashboard reasoning is a challenging task overall for all the VLMs evaluated. Even the top-performing agents struggle; for instance, the best agent based on Gemini-Pro-2.5 achieves only 38.69% accuracy, while the OpenAI CUA agent reaches just 22.69%, demonstrating the benchmark's significant difficulty. We release DashboardQA at https://github.com/vis-nlp/DashboardQA
Agent-Testing Agent: A Meta-Agent for Automated Testing and Evaluation of Conversational AI Agents
LLM agents are increasingly deployed to plan, retrieve, and write with tools, yet evaluation still leans on static benchmarks and small human studies. We present the Agent-Testing Agent (ATA), a meta-agent that combines static code analysis, designer interrogation, literature mining, and persona-driven adversarial test generation whose difficulty adapts via judge feedback. Each dialogue is scored with an LLM-as-a-Judge (LAAJ) rubric and used to steer subsequent tests toward the agent's weakest capabilities. On a travel planner and a Wikipedia writer, the ATA surfaces more diverse and severe failures than expert annotators while matching severity, and finishes in 20--30 minutes versus ten-annotator rounds that took days. Ablating code analysis and web search increases variance and miscalibration, underscoring the value of evidence-grounded test generation. The ATA outputs quantitative metrics and qualitative bug reports for developers. We release the full methodology and open-source implementation for reproducible agent testing: https://github.com/KhalilMrini/Agent-Testing-Agent
☆ Large Language Models as Universal Predictors? An Empirical Study on Small Tabular Datasets
Large Language Models (LLMs), originally developed for natural language processing (NLP), have demonstrated the potential to generalize across modalities and domains. With their in-context learning (ICL) capabilities, LLMs can perform predictive tasks over structured inputs without explicit fine-tuning on downstream tasks. In this work, we investigate the empirical function approximation capability of LLMs on small-scale structured datasets for classification, regression and clustering tasks. We evaluate the performance of state-of-the-art LLMs (GPT-5, GPT-4o, GPT-o3, Gemini-2.5-Flash, DeepSeek-R1) under few-shot prompting and compare them against established machine learning (ML) baselines, including linear models, ensemble methods and tabular foundation models (TFMs). Our results show that LLMs achieve strong performance in classification tasks under limited data availability, establishing practical zero-training baselines. In contrast, the performance in regression with continuous-valued outputs is poor compared to ML models, likely because regression demands outputs in a large (often infinite) space, and clustering results are similarly limited, which we attribute to the absence of genuine ICL in this setting. Nonetheless, this approach enables rapid, low-overhead data exploration and offers a viable alternative to traditional ML pipelines in business intelligence and exploratory analytics contexts. We further analyze the influence of context size and prompt structure on approximation quality, identifying trade-offs that affect predictive performance. Our findings suggest that LLMs can serve as general-purpose predictive engines for structured data, with clear strengths in classification and significant limitations in regression and clustering.
☆ UI-Level Evaluation of ALLaM 34B: Measuring an Arabic-Centric LLM via HUMAIN Chat
Large language models (LLMs) trained primarily on English corpora often struggle to capture the linguistic and cultural nuances of Arabic. To address this gap, the Saudi Data and AI Authority (SDAIA) introduced the $ALLaM$ family of Arabic-focused models. The most capable of these available to the public, $ALLaM-34B$, was subsequently adopted by HUMAIN, who developed and deployed HUMAIN Chat, a closed conversational web service built on this model. This paper presents an expanded and refined UI-level evaluation of $ALLaM-34B$. Using a prompt pack spanning modern standard Arabic, five regional dialects, code-switching, factual knowledge, arithmetic and temporal reasoning, creative generation, and adversarial safety, we collected 115 outputs (23 prompts times 5 runs) and scored each with three frontier LLM judges (GPT-5, Gemini 2.5 Pro, Claude Sonnet-4). We compute category-level means with 95\% confidence intervals, analyze score distributions, and visualize dialect-wise metric heat maps. The updated analysis reveals consistently high performance on generation and code-switching tasks (both averaging 4.92/5), alongside strong results in MSA handling (4.74/5), solid reasoning ability (4.64/5), and improved dialect fidelity (4.21/5). Safety-related prompts show stable, reliable performance of (4.54/5). Taken together, these results position $ALLaM-34B$ as a robust and culturally grounded Arabic LLM, demonstrating both technical strength and practical readiness for real-world deployment.
☆ The Arabic Generality Score: Another Dimension of Modeling Arabic Dialectness EMNLP 2025
Arabic dialects form a diverse continuum, yet NLP models often treat them as discrete categories. Recent work addresses this issue by modeling dialectness as a continuous variable, notably through the Arabic Level of Dialectness (ALDi). However, ALDi reduces complex variation to a single dimension. We propose a complementary measure: the Arabic Generality Score (AGS), which quantifies how widely a word is used across dialects. We introduce a pipeline that combines word alignment, etymology-aware edit distance, and smoothing to annotate a parallel corpus with word-level AGS. A regression model is then trained to predict AGS in context. Our approach outperforms strong baselines, including state-of-the-art dialect ID systems, on a multi-dialect benchmark. AGS offers a scalable, linguistically grounded way to model lexical generality, enriching representations of Arabic dialectness.
comment: Accepted to EMNLP 2025 Main Conference
☆ Capturing Legal Reasoning Paths from Facts to Law in Court Judgments using Knowledge Graphs
Court judgments reveal how legal rules have been interpreted and applied to facts, providing a foundation for understanding structured legal reasoning. However, existing automated approaches for capturing legal reasoning, including large language models, often fail to identify the relevant legal context, do not accurately trace how facts relate to legal norms, and may misrepresent the layered structure of judicial reasoning. These limitations hinder the ability to capture how courts apply the law to facts in practice. In this paper, we address these challenges by constructing a legal knowledge graph from 648 Japanese administrative court decisions. Our method extracts components of legal reasoning using prompt-based large language models, normalizes references to legal provisions, and links facts, norms, and legal applications through an ontology of legal inference. The resulting graph captures the full structure of legal reasoning as it appears in real court decisions, making implicit reasoning explicit and machine-readable. We evaluate our system using expert annotated data, and find that it achieves more accurate retrieval of relevant legal provisions from facts than large language model baselines and retrieval-augmented methods.
☆ DropLoRA: Sparse Low-Rank Adaptation for Parameter-Efficient Fine-Tuning
LoRA-based large model parameter-efficient fine-tuning (PEFT) methods use low-rank de- composition to approximate updates to model parameters. However, compared to full- parameter fine-tuning, low-rank updates often lead to a performance gap in downstream tasks. To address this, we introduce DropLoRA, a novel pruning-based approach that focuses on pruning the rank dimension. Unlike conven- tional methods that attempt to overcome the low-rank bottleneck, DropLoRA innovatively integrates a pruning module between the two low-rank matrices in LoRA to simulate dy- namic subspace learning. This dynamic low- rank subspace learning allows DropLoRA to overcome the limitations of traditional LoRA, which operates within a static subspace. By continuously adapting the learning subspace, DropLoRA significantly boosts performance without incurring additional training or infer- ence costs. Our experimental results demon- strate that DropLoRA consistently outperforms LoRA in fine-tuning the LLaMA series across a wide range of large language model gener- ation tasks, including commonsense reason- ing, mathematical reasoning, code generation, and instruction-following. Our code is avail- able at https://github.com/TayeeChang/DropLoRA.
comment: 8 pages
☆ Mind the (Language) Gap: Towards Probing Numerical and Cross-Lingual Limits of LVLMs
We introduce MMCRICBENCH-3K, a benchmark for Visual Question Answering (VQA) on cricket scorecards, designed to evaluate large vision-language models (LVLMs) on complex numerical and cross-lingual reasoning over semi-structured tabular images. MMCRICBENCH-3K comprises 1,463 synthetically generated scorecard images from ODI, T20, and Test formats, accompanied by 1,500 English QA pairs. It includes two subsets: MMCRICBENCH-E-1.5K, featuring English scorecards, and MMCRICBENCH-H-1.5K, containing visually similar Hindi scorecards, with all questions and answers kept in English to enable controlled cross-script evaluation. The task demands reasoning over structured numerical data, multi-image context, and implicit domain knowledge. Empirical results show that even state-of-the-art LVLMs, such as GPT-4o and Qwen2.5VL, struggle on the English subset despite it being their primary training language and exhibit a further drop in performance on the Hindi subset. This reveals key limitations in structure-aware visual text understanding, numerical reasoning, and cross-lingual generalization. The dataset is publicly available via Hugging Face at https://huggingface.co/datasets/DIALab/MMCricBench, to promote LVLM research in this direction.
☆ Omne-R1: Learning to Reason with Memory for Multi-hop Question Answering
This paper introduces Omne-R1, a novel approach designed to enhance multi-hop question answering capabilities on schema-free knowledge graphs by integrating advanced reasoning models. Our method employs a multi-stage training workflow, including two reinforcement learning phases and one supervised fine-tuning phase. We address the challenge of limited suitable knowledge graphs and QA data by constructing domain-independent knowledge graphs and auto-generating QA pairs. Experimental results show significant improvements in answering multi-hop questions, with notable performance gains on more complex 3+ hop questions. Our proposed training framework demonstrates strong generalization abilities across diverse knowledge domains.
☆ CultranAI at PalmX 2025: Data Augmentation for Cultural Knowledge Representation
In this paper, we report our participation to the PalmX cultural evaluation shared task. Our system, CultranAI, focused on data augmentation and LoRA fine-tuning of large language models (LLMs) for Arabic cultural knowledge representation. We benchmarked several LLMs to identify the best-performing model for the task. In addition to utilizing the PalmX dataset, we augmented it by incorporating the Palm dataset and curated a new dataset of over 22K culturally grounded multiple-choice questions (MCQs). Our experiments showed that the Fanar-1-9B-Instruct model achieved the highest performance. We fine-tuned this model on the combined augmented dataset of 22K+ MCQs. On the blind test set, our submitted system ranked 5th with an accuracy of 70.50%, while on the PalmX development set, it achieved an accuracy of 84.1%.
comment: LLMs, Native, Arabic LLMs, Augmentation, Multilingual, Language Diversity, Contextual Understanding, Minority Languages, Culturally Informed, Foundation Models, Large Language Models
♻ ☆ Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs
The rapid evolution of multimodal large language models (MLLMs) has significantly enhanced their real-world applications. However, achieving consistent performance across languages, especially when integrating cultural knowledge, remains a significant challenge. To better assess this issue, we introduce two new benchmarks: KnowRecall and VisRecall, which evaluate cross-lingual consistency in MLLMs. KnowRecall is a visual question answering benchmark designed to measure factual knowledge consistency in 15 languages, focusing on cultural and historical questions about global landmarks. VisRecall assesses visual memory consistency by asking models to describe landmark appearances in 9 languages without access to images. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, still struggle to achieve cross-lingual consistency. This underscores the need for more robust approaches that produce truly multilingual and culturally aware models.
comment: The first version of this paper mistakenly included a prompt injection phrase, which was inappropriate and unprofessional. Although we corrected the version on arXiv and withdrew from the conference, my co-authors and university strongly request a full withdrawal. Given the situation, I no longer have the authority to manage this paper, and withdrawing it from arXiv is the most responsible action
♻ ☆ From Legal Texts to Defeasible Deontic Logic via LLMs: A Study in Automated Semantic Analysis
We present a novel approach to the automated semantic analysis of legal texts using large language models (LLMs), targeting their transformation into formal representations in Defeasible Deontic Logic (DDL). We propose a structured pipeline that segments complex normative language into atomic snippets, extracts deontic rules, and evaluates them for syntactic and semantic coherence. Our methodology is evaluated across various LLM configurations, including prompt engineering strategies, fine-tuned models, and multi-stage pipelines, focusing on legal norms from the Australian Telecommunications Consumer Protections Code. Empirical results demonstrate promising alignment between machine-generated and expert-crafted formalizations, showing that LLMs - particularly when prompted effectively - can significantly contribute to scalable legal informatics.
♻ ☆ Does GPT-4 surpass human performance in linguistic pragmatics?
As Large Language Models (LLMs) become increasingly integrated into everyday life as general purpose multimodal AI systems, their capabilities to simulate human understanding are under examination. This study investigates LLMs ability to interpret linguistic pragmatics, which involves context and implied meanings. Using Grice communication principles, we evaluated both LLMs (GPT-2, GPT-3, GPT-3.5, GPT-4, and Bard) and human subjects (N = 147) on dialogue-based tasks. Human participants included 71 primarily Serbian students and 76 native English speakers from the United States. Findings revealed that LLMs, particularly GPT-4, outperformed humans. GPT4 achieved the highest score of 4.80, surpassing the best human score of 4.55. Other LLMs performed well: GPT 3.5 scored 4.10, Bard 3.75, and GPT-3 3.25. GPT-2 had the lowest score of 1.05. The average LLM score was 3.39, exceeding the human cohorts averages of 2.80 (Serbian students) and 2.34 (U.S. participants). In the ranking of all 155 subjects (including LLMs and humans), GPT-4 secured the top position, while the best human ranked second. These results highlight significant progress in LLMs ability to simulate understanding of linguistic pragmatics. Future studies should confirm these findings with more dialogue-based tasks and diverse participants. This research has important implications for advancing general-purpose AI models in various communication-centered tasks, including potential application in humanoid robots in the future.
comment: 19 pages, 1 figure, 2 tables
♻ ☆ Multi-Turn Puzzles: Evaluating Interactive Reasoning and Strategic Dialogue in LLMs
Large language models (LLMs) excel at solving problems with clear and complete statements, but often struggle with nuanced environments or interactive tasks which are common in most real-world scenarios. This highlights the critical need for developing LLMs that can effectively engage in logically consistent multi-turn dialogue, seek information and reason with incomplete data. To this end, we introduce a novel benchmark comprising a suite of multi-turn tasks each designed to test specific reasoning, interactive dialogue, and information-seeking abilities. These tasks have deterministic scoring mechanisms, thus eliminating the need for human intervention. Evaluating frontier models on our benchmark reveals significant headroom. Our analysis shows that most errors emerge from poor instruction following, reasoning failures, and poor planning. This benchmark provides valuable insights into the strengths and weaknesses of current LLMs in handling complex, interactive scenarios and offers a robust platform for future research aimed at improving these critical capabilities.
♻ ☆ Intern-S1: A Scientific Multimodal Foundation Model
In recent years, a plethora of open-source foundation models have emerged, achieving remarkable progress in some widely attended fields, with performance being quite close to that of closed-source models. However, in high-value but more challenging scientific professional fields, either the fields still rely on expert models, or the progress of general foundation models lags significantly compared to those in popular areas, far from sufficient for transforming scientific research and leaving substantial gap between open-source models and closed-source models in these scientific domains. To mitigate this gap and explore a step further toward Artificial General Intelligence (AGI), we introduce Intern-S1, a specialized generalist equipped with general understanding and reasoning capabilities with expertise to analyze multiple science modal data. Intern-S1 is a multimodal Mixture-of-Experts (MoE) model with 28 billion activated parameters and 241 billion total parameters, continually pre-trained on 5T tokens, including over 2.5T tokens from scientific domains. In the post-training stage, Intern-S1 undergoes offline and then online reinforcement learning (RL) in InternBootCamp, where we propose Mixture-of-Rewards (MoR) to synergize the RL training on more than 1000 tasks simultaneously. Through integrated innovations in algorithms, data, and training systems, Intern-S1 achieved top-tier performance in online RL training. On comprehensive evaluation benchmarks, Intern-S1 demonstrates competitive performance on general reasoning tasks among open-source models and significantly outperforms open-source models in scientific domains, surpassing closed-source state-of-the-art models in professional tasks, such as molecular synthesis planning, reaction condition prediction, predicting thermodynamic stabilities for crystals. Our models are available at https://huggingface.co/internlm/Intern-S1.
♻ ☆ VeriCoder: Enhancing LLM-Based RTL Code Generation through Functional Correctness Validation
Recent advances in Large Language Models (LLMs) have sparked growing interest in applying them to Electronic Design Automation (EDA) tasks, particularly Register Transfer Level (RTL) code generation. While several RTL datasets have been introduced, most focus on syntactic validity rather than functional validation with tests, leading to training examples that compile but may not implement the intended behavior. We present VERICODER, a model for RTL code generation fine-tuned on a dataset validated for functional correctness. This fine-tuning dataset is constructed using a novel methodology that combines unit test generation with feedback-directed refinement. Given a natural language specification and an initial RTL design, we prompt a teacher model (GPT-4o-mini) to generate unit tests and iteratively revise the RTL design based on its simulation results using the generated tests. If necessary, the teacher model also updates the tests to ensure they comply with the natural language specification. As a result of this process, every example in our dataset is functionally validated, consisting of a natural language description, an RTL implementation, and passing tests. Fine-tuned on this dataset of 125,777 examples, VERICODER achieves state-of-the-art metrics in functional correctness on VerilogEval and RTLLM, with relative gains of up to 71.7% and 27.4%, respectively. An ablation study further shows that models trained on our functionally validated dataset outperform those trained on functionally non-validated datasets, underscoring the importance of high-quality datasets in RTL code generation. Our code, data, and models are publicly available at https://github.com/Anjiang-Wei/VeriCoder
♻ ☆ Localizing Factual Inconsistencies in Attributable Text Generation ACL
There has been an increasing interest in detecting hallucinations in model-generated texts, both manually and automatically, at varying levels of granularity. However, most existing methods fail to precisely pinpoint the errors. In this work, we introduce QASemConsistency, a new formalism for localizing factual inconsistencies in attributable text generation, at a fine-grained level. Drawing inspiration from Neo-Davidsonian formal semantics, we propose decomposing the generated text into minimal predicate-argument level propositions, expressed as simple question-answer (QA) pairs, and assess whether each individual QA pair is supported by a trusted reference text. As each QA pair corresponds to a single semantic relation between a predicate and an argument, QASemConsistency effectively localizes the unsupported information. We first demonstrate the effectiveness of the QASemConsistency methodology for human annotation, by collecting crowdsourced annotations of granular consistency errors, while achieving a substantial inter-annotator agreement. This benchmark includes more than 3K instances spanning various tasks of attributable text generation. We also show that QASemConsistency yields factual consistency scores that correlate well with human judgments. Finally, we implement several methods for automatically detecting localized factual inconsistencies, with both supervised entailment models and LLMs.
comment: Accepted for publication in Transactions of the Association for Computational Linguistics (TACL), 2025. Authors pre-print
♻ ☆ sudoLLM: On Multi-role Alignment of Language Models EMNLP 2025
User authorization-based access privileges are a key feature in many safety-critical systems, but have not been extensively studied in the large language model (LLM) realm. In this work, drawing inspiration from such access control systems, we introduce sudoLLM, a novel framework that results in multi-role aligned LLMs, i.e., LLMs that account for, and behave in accordance with, user access rights. sudoLLM injects subtle user-based biases into queries and trains an LLM to utilize this bias signal in order to produce sensitive information if and only if the user is authorized. We present empirical results demonstrating that this approach shows substantially improved alignment, generalization, resistance to prefix-based jailbreaking attacks, and ``fails-closed''. The persistent tension between the language modeling objective and safety alignment, which is often exploited to jailbreak LLMs, is somewhat resolved with the aid of the injected bias signal. Our framework is meant as an additional security layer, and complements existing guardrail mechanisms for enhanced end-to-end safety with LLMs.
comment: Accepted to EMNLP 2025 (findings)
♻ ☆ DRT: Deep Reasoning Translation via Long Chain-of-Thought ACL 2025
Recently, O1-like models have emerged as representative examples, illustrating the effectiveness of long chain-of-thought (CoT) in reasoning tasks such as math and coding tasks. In this paper, we introduce DRT, an attempt to bring the success of long CoT to neural machine translation (MT). Specifically, in view of the literature books that might involve similes and metaphors, translating these texts to a target language is very difficult in practice due to cultural differences. In such cases, literal translation often fails to convey the intended meaning effectively. Even for professional human translators, considerable thought must be given to preserving semantics throughout the translation process. To simulate LLMs' long thought ability in MT, we first mine sentences containing similes or metaphors from existing literature books, and then develop a multi-agent framework to translate these sentences via long thought. In the multi-agent framework, a translator is used to iteratively translate the source sentence under the suggestions provided by an advisor. To ensure the effectiveness of the long thoughts, an evaluator is also employed to quantify the translation quality in each round. In this way, we collect tens of thousands of long-thought MT data, which is used to train our DRT. Using Qwen2.5 and LLama-3.1 as the backbones, DRT models can learn the thought process during machine translation, and outperform vanilla LLMs as well as LLMs which are simply fine-tuning on the paired sentences without long thought, showing its effectiveness. The synthesized data and model checkpoints are released at https://github.com/krystalan/DRT.
comment: ACL 2025 Findings
♻ ☆ AudioLens: A Closer Look at Auditory Attribute Perception of Large Audio-Language Models
Understanding the internal mechanisms of large audio-language models (LALMs) is crucial for interpreting their behavior and improving performance. This work presents the first in-depth analysis of how LALMs internally perceive and recognize auditory attributes. By applying vocabulary projection on three state-of-the-art LALMs, we track how attribute information evolves across layers and token positions. We find that attribute information generally decreases with layer depth when recognition fails, and that resolving attributes at earlier layers correlates with better accuracy. Moreover, LALMs heavily rely on querying auditory inputs for predicting attributes instead of aggregating necessary information in hidden states at attribute-mentioning positions. Based on our findings, we demonstrate a method to enhance LALMs. Our results offer insights into auditory attribute processing, paving the way for future improvements.
comment: Accepted to ASRU 2025
♻ ☆ SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information
Large audio-language models (LALMs) extend the large language models with multimodal understanding in speech, audio, etc. While their performances on speech and audio-processing tasks are extensively studied, their reasoning abilities remain underexplored. Particularly, their multi-hop reasoning, the ability to recall and integrate multiple facts, lacks systematic evaluation. Existing benchmarks focus on general speech and audio-processing tasks, conversational abilities, and fairness but overlook this aspect. To bridge this gap, we introduce SAKURA, a benchmark assessing LALMs' multi-hop reasoning based on speech and audio information. Results show that LALMs struggle to integrate speech/audio representations for multi-hop reasoning, even when they extract the relevant information correctly, highlighting a fundamental challenge in multimodal reasoning. Our findings expose a critical limitation in LALMs, offering insights and resources for future research.
comment: Accepted to Interspeech 2025 (Oral). Update acknowledgement in this version. Project page: https://github.com/ckyang1124/SAKURA
♻ ☆ Disentangling Exploration of Large Language Models by Optimal Exploitation
Exploration is a crucial skill for in-context reinforcement learning in unknown environments. However, it remains unclear if large language models can effectively explore a partially hidden state space. This work isolates exploration as the sole objective, tasking an agent with gathering information that enhances future returns. Within this framework, we argue that measuring agent returns is not sufficient for a fair evaluation. Hence, we decompose missing rewards into their exploration and exploitation components based on the optimal achievable return. Experiments with various models reveal that most struggle to explore the state space, and weak exploration is insufficient. Nevertheless, we found a positive correlation between exploration performance and reasoning capabilities. Our decomposition can provide insights into differences in behaviors driven by prompt engineering, offering a valuable tool for refining performance in exploratory tasks.
♻ ☆ Draft Model Knows When to Stop: Self-Verification Speculative Decoding for Long-Form Generation EMNLP 2025
Conventional speculative decoding (SD) methods utilize a predefined length policy for proposing drafts, which implies the premise that the target model smoothly accepts the proposed draft tokens. However, reality deviates from this assumption: the oracle draft length varies significantly, and the fixed-length policy hardly satisfies such a requirement. Moreover, such discrepancy is further exacerbated in scenarios involving complex reasoning and long-form generation, particularly under test-time scaling for reasoning-specialized models. Through both theoretical and empirical estimation, we establish that the discrepancy between the draft and target models can be approximated by the draft model's prediction entropy: a high entropy indicates a low acceptance rate of draft tokens, and vice versa. Based on this insight, we propose SVIP: Self-Verification Length Policy for Long-Context Speculative Decoding, which is a training-free dynamic length policy for speculative decoding systems that adaptively determines the lengths of draft sequences by referring to the draft entropy. Experimental results on mainstream SD benchmarks as well as reasoning-heavy benchmarks demonstrate the superior performance of SVIP, achieving up to 17% speedup on MT-Bench at 8K context compared with fixed draft lengths, and 22% speedup for QwQ in long-form reasoning.
comment: EMNLP 2025
♻ ☆ More Women, Same Stereotypes: Unpacking the Gender Bias Paradox in Large Language Models
Large Language Models (LLMs) have revolutionized natural language processing, yet concerns persist regarding their tendency to reflect or amplify social biases. This study introduces a novel evaluation framework to uncover gender biases in LLMs: using free-form storytelling to surface biases embedded within the models. A systematic analysis of ten prominent LLMs shows a consistent pattern of overrepresenting female characters across occupations, likely due to supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). Paradoxically, despite this overrepresentation, the occupational gender distributions produced by these LLMs align more closely with human stereotypes than with real-world labor data. This highlights the challenge and importance of implementing balanced mitigation measures to promote fairness and prevent the establishment of potentially new biases. We release the prompts and LLM-generated stories at GitHub.
♻ ☆ Graph-oriented Instruction Tuning of Large Language Models for Generic Graph Mining
Graphs with abundant attributes are essential in modeling interconnected entities and enhancing predictions across various real-world applications. Traditional Graph Neural Networks (GNNs) often require re-training for different graph tasks and datasets. Although the emergence of Large Language Models (LLMs) has introduced new paradigms in natural language processing, their potential for generic graph mining, training a single model to simultaneously handle diverse tasks and datasets, remains under-explored. To this end, our novel framework MuseGraph, seamlessly integrates the strengths of GNNs and LLMs into one foundation model for graph mining across tasks and datasets. This framework first features a compact graph description to encapsulate key graph information within language token limitations. Then, we propose a diverse instruction generation mechanism with Chain-of-Thought (CoT)-based instruction packages to distill the reasoning capabilities from advanced LLMs like GPT-4. Finally, we design a graph-aware instruction tuning strategy to facilitate mutual enhancement across multiple tasks and datasets while preventing catastrophic forgetting of LLMs' generative abilities. Our experimental results demonstrate significant improvements in five graph tasks and ten datasets, showcasing the potential of our MuseGraph in enhancing the accuracy of graph-oriented downstream tasks while improving the generation abilities of LLMs.
comment: Accepted by TPAMI 2025
♻ ☆ Agentic large language models improve retrieval-based radiology question answering
Clinical decision-making in radiology increasingly benefits from artificial intelligence (AI), particularly through large language models (LLMs). However, traditional retrieval-augmented generation (RAG) systems for radiology question answering (QA) typically rely on single-step retrieval, limiting their ability to handle complex clinical reasoning tasks. Here we propose an agentic RAG framework enabling LLMs to autonomously decompose radiology questions, iteratively retrieve targeted clinical evidence from Radiopaedia.org, and dynamically synthesize evidence-based responses. We evaluated 25 LLMs spanning diverse architectures, parameter scales (0.5B to >670B), and training paradigms (general-purpose, reasoning-optimized, clinically fine-tuned), using 104 expert-curated radiology questions from previously established RSNA-RadioQA and ExtendedQA datasets. To assess generalizability, we additionally tested on an unseen internal dataset of 65 real-world radiology board examination questions. Agentic retrieval significantly improved mean diagnostic accuracy over zero-shot prompting and conventional online RAG. The greatest gains occurred in small-scale models, while very large models (>200B parameters) demonstrated minimal changes (<2% improvement). Additionally, agentic retrieval reduced hallucinations (mean 9.4%) and retrieved clinically relevant context in 46% of cases, substantially aiding factual grounding. Even clinically fine-tuned models showed gains from agentic retrieval (e.g., MedGemma-27B), indicating that retrieval remains beneficial despite embedded domain knowledge. These results highlight the potential of agentic frameworks to enhance factuality and diagnostic accuracy in radiology QA, warranting future studies to validate their clinical utility. All datasets, code, and the full agentic framework are publicly available to support open research and clinical translation.
♻ ☆ SCP-116K: A High-Quality Problem-Solution Dataset and a Generalized Pipeline for Automated Extraction in the Higher Education Science Domain
Recent breakthroughs in large language models (LLMs) exemplified by the impressive mathematical and scientific reasoning capabilities of the o1 model have spotlighted the critical importance of high-quality training data in advancing LLM performance across STEM disciplines. While the mathematics community has benefited from a growing body of curated datasets, the scientific domain at the higher education level has long suffered from a scarcity of comparable resources. To address this gap, we present SCP-116K, a new large-scale dataset of 116,756 high-quality problem-solution pairs, automatically extracted from heterogeneous sources using a streamlined and highly generalizable pipeline. Our approach involves stringent filtering to ensure the scientific rigor and educational level of the extracted materials, while maintaining adaptability for future expansions or domain transfers. By openly releasing both the dataset and the extraction pipeline, we seek to foster research on scientific reasoning, enable comprehensive performance evaluations of new LLMs, and lower the barrier to replicating the successes of advanced models like o1 in the broader science community. We believe SCP-116K will serve as a critical resource, catalyzing progress in high-level scientific reasoning tasks and promoting further innovations in LLM development. The dataset and code are publicly available at https://github.com/AQA6666/SCP-116K-open.
comment: 9 pages, 1 figures
♻ ☆ Understanding Bias Reinforcement in LLM Agents Debate
Large Language Models $($LLMs$)$ solve complex problems using training-free methods like prompt engineering and in-context learning, yet ensuring reasoning correctness remains challenging. While self-correction methods such as self-consistency and self-refinement aim to improve reliability, they often reinforce biases due to the lack of effective feedback mechanisms. Multi-Agent Debate $($MAD$)$ has emerged as an alternative, but we identify two key limitations: bias reinforcement, where debate amplifies model biases instead of correcting them, and lack of perspective diversity, as all agents share the same model and reasoning patterns, limiting true debate effectiveness. To systematically evaluate these issues, we introduce $\textit{MetaNIM Arena}$, a benchmark designed to assess LLMs in adversarial strategic decision-making, where dynamic interactions influence optimal decisions. To overcome MAD's limitations, we propose $\textbf{DReaMAD}$ $($$\textbf{D}$iverse $\textbf{Rea}$soning via $\textbf{M}$ulti-$\textbf{A}$gent $\textbf{D}$ebate with Refined Prompt$)$, a novel framework that $(1)$ refines LLM's strategic prior knowledge to improve reasoning quality and $(2)$ promotes diverse viewpoints within a single model by systematically modifying prompts, reducing bias. Empirical results show that $\textbf{DReaMAD}$ significantly improves decision accuracy, reasoning diversity, and bias mitigation across multiple strategic tasks, establishing it as a more effective approach for LLM-based decision-making.
comment: 32 pages, 9 figures
♻ ☆ Jinx: Unlimited LLMs for Probing Alignment Failures
Unlimited, or so-called helpful-only language models are trained without safety alignment constraints and never refuse user queries. They are widely used by leading AI companies as internal tools for red teaming and alignment evaluation. For example, if a safety-aligned model produces harmful outputs similar to an unlimited model, this indicates alignment failures that require further attention. Despite their essential role in assessing alignment, such models are not available to the research community. We introduce Jinx, a helpful-only variant of popular open-weight LLMs. Jinx responds to all queries without refusals or safety filtering, while preserving the base model's capabilities in reasoning and instruction following. It provides researchers with an accessible tool for probing alignment failures, evaluating safety boundaries, and systematically studying failure modes in language model safety.
comment: https://huggingface.co/Jinx-org
Information Retrieval 12
☆ DS@GT at CheckThat! 2025: A Simple Retrieval-First, LLM-Backed Framework for Claim Normalization
Claim normalization is an integral part of any automatic fact-check verification system. It parses the typically noisy claim data, such as social media posts into normalized claims, which are then fed into downstream veracity classification tasks. The CheckThat! 2025 Task 2 focuses specifically on claim normalization and spans 20 languages under monolingual and zero-shot conditions. Our proposed solution consists of a lightweight \emph{retrieval-first, LLM-backed} pipeline, in which we either dynamically prompt a GPT-4o-mini with in-context examples, or retrieve the closest normalization from the train dataset directly. On the official test set, the system ranks near the top for most monolingual tracks, achieving first place in 7 out of of the 13 languages. In contrast, the system underperforms in the zero-shot setting, highlighting the limitation of the proposed solution.
comment: CLEF 2025 Working Notes, Madrid, Spain
☆ Retrieval Capabilities of Large Language Models Scale with Pretraining FLOPs
How does retrieval performance scale with pretraining FLOPs? We benchmark retrieval performance across LLM model sizes from 125 million parameters to 7 billion parameters pretrained on datasets ranging from 1 billion tokens to more than 2 trillion tokens. We find that retrieval performance on zero-shot BEIR tasks predictably scales with LLM size, training duration, and estimated FLOPs. We also show that In-Context Learning scores are strongly correlated with retrieval scores across retrieval tasks. Finally, we highlight the implications this has for the development of LLM-based retrievers.
comment: 15 pages, 4 figures
☆ Capturing Legal Reasoning Paths from Facts to Law in Court Judgments using Knowledge Graphs
Court judgments reveal how legal rules have been interpreted and applied to facts, providing a foundation for understanding structured legal reasoning. However, existing automated approaches for capturing legal reasoning, including large language models, often fail to identify the relevant legal context, do not accurately trace how facts relate to legal norms, and may misrepresent the layered structure of judicial reasoning. These limitations hinder the ability to capture how courts apply the law to facts in practice. In this paper, we address these challenges by constructing a legal knowledge graph from 648 Japanese administrative court decisions. Our method extracts components of legal reasoning using prompt-based large language models, normalizes references to legal provisions, and links facts, norms, and legal applications through an ontology of legal inference. The resulting graph captures the full structure of legal reasoning as it appears in real court decisions, making implicit reasoning explicit and machine-readable. We evaluate our system using expert annotated data, and find that it achieves more accurate retrieval of relevant legal provisions from facts than large language model baselines and retrieval-augmented methods.
☆ Opening the Black Box: Interpretable Remedies for Popularity Bias in Recommender Systems
Popularity bias is a well-known challenge in recommender systems, where a small number of popular items receive disproportionate attention, while the majority of less popular items are largely overlooked. This imbalance often results in reduced recommendation quality and unfair exposure of items. Although existing mitigation techniques address this bias to some extent, they typically lack transparency in how they operate. In this paper, we propose a post-hoc method using a Sparse Autoencoder (SAE) to interpret and mitigate popularity bias in deep recommendation models. The SAE is trained to replicate a pre-trained model's behavior while enabling neuron-level interpretability. By introducing synthetic users with clear preferences for either popular or unpopular items, we identify neurons encoding popularity signals based on their activation patterns. We then adjust the activations of the most biased neurons to steer recommendations toward fairer exposure. Experiments on two public datasets using a sequential recommendation model show that our method significantly improves fairness with minimal impact on accuracy. Moreover, it offers interpretability and fine-grained control over the fairness-accuracy trade-off.
☆ Are You Sure You're Positive? Consolidating Chain-of-Thought Agents with Uncertainty Quantification for Aspect-Category Sentiment Analysis
Aspect-category sentiment analysis provides granular insights by identifying specific themes within product reviews that are associated with particular opinions. Supervised learning approaches dominate the field. However, data is scarce and expensive to annotate for new domains. We argue that leveraging large language models in a zero-shot setting is beneficial where the time and resources required for dataset annotation are limited. Furthermore, annotation bias may lead to strong results using supervised methods but transfer poorly to new domains in contexts that lack annotations and demand reproducibility. In our work, we propose novel techniques that combine multiple chain-of-thought agents by leveraging large language models' token-level uncertainty scores. We experiment with the 3B and 70B+ parameter size variants of Llama and Qwen models, demonstrating how these approaches can fulfil practical needs and opening a discussion on how to gauge accuracy in label-scarce conditions.
comment: 18 pages, 10 figures, 3 tables, Proceedings of the 1st Workshop for Research on Agent Language Models (REALM 2025)
☆ Routing Distilled Knowledge via Mixture of LoRA Experts for Large Language Model based Bundle Generation
Large Language Models (LLMs) have shown potential in automatic bundle generation but suffer from prohibitive computational costs. Although knowledge distillation offers a pathway to more efficient student models, our preliminary study reveals that naively integrating diverse types of distilled knowledge from teacher LLMs into student LLMs leads to knowledge conflict, negatively impacting the performance of bundle generation. To address this, we propose RouteDK, a framework for routing distilled knowledge through a mixture of LoRA expert architecture. Specifically, we first distill knowledge from the teacher LLM for bundle generation in two complementary types: high-level knowledge (generalizable rules) and fine-grained knowledge (session-specific reasoning). We then train knowledge-specific LoRA experts for each type of knowledge together with a base LoRA expert. For effective integration, we propose a dynamic fusion module, featuring an input-aware router, where the router balances expert contributions by dynamically determining optimal weights based on input, thereby effectively mitigating knowledge conflicts. To further improve inference reliability, we design an inference-time enhancement module to reduce variance and mitigate suboptimal reasoning. Experiments on three public datasets show that our RouteDK achieves accuracy comparable to or even better than the teacher LLM, while maintaining strong computational efficiency. In addition, it outperforms state-of-the-art approaches for bundle generation.
☆ Exposing Privacy Risks in Graph Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) is a powerful technique for enhancing Large Language Models (LLMs) with external, up-to-date knowledge. Graph RAG has emerged as an advanced paradigm that leverages graph-based knowledge structures to provide more coherent and contextually rich answers. However, the move from plain document retrieval to structured graph traversal introduces new, under-explored privacy risks. This paper investigates the data extraction vulnerabilities of the Graph RAG systems. We design and execute tailored data extraction attacks to probe their susceptibility to leaking both raw text and structured data, such as entities and their relationships. Our findings reveal a critical trade-off: while Graph RAG systems may reduce raw text leakage, they are significantly more vulnerable to the extraction of structured entity and relationship information. We also explore potential defense mechanisms to mitigate these novel attack surfaces. This work provides a foundational analysis of the unique privacy challenges in Graph RAG and offers insights for building more secure systems.
♻ ☆ Test-time Corpus Feedback: From Retrieval to RAG
Retrieval-Augmented Generation (RAG) has emerged as a standard framework for knowledge-intensive NLP tasks, combining large language models (LLMs) with document retrieval from external corpora. Despite its widespread use, most RAG pipelines continue to treat retrieval and reasoning as isolated components, retrieving documents once and then generating answers without further interaction. This static design often limits performance on complex tasks that require iterative evidence gathering or high-precision retrieval. Recent work in both the information retrieval (IR) and NLP communities has begun to close this gap by introducing adaptive retrieval and ranking methods that incorporate feedback. In this survey, we present a structured overview of advanced retrieval and ranking mechanisms that integrate such feedback. We categorize feedback signals based on their source and role in improving the query, retrieved context, or document pool. By consolidating these developments, we aim to bridge IR and NLP perspectives and highlight retrieval as a dynamic, learnable component of end-to-end RAG systems.
comment: 18 pages, 1 figure
♻ ☆ LLM4MSR: An LLM-Enhanced Paradigm for Multi-Scenario Recommendation CIKM 2024
As the demand for more personalized recommendation grows and a dramatic boom in commercial scenarios arises, the study on multi-scenario recommendation (MSR) has attracted much attention, which uses the data from all scenarios to simultaneously improve their recommendation performance. However, existing methods tend to integrate insufficient scenario knowledge and neglect learning personalized cross-scenario preferences, thus leading to sub-optimal performance. Meanwhile, though large language model (LLM) has shown great capability of reasoning and capturing semantic information, the high inference latency and high computation cost of tuning hinder its implementation in industrial recommender systems. To fill these gaps, we propose an LLM-enhanced paradigm LLM4MSR in this work. Specifically, we first leverage LLM to uncover multi-level knowledge from the designed scenario- and user-level prompt without fine-tuning the LLM, then adopt hierarchical meta networks to generate multi-level meta layers to explicitly improve the scenario-aware and personalized recommendation capability. Our experiments on KuaiSAR-small, KuaiSAR, and Amazon datasets validate significant advantages of LLM4MSR: (i) the effectiveness and compatibility with different multi-scenario backbone models, (ii) high efficiency and deployability on industrial recommender systems, and (iii) improved interpretability. The implemented code and data is available to ease reproduction.
comment: CIKM 2024 Full Research Paper
♻ ☆ SCP-116K: A High-Quality Problem-Solution Dataset and a Generalized Pipeline for Automated Extraction in the Higher Education Science Domain
Recent breakthroughs in large language models (LLMs) exemplified by the impressive mathematical and scientific reasoning capabilities of the o1 model have spotlighted the critical importance of high-quality training data in advancing LLM performance across STEM disciplines. While the mathematics community has benefited from a growing body of curated datasets, the scientific domain at the higher education level has long suffered from a scarcity of comparable resources. To address this gap, we present SCP-116K, a new large-scale dataset of 116,756 high-quality problem-solution pairs, automatically extracted from heterogeneous sources using a streamlined and highly generalizable pipeline. Our approach involves stringent filtering to ensure the scientific rigor and educational level of the extracted materials, while maintaining adaptability for future expansions or domain transfers. By openly releasing both the dataset and the extraction pipeline, we seek to foster research on scientific reasoning, enable comprehensive performance evaluations of new LLMs, and lower the barrier to replicating the successes of advanced models like o1 in the broader science community. We believe SCP-116K will serve as a critical resource, catalyzing progress in high-level scientific reasoning tasks and promoting further innovations in LLM development. The dataset and code are publicly available at https://github.com/AQA6666/SCP-116K-open.
comment: 9 pages, 1 figures
♻ ☆ Expert-Guided Diffusion Planner for Auto-Bidding CIKM 2025
Auto-bidding is widely used in advertising systems, serving a diverse range of advertisers. Generative bidding is increasingly gaining traction due to its strong planning capabilities and generalizability. Unlike traditional reinforcement learning-based bidding, generative bidding does not depend on the Markov Decision Process (MDP), thereby exhibiting superior planning performance in long-horizon scenarios. Conditional diffusion modeling approaches have shown significant promise in the field of auto-bidding. However, relying solely on return as the optimality criterion is insufficient to guarantee the generation of truly optimal decision sequences, as it lacks personalized structural information. Moreover, the auto-regressive generation mechanism of diffusion models inherently introduces timeliness risks. To address these challenges, we introduce a novel conditional diffusion modeling approach that integrates expert trajectory guidance with a skip-step sampling strategy to improve generation efficiency. The efficacy of this method has been demonstrated through comprehensive offline experiments and further substantiated by statistically significant outcomes in online A/B testing, yielding an 11.29% increase in conversions and a 12.36% growth in revenue relative to the baseline.
comment: Accepted for presentation at the CIKM 2025 Applied Research Track, eight (8) pages, three (3) figures
♻ ☆ Macro Graph of Experts for Billion-Scale Multi-Task Recommendation
Graph-based multi-task learning at billion-scale presents a significant challenge, as different tasks correspond to distinct billion-scale graphs. Traditional multi-task learning methods often neglect these graph structures, relying solely on individual user and item embeddings. However, disregarding graph structures overlooks substantial potential for improving performance. In this paper, we introduce the Macro Graph of Expert (MGOE) framework, the first approach capable of leveraging macro graph embeddings to capture task-specific macro features while modeling the correlations between task-specific experts. Specifically, we propose the concept of a Macro Graph Bottom, which, for the first time, enables multi-task learning models to incorporate graph information effectively. We design the Macro Prediction Tower to dynamically integrate macro knowledge across tasks. MGOE has been deployed at scale, powering multi-task learning for the homepage of a leading billion-scale recommender system. Extensive offline experiments conducted on three public benchmark datasets demonstrate its superiority over state-of-the-art multi-task learning methods, establishing MGOE as a breakthrough in multi-task graph-based recommendation. Furthermore, online A/B tests confirm the superiority of MGOE in billion-scale recommender systems.
Multimedia 4
☆ py360tool: Um framework para manipulação de vídeo 360$^\circ$ com ladrilhos
Streaming 360$^\circ$ videos for virtual reality demands a lot of bandwidth. To optimize this transmission, videos are divided into "tiles" and selectively distributed to the user based on what they are looking at. This interactive approach makes it difficult to assess quality and user experience. To solve this, the paper presents py360tools, a Python library that automates client-side tasks like video reconstruction, tile selection, and viewport extraction. This facilitates the reproduction, simulation, and analysis of 360$^\circ$ video streaming sessions.
comment: in Portuguese language, Submetido ao WFA, Workshop de Ferramentas e Aplica\c{c}\~oes de 2025, evento sat\'elite do 31{\deg} Simp\'osio Brasileiro de Sistemas Multim\'idia e Web
☆ DanceEditor: Towards Iterative Editable Music-driven Dance Generation with Open-Vocabulary Descriptions
Generating coherent and diverse human dances from music signals has gained tremendous progress in animating virtual avatars. While existing methods support direct dance synthesis, they fail to recognize that enabling users to edit dance movements is far more practical in real-world choreography scenarios. Moreover, the lack of high-quality dance datasets incorporating iterative editing also limits addressing this challenge. To achieve this goal, we first construct DanceRemix, a large-scale multi-turn editable dance dataset comprising the prompt featuring over 25.3M dance frames and 84.5K pairs. In addition, we propose a novel framework for iterative and editable dance generation coherently aligned with given music signals, namely DanceEditor. Considering the dance motion should be both musical rhythmic and enable iterative editing by user descriptions, our framework is built upon a prediction-then-editing paradigm unifying multi-modal conditions. At the initial prediction stage, our framework improves the authority of generated results by directly modeling dance movements from tailored, aligned music. Moreover, at the subsequent iterative editing stages, we incorporate text descriptions as conditioning information to draw the editable results through a specifically designed Cross-modality Editing Module (CEM). Specifically, CEM adaptively integrates the initial prediction with music and text prompts as temporal motion cues to guide the synthesized sequences. Thereby, the results display music harmonics while preserving fine-grained semantic alignment with text descriptions. Extensive experiments demonstrate that our method outperforms the state-of-the-art models on our newly collected DanceRemix dataset. Code is available at https://lzvsdy.github.io/DanceEditor/.
☆ MTNet: Learning modality-aware representation with transformer for RGBT tracking
The ability to learn robust multi-modality representation has played a critical role in the development of RGBT tracking. However, the regular fusion paradigm and the invariable tracking template remain restrictive to the feature interaction. In this paper, we propose a modality-aware tracker based on transformer, termed MTNet. Specifically, a modality-aware network is presented to explore modality-specific cues, which contains both channel aggregation and distribution module(CADM) and spatial similarity perception module (SSPM). A transformer fusion network is then applied to capture global dependencies to reinforce instance representations. To estimate the precise location and tackle the challenges, such as scale variation and deformation, we design a trident prediction head and a dynamic update strategy which jointly maintain a reliable template for facilitating inter-frame communication. Extensive experiments validate that the proposed method achieves satisfactory results compared with the state-of-the-art competitors on three RGBT benchmarks while reaching real-time speed.
☆ Spatial-Temporal Human-Object Interaction Detection
In this paper, we propose a new instance-level human-object interaction detection task on videos called ST-HOID, which aims to distinguish fine-grained human-object interactions (HOIs) and the trajectories of subjects and objects. It is motivated by the fact that HOI is crucial for human-centric video content understanding. To solve ST-HOID, we propose a novel method consisting of an object trajectory detection module and an interaction reasoning module. Furthermore, we construct the first dataset named VidOR-HOID for ST-HOID evaluation, which contains 10,831 spatial-temporal HOI instances. We conduct extensive experiments to evaluate the effectiveness of our method. The experimental results demonstrate that our method outperforms the baselines generated by the state-of-the-art methods of image human-object interaction detection, video visual relation detection and video human-object interaction recognition.
Information Retrieval 9
☆ The Power of Framing: How News Headlines Guide Search Behavior EMNLP
Search engines play a central role in how people gather information, but subtle cues like headline framing may influence not only what users believe but also how they search. While framing effects on judgment are well documented, their impact on subsequent search behavior is less understood. We conducted a controlled experiment where participants issued queries and selected from headlines filtered by specific linguistic frames. Headline framing significantly shaped follow-up queries: conflict and strategy frames disrupted alignment with prior selections, while episodic frames led to more concrete queries than thematic ones. We also observed modest short-term frame persistence that declined over time. These results suggest that even brief exposure to framing can meaningfully alter the direction of users information-seeking behavior.
comment: Accepted to EMNLP
☆ VQL: An End-to-End Context-Aware Vector Quantization Attention for Ultra-Long User Behavior Modeling
In large-scale recommender systems, ultra-long user behavior sequences encode rich signals of evolving interests. Extending sequence length generally improves accuracy, but directly modeling such sequences in production is infeasible due to latency and memory constraints. Existing solutions fall into two categories: (1) top-k retrieval, which truncates the sequence and may discard most attention mass when L >> k; and (2) encoder-based compression, which preserves coverage but often over-compresses and fails to incorporate key context such as temporal gaps or target-aware signals. Neither class achieves a good balance of low-loss compression, context awareness, and efficiency. We propose VQL, a context-aware Vector Quantization Attention framework for ultra-long behavior modeling, with three innovations. (1) Key-only quantization: only attention keys are quantized, while values remain intact; we prove that softmax normalization yields an error bound independent of sequence length, and a codebook loss directly supervises quantization quality. This also enables L-free inference via offline caches. (2) Multi-scale quantization: attention heads are partitioned into groups, each with its own small codebook, which reduces quantization error while keeping cache size fixed. (3) Efficient context injection: static features (e.g., item category, modality) are directly integrated, and relative position is modeled via a separable temporal kernel. All context is injected without enlarging the codebook, so cached representations remain query-independent. Experiments on three large-scale datasets (KuaiRand-1K, KuaiRec, TMALL) show that VQL consistently outperforms strong baselines, achieving higher accuracy while reducing inference latency, establishing a new state of the art in balancing accuracy and efficiency for ultra-long sequence recommendation.
☆ Zero-shot Multimodal Document Retrieval via Cross-modal Question Generation
Rapid advances in Multimodal Large Language Models (MLLMs) have expanded information retrieval beyond purely textual inputs, enabling retrieval from complex real world documents that combine text and visuals. However, most documents are private either owned by individuals or confined within corporate silos and current retrievers struggle when faced with unseen domains or languages. To address this gap, we introduce PREMIR, a simple yet effective framework that leverages the broad knowledge of an MLLM to generate cross modal pre questions (preQs) before retrieval. Unlike earlier multimodal retrievers that compare embeddings in a single vector space, PREMIR leverages preQs from multiple complementary modalities to expand the scope of matching to the token level. Experiments show that PREMIR achieves state of the art performance on out of distribution benchmarks, including closed domain and multilingual settings, outperforming strong baselines across all retrieval metrics. We confirm the contribution of each component through in depth ablation studies, and qualitative analyses of the generated preQs further highlight the model's robustness in real world settings.
☆ Towards a Real-World Aligned Benchmark for Unlearning in Recommender Systems
Modern recommender systems heavily leverage user interaction data to deliver personalized experiences. However, relying on personal data presents challenges in adhering to privacy regulations, such as the GDPR's "right to be forgotten". Machine unlearning (MU) aims to address these challenges by enabling the efficient removal of specific training data from models post-training, without compromising model utility or leaving residual information. However, current benchmarks for unlearning in recommender systems -- most notably CURE4Rec -- fail to reflect real-world operational demands. They focus narrowly on collaborative filtering, overlook tasks like session-based and next-basket recommendation, simulate unrealistically large unlearning requests, and ignore critical efficiency constraints. In this paper, we propose a set of design desiderata and research questions to guide the development of a more realistic benchmark for unlearning in recommender systems, with the goal of gathering feedback from the research community. Our benchmark proposal spans multiple recommendation tasks, includes domain-specific unlearning scenarios, and several unlearning algorithms -- including ones adapted from a recent NeurIPS unlearning competition. Furthermore, we argue for an unlearning setup that reflects the sequential, time-sensitive nature of real-world deletion requests. We also present a preliminary experiment in a next-basket recommendation setting based on our proposed desiderata and find that unlearning also works for sequential recommendation models, exposed to many small unlearning requests. In this case, we observe that a modification of a custom-designed unlearning algorithm for recommender systems outperforms general unlearning algorithms significantly, and that unlearning can be executed with a latency of only several seconds.
☆ DeAR: Dual-Stage Document Reranking with Reasoning Agents via LLM Distillation EMNLP
Large Language Models (LLMs) have transformed listwise document reranking by enabling global reasoning over candidate sets, yet single models often struggle to balance fine-grained relevance scoring with holistic cross-document analysis. We propose \textbf{De}ep\textbf{A}gent\textbf{R}ank (\textbf{\DeAR}), an open-source framework that decouples these tasks through a dual-stage approach, achieving superior accuracy and interpretability. In \emph{Stage 1}, we distill token-level relevance signals from a frozen 13B LLaMA teacher into a compact \{3, 8\}B student model using a hybrid of cross-entropy, RankNet, and KL divergence losses, ensuring robust pointwise scoring. In \emph{Stage 2}, we attach a second LoRA adapter and fine-tune on 20K GPT-4o-generated chain-of-thought permutations, enabling listwise reasoning with natural-language justifications. Evaluated on TREC-DL19/20, eight BEIR datasets, and NovelEval-2306, \DeAR surpasses open-source baselines by +5.1 nDCG@5 on DL20 and achieves 90.97 nDCG@10 on NovelEval, outperforming GPT-4 by +3.09. Without fine-tuning on Wikipedia, DeAR also excels in open-domain QA, achieving 54.29 Top-1 accuracy on Natural Questions, surpassing baselines like MonoT5, UPR, and RankGPT. Ablations confirm that dual-loss distillation ensures stable calibration, making \DeAR a highly effective and interpretable solution for modern reranking systems.\footnote{Dataset and code available at https://github.com/DataScienceUIBK/DeAR-Reranking.}.
comment: Accept at EMNLP Findings 2025
☆ THEME : Enhancing Thematic Investing with Semantic Stock Representations and Temporal Dynamics CIKM
Thematic investing aims to construct portfolios aligned with structural trends, yet selecting relevant stocks remains challenging due to overlapping sector boundaries and evolving market dynamics. To address this challenge, we construct the Thematic Representation Set (TRS), an extended dataset that begins with real-world thematic ETFs and expands upon them by incorporating industry classifications and financial news to overcome their coverage limitations. The final dataset contains both the explicit mapping of themes to their constituent stocks and the rich textual profiles for each. Building on this dataset, we introduce \textsc{THEME}, a hierarchical contrastive learning framework. By representing the textual profiles of themes and stocks as embeddings, \textsc{THEME} first leverages their hierarchical relationship to achieve semantic alignment. Subsequently, it refines these semantic embeddings through a temporal refinement stage that incorporates individual stock returns. The final stock representations are designed for effective retrieval of thematically aligned assets with strong return potential. Empirical results show that \textsc{THEME} outperforms strong baselines across multiple retrieval metrics and significantly improves performance in portfolio construction. By jointly modeling thematic relationships from text and market dynamics from returns, \textsc{THEME} provides a scalable and adaptive solution for navigating complex investment themes.
comment: Accepted at ACM International Conference on Information and Knowledge Management (CIKM)
♻ ☆ RecCoT: Enhancing Recommendation via Chain-of-Thought
In real-world applications, users always interact with items in multiple aspects, such as through implicit binary feedback (e.g., clicks, dislikes, long views) and explicit feedback (e.g., comments, reviews). Modern recommendation systems (RecSys) learn user-item collaborative signals from these implicit feedback signals as a large-scale binary data-streaming, subsequently recommending other highly similar items based on users' personalized historical interactions. However, from this collaborative-connection perspective, the RecSys does not focus on the actual content of the items themselves but instead prioritizes higher-probability signals of behavioral co-occurrence among items. Consequently, under this binary learning paradigm, the RecSys struggles to understand why a user likes or dislikes certain items. To alleviate it, some works attempt to utilize the content-based reviews to capture the semantic knowledge to enhance recommender models. However, most of these methods focus on predicting the ratings of reviews, but do not provide a human-understandable explanation.
comment: Work in progress
♻ ☆ Trustworthy AI Psychotherapy: Multi-Agent LLM Workflow for Counseling and Explainable Mental Disorder Diagnosis CIKM 2025
LLM-based agents have emerged as transformative tools capable of executing complex tasks through iterative planning and action, achieving significant advancements in understanding and addressing user needs. Yet, their effectiveness remains limited in specialized domains such as mental health diagnosis, where they underperform compared to general applications. Current approaches to integrating diagnostic capabilities into LLMs rely on scarce, highly sensitive mental health datasets, which are challenging to acquire. These methods also fail to emulate clinicians' proactive inquiry skills, lack multi-turn conversational comprehension, and struggle to align outputs with expert clinical reasoning. To address these gaps, we propose DSM5AgentFlow, the first LLM-based agent workflow designed to autonomously generate DSM-5 Level-1 diagnostic questionnaires. By simulating therapist-client dialogues with specific client profiles, the framework delivers transparent, step-by-step disorder predictions, producing explainable and trustworthy results. This workflow serves as a complementary tool for mental health diagnosis, ensuring adherence to ethical and legal standards. Through comprehensive experiments, we evaluate leading LLMs across three critical dimensions: conversational realism, diagnostic accuracy, and explainability. Our datasets and implementations are fully open-sourced.
comment: This paper has been accepted by CIKM 2025 as a full paper
♻ ☆ 360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation
Ranking and recommendation systems are the foundation for numerous online experiences, ranging from search results to personalized content delivery. These systems have evolved into complex, multilayered architectures that leverage vast datasets and often incorporate thousands of predictive models. The maintenance and enhancement of these models is a labor intensive process that requires extensive feature engineering. This approach not only exacerbates technical debt but also hampers innovation in extending these systems to emerging problem domains. In this report, we present our research to address these challenges by utilizing a large foundation model with a textual interface for ranking and recommendation tasks. We illustrate several key advantages of our approach: (1) a single model can manage multiple predictive tasks involved in ranking and recommendation, (2) decoder models with textual interface due to their comprehension of reasoning capabilities, can generalize to new recommendation surfaces and out-of-domain problems, and (3) by employing natural language interfaces for task definitions and verbalizing member behaviors and their social connections, we eliminate the need for feature engineering and the maintenance of complex directed acyclic graphs of model dependencies. We introduce our research pre-production model, 360Brew V1.0, a 150B parameter, decoder-only model that has been trained and fine-tuned on LinkedIn's data and tasks. This model is capable of solving over 30 predictive tasks across various segments of the LinkedIn platform, achieving performance levels comparable to or exceeding those of current production systems based on offline metrics, without task-specific fine-tuning. Notably, each of these tasks is conventionally addressed by dedicated models that have been developed and maintained over multiple years by teams of a similar or larger size than our own.
Multimedia 7
☆ Generative Flow Networks for Personalized Multimedia Systems: A Case Study on Short Video Feeds
Multimedia systems underpin modern digital interactions, facilitating seamless integration and optimization of resources across diverse multimedia applications. To meet growing personalization demands, multimedia systems must efficiently manage competing resource needs, adaptive content, and user-specific data handling. This paper introduces Generative Flow Networks (GFlowNets, GFNs) as a brave new framework for enabling personalized multimedia systems. By integrating multi-candidate generative modeling with flow-based principles, GFlowNets offer a scalable and flexible solution for enhancing user-specific multimedia experiences. To illustrate the effectiveness of GFlowNets, we focus on short video feeds, a multimedia application characterized by high personalization demands and significant resource constraints, as a case study. Our proposed GFlowNet-based personalized feeds algorithm demonstrates superior performance compared to traditional rule-based and reinforcement learning methods across critical metrics, including video quality, resource utilization efficiency, and delivery cost. Moreover, we propose a unified GFlowNet-based framework generalizable to other multimedia systems, highlighting its adaptability and wide-ranging applicability. These findings underscore the potential of GFlowNets to advance personalized multimedia systems by addressing complex optimization challenges and supporting sophisticated multimedia application scenarios.
comment: ACM Multimedia 2025
☆ Generative AI for Multimedia Communication: Recent Advances, An Information-Theoretic Framework, and Future Opportunities
Recent breakthroughs in generative artificial intelligence (AI) are transforming multimedia communication. This paper systematically reviews key recent advancements across generative AI for multimedia communication, emphasizing transformative models like diffusion and transformers. However, conventional information-theoretic frameworks fail to address semantic fidelity, critical to human perception. We propose an innovative semantic information-theoretic framework, introducing semantic entropy, mutual information, channel capacity, and rate-distortion concepts specifically adapted to multimedia applications. This framework redefines multimedia communication from purely syntactic data transmission to semantic information conveyance. We further highlight future opportunities and critical research directions. We chart a path toward robust, efficient, and semantically meaningful multimedia communication systems by bridging generative AI innovations with information theory. This exploratory paper aims to inspire a semantic-first paradigm shift, offering a fresh perspective with significant implications for future multimedia research.
comment: ACM Multimedia 2025
☆ SyncGuard: Robust Audio Watermarking Capable of Countering Desynchronization Attacks
Audio watermarking has been widely applied in copyright protection and source tracing. However, due to the inherent characteristics of audio signals, watermark localization and resistance to desynchronization attacks remain significant challenges. In this paper, we propose a learning-based scheme named SyncGuard to address these challenges. Specifically, we design a frame-wise broadcast embedding strategy to embed the watermark in arbitrary-length audio, enhancing time-independence and eliminating the need for localization during watermark extraction. To further enhance robustness, we introduce a meticulously designed distortion layer. Additionally, we employ dilated residual blocks in conjunction with dilated gated blocks to effectively capture multi-resolution time-frequency features. Extensive experimental results show that SyncGuard efficiently handles variable-length audio segments, outperforms state-of-the-art methods in robustness against various attacks, and delivers superior auditory quality.
☆ Probabilistic Temporal Masked Attention for Cross-view Online Action Detection
As a critical task in video sequence classification within computer vision, Online Action Detection (OAD) has garnered significant attention. The sensitivity of mainstream OAD models to varying video viewpoints often hampers their generalization when confronted with unseen sources. To address this limitation, we propose a novel Probabilistic Temporal Masked Attention (PTMA) model, which leverages probabilistic modeling to derive latent compressed representations of video frames in a cross-view setting. The PTMA model incorporates a GRU-based temporal masked attention (TMA) cell, which leverages these representations to effectively query the input video sequence, thereby enhancing information interaction and facilitating autoregressive frame-level video analysis. Additionally, multi-view information can be integrated into the probabilistic modeling to facilitate the extraction of view-invariant features. Experiments conducted under three evaluation protocols: cross-subject (cs), cross-view (cv), and cross-subject-view (csv) show that PTMA achieves state-of-the-art performance on the DAHLIA, IKEA ASM, and Breakfast datasets.
comment: 12 pages, 6 figures, accepted at IEEE Transactions on Multimedia (TMM), in press
☆ MDD: A Dataset for Text-and-Music Conditioned Duet Dance Generation ICCV 2025
We introduce Multimodal DuetDance (MDD), a diverse multimodal benchmark dataset designed for text-controlled and music-conditioned 3D duet dance motion generation. Our dataset comprises 620 minutes of high-quality motion capture data performed by professional dancers, synchronized with music, and detailed with over 10K fine-grained natural language descriptions. The annotations capture a rich movement vocabulary, detailing spatial relationships, body movements, and rhythm, making MDD the first dataset to seamlessly integrate human motions, music, and text for duet dance generation. We introduce two novel tasks supported by our dataset: (1) Text-to-Duet, where given music and a textual prompt, both the leader and follower dance motion are generated (2) Text-to-Dance Accompaniment, where given music, textual prompt, and the leader's motion, the follower's motion is generated in a cohesive, text-aligned manner. We include baseline evaluations on both tasks to support future research.
comment: Accepted at ICCV 2025. Project page: https://gprerit96.github.io/mdd-page
♻ ☆ Boosting Temporal Sentence Grounding via Causal Inference ACM MM 2025
Temporal Sentence Grounding (TSG) aims to identify relevant moments in an untrimmed video that semantically correspond to a given textual query. Despite existing studies having made substantial progress, they often overlook the issue of spurious correlations between video and textual queries. These spurious correlations arise from two primary factors: (1) inherent biases in the textual data, such as frequent co-occurrences of specific verbs or phrases, and (2) the model's tendency to overfit to salient or repetitive patterns in video content. Such biases mislead the model into associating textual cues with incorrect visual moments, resulting in unreliable predictions and poor generalization to out-of-distribution examples. To overcome these limitations, we propose a novel TSG framework, causal intervention and counterfactual reasoning that utilizes causal inference to eliminate spurious correlations and enhance the model's robustness. Specifically, we first formulate the TSG task from a causal perspective with a structural causal model. Then, to address unobserved confounders reflecting textual biases toward specific verbs or phrases, a textual causal intervention is proposed, utilizing do-calculus to estimate the causal effects. Furthermore, visual counterfactual reasoning is performed by constructing a counterfactual scenario that focuses solely on video features, excluding the query and fused multi-modal features. This allows us to debias the model by isolating and removing the influence of the video from the overall effect. Experiments on public datasets demonstrate the superiority of the proposed method. The code is available at https://github.com/Tangkfan/CICR.
comment: Accepted by ACM MM 2025
♻ ☆ Watermarking Visual Concepts for Diffusion Models
The personalization techniques of diffusion models succeed in generating images with specific concepts. This ability also poses great threats to copyright protection and network security since malicious users can generate unauthorized content and disinformation relevant to a target concept. Model watermarking is an effective solution to trace the malicious generated images and safeguard their copyright. However, existing model watermarking techniques merely achieve image-level tracing without concept traceability. When tracing infringing or harmful concepts, current approaches execute image concept detection and model tracing sequentially, where performance is critically constrained by concept detection accuracy. In this paper, we propose a lightweight concept watermarking framework that efficiently binds target concepts to model watermarks, supporting simultaneous concept identification and model tracing via single-stage watermark verification. To further enhance the robustness of concept watermarking, we propose an adversarial perturbation injection method collaboratively embedded with watermarks during image generation, avoiding watermark removal by model purification attacks. Experimental results demonstrate that ConceptWM significantly outperforms state-of-the-art watermarking methods, improving detection accuracy by 6.3%-19.3% across diverse datasets including COCO and StableDiffusionDB. Additionally, ConceptWM possesses a critical capability absent in other watermarking methods: it sustains a 21.7% FID/CLIP degradation under adversarial fine-tuning of Stable Diffusion models on WikiArt and CelebA-HQ, demonstrating its capability to mitigate model misuse.
Computation and Language 91
☆ Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders
Sparse Autoencoders (SAEs) extract features from LLM internal activations, meant to correspond to single concepts. A core SAE training hyperparameter is L0: how many features should fire per token on average. Existing work compares SAE algorithms using sparsity--reconstruction tradeoff plots, implying L0 is a free parameter with no single correct value. In this work we study the effect of L0 on BatchTopK SAEs, and show that if L0 is not set precisely, the SAE fails to learn the underlying features of the LLM. If L0 is too low, the SAE will mix correlated features to improve reconstruction. If L0 is too high, the SAE finds degenerate solutions that also mix features. Further, we demonstrate a method to determine the correct L0 value for an SAE on a given training distribution, which finds the true L0 in toy models and coincides with peak sparse probing performance in LLMs. We find that most commonly used SAEs have an L0 that is too low. Our work shows that, to train SAEs with correct features, practitioners must set L0 correctly.
☆ Transfer Learning via Lexical Relatedness: A Sarcasm and Hate Speech Case Study
Detecting hate speech in non-direct forms, such as irony, sarcasm, and innuendos, remains a persistent challenge for social networks. Although sarcasm and hate speech are regarded as distinct expressions, our work explores whether integrating sarcasm as a pre-training step improves implicit hate speech detection and, by extension, explicit hate speech detection. Incorporating samples from ETHOS, Sarcasm on Reddit, and Implicit Hate Corpus, we devised two training strategies to compare the effectiveness of sarcasm pre-training on a CNN+LSTM and BERT+BiLSTM model. The first strategy is a single-step training approach, where a model trained only on sarcasm is then tested on hate speech. The second strategy uses sequential transfer learning to fine-tune models for sarcasm, implicit hate, and explicit hate. Our results show that sarcasm pre-training improved the BERT+BiLSTM's recall by 9.7%, AUC by 7.8%, and F1-score by 6% on ETHOS. On the Implicit Hate Corpus, precision increased by 7.8% when tested only on implicit samples. By incorporating sarcasm into the training process, we show that models can more effectively detect both implicit and explicit hate.
☆ FLAMES: Improving LLM Math Reasoning via a Fine-Grained Analysis of the Data Synthesis Pipeline EMNLP 2025
Recent works improving LLM math reasoning with synthetic data have used unique setups, making comparison of data synthesis strategies impractical. This leaves many unanswered questions about the roles of different factors in the synthetic data pipeline, such as the impact of filtering low-quality problems. To address this gap, we introduce FLAMES, a Framework for LLM Assessment of Math rEasoning Data Synthesis, and perform a systematic study of 10 existing data synthesis strategies and multiple other factors impacting the performance of synthetic math reasoning data. Our FLAMES experiments provide several valuable insights about the optimal balance of difficulty and diversity of synthetic data. First, data agents designed to increase problem complexity lead to best improvements on most math metrics. Second, with a fixed data generation budget, keeping higher problem coverage is more important than keeping only problems with reliable solutions. Third, GSM8K- and MATH-based synthetic data can lead to improvements on competition-level benchmarks, showcasing easy-to-hard generalization. Leveraging insights from our FLAMES experiments, we design two novel data synthesis strategies for improving out-of-domain generalization and robustness. Further, we develop the FLAMES dataset, an effective blend of our novel and existing data synthesis strategies, outperforming public datasets on OlympiadBench (+15.7), CollegeMath (+4.5), GSMPlus (+6.5), and MATH (+3.1). Fine-tuning Qwen2.5-Math-7B on the FLAMES dataset achieves 81.4% on MATH, surpassing larger Llama3 405B, GPT-4o and Claude 3.5 Sonnet.
comment: To appear at EMNLP 2025
☆ HAMSA: Hijacking Aligned Compact Models via Stealthy Automation
Large Language Models (LLMs), especially their compact efficiency-oriented variants, remain susceptible to jailbreak attacks that can elicit harmful outputs despite extensive alignment efforts. Existing adversarial prompt generation techniques often rely on manual engineering or rudimentary obfuscation, producing low-quality or incoherent text that is easily flagged by perplexity-based filters. We present an automated red-teaming framework that evolves semantically meaningful and stealthy jailbreak prompts for aligned compact LLMs. The approach employs a multi-stage evolutionary search, where candidate prompts are iteratively refined using a population-based strategy augmented with temperature-controlled variability to balance exploration and coherence preservation. This enables the systematic discovery of prompts capable of bypassing alignment safeguards while maintaining natural language fluency. We evaluate our method on benchmarks in English (In-The-Wild Jailbreak Prompts on LLMs), and a newly curated Arabic one derived from In-The-Wild Jailbreak Prompts on LLMs and annotated by native Arabic linguists, enabling multilingual assessment.
comment: 9 pages, 1 figure; article under review
☆ LLM-as-classifier: Semi-Supervised, Iterative Framework for Hierarchical Text Classification using Large Language Models
The advent of Large Language Models (LLMs) has provided unprecedented capabilities for analyzing unstructured text data. However, deploying these models as reliable, robust, and scalable classifiers in production environments presents significant methodological challenges. Standard fine-tuning approaches can be resource-intensive and often struggle with the dynamic nature of real-world data distributions, which is common in the industry. In this paper, we propose a comprehensive, semi-supervised framework that leverages the zero- and few-shot capabilities of LLMs for building hierarchical text classifiers as a framework for a solution to these industry-wide challenges. Our methodology emphasizes an iterative, human-in-the-loop process that begins with domain knowledge elicitation and progresses through prompt refinement, hierarchical expansion, and multi-faceted validation. We introduce techniques for assessing and mitigating sequence-based biases and outline a protocol for continuous monitoring and adaptation. This framework is designed to bridge the gap between the raw power of LLMs and the practical need for accurate, interpretable, and maintainable classification systems in industry applications.
comment: 20 pages excluding reference list, 2 figures
☆ What makes an entity salient in discourse?
Entities in discourse vary broadly in salience: main participants, objects and locations are noticeable and memorable, while tangential ones are less important and quickly forgotten, raising questions about how humans signal and infer relative salience. Using a graded operationalization of salience based on summary-worthiness in multiple summaries of a discourse, this paper explores data from 24 spoken and written genres of English to extract a multifactorial complex of overt and implicit linguistic cues, such as recurring subjecthood or definiteness, discourse relations and hierarchy across utterances, as well as pragmatic functional inferences based on genre and communicative intent. Tackling the question 'how is the degree of salience expressed for each and every entity mentioned?' our results show that while previous approaches to salience all correlate with our salience scores to some extent, no single generalization is without exceptions, and the phenomenon cuts across all levels of linguistic representation.
☆ A Probabilistic Inference Scaling Theory for LLM Self-Correction EMNLP 2025
Large Language Models (LLMs) have demonstrated the capability to refine their generated answers through self-correction, enabling continuous performance improvement over multiple rounds. However, the mechanisms underlying how and why accuracy evolves during this iterative process remain unexplored. To fill this gap, we propose a probabilistic theory to model the dynamics of accuracy change and explain the performance improvements observed in multi-round self-correction. Through mathematical derivation, we establish that the accuracy after the $t^{th}$ round of self-correction is given by: $Acc_t = Upp - \alpha^t(Upp - Acc_0),$ where $Acc_0$ denotes the initial accuracy, $Upp$ represents the upper bound of accuracy convergence, and $\alpha$ determines the rate of convergence. Based on our theory, these parameters can be calculated and the predicted accuracy curve then can be obtained through only a single round of self-correction. Extensive experiments across diverse models and datasets demonstrate that our theoretical predictions align closely with empirical accuracy curves, validating the effectiveness of the theory. Our work provides a theoretical foundation for understanding LLM self-correction, thus paving the way for further explorations.
comment: EMNLP 2025 Main
☆ Anti-establishment sentiment on TikTok: Implications for understanding influence(rs) and expertise on social media AAAI
Distrust of public serving institutions and anti-establishment views are on the rise (especially in the U.S.). As people turn to social media for information, it is imperative to understand whether and how social media environments may be contributing to distrust of institutions. In social media, content creators, influencers, and other opinion leaders often position themselves as having expertise and authority on a range of topics from health to politics, and in many cases devalue and dismiss institutional expertise to build a following and increase their own visibility. However, the extent to which this content appears and whether such content increases engagement is unclear. This study analyzes the prevalence of anti-establishment sentiment (AES) on the social media platform TikTok. Despite its popularity as a source of information, TikTok remains relatively understudied and may provide important insights into how people form attitudes towards institutions. We employ a computational approach to label TikTok posts as containing AES or not across topical domains where content creators tend to frame themselves as experts: finance and wellness. As a comparison, we also consider the topic of conspiracy theories, where AES is expected to be common. We find that AES is most prevalent in conspiracy theory content, and relatively rare in content related to the other two topics. However, we find that engagement patterns with such content varies by area, and that there may be platform incentives for users to post content that expresses anti-establishment sentiment.
comment: 10 pages excluding references; 14 pages in total; 4 figures; Accepted by the AAAI Conference on Web and Social Media (ICWSM-2026)
☆ PediatricsMQA: a Multi-modal Pediatrics Question Answering Benchmark
Large language models (LLMs) and vision-augmented LLMs (VLMs) have significantly advanced medical informatics, diagnostics, and decision support. However, these models exhibit systematic biases, particularly age bias, compromising their reliability and equity. This is evident in their poorer performance on pediatric-focused text and visual question-answering tasks. This bias reflects a broader imbalance in medical research, where pediatric studies receive less funding and representation despite the significant disease burden in children. To address these issues, a new comprehensive multi-modal pediatric question-answering benchmark, PediatricsMQA, has been introduced. It consists of 3,417 text-based multiple-choice questions (MCQs) covering 131 pediatric topics across seven developmental stages (prenatal to adolescent) and 2,067 vision-based MCQs using 634 pediatric images from 67 imaging modalities and 256 anatomical regions. The dataset was developed using a hybrid manual-automatic pipeline, incorporating peer-reviewed pediatric literature, validated question banks, existing benchmarks, and existing QA resources. Evaluating state-of-the-art open models, we find dramatic performance drops in younger cohorts, highlighting the need for age-aware methods to ensure equitable AI support in pediatric care.
☆ Cetvel: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish
We introduce Cetvel, a comprehensive benchmark designed to evaluate large language models (LLMs) in Turkish. Existing Turkish benchmarks often lack either task diversity or culturally relevant content, or both. Cetvel addresses these gaps by combining a broad range of both discriminative and generative tasks ensuring content that reflects the linguistic and cultural richness of Turkish language. Cetvel covers 23 tasks grouped into seven categories, including tasks such as grammatical error correction, machine translation, and question answering rooted in Turkish history and idiomatic language. We evaluate 33 open-weight LLMs (up to 70B parameters) covering different model families and instruction paradigms. Our experiments reveal that Turkish-centric instruction-tuned models generally underperform relative to multilingual or general-purpose models (e.g. Llama 3 and Mistral), despite being tailored for the language. Moreover, we show that tasks such as grammatical error correction and extractive question answering are particularly discriminative in differentiating model capabilities. Cetvel offers a comprehensive and culturally grounded evaluation suite for advancing the development and assessment of LLMs in Turkish.
comment: 31 pages, 2 figures, 10 tables
☆ Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models
Large Language Models (LLMs) remain vulnerable to jailbreak attacks, which attempt to elicit harmful responses from LLMs. The evolving nature and diversity of these attacks pose many challenges for defense systems, including (1) adaptation to counter emerging attack strategies without costly retraining, and (2) control of the trade-off between safety and utility. To address these challenges, we propose Retrieval-Augmented Defense (RAD), a novel framework for jailbreak detection that incorporates a database of known attack examples into Retrieval-Augmented Generation, which is used to infer the underlying, malicious user query and jailbreak strategy used to attack the system. RAD enables training-free updates for newly discovered jailbreak strategies and provides a mechanism to balance safety and utility. Experiments on StrongREJECT show that RAD substantially reduces the effectiveness of strong jailbreak attacks such as PAP and PAIR while maintaining low rejection rates for benign queries. We propose a novel evaluation scheme and show that RAD achieves a robust safety-utility trade-off across a range of operating points in a controllable manner.
☆ AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions
Competitive programming has emerged as a critical benchmark for evaluating the reasoning and coding capabilities of Large Language Models (LLMs). Despite impressive progress on existing benchmarks, we argue that current evaluations overstate model proficiency, masking a substantial gap between LLMs and elite human programmers. This gap arises from two key limitations: insufficient difficulty and scope of benchmark problems, and evaluation bias from low-quality test cases. To address these shortcomings, we present AetherCode, a new benchmark that draws problems from premier programming competitions such as IOI and ICPC, offering broader coverage and higher difficulty. AetherCode further incorporates comprehensive, expert-validated test suites built through a hybrid of automated generation and human curation, ensuring rigorous and reliable assessment. By combining challenging problem design with robust evaluation, AetherCode provides a more faithful measure of LLM capabilities and sets a new standard for future research in code reasoning.
comment: 15 pages
☆ RoMedQA: The First Benchmark for Romanian Medical Question Answering
Question answering (QA) is an actively studied topic, being a core natural language processing (NLP) task that needs to be addressed before achieving Artificial General Intelligence (AGI). However, the lack of QA datasets in specific domains and languages hinders the development of robust AI models able to generalize across various domains and languages. To this end, we introduce RoMedQA, the first Romanian QA benchmark for the medical domain, alongside a comprehensive evaluation of state-of-the-art large language models (LLMs). We construct a high-quality and large-scale dataset comprising 102,646 QA pairs related to cancer patients. The questions regard medical case summaries of 1,011 patients, requiring either keyword extraction or reasoning to be answered correctly. RoMedQA is the result of a time-consuming manual annotation process carried out by seven physicians specialized in oncology or radiotherapy, who spent a total of about 2,100 work hours to generate the QA pairs. We experiment with four LLMs from distinct families of models on RoMedQA. Each model is employed in two scenarios, namely one based on zero-shot prompting and one based on supervised fine-tuning. Our results show that fine-tuned models significantly outperform their zero-shot counterparts, clearly indicating that pretrained models fail to generalize on RoMedQA. Our findings demonstrate the importance of both domain-specific and language-specific fine-tuning for reliable clinical QA in Romanian. We publicly release our dataset and code at https://github.com/ana-rogoz/RoMedQA.
☆ ChatGPT-generated texts show authorship traits that identify them as non-human
Large Language Models can emulate different writing styles, ranging from composing poetry that appears indistinguishable from that of famous poets to using slang that can convince people that they are chatting with a human online. While differences in style may not always be visible to the untrained eye, we can generally distinguish the writing of different people, like a linguistic fingerprint. This work examines whether a language model can also be linked to a specific fingerprint. Through stylometric and multidimensional register analyses, we compare human-authored and model-authored texts from different registers. We find that the model can successfully adapt its style depending on whether it is prompted to produce a Wikipedia entry vs. a college essay, but not in a way that makes it indistinguishable from humans. Concretely, the model shows more limited variation when producing outputs in different registers. Our results suggest that the model prefers nouns to verbs, thus showing a distinct linguistic backbone from humans, who tend to anchor language in the highly grammaticalized dimensions of tense, aspect, and mood. It is possible that the more complex domains of grammar reflect a mode of thought unique to humans, thus acting as a litmus test for Artificial Intelligence.
☆ GLARE: Agentic Reasoning for Legal Judgment Prediction
Legal judgment prediction (LJP) has become increasingly important in the legal field. In this paper, we identify that existing large language models (LLMs) have significant problems of insufficient reasoning due to a lack of legal knowledge. Therefore, we introduce GLARE, an agentic legal reasoning framework that dynamically acquires key legal knowledge by invoking different modules, thereby improving the breadth and depth of reasoning. Experiments conducted on the real-world dataset verify the effectiveness of our method. Furthermore, the reasoning chain generated during the analysis process can increase interpretability and provide the possibility for practical applications.
☆ The Mediomatix Corpus: Parallel Data for Romansh Idioms via Comparable Schoolbooks
The five idioms (i.e., varieties) of the Romansh language are largely standardized and are taught in the schools of the respective communities in Switzerland. In this paper, we present the first parallel corpus of Romansh idioms. The corpus is based on 291 schoolbook volumes, which are comparable in content for the five idioms. We use automatic alignment methods to extract 207k multi-parallel segments from the books, with more than 2M tokens in total. A small-scale human evaluation confirms that the segments are highly parallel, making the dataset suitable for NLP applications such as machine translation between Romansh idioms. We release the parallel and unaligned versions of the dataset under a CC-BY-NC-SA license and demonstrate its utility for machine translation by training and evaluating an LLM on a sample of the dataset.
☆ MizanQA: Benchmarking Large Language Models on Moroccan Legal Question Answering
The rapid advancement of large language models (LLMs) has significantly propelled progress in natural language processing (NLP). However, their effectiveness in specialized, low-resource domains-such as Arabic legal contexts-remains limited. This paper introduces MizanQA (pronounced Mizan, meaning "scale" in Arabic, a universal symbol of justice), a benchmark designed to evaluate LLMs on Moroccan legal question answering (QA) tasks, characterised by rich linguistic and legal complexity. The dataset draws on Modern Standard Arabic, Islamic Maliki jurisprudence, Moroccan customary law, and French legal influences. Comprising over 1,700 multiple-choice questions, including multi-answer formats, MizanQA captures the nuances of authentic legal reasoning. Benchmarking experiments with multilingual and Arabic-focused LLMs reveal substantial performance gaps, highlighting the need for tailored evaluation metrics and culturally grounded, domain-specific LLM development.
☆ Vevo2: Bridging Controllable Speech and Singing Voice Generation via Unified Prosody Learning
Controllable human voice generation, particularly for expressive domains like singing, remains a significant challenge. This paper introduces Vevo2, a unified framework for controllable speech and singing voice generation. To tackle issues like the scarcity of annotated singing data and to enable flexible controllability, Vevo2 introduces two audio tokenizers: (1) a music-notation-free prosody tokenizer that captures prosody and melody from speech, singing, and even instrumental sounds, and (2) a low-frame-rate (12.5 Hz) content-style tokenizer that encodes linguistic content, prosody, and style for both speech and singing, while enabling timbre disentanglement. Vevo2 consists of an auto-regressive (AR) content-style modeling stage, which aims to enable controllability over text, prosody, and style, as well as a flow-matching acoustic modeling stage that allows for timbre control. Particularly, during pre-training of the AR model, we propose both explicit and implicit prosody learning strategies to bridge speech and singing voice. Moreover, to further enhance the AR model's ability to follow text and prosody, we design a multi-objective post-training task that integrates both intelligibility and prosody similarity alignment. Experimental results show that the unified modeling in Vevo2 brings mutual benefits to both speech and singing voice generation. Additionally, Vevo2's effectiveness across a wide range of synthesis, conversion, and editing tasks for both speech and singing further demonstrates its strong generalization ability and versatility. Audio samples are are available at https://versasinger.github.io/.
comment: We will release code and model checkpoints at https://github.com/open-mmlab/Amphion
☆ LLMSymGuard: A Symbolic Safety Guardrail Framework Leveraging Interpretable Jailbreak Concepts
Large Language Models have found success in a variety of applications; however, their safety remains a matter of concern due to the existence of various types of jailbreaking methods. Despite significant efforts, alignment and safety fine-tuning only provide a certain degree of robustness against jailbreak attacks that covertly mislead LLMs towards the generation of harmful content. This leaves them prone to a number of vulnerabilities, ranging from targeted misuse to accidental profiling of users. This work introduces \textbf{LLMSymGuard}, a novel framework that leverages Sparse Autoencoders (SAEs) to identify interpretable concepts within LLM internals associated with different jailbreak themes. By extracting semantically meaningful internal representations, LLMSymGuard enables building symbolic, logical safety guardrails -- offering transparent and robust defenses without sacrificing model capabilities or requiring further fine-tuning. Leveraging advances in mechanistic interpretability of LLMs, our approach demonstrates that LLMs learn human-interpretable concepts from jailbreaks, and provides a foundation for designing more interpretable and logical safeguard measures against attackers. Code will be released upon publication.
☆ Retrieval Enhanced Feedback via In-context Neural Error-book EMNLP 2025
Recent advancements in Large Language Models (LLMs) have significantly improved reasoning capabilities, with in-context learning (ICL) emerging as a key technique for adaptation without retraining. While previous works have focused on leveraging correct examples, recent research highlights the importance of learning from errors to enhance performance. However, existing methods lack a structured framework for analyzing and mitigating errors, particularly in Multimodal Large Language Models (MLLMs), where integrating visual and textual inputs adds complexity. To address this issue, we propose REFINE: Retrieval-Enhanced Feedback via In-context Neural Error-book, a teacher-student framework that systematically structures errors and provides targeted feedback. REFINE introduces three systematic queries to construct structured feedback -- Feed-Target, Feed-Check, and Feed-Path -- to enhance multimodal reasoning by prioritizing relevant visual information, diagnosing critical failure points, and formulating corrective actions. Unlike prior approaches that rely on redundant retrievals, REFINE optimizes structured feedback retrieval, improving inference efficiency, token usage, and scalability. Our results demonstrate substantial speedup, reduced computational costs, and successful generalization, highlighting REFINE's potential for enhancing multimodal reasoning.
comment: Accepted at EMNLP 2025 main conference
☆ JaParaPat: A Large-Scale Japanese-English Parallel Patent Application Corpus LREC
We constructed JaParaPat (Japanese-English Parallel Patent Application Corpus), a bilingual corpus of more than 300 million Japanese-English sentence pairs from patent applications published in Japan and the United States from 2000 to 2021. We obtained the publication of unexamined patent applications from the Japan Patent Office (JPO) and the United States Patent and Trademark Office (USPTO). We also obtained patent family information from the DOCDB, that is a bibliographic database maintained by the European Patent Office (EPO). We extracted approximately 1.4M Japanese-English document pairs, which are translations of each other based on the patent families, and extracted about 350M sentence pairs from the document pairs using a translation-based sentence alignment method whose initial translation model is bootstrapped from a dictionary-based sentence alignment method. We experimentally improved the accuracy of the patent translations by 20 bleu points by adding more than 300M sentence pairs obtained from patent applications to 22M sentence pairs obtained from the web.
comment: LREC-COLING 2024
☆ LLMs that Understand Processes: Instruction-tuning for Semantics-Aware Process Mining
Process mining is increasingly using textual information associated with events to tackle tasks such as anomaly detection and process discovery. Such semantics-aware process mining focuses on what behavior should be possible in a process (i.e., expectations), thus providing an important complement to traditional, frequency-based techniques that focus on recorded behavior (i.e., reality). Large Language Models (LLMs) provide a powerful means for tackling semantics-aware tasks. However, the best performance is so far achieved through task-specific fine-tuning, which is computationally intensive and results in models that can only handle one specific task. To overcome this lack of generalization, we use this paper to investigate the potential of instruction-tuning for semantics-aware process mining. The idea of instruction-tuning here is to expose an LLM to prompt-answer pairs for different tasks, e.g., anomaly detection and next-activity prediction, making it more familiar with process mining, thus allowing it to also perform better at unseen tasks, such as process discovery. Our findings demonstrate a varied impact of instruction-tuning: while performance considerably improved on process discovery and prediction tasks, it varies across models on anomaly detection tasks, highlighting that the selection of tasks for instruction-tuning is critical to achieving desired outcomes.
comment: Accepted at IEEE ICPM 2025, 8 pages, 2 figures
☆ From Confidence to Collapse in LLM Factual Robustness
Ensuring the robustness of factual knowledge in LLMs is critical for reliable applications in tasks such as question answering and reasoning. However, existing evaluation methods predominantly focus on performance-based metrics, often investigating from the perspective of prompt perturbations, which captures only the externally triggered side of knowledge robustness. To bridge this gap, we introduce a principled approach to measure factual robustness from the perspective of the generation process by analyzing token distribution entropy in combination with temperature scaling sensitivity. These two factors build the Factual Robustness Score (FRS), a novel metric which quantifies the stability of a fact against perturbations in decoding conditions, given its initial uncertainty. To validate our approach, we conduct extensive experiments on 5 LLMs across 3 closed-book QA datasets (SQuAD, TriviaQA, and HotpotQA). We show that factual robustness varies significantly -- smaller models report an FRS of $0.76$, larger ones $0.93$ -- with accuracy degrading by ~$60\%$ under increased uncertainty. These insights demonstrate how entropy and temperature scaling impact factual accuracy, and lay a foundation for developing more robust knowledge retention and retrieval in future models.
☆ M3TQA: Massively Multilingual Multitask Table Question Answering
Tabular data is a fundamental component of real-world information systems, yet most research in table understanding remains confined to English, leaving multilingual comprehension significantly underexplored. Existing multilingual table benchmarks suffer from geolinguistic imbalance - overrepresenting certain languages and lacking sufficient scale for rigorous cross-lingual analysis. To address these limitations, we introduce a comprehensive framework for massively multilingual multitask table question answering, featuring m3TQA-Instruct, a large-scale benchmark spanning 97 languages across diverse language families, including underrepresented and low-resource languages. We construct m3TQA by curating 50 real-world tables in Chinese and English, then applying a robust six-step LLM-based translation pipeline powered by DeepSeek and GPT-4o, achieving high translation fidelity with a median BLEU score of 60.19 as validated through back-translation. The benchmark includes 2,916 professionally annotated question-answering pairs across four tasks designed to evaluate nuanced table reasoning capabilities. Experiments on state-of-the-art LLMs reveal critical insights into cross-lingual generalization, demonstrating that synthetically generated, unannotated QA data can significantly boost performance, particularly for low-resource languages. M3T-Bench establishes a new standard for multilingual table understanding, providing both a challenging evaluation platform and a scalable methodology for future research.
☆ MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use
Large Language Models (LLMs) are evolving from text generators into reasoning agents. This transition makes their ability to use external tools a critical capability. However, evaluating this skill presents a significant challenge. Existing benchmarks are often limited by their reliance on synthetic tools and severely constrained action spaces. To address these limitations, we introduce MCPVerse, an expansive, real-world benchmark for evaluating agentic tool use. MCPVerse integrates more than 550 real-world, executable tools to create an unprecedented action space exceeding 140k tokens, and employs outcome-based evaluation with real-time ground truth for time-sensitive tasks. We benchmarked the state-of-the-art LLMs across three modes (Oracle, Standard, and Max-Scale), revealing that while most models suffer performance degradation when confronted with larger tool sets, the agentic models, such as Claude-4-Sonnet, can effectively leverage expanded exploration spaces to improve accuracy. This finding not only exposes the limitations of state-of-the-art models in complex, real-world scenarios but also establishes MCPVerse as a critical benchmark for measuring and advancing agentic tool use capabilities.
☆ TULIP: Adapting Open-Source Large Language Models for Underrepresented Languages and Specialized Financial Tasks IJCAI 2025
Thanks to the growing popularity of large language models over the years, there is great potential for their applications in finance. Despite the exceptional performance of larger proprietary models, which are presented as black-box solutions through APIs, smaller models that can be hosted on-premise present opportunities for adaptability and privacy. Especially in cases where the management of sensitive information and application of domain knowledge is important, like finance, enhancing the capabilities of smaller models becomes crucial, notably for underrepresented languages. In this work, we introduce TULIP models, which adapt Llama 3.1 8B and Qwen 2.5 7B for domain and language adaptation, focusing on financial Turkish use cases. The five-stage development pipeline involves data collection, continual pre-training (CPT), benchmark design, synthetic data generation and supervised fine-tuning (SFT). The results show that the capabilities of the models can be enhanced to effectively accomplish targeted tasks in this specific domain and language.
comment: IJCAI 2025 - FinLLM Workshop
☆ SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning EMNLP 2025
Video large language models (Vid-LLMs) have shown strong capabilities in understanding video content. However, their reliance on dense video token representations introduces substantial memory and computational overhead in both prefilling and decoding. To mitigate the information loss of recent video token reduction methods and accelerate the decoding stage of Vid-LLMs losslessly, we introduce SpecVLM, a training-free speculative decoding (SD) framework tailored for Vid-LLMs that incorporates staged video token pruning. Building on our novel finding that the draft model's speculation exhibits low sensitivity to video token pruning, SpecVLM prunes up to 90% of video tokens, enabling efficient speculation without sacrificing accuracy. To achieve this, it performs a two-stage pruning process: Stage I selects highly informative tokens guided by attention signals from the verifier (target model), while Stage II prunes remaining redundant ones in a spatially uniform manner. Extensive experiments on four video understanding benchmarks demonstrate the effectiveness and robustness of SpecVLM, which achieves up to 2.68$\times$ decoding speedup for LLaVA-OneVision-72B and 2.11$\times$ speedup for Qwen2.5-VL-32B.
comment: Accepted at EMNLP 2025
☆ CMR-SPB: Cross-Modal Multi-Hop Reasoning over Text, Image, and Speech with Path Balance
Cross-modal multi-hop reasoning (CMR) is a valuable yet underexplored capability of multimodal large language models (MLLMs), entailing the integration of information from multiple modalities to produce a coherent output for a given context. We argue that existing benchmarks for evaluating this ability have critical shortcomings: (1) they largely overlook the speech modality, and (2) they exhibit heavily biased reasoning path distributions, which can severely undermine fair evaluation. To address these limitations, we introduce a novel benchmark -- Cross-Modal Multi-Hop Reasoning over Text, Image and Speech with Path Balance (CMR-SPB) -- designed to assess tri-modal multi-hop reasoning while ensuring both unbiased and diverse reasoning paths. Our experiments with the new dataset reveal consistent model failures in specific reasoning sequences and show that biased benchmarks risk misrepresenting model performance. Finally, based on our extensive analysis, we propose a new ECV (Extract, Connect, Verify) prompting technique that effectively mitigates the performance gap across different reasoning paths. Overall, we call for more careful evaluation in CMR to advance the development of robust multimodal AI.
☆ ComicScene154: A Scene Dataset for Comic Analysis
Comics offer a compelling yet under-explored domain for computational narrative analysis, combining text and imagery in ways distinct from purely textual or audiovisual media. We introduce ComicScene154, a manually annotated dataset of scene-level narrative arcs derived from public-domain comic books spanning diverse genres. By conceptualizing comics as an abstraction for narrative-driven, multimodal data, we highlight their potential to inform broader research on multi-modal storytelling. To demonstrate the utility of ComicScene154, we present a baseline scene segmentation pipeline, providing an initial benchmark that future studies can build upon. Our results indicate that ComicScene154 constitutes a valuable resource for advancing computational methods in multimodal narrative understanding and expanding the scope of comic analysis within the Natural Language Processing community.
☆ Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation EMNLP 2025
We present an Audio-Visual Language Model (AVLM) for expressive speech generation by integrating full-face visual cues into a pre-trained expressive speech model. We explore multiple visual encoders and multimodal fusion strategies during pre-training to identify the most effective integration approach. Subsequent fine-tuning on emotion recognition and expressive dialogue tasks yields substantial gains over speech-only baselines (e.g., +5 F1 in emotion recognition). AVLM highlights the value of expressive visual information in guiding speech generation and offers a foundation for end-to-end multimodal conversational systems.
comment: EMNLP 2025 (Findings)
☆ ParamBench: A Graduate-Level Benchmark for Evaluating LLM Understanding on Indic Subjects
Large language models (LLMs) have been widely evaluated on tasks such as comprehension, question answering, summarization, code generation, etc. However, their performance on graduate-level, culturally grounded questions in the Indian context remains largely unexplored. Existing Indian benchmarks emphasise basic fact-orientated queries that offer limited assessment of a deeper disciplinary understanding tailored to the Indian setting. In this paper, we present ParamBench, consisting of around 11.5K questions in Hindi language comprising questionnaires from 16 diverse subjects. These questions are primarily derived from nation-wide graduate level entrance examination covering topics such as history, music, instruments, yoga, literature, philosophy, law, etc., specifically for the Indian context. Additionally, we assess the ability of LLMs to handle diverse question formats-such as list-based matching, assertion-reason pairs, and sequence ordering-alongside conventional multiple-choice questions. We evaluated the performance of more than 17 open source LLMs on this benchmark, observing that Llama 3.3 70B attains the highest overall accuracy of 48%. Furthermore, subject-wise analysis indicates that even for the best performing LLMs, performance remains weak on topics such as music, classical instruments, politics and archaeology, underscoring persistent challenges in culturally grounded reasoning.
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs
In this paper, we introduce a novel learning paradigm for adaptive Large Language Model (LLM) agents that eliminates the need for fine-tuning the underlying LLMs. Existing approaches are often either rigid, relying on static, handcrafted reflection workflows, or computationally intensive, requiring gradient updates of LLM model parameters. In contrast, our method enables low-cost continual adaptation via memory-based online reinforcement learning. We formalise this as a Memory-augmented Markov Decision Process (M-MDP), equipped with a neural case-selection policy to guide action decisions. Past experiences are stored in an episodic memory, either differentiable or non-parametric. The policy is continually updated based on environmental feedback through a memory rewriting mechanism, whereas policy improvement is achieved through efficient memory reading (retrieval). We instantiate our agent model in the deep research setting, namely AgentFly, which attains top-1 on GAIA validation ($87.88\%$ Pass@$3$) and $79.40\%$ on the test set. It reaches $66.6\%$ F1 and $80.4\%$ PM on the DeepResearcher dataset, outperforming the state-of-the-art training-based method, while case-based memory adds $4.7\%$ to $9.6\%$ absolute points on out-of-distribution tasks. Our approach offers a scalable and efficient pathway for developing generalist LLM agents capable of continuous, real-time learning without gradient updates, advancing machine learning towards open-ended skill acquisition and deep research scenarios. The code is available at https://github.com/Agent-on-the-Fly/AgentFly.
☆ Hardwired-Neurons Language Processing Units as General-Purpose Cognitive Substrates
The rapid advancement of Large Language Models (LLMs) has established language as a core general-purpose cognitive substrate, driving the demand for specialized Language Processing Units (LPUs) tailored for LLM inference. To overcome the growing energy consumption of LLM inference systems, this paper proposes a Hardwired-Neurons Language Processing Unit (HNLPU), which physically hardwires LLM weight parameters into the computational fabric, achieving several orders of magnitude computational efficiency improvement by extreme specialization. However, a significant challenge still lies in the scale of modern LLMs. An ideal estimation on hardwiring gpt-oss 120 B requires fabricating at least 6 billion dollars of photomask sets, rendering the straightforward solution economically impractical. Addressing this challenge, we propose the novel Metal-Embedding methodology. Instead of embedding weights in a 2D grid of silicon device cells, Metal-Embedding embeds weight parameters into the 3D topology of metal wires. This brings two benefits: (1) a 15x increase in density, and (2) 60 out of 70 layers of photomasks are made homogeneous across chips, including all EUV photomasks. In total, Metal-Embedding reduced the photomask cost by 112x, bringing the Non-Recurring Engineering (NRE) cost of HNLPU into an economically viable range. Experimental results show that HNLPU achieved 249,960 tokens/s (5,555x/85x of GPU/WSE), 36 tokens/J (1,047x/283x of GPU/WSE), 13,232 mm2 total die area (29% inscribed rectangular area in a 300 mm wafer), \$184M estimated NRE at 5 nm technology. Analysis shows that HNLPU achieved 8.57x cost-effectiveness and 230x carbon footprint reduction compared to H100 clusters, under an annual weight updating assumption.
☆ Hierarchical Vision-Language Reasoning for Multimodal Multiple-Choice Question Answering ACM MM 2025
Multimodal Large Language Models (MLLMs) have demonstrated remarkable multimodal understanding capabilities in Visual Question Answering (VQA) tasks by integrating visual and textual features. However, under the challenging ten-choice question evaluation paradigm, existing methods still exhibit significant limitations when processing PDF documents with complex layouts and lengthy content. Notably, current mainstream models suffer from a strong bias toward English training data, resulting in suboptimal performance for Japanese and other language scenarios. To address these challenges, this paper proposes a novel Japanese PDF document understanding framework that combines multimodal hierarchical reasoning mechanisms with Colqwen-optimized retrieval methods, while innovatively introducing a semantic verification strategy through sub-question decomposition. Experimental results demonstrate that our framework not only significantly enhances the model's deep semantic parsing capability for complex documents, but also exhibits superior robustness in practical application scenarios.
comment: This paper has been accepted by ACM MM 2025
☆ XLQA: A Benchmark for Locale-Aware Multilingual Open-Domain Question Answering EMNLP 2025
Large Language Models (LLMs) have shown significant progress in Open-domain question answering (ODQA), yet most evaluations focus on English and assume locale-invariant answers across languages. This assumption neglects the cultural and regional variations that affect question understanding and answer, leading to biased evaluation in multilingual benchmarks. To address these limitations, we introduce XLQA, a novel benchmark explicitly designed for locale-sensitive multilingual ODQA. XLQA contains 3,000 English seed questions expanded to eight languages, with careful filtering for semantic consistency and human-verified annotations distinguishing locale-invariant and locale-sensitive cases. Our evaluation of five state-of-the-art multilingual LLMs reveals notable failures on locale-sensitive questions, exposing gaps between English and other languages due to a lack of locale-grounding knowledge. We provide a systematic framework and scalable methodology for assessing multilingual QA under diverse cultural contexts, offering a critical resource to advance the real-world applicability of multilingual ODQA systems. Our findings suggest that disparities in training data distribution contribute to differences in both linguistic competence and locale-awareness across models.
comment: Accepted to EMNLP 2025 main conference. 12 pages, 4 figures, 7 tables. Code is available at https://github.com/ro-ko/XLQA
☆ Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection EMNLP 2025
The rise of multimodal data, integrating text, audio, and visuals, has created new opportunities for studying multimodal tasks such as intent detection. This work investigates the effectiveness of Large Language Models (LLMs) and non-LLMs, including text-only and multi-modal models, in the multimodal intent detection task. Our study reveals that Mistral-7B, a text-only LLM, outperforms most competitive multimodal models by approximately 9% on MIntRec-1 and 4% on MIntRec2.0 datasets. This performance advantage comes from a strong textual bias in these datasets, where over 90% of the samples require textual input, either alone or in combination with other modalities, for correct classification. We confirm the modality bias of these datasets via human evaluation, too. Next, we propose a framework to debias the datasets, and upon debiasing, more than 70% of the samples in MIntRec-1 and more than 50% in MIntRec2.0 get removed, resulting in significant performance degradation across all models, with smaller multimodal fusion models being the most affected with an accuracy drop of over 50 - 60%. Further, we analyze the context-specific relevance of different modalities through empirical analysis. Our findings highlight the challenges posed by modality bias in multimodal intent datasets and emphasize the need for unbiased datasets to evaluate multimodal models effectively.
comment: EMNLP 2025 Main Conference Full Paper
☆ Extending FKG.in: Towards a Food Claim Traceability Network
The global food landscape is rife with scientific, cultural, and commercial claims about what foods are, what they do, what they should not do, or should not do. These range from rigorously studied health benefits (probiotics improve gut health) and misrepresentations (soaked almonds make one smarter) to vague promises (superfoods boost immunity) and culturally rooted beliefs (cold foods cause coughs). Despite their widespread influence, the infrastructure for tracing, verifying, and contextualizing these claims remains fragmented and underdeveloped. In this paper, we propose a Food Claim-Traceability Network (FCN) as an extension of FKG.in, a knowledge graph of Indian food that we have been incrementally building. We also present the ontology design and the semi-automated knowledge curation workflow that we used to develop a proof of concept of FKG.in-FCN using Reddit data and Large Language Models. FCN integrates curated data inputs, structured schemas, and provenance-aware pipelines for food-related claim extraction and validation. While directly linked to the Indian food knowledge graph as an application, our methodology remains application-agnostic and adaptable to other geographic, culinary, or regulatory settings. By modeling food claims and their traceability in a structured, verifiable, and explainable way, we aim to contribute to more transparent and accountable food knowledge ecosystems, supporting researchers, policymakers, and most importantly, everyday consumers in navigating a world saturated with dietary assertions.
comment: 10 pages, 3 figures, 1 table, 45 references, ACM International Conference on Multimedia 2025 - Multi-modal Food Computing Workshop
☆ From Indirect Object Identification to Syllogisms: Exploring Binary Mechanisms in Transformer Circuits
Transformer-based language models (LMs) can perform a wide range of tasks, and mechanistic interpretability (MI) aims to reverse engineer the components responsible for task completion to understand their behavior. Previous MI research has focused on linguistic tasks such as Indirect Object Identification (IOI). In this paper, we investigate the ability of GPT-2 small to handle binary truth values by analyzing its behavior with syllogistic prompts, e.g., "Statement A is true. Statement B matches statement A. Statement B is", which requires more complex logical reasoning compared to IOI. Through our analysis of several syllogism tasks of varying difficulty, we identify multiple circuits that mechanistically explain GPT-2's logical-reasoning capabilities and uncover binary mechanisms that facilitate task completion, including the ability to produce a negated token not present in the input prompt through negative heads. Our evaluation using a faithfulness metric shows that a circuit comprising five attention heads achieves over 90% of the original model's performance. By relating our findings to IOI analysis, we provide new insights into the roles of specific attention heads and MLPs in LMs. These insights contribute to a broader understanding of model reasoning and support future research in mechanistic interpretability.
☆ CYCLE-INSTRUCT: Fully Seed-Free Instruction Tuning via Dual Self-Training and Cycle Consistency EMNLP 2025
Instruction tuning is vital for aligning large language models (LLMs) with human intent, but current methods typically rely on costly human-annotated seed data or powerful external teacher models. While instruction back-translation techniques reduce this dependency, they remain fundamentally tethered to an initial seed set, which limits full automation, introduces biases, and can lead to inefficient use of unlabeled corpora. In this paper, we propose Cycle-Instruct, a novel framework that achieves fully seed-free instruction tuning. Inspired by cycle consistency, Cycle-Instruct employs a dual self-training loop where two models-an answer generator and a question generator-are bootstrapped solely from raw, unlabeled text. These models mutually supervise each other by reconstructing original text segments from their counterpart's generated pseudo-labels, effectively learning from the intrinsic structure of the data without any human-provided seeds. We demonstrate Cycle-Instruct's efficacy across four diverse data tracks, including general instruction-following, domain-specific tasks, dialogue logs, and plain text. Our extensive experiments show that Cycle-Instruct not only outperforms seed-driven back-translation baselines but also achieves performance comparable to strongly supervised methods.
comment: EMNLP 2025 Main
☆ CEQuest: Benchmarking Large Language Models for Construction Estimation
Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of general-domain tasks. However, their effectiveness in specialized fields, such as construction, remains underexplored. In this paper, we introduce CEQuest, a novel benchmark dataset specifically designed to evaluate the performance of LLMs in answering construction-related questions, particularly in the areas of construction drawing interpretation and estimation. We conduct comprehensive experiments using five state-of-the-art LLMs, including Gemma 3, Phi4, LLaVA, Llama 3.3, and GPT-4.1, and evaluate their performance in terms of accuracy, execution time, and model size. Our experimental results demonstrate that current LLMs exhibit considerable room for improvement, highlighting the importance of integrating domain-specific knowledge into these models. To facilitate further research, we will open-source the proposed CEQuest dataset, aiming to foster the development of specialized large language models (LLMs) tailored to the construction domain.
☆ InMind: Evaluating LLMs in Capturing and Applying Individual Human Reasoning Styles EMNLP 2025
LLMs have shown strong performance on human-centric reasoning tasks. While previous evaluations have explored whether LLMs can infer intentions or detect deception, they often overlook the individualized reasoning styles that influence how people interpret and act in social contexts. Social deduction games (SDGs) provide a natural testbed for evaluating individualized reasoning styles, where different players may adopt diverse but contextually valid reasoning strategies under identical conditions. To address this, we introduce InMind, a cognitively grounded evaluation framework designed to assess whether LLMs can capture and apply personalized reasoning styles in SDGs. InMind enhances structured gameplay data with round-level strategy traces and post-game reflections, collected under both Observer and Participant modes. It supports four cognitively motivated tasks that jointly evaluate both static alignment and dynamic adaptation. As a case study, we apply InMind to the game Avalon, evaluating 11 state-of-the-art LLMs. General-purpose LLMs, even GPT-4o frequently rely on lexical cues, struggling to anchor reflections in temporal gameplay or adapt to evolving strategies. In contrast, reasoning-enhanced LLMs like DeepSeek-R1 exhibit early signs of style-sensitive reasoning. These findings reveal key limitations in current LLMs' capacity for individualized, adaptive reasoning, and position InMind as a step toward cognitively aligned human-AI interaction.
comment: EMNLP 2025 MainConference
☆ Less Redundancy: Boosting Practicality of Vision Language Model in Walking Assistants
Approximately 283 million people worldwide live with visual impairments, motivating increasing research into leveraging Visual Language Models (VLMs) to develop effective walking assistance systems for blind and low vision individuals. However, existing VLMs in walking assistant task often have outputs that contain considerable redundancy and extraneous details, adversely affecting users' ability to accurately assess their surroundings. Moreover, these models typically lack the capability to proactively assess environmental risks and adaptively trigger reminders based on the appropriate scene, leading to excessive temporal redundancy. To mitigate output and temporal redundancy, we propose WalkVLM-LR, a walking assistance model with less redundancy. To reduce output redundancy, we introduce four human-preference-based custom reward functions within the GRPO-based reasoning framework to optimize the output in terms of conciseness, fluency, keyword density, and accuracy, thereby producing more informative and streamlined outputs. To minimize temporal redundancy, we incorporate an environment awareness discriminator, which shares the visual encoder with the VLMs to reduce redundant computations and enhance discriminative efficiency, to make WalkVLM-LR assess scene risk levels and minimize unnecessary reminders. Experimental results demonstrate that our method achieves state-of-the-art performance across all evaluation metrics compared with other models, particularly in output conciseness and less temporal redundancy.
☆ Ethical Considerations of Large Language Models in Game Playing
Large language models (LLMs) have demonstrated tremendous potential in game playing, while little attention has been paid to their ethical implications in those contexts. This work investigates and analyses the ethical considerations of applying LLMs in game playing, using Werewolf, also known as Mafia, as a case study. Gender bias, which affects game fairness and player experience, has been observed from the behaviour of LLMs. Some roles, such as the Guard and Werewolf, are more sensitive than others to gender information, presented as a higher degree of behavioural change. We further examine scenarios in which gender information is implicitly conveyed through names, revealing that LLMs still exhibit discriminatory tendencies even in the absence of explicit gender labels. This research showcases the importance of developing fair and ethical LLMs. Beyond our research findings, we discuss the challenges and opportunities that lie ahead in this field, emphasising the need for diving deeper into the ethical implications of LLMs in gaming and other interactive domains.
comment: 19 pages
☆ Integrating Time Series into LLMs via Multi-layer Steerable Embedding Fusion for Enhanced Forecasting CIKM 2025
Time series (TS) data are ubiquitous across various application areas, rendering time series forecasting (TSF) a fundamental task. With the astounding advances in large language models (LLMs), a variety of methods have been developed to adapt LLMs for time series forecasting. Despite unlocking the potential of LLMs in comprehending TS data, existing methods are inherently constrained by their shallow integration of TS information, wherein LLMs typically access TS representations at shallow layers, primarily at the input layer. This causes the influence of TS representations to progressively fade in deeper layers and eventually leads to ineffective adaptation between textual embeddings and TS representations. In this paper, we propose the Multi-layer Steerable Embedding Fusion (MSEF), a novel framework that enables LLMs to directly access time series patterns at all depths, thereby mitigating the progressive loss of TS information in deeper layers. Specifically, MSEF leverages off-the-shelf time series foundation models to extract semantically rich embeddings, which are fused with intermediate text representations across LLM layers via layer-specific steering vectors. These steering vectors are designed to continuously optimize the alignment between time series and textual modalities and facilitate a layer-specific adaptation mechanism that ensures efficient few-shot learning capabilities. Experimental results on seven benchmarks demonstrate significant performance improvements by MSEF compared with baselines, with an average reduction of 31.8% in terms of MSE. The code is available at https://github.com/One1sAll/MSEF.
comment: To be published in CIKM 2025
☆ Generative Foundation Model for Structured and Unstructured Electronic Health Records
Electronic health records (EHRs) are rich clinical data sources but complex repositories of patient data, spanning structured elements (demographics, vitals, lab results, codes), unstructured clinical notes and other modalities of data. Harnessing this heterogeneity is critical for improving patient outcomes. Recent advances in large language models (LLMs) have enabled foundation models that can learn from multiple data modalities and support clinical tasks. However, most current approaches simply serialize numeric EHR data into text, which risks losing temporal and quantitative detail. We introduce Generative Deep Patient (GDP), a multimodal foundation model that natively encodes structured EHR time-series via a CNN-Transformer encoder and fuses it with unstructured EHRs through cross-modal attention into a LLaMA-based decoder. GDP is trained in two stages: (1) generative pretraining, where it learns to produce clinical narratives from raw patient timelines while also performing masked feature prediction (MFP) and next time-step prediction (NTP) to capture temporal dynamics; and (2) multi-task fine-tuning for clinically meaningful predictions (e.g., heart failure, type 2 diabetes, 30-day readmission). In clinical prediction, GDP demonstrated superior performance on MIMIC-IV: heart failure AUROC = 0.923, type 2 diabetes AUROC = 0.817, and 30-day readmission AUROC = 0.627. For narrative generation, GDP achieved ROUGE-L = 0.135 and BERTScore-F1 = 0.545. In a blinded human evaluation, GDP-Instruct scored highest on faithfulness, fluency, and overall clinical utility, suggesting reduced hospital documentation workload without sacrificing accuracy. Our results demonstrate that a single multimodal foundation model can both predict clinically actionable events and generate high-quality clinical narratives. Furthermore, GDP's flexible architecture can be extended to additional modalities.
☆ OpenWHO: A Document-Level Parallel Corpus for Health Translation in Low-Resource Languages
In machine translation (MT), health is a high-stakes domain characterised by widespread deployment and domain-specific vocabulary. However, there is a lack of MT evaluation datasets for low-resource languages in this domain. To address this gap, we introduce OpenWHO, a document-level parallel corpus of 2,978 documents and 26,824 sentences from the World Health Organization's e-learning platform. Sourced from expert-authored, professionally translated materials shielded from web-crawling, OpenWHO spans a diverse range of over 20 languages, of which nine are low-resource. Leveraging this new resource, we evaluate modern large language models (LLMs) against traditional MT models. Our findings reveal that LLMs consistently outperform traditional MT models, with Gemini 2.5 Flash achieving a +4.79 ChrF point improvement over NLLB-54B on our low-resource test set. Further, we investigate how LLM context utilisation affects accuracy, finding that the benefits of document-level translation are most pronounced in specialised domains like health. We release the OpenWHO corpus to encourage further research into low-resource MT in the health domain.
☆ X-Troll: eXplainable Detection of State-Sponsored Information Operations Agents CIKM2025
State-sponsored trolls, malicious actors who deploy sophisticated linguistic manipulation in coordinated information campaigns, posing threats to online discourse integrity. While Large Language Models (LLMs) achieve strong performance on general natural language processing (NLP) tasks, they struggle with subtle propaganda detection and operate as ``black boxes'', providing no interpretable insights into manipulation strategies. This paper introduces X-Troll, a novel framework that bridges this gap by integrating explainable adapter-based LLMs with expert-derived linguistic knowledge to detect state-sponsored trolls and provide human-readable explanations for its decisions. X-Troll incorporates appraisal theory and propaganda analysis through specialized LoRA adapters, using dynamic gating to capture campaign-specific discourse patterns in coordinated information operations. Experiments on real-world data demonstrate that our linguistically-informed approach shows strong performance compared with both general LLM baselines and existing troll detection models in accuracy while providing enhanced transparency through expert-grounded explanations that reveal the specific linguistic strategies used by state-sponsored actors. X-Troll source code is available at: https://github.com/ltian678/xtroll_source/.
comment: 14 pages, 4 figures, 4 tables, accepted by CIKM2025
☆ Political Ideology Shifts in Large Language Models
Large language models (LLMs) are increasingly deployed in politically sensitive settings, raising concerns about their potential to encode, amplify, or be steered toward specific ideologies. We investigate how adopting synthetic personas influences ideological expression in LLMs across seven models (7B-70B+ parameters) from multiple families, using the Political Compass Test as a standardized probe. Our analysis reveals four consistent patterns: (i) larger models display broader and more polarized implicit ideological coverage; (ii) susceptibility to explicit ideological cues grows with scale; (iii) models respond more strongly to right-authoritarian than to left-libertarian priming; and (iv) thematic content in persona descriptions induces systematic and predictable ideological shifts, which amplify with size. These findings indicate that both scale and persona content shape LLM political behavior. As such systems enter decision-making, educational, and policy contexts, their latent ideological malleability demands attention to safeguard fairness, transparency, and safety.
♻ ☆ Establishing Task Scaling Laws via Compute-Efficient Model Ladders
We develop task scaling laws and model ladders to predict the individual task performance of pretrained language models (LMs) in the overtrained setting. Standard power laws for language modeling loss cannot accurately model task performance. Therefore, we leverage a two-step prediction approach: (1) use model and data size to predict an intermediate loss, then (2) use it to predict task performance. We train a set of small-scale "ladder" models, collect data points to fit the parameterized functions of the two prediction steps, and make predictions for two target models: a 7B model trained to 4T tokens and a 13B model trained to 5T tokens. Training the ladder models only costs 1% of the compute used for the target models. On four multiple-choice tasks formatted as ranked classification, we can predict the accuracy of both target models within 2 points of absolute error. We find that tasks with higher prediction error also have higher variance in the metrics over model checkpoints. We also contrast multiple design choices for predicting accuracy, and present recommendations for extending our method to new models and tasks.
comment: COLM 2025
♻ ☆ Can Large Language Models Simulate Human Responses? A Case Study of Stated Preference Experiments in the Context of Heating-related Choices
Stated preference (SP) surveys are a key method to research how individuals make trade-offs in hypothetical, also futuristic, scenarios. In energy context this includes key decarbonisation enablement contexts, such as low-carbon technologies, distributed renewable energy generation, and demand-side response [1,2]. However, they tend to be costly, time-consuming, and can be affected by respondent fatigue and ethical constraints. Large language models (LLMs) have demonstrated remarkable capabilities in generating human-like textual responses, prompting growing interest in their application to survey research. This study investigates the use of LLMs to simulate consumer choices in energy-related SP surveys and explores their integration into data analysis workflows. A series of test scenarios were designed to systematically assess the simulation performance of several LLMs (LLaMA 3.1, Mistral, GPT-3.5 and DeepSeek-R1) at both individual and aggregated levels, considering contexts factors such as prompt design, in-context learning (ICL), chain-of-thought (CoT) reasoning, LLM types, integration with traditional choice models, and potential biases. Cloud-based LLMs do not consistently outperform smaller local models. In this study, the reasoning model DeepSeek-R1 achieves the highest average accuracy (77%) and outperforms non-reasoning LLMs in accuracy, factor identification, and choice distribution alignment. Across models, systematic biases are observed against the gas boiler and no-retrofit options, with a preference for more energy-efficient alternatives. The findings suggest that previous SP choices are the most effective input factor, while longer prompts with additional factors and varied formats can cause LLMs to lose focus, reducing accuracy.
♻ ☆ Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization
Tokenization is the first -- and often least scrutinized -- step of most NLP pipelines. Standard algorithms for learning tokenizers rely on frequency-based objectives, which favor languages dominant in the training data and consequently leave lower-resource languages with tokenizations that are disproportionately longer, morphologically implausible, or even riddled with placeholders. This phenomenon ultimately amplifies computational and financial inequalities between users from different language backgrounds. To remedy this, we introduce Parity-aware Byte Pair Encoding (BPE), a variant of the widely-used BPE algorithm. At every merge step, Parity-aware BPE maximizes the compression gain of the currently worst-compressed language, trading a small amount of global compression for cross-lingual parity. We find empirically that Parity-aware BPE leads to more equitable token counts across languages, with negligible impact on global compression rate and no substantial effect on language-model performance in downstream tasks.
♻ ☆ Seamless Language Expansion: Enhancing Multilingual Mastery in Self-Supervised Models
Self-supervised (SSL) models have shown great performance in various downstream tasks. However, they are typically developed for limited languages, and may encounter new languages in real-world. Developing a SSL model for each new language is costly. Thus, it is vital to figure out how to efficiently adapt existed SSL models to a new language without impairing its original abilities. We propose adaptation methods which integrate LoRA to existed SSL models to extend new language. We also develop preservation strategies which include data combination and re-clustering to retain abilities on existed languages. Applied to mHuBERT, we investigate their effectiveness on speech re-synthesis task. Experiments show that our adaptation methods enable mHuBERT to be applied to a new language (Mandarin) with MOS value increased about 1.6 and the relative value of WER reduced up to 61.72%. Also, our preservation strategies ensure that the performance on both existed and new languages remains intact.
comment: Accepted by Interspeech 2024
♻ ☆ Enhancing Code-switched Text-to-Speech Synthesis Capability in Large Language Models with only Monolingual Corpora
While Large Language Models (LLMs) have shown potential in speech generation and recognition, their applications are mainly confined to monolingual scenarios, with limited explorations in code-switched (CS) contexts. In this paper, we propose a Code-Switched Large Language Model (CS-LLM) to enhance the code-switched text-to-speech synthesis (CS TTS) capability in LLMs with only monolingual corpora. Specifically, we begin by enhancing the multilingual speech processing ability of LLMs through multilingual speech recognition and synthesis tasks. Then, we develop an effective code-switched (CS) data construction strategy that splits and concatenates words from different monolingual speech corpora to equip LLMs with improved CS TTS ability. Experiments show that our approach outperforms baselines in CS TTS in terms of naturalness, speaker consistency and similarity even with limited data. Additionally, the constructed CS data further improves multilingual speech synthesis and recognition.
comment: Accepted to ASRU2025
♻ ☆ Psyche-R1: Towards Reliable Psychological LLMs through Unified Empathy, Expertise, and Reasoning
Amidst a shortage of qualified mental health professionals, the integration of large language models (LLMs) into psychological applications offers a promising way to alleviate the growing burden of mental health disorders. Recent reasoning-augmented LLMs have achieved remarkable performance in mathematics and programming, while research in the psychological domain has predominantly emphasized emotional support and empathetic dialogue, with limited attention to reasoning mechanisms that are beneficial to generating reliable responses. Therefore, in this paper, we propose Psyche-R1, the first Chinese psychological LLM that jointly integrates empathy, psychological expertise, and reasoning, built upon a novel data curation pipeline. Specifically, we design a comprehensive data synthesis pipeline that produces over 75k high-quality psychological questions paired with detailed rationales, generated through chain-of-thought (CoT) reasoning and iterative prompt-rationale optimization, along with 73k empathetic dialogues. Subsequently, we employ a hybrid training strategy wherein challenging samples are identified through a multi-LLM cross-selection strategy for group relative policy optimization (GRPO) to improve reasoning ability, while the remaining data is used for supervised fine-tuning (SFT) to enhance empathetic response generation and psychological domain knowledge. Extensive experiment results demonstrate the effectiveness of the Psyche-R1 across several psychological benchmarks, where our 7B Psyche-R1 achieves comparable results to 671B DeepSeek-R1.
♻ ☆ Is Small Language Model the Silver Bullet to Low-Resource Languages Machine Translation?
Low-resource languages (LRLs) lack sufficient linguistic resources and are underrepresented in benchmark datasets, resulting in persistently lower translation quality than high-resource languages, especially in privacy-sensitive and resource-limited contexts. Firstly, this study systematically evaluates state-of-the-art smaller Large Language Models in 200 languages using the FLORES-200 benchmark, highlighting persistent deficiencies and disparities in the translation of LRLs. To mitigate these limitations, we investigate knowledge distillation from large pre-trained teacher models to Small Language Models (SLMs) through supervised fine-tuning. The results show substantial improvements; for example, the translation performance of English to Luxembourgish (EN to LB), measured by the LLM-as-a-Judge score, increases from 0.36 to 0.89 in the validation set for Llama-3.2-3B. We further investigate various fine-tuning configurations and tasks to clarify the trade-offs between data scale and training efficiency, verify that the model retains its general capabilities without significant catastrophic forgetting after training, and explore the distillation benefits to other LRLs on SLMs (Khasi, Assamese, and Ukrainian). In general, this work exposes the limitations and fairness issues of current SLMs in LRL translation and systematically explores the potential of using the distillation of knowledge from large to small models, offering practical, empirically grounded recommendations to improve LRL translation systems
♻ ☆ CAMA: Enhancing Multimodal In-Context Learning with Context-Aware Modulated Attention
Multimodal in-context learning (ICL) is emerging as a key capability that enables large vision-language models (LVLMs) to adapt to novel tasks without parameter updates, expanding their utility across various real-world applications. However, ICL remains unstable, even with well-matched in-context demonstrations (ICDs), suggesting that LVLMs struggle to fully utilize the provided context. While existing efforts focus on prompt engineering or post-hoc logit calibration, we instead investigate the underlying attention dynamics to overcome LVLMs' inherent limitations. We identify two critical deficits in their self-attention that impair effective ICL. To bridge the gap, we propose \textbf{Context-Aware Modulated Attention} (CAMA), a plug-and-play and training-free method that dynamically modulates LVLM's attention logits based on the input in-context sequence. CAMA employs a two-stage attention modulation to address both identified deficits, enhancing the focus on semantically significant tokens, particularly visual ones. Across four LVLMs and seven benchmarks, CAMA consistently outperforms vanilla models and baselines, demonstrating great effectiveness and generalization. It can also activate the desired effects of prompt engineering methods and remains robust under diverse sequence configurations. Thus, CAMA paves the way for deeper explorations of attention dynamics to advance multimodal reasoning.
comment: 14 pages, 8 figures, 5 tables
♻ ☆ Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms
Direct Alignment Algorithms (DAAs), such as Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO), have emerged as efficient alternatives to Reinforcement Learning from Human Feedback (RLHF) algorithms for aligning large language models (LLMs) with human preferences. However, DAAs suffer from a fundamental limitation we identify as the "reward-generation gap" -- a misalignment between optimization objectives during training and actual generation performance during inference. In this paper, we find a contributor to the reward-generation gap is the mismatch between the inherent importance of prefix tokens during the LLM generation process and how this importance is reflected in the implicit reward functions of DAAs. To bridge the gap, we adopt a token-level MDP perspective of DAAs to analyze its limitations and introduce a simple yet effective approach called Prefix-Oriented Equal-length Training (POET), which truncates both preferred and dispreferred responses to match the shorter one's length. Training with \mname, where both responses in each sample are truncated to equal length, resulting in diverse truncated lengths across samples, the optimization of DAAs objective is implicitly constrained to converge across all timesteps of token-level MDP, thus paying more attention to prefix tokens than the standard DAAs. We conduct experiments with DPO and SimPO, two representative DAAs, demonstrating that POET improves over their standard implementations, achieving up to 15.6 points in AlpacaEval 2 and overall improvements across downstream tasks. Our results highlight the importance of addressing the misalignment between reward optimization and generation performance in DAAs.
♻ ☆ MultiBLiMP 1.0: A Massively Multilingual Benchmark of Linguistic Minimal Pairs ACL
We introduce MultiBLiMP 1.0, a massively multilingual benchmark of linguistic minimal pairs, covering 101 languages and 2 types of subject-verb agreement, containing more than 128,000 minimal pairs. Our minimal pairs are created using a fully automated pipeline, leveraging the large-scale linguistic resources of Universal Dependencies and UniMorph. MultiBLiMP 1.0 evaluates abilities of LLMs at an unprecedented multilingual scale, and highlights the shortcomings of the current state-of-the-art in modelling low-resource languages.
comment: Published in TACL, MIT Press
♻ ☆ Collaborative Stance Detection via Small-Large Language Model Consistency Verification
Stance detection on social media aims to identify attitudes expressed in tweets towards specific targets. Current studies prioritize Large Language Models (LLMs) over Small Language Models (SLMs) due to the overwhelming performance improving provided by LLMs. However, heavily relying on LLMs for stance detection, regardless of the cost, is impractical for real-world social media monitoring systems that require vast data analysis. To this end, we propose \textbf{\underline{Co}}llaborative Stance Detection via Small-Large Language Model Consistency \textbf{\underline{Ver}}ification (\textbf{CoVer}) framework, which enhances LLM utilization via context-shared batch reasoning and logical verification between LLM and SLM. Specifically, instead of processing each text individually, CoVer processes texts batch-by-batch, obtaining stance predictions and corresponding explanations via LLM reasoning in a shared context. Then, to exclude the bias caused by context noises, CoVer introduces the SLM for logical consistency verification. Finally, texts that repeatedly exhibit low logical consistency are classified using consistency-weighted aggregation of prior LLM stance predictions. Our experiments show that CoVer outperforms state-of-the-art methods across multiple benchmarks in the zero-shot setting, achieving 0.54 LLM queries per tweet while significantly enhancing performance. Our CoVer offers a more practical solution for LLM deploying for social media stance detection.
♻ ☆ Rotary Offset Features in Large Language Models
Transformer-based Large Language Models (LLMs) rely on positional encodings to provide sequence position information to their attention mechanism. Rotary Positional Encodings (RoPE), which encode relative position by rotating queries and keys, have become widely used in modern LLMs. We study the features and patterns that emerge in queries and keys when using rotary embeddings and introduce the concept of rotary offset features. Our analysis reveals that these features, which frequently exhibit large activations and are often interpreted as outliers, arise consistently across layers, attention heads, and model architectures. We derive bounds predicting which rotary frequencies give rise to rotary offset features and the minimum angle between the query-key pairs for these features. We verify our predictions empirically across models of different sizes and architectures.
♻ ☆ Leveraging LLMs for Utility-Focused Annotation: Reducing Manual Effort for Retrieval and RAG EMNLP25
Retrieval models typically rely on costly human-labeled query-document relevance annotations for training and evaluation. To reduce this cost and leverage the potential of Large Language Models (LLMs) in relevance judgments, we aim to explore whether LLM-generated annotations can effectively replace human annotations in training retrieval models. Retrieval usually emphasizes relevance, which indicates "topic-relatedness" of a document to a query, while in RAG, the value of a document (or utility) depends on how it contributes to answer generation. Recognizing this mismatch, some researchers use LLM performance on downstream tasks with documents as labels, but this approach requires manual answers for specific tasks, leading to high costs and limited generalization. In another line of work, prompting LLMs to select useful documents as RAG references eliminates the need for human annotation and is not task-specific. If we leverage LLMs' utility judgments to annotate retrieval data, we may retain cross-task generalization without human annotation in large-scale corpora. Therefore, we investigate utility-focused annotation via LLMs for large-scale retriever training data across both in-domain and out-of-domain settings on the retrieval and RAG tasks. To reduce the impact of low-quality positives labeled by LLMs, we design a novel loss function, i.e., Disj-InfoNCE. Our experiments reveal that: (1) Retrievers trained on utility-focused annotations significantly outperform those trained on human annotations in the out-of-domain setting on both tasks, demonstrating superior generalization capabilities. (2) LLM annotation does not replace human annotation in the in-domain setting. However, incorporating just 20% human-annotated data enables retrievers trained with utility-focused annotations to match the performance of models trained entirely with human annotations.
comment: Accepted by the EMNLP25 main conference
♻ ☆ Sentiment Reasoning for Healthcare ACL 2025
Transparency in AI healthcare decision-making is crucial. By incorporating rationales to explain reason for each predicted label, users could understand Large Language Models (LLMs)'s reasoning to make better decision. In this work, we introduce a new task - Sentiment Reasoning - for both speech and text modalities, and our proposed multimodal multitask framework and the world's largest multimodal sentiment analysis dataset. Sentiment Reasoning is an auxiliary task in sentiment analysis where the model predicts both the sentiment label and generates the rationale behind it based on the input transcript. Our study conducted on both human transcripts and Automatic Speech Recognition (ASR) transcripts shows that Sentiment Reasoning helps improve model transparency by providing rationale for model prediction with quality semantically comparable to humans while also improving model's classification performance (+2% increase in both accuracy and macro-F1) via rationale-augmented fine-tuning. Also, no significant difference in the semantic quality of generated rationales between human and ASR transcripts. All code, data (five languages - Vietnamese, English, Chinese, German, and French) and models are published online: https://github.com/leduckhai/Sentiment-Reasoning
comment: ACL 2025 Industry Track (Oral)
♻ ☆ PublicHearingBR: A Brazilian Portuguese Dataset of Public Hearing Transcripts for Summarization of Long Documents
This paper introduces PublicHearingBR, a Brazilian Portuguese dataset designed for summarizing long documents. The dataset consists of transcripts of public hearings held by the Brazilian Chamber of Deputies, paired with news articles and structured summaries containing the individuals participating in the hearing and their statements or opinions. The dataset supports the development and evaluation of long document summarization systems in Portuguese. Our contributions include the dataset, a hybrid summarization system to establish a baseline for future studies, and a discussion of evaluation metrics for summarization involving large language models, addressing the challenge of hallucination in the generated summaries. As a result of this discussion, the dataset also includes annotated data to evaluate natural language inference tasks in Portuguese.
comment: 23 pages
♻ ☆ PoisonSwarm: Universal Harmful Information Synthesis via Model Crowdsourcing
To construct responsible and secure AI applications, harmful information data is widely utilized for adversarial testing and the development of safeguards. Existing studies mainly leverage Large Language Models (LLMs) to synthesize data to obtain high-quality task datasets at scale, thereby avoiding costly human annotation. However, limited by the safety alignment mechanisms of LLMs, the synthesis of harmful data still faces challenges in generation reliability and content diversity. In this study, we propose a novel harmful information synthesis framework, PoisonSwarm, which applies the model crowdsourcing strategy to generate diverse harmful data while maintaining a high success rate. Specifically, we generate abundant benign data as the based templates in a counterfactual manner. Subsequently, we decompose each based template into multiple semantic units and perform unit-by-unit toxification and final refinement through dynamic model switching, thus ensuring the success of synthesis. Experimental results demonstrate that PoisonSwarm achieves state-of-the-art performance in synthesizing different categories of harmful data with high scalability and diversity.
♻ ☆ Towards Privacy-aware Mental Health AI Models: Advances, Challenges, and Opportunities
Mental health disorders create profound personal and societal burdens, yet conventional diagnostics are resource-intensive and limit accessibility. Advances in artificial intelligence, particularly natural language processing and multimodal methods, offer promise for detecting and addressing mental disorders, but raise critical privacy risks. This paper examines these challenges and proposes solutions, including anonymization, synthetic data, and privacy-preserving training, while outlining frameworks for privacy-utility trade-offs, aiming to advance reliable, privacy-aware AI tools that support clinical decision-making and improve mental health outcomes.
comment: 18 pages, 2 figures
♻ ☆ Contextualize-then-Aggregate: Circuits for In-Context Learning in Gemma-2 2B
In-Context Learning (ICL) is an intriguing ability of large language models (LLMs). Despite a substantial amount of work on its behavioral aspects and how it emerges in miniature setups, it remains unclear which mechanism assembles task information from the individual examples in a fewshot prompt. We use causal interventions to identify information flow in Gemma-2 2B for five naturalistic ICL tasks. We find that the model infers task information using a two-step strategy we call contextualize-then-aggregate: In the lower layers, the model builds up representations of individual fewshot examples, which are contextualized by preceding examples through connections between fewshot input and output tokens across the sequence. In the higher layers, these representations are aggregated to identify the task and prepare prediction of the next output. The importance of the contextualization step differs between tasks, and it may become more important in the presence of ambiguous examples. Overall, by providing rigorous causal analysis, our results shed light on the mechanisms through which ICL happens in language models.
♻ ☆ from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors
Current studies have exposed the risk of Large Language Models (LLMs) generating harmful content by jailbreak attacks. However, they overlook that the direct generation of harmful content from scratch is more difficult than inducing LLM to calibrate benign content into harmful forms. In our study, we introduce a novel attack framework that exploits AdVersArial meTAphoR (AVATAR) to induce the LLM to calibrate malicious metaphors for jailbreaking. Specifically, to answer harmful queries, AVATAR adaptively identifies a set of benign but logically related metaphors as the initial seed. Then, driven by these metaphors, the target LLM is induced to reason and calibrate about the metaphorical content, thus jailbroken by either directly outputting harmful responses or calibrating residuals between metaphorical and professional harmful content. Experimental results demonstrate that AVATAR can effectively and transferable jailbreak LLMs and achieve a state-of-the-art attack success rate across multiple advanced LLMs.
comment: arXiv admin note: substantial text overlap with arXiv:2412.12145
♻ ☆ Cyberbullying Detection via Aggression-Enhanced Prompting
Detecting cyberbullying on social media remains a critical challenge due to its subtle and varied expressions. This study investigates whether integrating aggression detection as an auxiliary task within a unified training framework can enhance the generalisation and performance of large language models (LLMs) in cyberbullying detection. Experiments are conducted on five aggression datasets and one cyberbullying dataset using instruction-tuned LLMs. We evaluated multiple strategies: zero-shot, few-shot, independent LoRA fine-tuning, and multi-task learning (MTL). Given the inconsistent results of MTL, we propose an enriched prompt pipeline approach in which aggression predictions are embedded into cyberbullying detection prompts to provide contextual augmentation. Preliminary results show that the enriched prompt pipeline consistently outperforms standard LoRA fine-tuning, indicating that aggression-informed context significantly boosts cyberbullying detection. This study highlights the potential of auxiliary tasks, such as aggression detection, to improve the generalisation of LLMs for safety-critical applications on social networks.
comment: Accepted to RANLP 2025
♻ ☆ Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment EMNLP 2025
Ensuring consistent safety across multiple languages remains a significant challenge for large language models (LLMs). We introduce Soteria, a lightweight yet powerful strategy that locates and minimally adjusts the "functional heads" most responsible for harmful content generation in each language. By altering only a fraction of parameters, Soteria drastically reduces policy violations without sacrificing overall model performance, even in low-resource settings. To rigorously evaluate our approach, we also present XThreatBench, a specialized multilingual dataset capturing fine-grained harmful behaviors drawn from real policy guidelines. Experiments with leading open-source LLMs (e.g., Llama, Qwen, Mistral) show that Soteria consistently improves safety metrics across high-, mid-, and low-resource languages. These findings highlight a promising path toward scalable, linguistically attuned, and ethically aligned LLMs worldwide.
comment: Accepted at EMNLP 2025
♻ ☆ Can Hallucinations Help? Boosting LLMs for Drug Discovery
Hallucinations in large language models (LLMs), plausible but factually inaccurate text, are often viewed as undesirable. However, recent work suggests that such outputs may hold creative potential. In this paper, we investigate whether hallucinations can improve LLMs on molecule property prediction, a key task in early-stage drug discovery. We prompt LLMs to generate natural language descriptions from molecular SMILES strings and incorporate these often hallucinated descriptions into downstream classification tasks. Evaluating seven instruction-tuned LLMs across five datasets, we find that hallucinations significantly improve predictive accuracy for some models. Notably, Falcon3-Mamba-7B outperforms all baselines when hallucinated text is included, while hallucinations generated by GPT-4o consistently yield the greatest gains between models. We further identify and categorize over 18,000 beneficial hallucinations, with structural misdescriptions emerging as the most impactful type, suggesting that hallucinated statements about molecular structure may increase model confidence. Ablation studies show that larger models benefit more from hallucinations, while temperature has a limited effect. Our findings challenge conventional views of hallucination as purely problematic and suggest new directions for leveraging hallucinations as a useful signal in scientific modeling tasks like drug discovery.
♻ ☆ MedArabiQ: Benchmarking Large Language Models on Arabic Medical Tasks
Large Language Models (LLMs) have demonstrated significant promise for various applications in healthcare. However, their efficacy in the Arabic medical domain remains unexplored due to the lack of high-quality domain-specific datasets and benchmarks. This study introduces MedArabiQ, a novel benchmark dataset consisting of seven Arabic medical tasks, covering multiple specialties and including multiple choice questions, fill-in-the-blank, and patient-doctor question answering. We first constructed the dataset using past medical exams and publicly available datasets. We then introduced different modifications to evaluate various LLM capabilities, including bias mitigation. We conducted an extensive evaluation with five state-of-the-art open-source and proprietary LLMs, including GPT-4o, Claude 3.5-Sonnet, and Gemini 1.5. Our findings highlight the need for the creation of new high-quality benchmarks that span different languages to ensure fair deployment and scalability of LLMs in healthcare. By establishing this benchmark and releasing the dataset, we provide a foundation for future research aimed at evaluating and enhancing the multilingual capabilities of LLMs for the equitable use of generative AI in healthcare.
comment: 21 pages
♻ ☆ DIDS: Domain Impact-aware Data Sampling for Large Language Model Training
Large language models (LLMs) are commonly trained on multi-domain datasets, where domain sampling strategies significantly impact model performance due to varying domain importance across downstream tasks. Existing approaches for optimizing domain-level sampling strategies struggle with maintaining intra-domain consistency and accurately measuring domain impact. In this paper, we present Domain Impact-aware Data Sampling (DIDS). To ensure intra-domain consistency, a gradient clustering algorithm is proposed to group training data based on their learning effects, where a proxy language model and dimensionality reduction are employed to reduce computational overhead. To accurately measure domain impact, we develop a Fisher Information Matrix (FIM) guided metric that quantifies how domain-specific parameter updates affect the model's output distributions on downstream tasks, with theoretical guarantees. Furthermore, to determine optimal sampling ratios, DIDS combines both the FIM-guided domain impact assessment and loss learning trajectories that indicate domain-specific potential, while accounting for diminishing marginal returns. Extensive experiments demonstrate that DIDS achieves 3.4% higher average performance while maintaining comparable training efficiency. The code is available at https://github.com/shiweijiezero/DIDS.
♻ ☆ Bridging the Culture Gap: A Framework for LLM-Driven Socio-Cultural Localization of Math Word Problems in Low-Resource Languages
Large language models (LLMs) have demonstrated significant capabilities in solving mathematical problems expressed in natural language. However, multilingual and culturally-grounded mathematical reasoning in low-resource languages lags behind English due to the scarcity of socio-cultural task datasets that reflect accurate native entities such as person names, organization names, and currencies. Existing multilingual benchmarks are predominantly produced via translation and typically retain English-centric entities, owing to the high cost associated with human annotater-based localization. Moreover, automated localization tools are limited, and hence, truly localized datasets remain scarce. To bridge this gap, we introduce a framework for LLM-driven cultural localization of math word problems that automatically constructs datasets with native names, organizations, and currencies from existing sources. We find that translated benchmarks can obscure true multilingual math ability under appropriate socio-cultural contexts. Through extensive experiments, we also show that our framework can help mitigate English-centric entity bias and improves robustness when native entities are introduced across various languages.
♻ ☆ Dynamically Adaptive Reasoning via LLM-Guided MCTS for Efficient and Context-Aware KGQA
Knowledge Graph Question Answering (KGQA) aims to interpret natural language queries and perform structured reasoning over knowledge graphs by leveraging their relational and semantic structures to retrieve accurate answers. Recent KGQA methods primarily follow either retrieve-then-reason paradigm, relying on GNNs or heuristic rules for static paths extraction, or dynamic path generation strategies that use large language models (LLMs) with prompting to jointly perform retrieval and reasoning. However, the former suffers from limited adaptability due to static path extraction and lack of contextual refinement, while the latter incurs high computational costs and struggles with accurate path evaluation due to reliance on fixed scoring functions and extensive LLM calls. To address these issues, this paper proposes Dynamically Adaptive MCTS-based Reasoning (DAMR), a novel framework that integrates symbolic search with adaptive path evaluation for efficient and context-aware KGQA. DAMR employs a Monte Carlo Tree Search (MCTS) backbone guided by an LLM-based planner, which selects top-$k$ relevant relations at each step to reduce search space. To improve path evaluation accuracy, we introduce a lightweight Transformer-based scorer that performs context-aware plausibility estimation by jointly encoding the question and relation sequence through cross-attention, enabling the model to capture fine-grained semantic shifts during multi-hop reasoning. Furthermore, to alleviate the scarcity of high-quality supervision, DAMR incorporates a dynamic pseudo-path refinement mechanism that periodically generates training signals from partial paths explored during search, allowing the scorer to continuously adapt to the evolving distribution of reasoning trajectories. Extensive experiments on multiple KGQA benchmarks show that DAMR significantly outperforms state-of-the-art methods.
♻ ☆ Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding
We present Top-Theta (Top-$\theta$) Attention, a training-free method for sparsifying transformer attention during inference. Our key insight is that static, per-head thresholds can be calibrated to retain the desired constant number of significant elements per attention row. This approach enables content-based sparsity without retraining, and it remains robust across data domains. We further introduce compensation techniques to preserve accuracy under aggressive sparsification, establishing attention thresholding as a practical and principled alternative to top-k attention. We provide extensive evaluation on natural language processing tasks, showing that Top-$\theta$ achieves 3-10x reduction in V-cache usage and up to 10x fewer attention elements during inference while degrading no more than 1% in accuracy.
comment: 11 pages, 11 figures + Appendix. work under submission
♻ ☆ ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability
Large Language Model (LLM) based listwise ranking has shown superior performance in many passage ranking tasks. With the development of Large Reasoning Models, many studies have demonstrated that step-by-step reasoning during test-time helps improve listwise ranking performance. However, due to the scarcity of reasoning-intensive training data, existing rerankers perform poorly in many complex ranking scenarios and the ranking ability of reasoning-intensive rerankers remains largely underdeveloped. In this paper, we first propose an automated reasoning-intensive training data synthesis framework, which sources training queries and passages from diverse domains and applies DeepSeek-R1 to generate high-quality training labels. A self-consistency data filtering mechanism is designed to ensure the data quality. To empower the listwise reranker with strong reasoning ability, we further propose a two-stage post-training approach, which includes a cold-start supervised fine-tuning (SFT) stage for reasoning pattern learning and a reinforcement learning (RL) stage for further ranking ability enhancement. During the RL stage, based on the nature of listwise ranking, we design a multi-view ranking reward, which is more effective than a ranking metric-based reward. Extensive experiments demonstrate that our trained reasoning-intensive reranker \textbf{ReasonRank} outperforms existing baselines significantly and also achieves much lower latency than pointwise reranker Rank1. \textbf{Through further experiments, our ReasonRank has achieved state-of-the-art (SOTA) performance 40.6 on the BRIGHT leaderboard\footnote{https://brightbenchmark.github.io/}.} Our codes are available at https://github.com/8421BCD/ReasonRank.
comment: 21 pages
♻ ☆ SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences
Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), but its performance degrades on long inputs due to increased attention cost and reduced draft accuracy. We introduce SpecExtend, a drop-in enhancement that improves the performance of speculative decoding on long sequences without any additional training. First, SpecExtend integrates efficient attention mechanisms such as FlashAttention and Hybrid Tree Attention into both the draft and target models. To improve draft accuracy and speed on long inputs without retraining, we propose Cross-model Retrieval, a novel KV cache eviction strategy that uses the target model's attention scores to dynamically select relevant context for the draft model. Extensive evaluations on three long-context understanding datasets show that SpecExtend accelerates standard tree-based speculative decoding by up to 2.22x for inputs up to 16K tokens, providing an effective solution for speculative decoding of long sequences. Our code is available at https://github.com/jycha98/SpecExtend .
♻ ☆ CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization
Although LLM-based agents have attracted significant attention in domains such as software engineering and machine learning research, their role in advancing combinatorial optimization (CO) remains relatively underexplored. This gap underscores the need for a deeper understanding of their potential in tackling structured, constraint-intensive problems -- a pursuit currently limited by the absence of comprehensive benchmarks for systematic investigation. To address this, we introduce CO-Bench, a benchmark suite featuring 36 real-world CO problems drawn from a broad range of domains and complexity levels. CO-Bench includes structured problem formulations and curated data to support rigorous investigation of LLM agents. We evaluate multiple agentic frameworks against established human-designed algorithms, revealing the strengths and limitations of existing LLM agents and identifying promising directions for future research. CO-Bench is publicly available at https://github.com/sunnweiwei/CO-Bench.
♻ ☆ One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs ICML 2025
Leveraging mathematical Large Language Models (LLMs) for proof generation is a fundamental topic in LLMs research. We argue that the ability of current LLMs to prove statements largely depends on whether they have encountered the relevant proof process during training. This reliance limits their deeper understanding of mathematical theorems and related concepts. Inspired by the pedagogical method of "proof by counterexamples" commonly used in human mathematics education, our work aims to enhance LLMs' ability to conduct mathematical reasoning and proof through counterexamples. Specifically, we manually create a high-quality, university-level mathematical benchmark, CounterMATH, which requires LLMs to prove mathematical statements by providing counterexamples, thereby assessing their grasp of mathematical concepts. Additionally, we develop a data engineering framework to automatically obtain training data for further model improvement. Extensive experiments and detailed analyses demonstrate that CounterMATH is challenging, indicating that LLMs, such as OpenAI o1, have insufficient counterexample-driven proof capabilities. Moreover, our exploration into model training reveals that strengthening LLMs' counterexample-driven conceptual reasoning abilities is crucial for improving their overall mathematical capabilities. We believe that our work offers new perspectives on the community of mathematical LLMs.
comment: ICML 2025
♻ ☆ Prompting Techniques for Reducing Social Bias in LLMs through System 1 and System 2 Cognitive Processes
Dual process theory posits that human cognition arises via two systems. System 1, which is a quick, emotional, and intuitive process, which is subject to cognitive biases, and System 2, is a slow, onerous, and deliberate process. Prior research in LLMs found that using chain-of-thought (CoT) prompting in LLMs, which has been often compared to System 2 reasoning, can lead to reduced gender bias. Along these lines, we investigate the relationship between bias, CoT prompting, a direct debiasing, and dual process theory modeling in LLMs. We compare zero-shot CoT, debiasing, and dual process theory-based prompting strategies on two bias datasets spanning nine different social bias categories. We incorporate human and machine personas to determine whether LLM modeling of the effects of dual process theory exist independent of explicit persona models or are tied to the LLM's modeling of human-like generation. We find that a human persona, debiasing, System 2, and CoT prompting all tend to reduce social biases in LLMs, though the best combination of features depends on the exact model and bias category -- resulting in up to a 33 percent drop in stereotypical judgments by an LLM.
comment: Accepted at RANLP-2025 (main conference)
♻ ☆ Efficient RL Training for Reasoning Models via Length-Aware Optimization
Large reasoning models, such as OpenAI o1 or DeepSeek R1, have demonstrated remarkable performance on reasoning tasks but often incur a long reasoning path with significant memory and time costs. Existing methods primarily aim to shorten reasoning paths by introducing additional training data and stages. In this paper, we propose three critical reward designs integrated directly into the reinforcement learning process of large reasoning models, which reduce the response length without extra training stages. Experiments on four settings show that our method significantly decreases response length while maintaining or even improving performance. Specifically, in a logic reasoning setting, we achieve a 40% reduction in response length averaged by steps alongside a 14% gain in performance. For math problems, we reduce response length averaged by steps by 33% while preserving performance.
comment: Under review
♻ ☆ A Survey of Deep Learning for Geometry Problem Solving
Geometry problem solving, a crucial aspect of mathematical reasoning, is vital across various domains, including education, the assessment of AI's mathematical abilities, and multimodal capability evaluation. The recent surge in deep learning technologies, particularly the emergence of multimodal large language models, has significantly accelerated research in this area. This paper provides a survey of the applications of deep learning in geometry problem solving, including (i) a comprehensive summary of the relevant tasks in geometry problem solving; (ii) a thorough review of related deep learning methods; (iii) a detailed analysis of evaluation metrics and methods; and (iv) a critical discussion of the current challenges and future directions that can be explored. Our objective is to offer a comprehensive and practical reference of deep learning for geometry problem solving, thereby fostering further advancements in this field. We create a continuously updated list of papers on GitHub: https://github.com/majianz/dl4gps.
comment: Work in progress
♻ ☆ Revealing the Role of Audio Channels in ASR Performance Degradation
Pre-trained automatic speech recognition (ASR) models have demonstrated strong performance on a variety of tasks. However, their performance can degrade substantially when the input audio comes from different recording channels. While previous studies have demonstrated this phenomenon, it is often attributed to the mismatch between training and testing corpora. This study argues that variations in speech characteristics caused by different recording channels can fundamentally harm ASR performance. To address this limitation, we propose a normalization technique designed to mitigate the impact of channel variation by aligning internal feature representations in the ASR model with those derived from a clean reference channel. This approach significantly improves ASR performance on previously unseen channels and languages, highlighting its ability to generalize across channel and language differences.
comment: Accepted to IEEE ASRU 2025
♻ ☆ DRP: Distilled Reasoning Pruning with Skill-aware Step Decomposition for Efficient Large Reasoning Models
While Large Reasoning Models (LRMs) have demonstrated success in complex reasoning tasks through long chain-of-thought (CoT) reasoning, their inference often involves excessively verbose reasoning traces, resulting in substantial inefficiency. To address this, we propose Distilled Reasoning Pruning (DRP), a hybrid framework that combines inference-time pruning with tuning-based distillation, two widely used strategies for efficient reasoning. DRP uses a teacher model to perform skill-aware step decomposition and content pruning, and then distills the pruned reasoning paths into a student model, enabling it to reason both efficiently and accurately. Across several challenging mathematical reasoning datasets, we find that models trained with DRP achieve substantial improvements in token efficiency without sacrificing accuracy. Specifically, DRP reduces average token usage on GSM8K from 917 to 328 while improving accuracy from 91.7% to 94.1%, and achieves a 43% token reduction on AIME with no performance drop. Further analysis shows that aligning the reasoning structure of training CoTs with the student's reasoning capacity is critical for effective knowledge transfer and performance gains.
♻ ☆ SinLlama -- A Large Language Model for Sinhala
Low-resource languages such as Sinhala are often overlooked by open-source Large Language Models (LLMs). In this research, we extend an existing multilingual LLM (Llama-3-8B) to better serve Sinhala. We enhance the LLM tokenizer with Sinhala specific vocabulary and perform continual pre-training on a cleaned 10 million Sinhala corpus, resulting in the SinLlama model. This is the very first decoder-based open-source LLM with explicit Sinhala support. When SinLlama was instruction fine-tuned for three text classification tasks, it outperformed base and instruct variants of Llama-3-8B by a significant margin.
♻ ☆ Utilizing Multilingual Encoders to Improve Large Language Models for Low-Resource Languages
Large Language Models (LLMs) excel in English, but their performance degrades significantly on low-resource languages (LRLs) due to English-centric training. While methods like LangBridge align LLMs with multilingual encoders such as the Massively Multilingual Text-to-Text Transfer Transformer (mT5), they typically use only the final encoder layer. We propose a novel architecture that fuses all intermediate layers, enriching the linguistic information passed to the LLM. Our approach features two strategies: (1) a Global Softmax weighting for overall layer importance, and (2) a Transformer Softmax model that learns token-specific weights. The fused representations are mapped into the LLM's embedding space, enabling it to process multilingual inputs. The model is trained only on English data, without using any parallel or multilingual data. Evaluated on XNLI, IndicXNLI, Sinhala News Classification, and Amazon Reviews, our Transformer Softmax model significantly outperforms the LangBridge baseline. We observe strong performance gains in LRLs, improving Sinhala classification accuracy from 71.66% to 75.86% and achieving clear improvements across Indic languages such as Tamil, Bengali, and Malayalam. These specific gains contribute to an overall boost in average XNLI accuracy from 70.36% to 71.50%. This approach offers a scalable, data-efficient path toward more capable and equitable multilingual LLMs.
♻ ☆ MINTQA: A Multi-Hop Question Answering Benchmark for Evaluating LLMs on New and Tail Knowledge
Large language models (LLMs) have demonstrated impressive capabilities in various reasoning tasks but face significant challenges with complex, knowledge-intensive multi-hop queries, particularly those involving new or long-tail knowledge. Existing benchmarks often fail to fully address these challenges. To bridge this gap, we introduce MINTQA (Multi-hop Question Answering on New and Tail Knowledge), a comprehensive benchmark to evaluate LLMs' capabilities in multi-hop reasoning across four critical dimensions: question handling strategy, sub-question generation, retrieval-augmented generation, and iterative or dynamic decomposition and retrieval. MINTQA comprises 10,479 question-answer pairs for evaluating new knowledge and 17,887 pairs for assessing long-tail knowledge, with each question equipped with corresponding sub-questions and answers. Our systematic evaluation of 22 state-of-the-art LLMs on MINTQA reveals significant limitations in their ability to handle complex knowledge base queries, particularly in handling new or unpopular knowledge. Our findings highlight critical challenges and offer insights for advancing multi-hop reasoning capabilities. The MINTQA benchmark is available at https://github.com/probe2/multi-hop/.
♻ ☆ On the Role of Entity and Event Level Conceptualization in Generalizable Reasoning: A Survey of Tasks, Methods, Applications, and Future Directions EMNLP 2025
Conceptualization, a fundamental element of human cognition, plays a pivotal role in human generalizable reasoning. Generally speaking, it refers to the process of sequentially abstracting specific instances into higher-level concepts and then forming abstract knowledge that can be applied in unfamiliar or novel situations. This enhances models' inferential capabilities and supports the effective transfer of knowledge across various domains. Despite its significance, the broad nature of this term has led to inconsistencies in understanding conceptualization across various works, as there exists different types of instances that can be abstracted in a wide variety of ways. There is also a lack of a systematic overview that comprehensively examines existing works on the definition, execution, and application of conceptualization to enhance reasoning tasks. In this paper, we address these gaps by first proposing a categorization of different types of conceptualizations into four levels based on the types of instances being conceptualized, in order to clarify the term and define the scope of our work. Then, we present the first comprehensive survey of over 150 papers, surveying various definitions, resources, methods, and downstream applications related to conceptualization into a unified taxonomy, with a focus on the entity and event levels. Furthermore, we shed light on potential future directions in this field and hope to garner more attention from the community.
comment: Findings of EMNLP 2025
♻ ☆ A Toolbox, Not a Hammer -- Multi-TAG: Scaling Math Reasoning with Multi-Tool Aggregation EMNLP
Augmenting large language models (LLMs) with external tools is a promising avenue for developing high-performance mathematical reasoning systems. Prior tool-augmented approaches typically finetune an LLM to select and invoke a single tool at each reasoning step and show promising results on simpler math reasoning benchmarks such as GSM8K. However, these approaches struggle with more complex math problems that require precise reasoning over multiple steps. To address this limitation, in this work, we propose Multi-TAG, a Multi-Tool AGgregation-based framework. Instead of relying on a single tool, Multi-TAG guides an LLM to concurrently invoke multiple tools at each reasoning step. It then aggregates their diverse outputs to verify and refine the reasoning process, enhancing solution robustness and accuracy. Notably, Multi-TAG is a finetuning-free, inference-only framework, making it readily applicable to any LLM backbone, including large open-weight models which are computationally expensive to finetune and proprietary frontier models which cannot be finetuned with custom recipes. We evaluate Multi-TAG on four challenging benchmarks: MATH500, AIME, AMC, and OlympiadBench. Across both open-weight and closed-source LLM backbones, Multi-TAG consistently and substantially outperforms state-of-the-art baselines, achieving average improvements of 6.0% to 7.5% over state-of-the-art baselines.
comment: Published at EMNLP Findings 2025; 21 pages, 3 figures
♻ ☆ Robust Bias Detection in MLMs and its Application to Human Trait Ratings NAACL 2025
There has been significant prior work using templates to study bias against demographic attributes in MLMs. However, these have limitations: they overlook random variability of templates and target concepts analyzed, assume equality amongst templates, and overlook bias quantification. Addressing these, we propose a systematic statistical approach to assess bias in MLMs, using mixed models to account for random effects, pseudo-perplexity weights for sentences derived from templates and quantify bias using statistical effect sizes. Replicating prior studies, we match on bias scores in magnitude and direction with small to medium effect sizes. Next, we explore the novel problem of gender bias in the context of $\textit{personality}$ and $\textit{character}$ traits, across seven MLMs (base and large). We find that MLMs vary; ALBERT is unbiased for binary gender but the most biased for non-binary $\textit{neo}$, while RoBERTa-large is the most biased for binary gender but shows small to no bias for $\textit{neo}$. There is some alignment of MLM bias and findings in psychology (human perspective) - in $\textit{agreeableness}$ with RoBERTa-large and $\textit{emotional stability}$ with BERT-large. There is general agreement for the remaining 3 personality dimensions: both sides observe at most small differences across gender. For character traits, human studies on gender bias are limited thus comparisons are not feasible.
comment: Findings of NAACL 2025
♻ ☆ SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling
Process or step-wise supervision has played a crucial role in advancing complex multi-step reasoning capabilities of Large Language Models (LLMs). However, efficient, high-quality automated process annotation remains a significant challenge. To address this, we introduce Single-Pass Annotation with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation by jointly aligning solution steps to reference solutions and determine its accuracy with explicit reasoning in single generation. We demonstrate SPARE's effectiveness across four diverse datasets spanning mathematical reasoning (GSM8K, MATH), multi-hop question answering (MuSiQue-Ans), and spatial reasoning (SpaRP), showing consistent improvements in two applications: (1) training Process Reward Models (PRMs) for ranking and aggregating multiple generations, and (2) fine-tuning models via offline reinforcement learning for greedy decoding. On ProcessBench, SPARE demonstrates data-efficient out-of-distribution generalization, using only $\sim$16% of training samples compared to human-labeled and other synthetically trained baselines. Additionally, it achieves competitive performance with MCTS-based methods while offering 2.3$\times$ speedup in terms of total token count. Manual analysis reveals complementary precision-recall characteristics with MCTS approaches, suggesting potential for ensemble methods. These results establish SPARE as a practical and scalable solution for automatic process supervision in LLM reasoning.
comment: 7 pages main content, 3 figures, 6 tables
Information Retrieval 25
☆ ORCA: Mitigating Over-Reliance for Multi-Task Dwell Time Prediction with Causal Decoupling CIKM 2025
Dwell time (DT) is a critical post-click metric for evaluating user preference in recommender systems, complementing the traditional click-through rate (CTR). Although multi-task learning is widely adopted to jointly optimize DT and CTR, we observe that multi-task models systematically collapse their DT predictions to the shortest and longest bins, under-predicting the moderate durations. We attribute this moderate-duration bin under-representation to over-reliance on the CTR-DT spurious correlation, and propose ORCA to address it with causal-decoupling. Specifically, ORCA explicitly models and subtracts CTR's negative transfer while preserving its positive transfer. We further introduce (i) feature-level counterfactual intervention, and (ii) a task-interaction module with instance inverse-weighting, weakening CTR-mediated effect and restoring direct DT semantics. ORCA is model-agnostic and easy to deploy. Experiments show an average 10.6% lift in DT metrics without harming CTR. Code is available at https://github.com/Chrissie-Law/ORCA-Mitigating-Over-Reliance-for-Multi-Task-Dwell-Time-Prediction-with-Causal-Decoupling.
comment: Accepted as a short paper at CIKM 2025
☆ LLM-Based Agents for Competitive Landscape Mapping in Drug Asset Due Diligence
In this paper, we describe and benchmark a competitor-discovery component used within an agentic AI system for fast drug asset due diligence. A competitor-discovery AI agent, given an indication, retrieves all drugs comprising the competitive landscape of that indication and extracts canonical attributes for these drugs. The competitor definition is investor-specific, and data is paywalled/licensed, fragmented across registries, ontology-mismatched by indication, alias-heavy for drug names, multimodal, and rapidly changing. Although considered the best tool for this problem, the current LLM-based AI systems aren't capable of reliably retrieving all competing drug names, and there is no accepted public benchmark for this task. To address the lack of evaluation, we use LLM-based agents to transform five years of multi-modal, unstructured diligence memos from a private biotech VC fund into a structured evaluation corpus mapping indications to competitor drugs with normalized attributes. We also introduce a competitor validating LLM-as-a-judge agent that filters out false positives from the list of predicted competitors to maximize precision and suppress hallucinations. On this benchmark, our competitor-discovery agent achieves 83% recall, exceeding OpenAI Deep Research (65%) and Perplexity Labs (60%). The system is deployed in production with enterprise users; in a case study with a biotech VC investment fund, analyst turnaround time dropped from 2.5 days to $\sim$3 hours ($\sim$20x) for the competitive analysis.
☆ Enhanced NIRMAL Optimizer With Damped Nesterov Acceleration: A Comparative Analysis
This study introduces the Enhanced NIRMAL (Novel Integrated Robust Multi-Adaptation Learning with Damped Nesterov Acceleration) optimizer, an improved version of the original NIRMAL optimizer. By incorporating an $(\alpha, r)$-damped Nesterov acceleration mechanism, Enhanced NIRMAL improves convergence stability while retaining chess-inspired strategies of gradient descent, momentum, stochastic perturbations, adaptive learning rates, and non-linear transformations. We evaluate Enhanced NIRMAL against Adam, SGD with Momentum, Nesterov, and the original NIRMAL on four benchmark image classification datasets: MNIST, FashionMNIST, CIFAR-10, and CIFAR-100, using tailored convolutional neural network (CNN) architectures. Enhanced NIRMAL achieves a test accuracy of 46.06\% and the lowest test loss (1.960435) on CIFAR-100, surpassing the original NIRMAL (44.34\% accuracy) and closely rivaling SGD with Momentum (46.43\% accuracy). These results underscore Enhanced NIRMAL's superior generalization and stability, particularly on complex datasets.
comment: 7 pages, 1 figure, 1 table. arXiv admin note: substantial text overlap with arXiv:2508.04293
☆ A Node-Aware Dynamic Quantization Approach for Graph Collaborative Filtering
In the realm of collaborative filtering recommendation systems, Graph Neural Networks (GNNs) have demonstrated remarkable performance but face significant challenges in deployment on resource-constrained edge devices due to their high embedding parameter requirements and computational costs. Using common quantization method directly on node embeddings may overlooks their graph based structure, causing error accumulation during message passing and degrading the quality of quantized embeddings.To address this, we propose Graph based Node-Aware Dynamic Quantization training for collaborative filtering (GNAQ), a novel quantization approach that leverages graph structural information to enhance the balance between efficiency and accuracy of GNNs for Top-K recommendation. GNAQ introduces a node-aware dynamic quantization strategy that adapts quantization scales to individual node embeddings by incorporating graph interaction relationships. Specifically, it initializes quantization intervals based on node-wise feature distributions and dynamically refines them through message passing in GNN layers. This approach mitigates information loss caused by fixed quantization scales and captures hierarchical semantic features in user-item interaction graphs. Additionally, GNAQ employs graph relation-aware gradient estimation to replace traditional straight-through estimators, ensuring more accurate gradient propagation during training. Extensive experiments on four real-world datasets demonstrate that GNAQ outperforms state-of-the-art quantization methods, including BiGeaR and N2UQ, by achieving average improvement in 27.8\% Recall@10 and 17.6\% NDCG@10 under 2-bit quantization. In particular, GNAQ is capable of maintaining the performance of full-precision models while reducing their model sizes by 8 to 12 times; in addition, the training time is twice as fast compared to quantization baseline methods.
☆ LLM-as-classifier: Semi-Supervised, Iterative Framework for Hierarchical Text Classification using Large Language Models
The advent of Large Language Models (LLMs) has provided unprecedented capabilities for analyzing unstructured text data. However, deploying these models as reliable, robust, and scalable classifiers in production environments presents significant methodological challenges. Standard fine-tuning approaches can be resource-intensive and often struggle with the dynamic nature of real-world data distributions, which is common in the industry. In this paper, we propose a comprehensive, semi-supervised framework that leverages the zero- and few-shot capabilities of LLMs for building hierarchical text classifiers as a framework for a solution to these industry-wide challenges. Our methodology emphasizes an iterative, human-in-the-loop process that begins with domain knowledge elicitation and progresses through prompt refinement, hierarchical expansion, and multi-faceted validation. We introduce techniques for assessing and mitigating sequence-based biases and outline a protocol for continuous monitoring and adaptation. This framework is designed to bridge the gap between the raw power of LLMs and the practical need for accurate, interpretable, and maintainable classification systems in industry applications.
comment: 20 pages excluding reference list, 2 figures
☆ OPERA: A Reinforcement Learning--Enhanced Orchestrated Planner-Executor Architecture for Reasoning-Oriented Multi-Hop Retrieval
Recent advances in large language models (LLMs) and dense retrievers have driven significant progress in retrieval-augmented generation (RAG). However, existing approaches face significant challenges in complex reasoning-oriented multi-hop retrieval tasks: 1) Ineffective reasoning-oriented planning: Prior methods struggle to generate robust multi-step plans for complex queries, as rule-based decomposers perform poorly on out-of-template questions. 2) Suboptimal reasoning-driven retrieval: Related methods employ limited query reformulation, leading to iterative retrieval loops that often fail to locate golden documents. 3) Insufficient reasoning-guided filtering: Prevailing methods lack the fine-grained reasoning to effectively filter salient information from noisy results, hindering utilization of retrieved knowledge. Fundamentally, these limitations all stem from the weak coupling between retrieval and reasoning in current RAG architectures. We introduce the Orchestrated Planner-Executor Reasoning Architecture (OPERA), a novel reasoning-driven retrieval framework. OPERA's Goal Planning Module (GPM) decomposes questions into sub-goals, which are executed by a Reason-Execute Module (REM) with specialized components for precise reasoning and effective retrieval. To train OPERA, we propose Multi-Agents Progressive Group Relative Policy Optimization (MAPGRPO), a novel variant of GRPO. Experiments on complex multi-hop benchmarks show OPERA's superior performance, validating both the MAPGRPO method and OPERA's design. Code is available at https://github.com/Ameame1/OPERA.
☆ MizanQA: Benchmarking Large Language Models on Moroccan Legal Question Answering
The rapid advancement of large language models (LLMs) has significantly propelled progress in natural language processing (NLP). However, their effectiveness in specialized, low-resource domains-such as Arabic legal contexts-remains limited. This paper introduces MizanQA (pronounced Mizan, meaning "scale" in Arabic, a universal symbol of justice), a benchmark designed to evaluate LLMs on Moroccan legal question answering (QA) tasks, characterised by rich linguistic and legal complexity. The dataset draws on Modern Standard Arabic, Islamic Maliki jurisprudence, Moroccan customary law, and French legal influences. Comprising over 1,700 multiple-choice questions, including multi-answer formats, MizanQA captures the nuances of authentic legal reasoning. Benchmarking experiments with multilingual and Arabic-focused LLMs reveal substantial performance gaps, highlighting the need for tailored evaluation metrics and culturally grounded, domain-specific LLM development.
☆ Attribute Filtering in Approximate Nearest Neighbor Search: An In-depth Experimental Study SIGMOD 2026
With the growing integration of structured and unstructured data, new methods have emerged for performing similarity searches on vectors while honoring structured attribute constraints, i.e., a process known as Filtering Approximate Nearest Neighbor (Filtering ANN) search. Since many of these algorithms have only appeared in recent years and are designed to work with a variety of base indexing methods and filtering strategies, there is a pressing need for a unified analysis that identifies their core techniques and enables meaningful comparisons. In this work, we present a unified Filtering ANN search interface that encompasses the latest algorithms and evaluate them extensively from multiple perspectives. First, we propose a comprehensive taxonomy of existing Filtering ANN algorithms based on attribute types and filtering strategies. Next, we analyze their key components, i.e., index structures, pruning strategies, and entry point selection, to elucidate design differences and tradeoffs. We then conduct a broad experimental evaluation on 10 algorithms and 12 methods across 4 datasets (each with up to 10 million items), incorporating both synthetic and real attributes and covering selectivity levels from 0.1% to 100%. Finally, an in-depth component analysis reveals the influence of pruning, entry point selection, and edge filtering costs on overall performance. Based on our findings, we summarize the strengths and limitations of each approach, provide practical guidelines for selecting appropriate methods, and suggest promising directions for future research. Our code is available at: https://github.com/lmccccc/FANNBench.
comment: 15 pages, 15 figures, Accepted at SIGMOD 2026
☆ Modeling User Preferences as Distributions for Optimal Transport-based Cross-domain Recommendation under Non-overlapping Settings
Cross-Domain Recommender (CDR) systems aim to transfer knowledge from dense to sparse domains, alleviating data sparsity and cold-start issues in single-domain recommendation. While many methods assume overlapping users or items to connect domains, this is often unrealistic in real-world settings. Thus, non-overlapping CDR systems, which require no shared users or items, are needed. However, non-overlapping CDR is challenging due to: (1) the absence of overlap preventing direct bridges between domains, and (2) large distributional discrepancies degrading transfer performance. Moreover, most recommenders represent user preferences as discrete vectors, failing to capture their fine-grained, multi-faceted nature. We propose DUP-OT (Distributional User Preferences with Optimal Transport), a framework for non-overlapping CDR. DUP-OT has three stages: (1) Shared Preprocessing, where review-based embeddings and an autoencoder encode users and items from both domains; (2) User GMM Weight Learning, which models user preferences as Gaussian mixtures with learned weights; and (3) Cross-domain Rating Prediction, where optimal transport aligns Gaussian components across domains, enabling preference transfer from source to target. Experiments on Amazon review datasets show that DUP-OT effectively mitigates domain discrepancy and outperforms state-of-the-art baselines under the non-overlapping CDR setting.
☆ EGRA:Toward Enhanced Behavior Graphs and Representation Alignment for Multimodal Recommendation
MultiModal Recommendation (MMR) systems have emerged as a promising solution for improving recommendation quality by leveraging rich item-side modality information, prompting a surge of diverse methods. Despite these advances, existing methods still face two critical limitations. First, they use raw modality features to construct item-item links for enriching the behavior graph, while giving limited attention to balancing collaborative and modality-aware semantics or mitigating modality noise in the process. Second, they use a uniform alignment weight across all entities and also maintain a fixed alignment strength throughout training, limiting the effectiveness of modality-behavior alignment. To address these challenges, we propose EGRA. First, instead of relying on raw modality features, it alleviates sparsity by incorporating into the behavior graph an item-item graph built from representations generated by a pretrained MMR model. This enables the graph to capture both collaborative patterns and modality aware similarities with enhanced robustness against modality noise. Moreover, it introduces a novel bi-level dynamic alignment weighting mechanism to improve modality-behavior representation alignment, which dynamically assigns alignment strength across entities according to their alignment degree, while gradually increasing the overall alignment intensity throughout training. Extensive experiments on five datasets show that EGRA significantly outperforms recent methods, confirming its effectiveness.
☆ Hierarchical Vision-Language Reasoning for Multimodal Multiple-Choice Question Answering ACM MM 2025
Multimodal Large Language Models (MLLMs) have demonstrated remarkable multimodal understanding capabilities in Visual Question Answering (VQA) tasks by integrating visual and textual features. However, under the challenging ten-choice question evaluation paradigm, existing methods still exhibit significant limitations when processing PDF documents with complex layouts and lengthy content. Notably, current mainstream models suffer from a strong bias toward English training data, resulting in suboptimal performance for Japanese and other language scenarios. To address these challenges, this paper proposes a novel Japanese PDF document understanding framework that combines multimodal hierarchical reasoning mechanisms with Colqwen-optimized retrieval methods, while innovatively introducing a semantic verification strategy through sub-question decomposition. Experimental results demonstrate that our framework not only significantly enhances the model's deep semantic parsing capability for complex documents, but also exhibits superior robustness in practical application scenarios.
comment: This paper has been accepted by ACM MM 2025
☆ Cross-Modal Prototype Augmentation and Dual-Grained Prompt Learning for Social Media Popularity Prediction ACM MM 2025
Social Media Popularity Prediction is a complex multimodal task that requires effective integration of images, text, and structured information. However, current approaches suffer from inadequate visual-textual alignment and fail to capture the inherent cross-content correlations and hierarchical patterns in social media data. To overcome these limitations, we establish a multi-class framework , introducing hierarchical prototypes for structural enhancement and contrastive learning for improved vision-text alignment. Furthermore, we propose a feature-enhanced framework integrating dual-grained prompt learning and cross-modal attention mechanisms, achieving precise multimodal representation through fine-grained category modeling. Experimental results demonstrate state-of-the-art performance on benchmark metrics, establishing new reference standards for multimodal social media analysis.
comment: This paper has been accepted by ACM MM 2025
☆ Spacetime-GR: A Spacetime-Aware Generative Model for Large Scale Online POI Recommendation
Building upon the strong sequence modeling capability, Generative Recommendation (GR) has gradually assumed a dominant position in the application of recommendation tasks (e.g., video and product recommendation). However, the application of Generative Recommendation in Point-of-Interest (POI) recommendation, where user preferences are significantly affected by spatiotemporal variations, remains a challenging open problem. In this paper, we propose Spacetime-GR, the first spacetime-aware generative model for large-scale online POI recommendation. It extends the strong sequence modeling ability of generative models by incorporating flexible spatiotemporal information encoding. Specifically, we first introduce a geographic-aware hierarchical POI indexing strategy to address the challenge of large vocabulary modeling. Subsequently, a novel spatiotemporal encoding module is introduced to seamlessly incorporate spatiotemporal context into user action sequences, thereby enhancing the model's sensitivity to spatiotemporal variations. Furthermore, we incorporate multimodal POI embeddings to enrich the semantic understanding of each POI. Finally, to facilitate practical deployment, we develop a set of post-training adaptation strategies after sufficient pre-training on action sequences. These strategies enable Spacetime-GR to generate outputs in multiple formats (i.e., embeddings, ranking scores and POI candidates) and support a wide range of downstream application scenarios (i.e., ranking and end-to-end recommendation). We evaluate the proposed model on both public benchmark datasets and large-scale industrial datasets, demonstrating its superior performance over existing methods in terms of POI recommendation accuracy and ranking quality. Furthermore, the model is the first generative model deployed in online POI recommendation services that scale to hundreds of millions of POIs and users.
☆ Extending FKG.in: Towards a Food Claim Traceability Network
The global food landscape is rife with scientific, cultural, and commercial claims about what foods are, what they do, what they should not do, or should not do. These range from rigorously studied health benefits (probiotics improve gut health) and misrepresentations (soaked almonds make one smarter) to vague promises (superfoods boost immunity) and culturally rooted beliefs (cold foods cause coughs). Despite their widespread influence, the infrastructure for tracing, verifying, and contextualizing these claims remains fragmented and underdeveloped. In this paper, we propose a Food Claim-Traceability Network (FCN) as an extension of FKG.in, a knowledge graph of Indian food that we have been incrementally building. We also present the ontology design and the semi-automated knowledge curation workflow that we used to develop a proof of concept of FKG.in-FCN using Reddit data and Large Language Models. FCN integrates curated data inputs, structured schemas, and provenance-aware pipelines for food-related claim extraction and validation. While directly linked to the Indian food knowledge graph as an application, our methodology remains application-agnostic and adaptable to other geographic, culinary, or regulatory settings. By modeling food claims and their traceability in a structured, verifiable, and explainable way, we aim to contribute to more transparent and accountable food knowledge ecosystems, supporting researchers, policymakers, and most importantly, everyday consumers in navigating a world saturated with dietary assertions.
comment: 10 pages, 3 figures, 1 table, 45 references, ACM International Conference on Multimedia 2025 - Multi-modal Food Computing Workshop
☆ Similarity-Based Supervised User Session Segmentation Method for Behavior Logs
In information recommendation, a session refers to a sequence of user actions within a specific time frame. Session-based recommender systems aim to capture short-term preferences and generate relevant recommendations. However, user interests may shift even within a session, making appropriate segmentation essential for modeling dynamic behaviors. In this study, we propose a supervised session segmentation method based on similarity features derived from action embeddings and attributes. We compute the similarity scores between items within a fixed-size window around each candidate segmentation point, using four types of features: item co-occurrence embeddings, text embeddings of titles and brands, and price. These features are used to train supervised classifiers (LightGBM, XGBoost, CatBoost, support vector machine, and logistic regression) to predict the session boundaries. We construct a manually annotated dataset from real user browsing histories and evaluate the segmentation performance using F1-score, area under the precision-recall curve (PR-AUC), and area under the receiver operating characteristic curve. The LightGBM model achieves the best performance, with an F1-score of 0.806 and a PR-AUC of 0.831. These results demonstrate the effectiveness of the proposed method for session segmentation and its potential to capture dynamic user behaviors.
comment: Submitted to Journal of Advanced Computational Intelligence and Intelligent Informatics
☆ Estimating the Effective Topics of Articles and journals Abstract Using LDA And K-Means Clustering Algorithm
Analyzing journals and articles abstract text or documents using topic modelling and text clustering has become a modern solution for the increasing number of text documents. Topic modelling and text clustering are both intensely involved tasks that can benefit one another. Text clustering and topic modelling algorithms are used to maintain massive amounts of text documents. In this study, we have used LDA, K-Means cluster and also lexical database WordNet for keyphrases extraction in our text documents. K-Means cluster and LDA algorithms achieve the most reliable performance for keyphrase extraction in our text documents. This study will help the researcher to make a search string based on journals and articles by avoiding misunderstandings.
☆ Bootstrapping Conditional Retrieval for User-to-Item Recommendations
User-to-item retrieval has been an active research area in recommendation system, and two tower models are widely adopted due to model simplicity and serving efficiency. In this work, we focus on a variant called \textit{conditional retrieval}, where we expect retrieved items to be relevant to a condition (e.g. topic). We propose a method that uses the same training data as standard two tower models but incorporates item-side information as conditions in query. This allows us to bootstrap new conditional retrieval use cases and encourages feature interactions between user and condition. Experiments show that our method can retrieve highly relevant items and outperforms standard two tower models with filters on engagement metrics. The proposed model is deployed to power a topic-based notification feed at Pinterest and led to +0.26\% weekly active users.
☆ How Good are LLM-based Rerankers? An Empirical Analysis of State-of-the-Art Reranking Models EMNLP
In this work, we present a systematic and comprehensive empirical evaluation of state-of-the-art reranking methods, encompassing large language model (LLM)-based, lightweight contextual, and zero-shot approaches, with respect to their performance in information retrieval tasks. We evaluate in total 22 methods, including 40 variants (depending on used LLM) across several established benchmarks, including TREC DL19, DL20, and BEIR, as well as a novel dataset designed to test queries unseen by pretrained models. Our primary goal is to determine, through controlled and fair comparisons, whether a performance disparity exists between LLM-based rerankers and their lightweight counterparts, particularly on novel queries, and to elucidate the underlying causes of any observed differences. To disentangle confounding factors, we analyze the effects of training data overlap, model architecture, and computational efficiency on reranking performance. Our findings indicate that while LLM-based rerankers demonstrate superior performance on familiar queries, their generalization ability to novel queries varies, with lightweight models offering comparable efficiency. We further identify that the novelty of queries significantly impacts reranking effectiveness, highlighting limitations in existing approaches. https://github.com/DataScienceUIBK/llm-reranking-generalization-study
comment: EMNLP Findings 2025
☆ Sparse and Dense Retrievers Learn Better Together: Joint Sparse-Dense Optimization for Text-Image Retrieval CIKM 2025
Vision-Language Pretrained (VLP) models have achieved impressive performance on multimodal tasks, including text-image retrieval, based on dense representations. Meanwhile, Learned Sparse Retrieval (LSR) has gained traction in text-only settings due to its interpretability and efficiency with fast term-based lookup via inverted indexes. Inspired by these advantages, recent work has extended LSR to the multimodal domain. However, these methods often rely on computationally expensive contrastive pre-training, or distillation from a frozen dense model, which limits the potential for mutual enhancement. To address these limitations, we propose a simple yet effective framework that enables bi-directional learning between dense and sparse representations through Self-Knowledge Distillation. This bi-directional learning is achieved using an integrated similarity score-a weighted sum of dense and sparse similarities-which serves as a shared teacher signal for both representations. To ensure efficiency, we fine-tune the final layer of the dense encoder and the sparse projection head, enabling easy adaptation of any existing VLP model. Experiments on MSCOCO and Flickr30k demonstrate that our sparse retriever not only outperforms existing sparse baselines, but also achieves performance comparable to-or even surpassing-its dense counterparts, while retaining the benefits of sparse models.
comment: accepted to CIKM 2025 short research paper track
♻ ☆ Enhancing and Scaling Search Query Datasets for Recommendation Systems
This paper presents a deployed, production-grade system designed to enhance and scale search query datasets for intent-based recommendation systems in digital banking. In real-world environments, the growing volume and complexity of user intents create substantial challenges for data management, resulting in suboptimal recommendations and delayed product onboarding. To overcome these challenges, our approach shifts the focus from model-centric enhancements to automated, data-centric strategies. The proposed system integrates three core modules: Synthetic Query Generation, Intent Disambiguation, and Intent Gap Analysis. Synthetic Query Generation produces diverse and realistic user queries. Our experiments reveal no statistically significant difference when using synthetic data for Clinc150, while Banking77 and a proprietary dataset show significant differences. We dig into the underlying factors driving these variations, demonstrating that our approach effectively alleviates the cold start problem (i.e. the challenge of recommending new products with limited historical data). Intent Disambiguation refines broad and overlapping intent categories into precise subintents, achieving an F1 score of 0.863 $\pm$ 0.127 against expert reannotations and leading to clearer differentiation and more precise recommendation mapping. Meanwhile, Intent Gap Analysis identifies latent customer needs by extracting novel intents from unlabeled queries; recovery rates reach up to 71\% in controlled evaluations. Deployed in a live banking environment, our system demonstrates significant improvements in recommendation precision and operation agility, ultimately delivering enhanced user experiences and strategic business benefits. This work underscores the role of high-quality, scalable data in modern AI-driven applications and advocates a proactive approach to data enhancement as a key driver of value.
♻ ☆ Leveraging LLMs for Utility-Focused Annotation: Reducing Manual Effort for Retrieval and RAG EMNLP25
Retrieval models typically rely on costly human-labeled query-document relevance annotations for training and evaluation. To reduce this cost and leverage the potential of Large Language Models (LLMs) in relevance judgments, we aim to explore whether LLM-generated annotations can effectively replace human annotations in training retrieval models. Retrieval usually emphasizes relevance, which indicates "topic-relatedness" of a document to a query, while in RAG, the value of a document (or utility) depends on how it contributes to answer generation. Recognizing this mismatch, some researchers use LLM performance on downstream tasks with documents as labels, but this approach requires manual answers for specific tasks, leading to high costs and limited generalization. In another line of work, prompting LLMs to select useful documents as RAG references eliminates the need for human annotation and is not task-specific. If we leverage LLMs' utility judgments to annotate retrieval data, we may retain cross-task generalization without human annotation in large-scale corpora. Therefore, we investigate utility-focused annotation via LLMs for large-scale retriever training data across both in-domain and out-of-domain settings on the retrieval and RAG tasks. To reduce the impact of low-quality positives labeled by LLMs, we design a novel loss function, i.e., Disj-InfoNCE. Our experiments reveal that: (1) Retrievers trained on utility-focused annotations significantly outperform those trained on human annotations in the out-of-domain setting on both tasks, demonstrating superior generalization capabilities. (2) LLM annotation does not replace human annotation in the in-domain setting. However, incorporating just 20% human-annotated data enables retrievers trained with utility-focused annotations to match the performance of models trained entirely with human annotations.
comment: Accepted by the EMNLP25 main conference
♻ ☆ ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability
Large Language Model (LLM) based listwise ranking has shown superior performance in many passage ranking tasks. With the development of Large Reasoning Models, many studies have demonstrated that step-by-step reasoning during test-time helps improve listwise ranking performance. However, due to the scarcity of reasoning-intensive training data, existing rerankers perform poorly in many complex ranking scenarios and the ranking ability of reasoning-intensive rerankers remains largely underdeveloped. In this paper, we first propose an automated reasoning-intensive training data synthesis framework, which sources training queries and passages from diverse domains and applies DeepSeek-R1 to generate high-quality training labels. A self-consistency data filtering mechanism is designed to ensure the data quality. To empower the listwise reranker with strong reasoning ability, we further propose a two-stage post-training approach, which includes a cold-start supervised fine-tuning (SFT) stage for reasoning pattern learning and a reinforcement learning (RL) stage for further ranking ability enhancement. During the RL stage, based on the nature of listwise ranking, we design a multi-view ranking reward, which is more effective than a ranking metric-based reward. Extensive experiments demonstrate that our trained reasoning-intensive reranker \textbf{ReasonRank} outperforms existing baselines significantly and also achieves much lower latency than pointwise reranker Rank1. \textbf{Through further experiments, our ReasonRank has achieved state-of-the-art (SOTA) performance 40.6 on the BRIGHT leaderboard\footnote{https://brightbenchmark.github.io/}.} Our codes are available at https://github.com/8421BCD/ReasonRank.
comment: 21 pages
♻ ☆ MTGR: Industrial-Scale Generative Recommendation Framework in Meituan
Scaling law has been extensively validated in many domains such as natural language processing and computer vision. In the recommendation system, recent work has adopted generative recommendations to achieve scalability, but their generative approaches require abandoning the carefully constructed cross features of traditional recommendation models. We found that this approach significantly degrades model performance, and scaling up cannot compensate for it at all. In this paper, we propose MTGR (Meituan Generative Recommendation) to address this issue. MTGR is modeling based on the HSTU architecture and can retain the original deep learning recommendation model (DLRM) features, including cross features. Additionally, MTGR achieves training and inference acceleration through user-level compression to ensure efficient scaling. We also propose Group-Layer Normalization (GLN) to enhance the performance of encoding within different semantic spaces and the dynamic masking strategy to avoid information leakage. We further optimize the training frameworks, enabling support for our models with 10 to 100 times computational complexity compared to the DLRM, without significant cost increases. MTGR achieved 65x FLOPs for single-sample forward inference compared to the DLRM model, resulting in the largest gain in nearly two years both offline and online. This breakthrough was successfully deployed on Meituan, the world's largest food delivery platform, where it has been handling the main traffic.
♻ ☆ Towards Goal-oriented Intelligent Tutoring Systems in Online Education
Interactive Intelligent Tutoring Systems (ITSs) enhance traditional ITSs by promoting effective learning through interactions and problem resolution in online education. Yet, proactive engagement, prioritizing resource optimization with planning and assessment capabilities, is often overlooked in current ITS designs. In this work, we investigate a new task, named Goal-oriented Intelligent Tutoring Systems (GITS), which aims to enable the student's mastery of a designated concept by strategically planning a customized sequence of exercises and assessment. To address the problem of goal-oriented policy learning in GITS, we propose a novel graph-based reinforcement learning framework, named Planning-Assessment-Interaction (PAI). Specifically, we first leverage cognitive structure information to improve state representation learning and action selection for planning the next action, which can be either to tutor an exercise or to assess the target concept. Further, we use a dynamically updated cognitive diagnosis model to simulate student responses to exercises and concepts. Three benchmark datasets across different subjects are constructed for enabling offline academic research on GITS. Experimental results demonstrate the effectiveness and efficiency of PAI and extensive analyses of various types of students are conducted to showcase the challenges in this task.
comment: Accepted by ACM TOIS
♻ ☆ Order-Preserving Dimension Reduction for Multimodal Semantic Embedding
Searching for the $k$-nearest neighbors (KNN) in multimodal data retrieval is computationally expensive, particularly due to the inherent difficulty in comparing similarity measures across different modalities. Recent advances in multimodal machine learning address this issue by mapping data into a shared embedding space; however, the high dimensionality of these embeddings (hundreds to thousands of dimensions) presents a challenge for time-sensitive vision applications. This work proposes Order-Preserving Dimension Reduction (OPDR), aiming to reduce the dimensionality of embeddings while preserving the ranking of KNN in the lower-dimensional space. One notable component of OPDR is a new measure function to quantify KNN quality as a global metric, based on which we derive a closed-form map between target dimensionality and key contextual parameters. We have integrated OPDR with multiple state-of-the-art dimension-reduction techniques, distance functions, and embedding models; experiments on a variety of multimodal datasets demonstrate that OPDR effectively retains recall high accuracy while significantly reducing computational costs.
Machine Learning 150
☆ Benchmarking Training Paradigms, Dataset Composition, and Model Scaling for Child ASR in ESPnet
Despite advancements in ASR, child speech recognition remains challenging due to acoustic variability and limited annotated data. While fine-tuning adult ASR models on child speech is common, comparisons with flat-start training remain underexplored. We compare flat-start training across multiple datasets, SSL representations (WavLM, XEUS), and decoder architectures. Our results show that SSL representations are biased toward adult speech, with flat-start training on child speech mitigating these biases. We also analyze model scaling, finding consistent improvements up to 1B parameters, beyond which performance plateaus. Additionally, age-related ASR and speaker verification analysis highlights the limitations of proprietary models like Whisper, emphasizing the need for open-data models for reliable child speech research. All investigations are conducted using ESPnet, and our publicly available benchmark provides insights into training strategies for robust child speech processing.
comment: 5 pages, 3 figures, presented at WOCCI 2025 (Workshop on Child Computer Interaction), satellite workshop of Interspeech 2025
☆ Closer to Reality: Practical Semi-Supervised Federated Learning for Foundation Model Adaptation
Foundation models (FMs) exhibit remarkable generalization but require adaptation to downstream tasks, particularly in privacy-sensitive applications. Due to data privacy regulations, cloud-based FMs cannot directly access private edge data, limiting their adaptation. Federated learning (FL) provides a privacy-aware alternative, but existing FL approaches overlook the constraints imposed by edge devices -- namely, limited computational resources and the scarcity of labeled data. To address these challenges, we introduce Practical Semi-Supervised Federated Learning (PSSFL), where edge devices hold only unlabeled, low-resolution data, while the server has limited labeled, high-resolution data. In this setting, we propose the Federated Mixture of Experts (FedMox), a novel framework that enhances FM adaptation in FL. FedMox tackles computational and resolution mismatch challenges via a sparse Mixture-of-Experts architecture, employing a spatial router to align features across resolutions and a Soft-Mixture strategy to stabilize semi-supervised learning. We take object detection as a case study, and experiments on real-world autonomous driving datasets demonstrate that FedMox effectively adapts FMs under PSSFL, significantly improving performance with constrained memory costs on edge devices. Our work paves the way for scalable and privacy-preserving FM adaptation in federated scenarios.
☆ Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders
Sparse Autoencoders (SAEs) extract features from LLM internal activations, meant to correspond to single concepts. A core SAE training hyperparameter is L0: how many features should fire per token on average. Existing work compares SAE algorithms using sparsity--reconstruction tradeoff plots, implying L0 is a free parameter with no single correct value. In this work we study the effect of L0 on BatchTopK SAEs, and show that if L0 is not set precisely, the SAE fails to learn the underlying features of the LLM. If L0 is too low, the SAE will mix correlated features to improve reconstruction. If L0 is too high, the SAE finds degenerate solutions that also mix features. Further, we demonstrate a method to determine the correct L0 value for an SAE on a given training distribution, which finds the true L0 in toy models and coincides with peak sparse probing performance in LLMs. We find that most commonly used SAEs have an L0 that is too low. Our work shows that, to train SAEs with correct features, practitioners must set L0 correctly.
☆ Transfer Learning via Lexical Relatedness: A Sarcasm and Hate Speech Case Study
Detecting hate speech in non-direct forms, such as irony, sarcasm, and innuendos, remains a persistent challenge for social networks. Although sarcasm and hate speech are regarded as distinct expressions, our work explores whether integrating sarcasm as a pre-training step improves implicit hate speech detection and, by extension, explicit hate speech detection. Incorporating samples from ETHOS, Sarcasm on Reddit, and Implicit Hate Corpus, we devised two training strategies to compare the effectiveness of sarcasm pre-training on a CNN+LSTM and BERT+BiLSTM model. The first strategy is a single-step training approach, where a model trained only on sarcasm is then tested on hate speech. The second strategy uses sequential transfer learning to fine-tune models for sarcasm, implicit hate, and explicit hate. Our results show that sarcasm pre-training improved the BERT+BiLSTM's recall by 9.7%, AUC by 7.8%, and F1-score by 6% on ETHOS. On the Implicit Hate Corpus, precision increased by 7.8% when tested only on implicit samples. By incorporating sarcasm into the training process, we show that models can more effectively detect both implicit and explicit hate.
☆ Machine Learning Time Propagators for Time-Dependent Density Functional Theory Simulations
Time-dependent density functional theory (TDDFT) is a widely used method to investigate electron dynamics under external time-dependent perturbations such as laser fields. In this work, we present a novel approach to accelerate electron dynamics simulations based on real time TDDFT using autoregressive neural operators as time-propagators for the electron density. By leveraging physics-informed constraints and featurization, and high-resolution training data, our model achieves superior accuracy and computational speed compared to traditional numerical solvers. We demonstrate the effectiveness of our model on a class of one-dimensional diatomic molecules under the influence of a range of laser parameters. This method has potential in enabling real-time, on-the-fly modeling of laser-irradiated molecules and materials with varying experimental parameters.
comment: 20 pages, 5 figures
☆ TinyML Towards Industry 4.0: Resource-Efficient Process Monitoring of a Milling Machine
In the context of industry 4.0, long-serving industrial machines can be retrofitted with process monitoring capabilities for future use in a smart factory. One possible approach is the deployment of wireless monitoring systems, which can benefit substantially from the TinyML paradigm. This work presents a complete TinyML flow from dataset generation, to machine learning model development, up to implementation and evaluation of a full preprocessing and classification pipeline on a microcontroller. After a short review on TinyML in industrial process monitoring, the creation of the novel MillingVibes dataset is described. The feasibility of a TinyML system for structure-integrated process quality monitoring could be shown by the development of an 8-bit-quantized convolutional neural network (CNN) model with 12.59kiB parameter storage. A test accuracy of 100.0% could be reached at 15.4ms inference time and 1.462mJ per quantized CNN inference on an ARM Cortex M4F microcontroller, serving as a reference for future TinyML process monitoring solutions.
comment: 10 pages, 5 figures, 1 table
☆ RL Is Neither a Panacea Nor a Mirage: Understanding Supervised vs. Reinforcement Learning Fine-Tuning for LLMs
Training large language models (LLMs) from scratch is increasingly impractical, making post-training methods such as supervised fine-tuning (SFT) and reinforcement-learning fine-tuning (RL-FT, e.g., PPO) central to modern practice. Using an out-of-distribution (OOD) variant of the 24-point card game and new spectrum-based diagnostics, we revisit how these two stages reshape model representation and OOD performance. Our key findings are- (1) RL-FT can restore much of the OOD performance loss from SFT (e.g., Llama-11B 8.97% to 15.38%, Qwen-7B 17.09% to 19.66%). But when SFT induces severe overfitting and a clear distribution shift, RL-FT cannot fully recover OOD performance. (2) Direction shifts of singular vectors matter more than singular value magnitudes. These shifts concentrate on directions linked to the largest and smallest singular values, leaving the bulk spectrum intact. (3) Low-rank and shallow recovery is effective: restoring singular vector directions for the top 20% of values or first 25% of layers recovers 70-80% of OOD performance. (4) Stronger SFT checkpoints enable better recovery by RL, while overfitted ones resist restoration. These results reconcile prior reports of RL superior OOD performance: RL primarily counteracts SFT-induced directional drift rather than finding new solutions. Our spectrum-aware analysis highlights inexpensive recovery knobs low-rank UV merging and shallow-layer resets that practitioners can use before costly RL fine-tuning.
☆ Parameter-Free Logit Distillation via Sorting Mechanism
Knowledge distillation (KD) aims to distill the knowledge from the teacher (larger) to the student (smaller) model via soft-label for the efficient neural network. In general, the performance of a model is determined by accuracy, which is measured with labels. However, existing KD approaches usually use the teacher with its original distribution, neglecting the potential of incorrect prediction. This may contradict the motivation of hard-label learning through cross-entropy loss, which may lead to sub-optimal knowledge distillation on certain samples. To address this issue, we propose a novel logit processing scheme via a sorting mechanism. Specifically, our method has a two-fold goal: (1) fixing the incorrect prediction of the teacher based on the labels and (2) reordering the distribution in a natural way according to priority rank at once. As an easy-to-use, plug-and-play pre-processing, our sort method can be effectively applied to existing logit-based KD methods. Extensive experiments on the CIFAR-100 and ImageNet datasets demonstrate the effectiveness of our method.
comment: Accepted in IEEE Signal Processing Letters 2025
☆ Explainable AI in Deep Learning-Based Prediction of Solar Storms
A deep learning model is often considered a black-box model, as its internal workings tend to be opaque to the user. Because of the lack of transparency, it is challenging to understand the reasoning behind the model's predictions. Here, we present an approach to making a deep learning-based solar storm prediction model interpretable, where solar storms include solar flares and coronal mass ejections (CMEs). This deep learning model, built based on a long short-term memory (LSTM) network with an attention mechanism, aims to predict whether an active region (AR) on the Sun's surface that produces a flare within 24 hours will also produce a CME associated with the flare. The crux of our approach is to model data samples in an AR as time series and use the LSTM network to capture the temporal dynamics of the data samples. To make the model's predictions accountable and reliable, we leverage post hoc model-agnostic techniques, which help elucidate the factors contributing to the predicted output for an input sequence and provide insights into the model's behavior across multiple sequences within an AR. To our knowledge, this is the first time that interpretability has been added to an LSTM-based solar storm prediction model.
comment: 6 pages, 8 figures
☆ Escaping Saddle Points via Curvature-Calibrated Perturbations: A Complete Analysis with Explicit Constants and Empirical Validation
We present a comprehensive theoretical analysis of first-order methods for escaping strict saddle points in smooth non-convex optimization. Our main contribution is a Perturbed Saddle-escape Descent (PSD) algorithm with fully explicit constants and a rigorous separation between gradient-descent and saddle-escape phases. For a function $f:\mathbb{R}^d\to\mathbb{R}$ with $\ell$-Lipschitz gradient and $\rho$-Lipschitz Hessian, we prove that PSD finds an $(\epsilon,\sqrt{\rho\epsilon})$-approximate second-order stationary point with high probability using at most $O(\ell\Delta_f/\epsilon^2)$ gradient evaluations for the descent phase plus $O((\ell/\sqrt{\rho\epsilon})\log(d/\delta))$ evaluations per escape episode, with at most $O(\ell\Delta_f/\epsilon^2)$ episodes needed. We validate our theoretical predictions through extensive experiments across both synthetic functions and practical machine learning tasks, confirming the logarithmic dimension dependence and the predicted per-episode function decrease. We also provide complete algorithmic specifications including a finite-difference variant (PSD-Probe) and a stochastic extension (PSGD) with robust mini-batch sizing.
comment: 16 pages. Perturbed gradient descent with fully explicit constants for escaping saddle points, validated empirically
☆ Quality control in sublinear time: a case study via random graphs
Many algorithms are designed to work well on average over inputs. When running such an algorithm on an arbitrary input, we must ask: Can we trust the algorithm on this input? We identify a new class of algorithmic problems addressing this, which we call "Quality Control Problems." These problems are specified by a (positive, real-valued) "quality function" $\rho$ and a distribution $D$ such that, with high probability, a sample drawn from $D$ is "high quality," meaning its $\rho$-value is near $1$. The goal is to accept inputs $x \sim D$ and reject potentially adversarially generated inputs $x$ with $\rho(x)$ far from $1$. The objective of quality control is thus weaker than either component problem: testing for "$\rho(x) \approx 1$" or testing if $x \sim D$, and offers the possibility of more efficient algorithms. In this work, we consider the sublinear version of the quality control problem, where $D \in \Delta(\{0,1\}^N)$ and the goal is to solve the $(D ,\rho)$-quality problem with $o(N)$ queries and time. As a case study, we consider random graphs, i.e., $D = G_{n,p}$ (and $N = \binom{n}2$), and the $k$-clique count function $\rho_k := C_k(G)/\mathbb{E}_{G' \sim G_{n,p}}[C_k(G')]$, where $C_k(G)$ is the number of $k$-cliques in $G$. Testing if $G \sim G_{n,p}$ with one sample, let alone with sublinear query access to the sample, is of course impossible. Testing if $\rho_k(G)\approx 1$ requires $p^{-\Omega(k^2)}$ samples. In contrast, we show that the quality control problem for $G_{n,p}$ (with $n \geq p^{-ck}$ for some constant $c$) with respect to $\rho_k$ can be tested with $p^{-O(k)}$ queries and time, showing quality control is provably superpolynomially more efficient in this setting. More generally, for a motif $H$ of maximum degree $\Delta(H)$, the respective quality control problem can be solved with $p^{-O(\Delta(H))}$ queries and running time.
comment: 70 pages
☆ Guiding Diffusion Models with Reinforcement Learning for Stable Molecule Generation
Generating physically realistic 3D molecular structures remains a core challenge in molecular generative modeling. While diffusion models equipped with equivariant neural networks have made progress in capturing molecular geometries, they often struggle to produce equilibrium structures that adhere to physical principles such as force field consistency. To bridge this gap, we propose Reinforcement Learning with Physical Feedback (RLPF), a novel framework that extends Denoising Diffusion Policy Optimization to 3D molecular generation. RLPF formulates the task as a Markov decision process and applies proximal policy optimization to fine-tune equivariant diffusion models. Crucially, RLPF introduces reward functions derived from force-field evaluations, providing direct physical feedback to guide the generation toward energetically stable and physically meaningful structures. Experiments on the QM9 and GEOM-drug datasets demonstrate that RLPF significantly improves molecular stability compared to existing methods. These results highlight the value of incorporating physics-based feedback into generative modeling. The code is available at: https://github.com/ZhijianZhou/RLPF/tree/verl_diffusion.
☆ FLAMES: Improving LLM Math Reasoning via a Fine-Grained Analysis of the Data Synthesis Pipeline EMNLP 2025
Recent works improving LLM math reasoning with synthetic data have used unique setups, making comparison of data synthesis strategies impractical. This leaves many unanswered questions about the roles of different factors in the synthetic data pipeline, such as the impact of filtering low-quality problems. To address this gap, we introduce FLAMES, a Framework for LLM Assessment of Math rEasoning Data Synthesis, and perform a systematic study of 10 existing data synthesis strategies and multiple other factors impacting the performance of synthetic math reasoning data. Our FLAMES experiments provide several valuable insights about the optimal balance of difficulty and diversity of synthetic data. First, data agents designed to increase problem complexity lead to best improvements on most math metrics. Second, with a fixed data generation budget, keeping higher problem coverage is more important than keeping only problems with reliable solutions. Third, GSM8K- and MATH-based synthetic data can lead to improvements on competition-level benchmarks, showcasing easy-to-hard generalization. Leveraging insights from our FLAMES experiments, we design two novel data synthesis strategies for improving out-of-domain generalization and robustness. Further, we develop the FLAMES dataset, an effective blend of our novel and existing data synthesis strategies, outperforming public datasets on OlympiadBench (+15.7), CollegeMath (+4.5), GSMPlus (+6.5), and MATH (+3.1). Fine-tuning Qwen2.5-Math-7B on the FLAMES dataset achieves 81.4% on MATH, surpassing larger Llama3 405B, GPT-4o and Claude 3.5 Sonnet.
comment: To appear at EMNLP 2025
☆ ML-PWS: Estimating the Mutual Information Between Experimental Time Series Using Neural Networks
The ability to quantify information transmission is crucial for the analysis and design of natural and engineered systems. The information transmission rate is the fundamental measure for systems with time-varying signals, yet computing it is extremely challenging. In particular, the rate cannot be obtained directly from experimental time-series data without approximations, because of the high dimensionality of the signal trajectory space. Path Weight Sampling (PWS) is a computational technique that makes it possible to obtain the information rate exactly for any stochastic system. However, it requires a mathematical model of the system of interest, be it described by a master equation or a set of differential equations. Here, we present a technique that employs Machine Learning (ML) to develop a generative model from experimental time-series data, which is then combined with PWS to obtain the information rate. We demonstrate the accuracy of this technique, called ML-PWS, by comparing its results on synthetic time-series data generated from a non-linear model against ground-truth results obtained by applying PWS directly to the same model. We illustrate the utility of ML-PWS by applying it to neuronal time-series data.
comment: 9 pages, 2 figures
☆ MuST2-Learn: Multi-view Spatial-Temporal-Type Learning for Heterogeneous Municipal Service Time Estimation SP
Non-emergency municipal services such as city 311 systems have been widely implemented across cities in Canada and the United States to enhance residents' quality of life. These systems enable residents to report issues, e.g., noise complaints, missed garbage collection, and potholes, via phone calls, mobile applications, or webpages. However, residents are often given limited information about when their service requests will be addressed, which can reduce transparency, lower resident satisfaction, and increase the number of follow-up inquiries. Predicting the service time for municipal service requests is challenging due to several complex factors: dynamic spatial-temporal correlations, underlying interactions among heterogeneous service request types, and high variation in service duration even within the same request category. In this work, we propose MuST2-Learn: a Multi-view Spatial-Temporal-Type Learning framework designed to address the aforementioned challenges by jointly modeling spatial, temporal, and service type dimensions. In detail, it incorporates an inter-type encoder to capture relationships among heterogeneous service request types and an intra-type variation encoder to model service time variation within homogeneous types. In addition, a spatiotemporal encoder is integrated to capture spatial and temporal correlations in each request type. The proposed framework is evaluated with extensive experiments using two real-world datasets. The results show that MuST2-Learn reduces mean absolute error by at least 32.5%, which outperforms state-of-the-art methods.
comment: Accepted to SIGSPATIAL 2025
☆ On Zero-Shot Reinforcement Learning
Modern reinforcement learning (RL) systems capture deep truths about general, human problem-solving. In domains where new data can be simulated cheaply, these systems uncover sequential decision-making policies that far exceed the ability of any human. Society faces many problems whose solutions require this skill, but they are often in domains where new data cannot be cheaply simulated. In such scenarios, we can learn simulators from existing data, but these will only ever be approximately correct, and can be pathologically incorrect when queried outside of their training distribution. As a result, a misalignment between the environments in which we train our agents and the real-world in which we wish to deploy our agents is inevitable. Dealing with this misalignment is the primary concern of zero-shot reinforcement learning, a problem setting where the agent must generalise to a new task or domain with zero practice shots. Whilst impressive progress has been made on methods that perform zero-shot RL in idealised settings, new work is needed if these results are to be replicated in real-world settings. In this thesis, we argue that doing so requires us to navigate (at least) three constraints. First, the data quality constraint: real-world datasets are small and homogeneous. Second, the observability constraint: states, dynamics and rewards in the real-world are often only partially observed. And third, the data availability constraint: a priori access to data cannot always be assumed. This work proposes a suite of methods that perform zero-shot RL subject to these constraints. In a series of empirical studies we expose the failings of existing methods, and justify our techniques for remedying them. We believe these designs take us a step closer to RL methods that can be deployed to solve real-world problems.
comment: PhD thesis
☆ Post Hoc Regression Refinement via Pairwise Rankings
Accurate prediction of continuous properties is essential to many scientific and engineering tasks. Although deep-learning regressors excel with abundant labels, their accuracy deteriorates in data-scarce regimes. We introduce RankRefine, a model-agnostic, plug-and-play post hoc method that refines regression with expert knowledge coming from pairwise rankings. Given a query item and a small reference set with known properties, RankRefine combines the base regressor's output with a rank-based estimate via inverse variance weighting, requiring no retraining. In molecular property prediction task, RankRefine achieves up to 10% relative reduction in mean absolute error using only 20 pairwise comparisons obtained through a general-purpose large language model (LLM) with no finetuning. As rankings provided by human experts or general-purpose LLMs are sufficient for improving regression across diverse domains, RankRefine offers practicality and broad applicability, especially in low-data settings.
☆ Ensembles of Neural Surrogates for Parametric Sensitivity in Ocean Modeling
Accurate simulations of the oceans are crucial in understanding the Earth system. Despite their efficiency, simulations at lower resolutions must rely on various uncertain parameterizations to account for unresolved processes. However, model sensitivity to parameterizations is difficult to quantify, making it challenging to tune these parameterizations to reproduce observations. Deep learning surrogates have shown promise for efficient computation of the parametric sensitivities in the form of partial derivatives, but their reliability is difficult to evaluate without ground truth derivatives. In this work, we leverage large-scale hyperparameter search and ensemble learning to improve both forward predictions, autoregressive rollout, and backward adjoint sensitivity estimation. Particularly, the ensemble method provides epistemic uncertainty of function value predictions and their derivatives, providing improved reliability of the neural surrogates in decision making.
comment: 12 pages, 7 figures
☆ FraPPE: Fast and Efficient Preference-based Pure Exploration
Preference-based Pure Exploration (PrePEx) aims to identify with a given confidence level the set of Pareto optimal arms in a vector-valued (aka multi-objective) bandit, where the reward vectors are ordered via a (given) preference cone $\mathcal{C}$. Though PrePEx and its variants are well-studied, there does not exist a computationally efficient algorithm that can optimally track the existing lower bound for arbitrary preference cones. We successfully fill this gap by efficiently solving the minimisation and maximisation problems in the lower bound. First, we derive three structural properties of the lower bound that yield a computationally tractable reduction of the minimisation problem. Then, we deploy a Frank-Wolfe optimiser to accelerate the maximisation problem in the lower bound. Together, these techniques solve the maxmin optimisation problem in $\mathcal{O}(KL^{2})$ time for a bandit instance with $K$ arms and $L$ dimensional reward, which is a significant acceleration over the literature. We further prove that our proposed PrePEx algorithm, FraPPE, asymptotically achieves the optimal sample complexity. Finally, we perform numerical experiments across synthetic and real datasets demonstrating that FraPPE achieves the lowest sample complexities to identify the exact Pareto set among the existing algorithms.
☆ Underdamped Langevin MCMC with third order convergence
In this paper, we propose a new numerical method for the underdamped Langevin diffusion (ULD) and present a non-asymptotic analysis of its sampling error in the 2-Wasserstein distance when the $d$-dimensional target distribution $p(x)\propto e^{-f(x)}$ is strongly log-concave and has varying degrees of smoothness. Precisely, under the assumptions that the gradient and Hessian of $f$ are Lipschitz continuous, our algorithm achieves a 2-Wasserstein error of $\varepsilon$ in $\mathcal{O}(\sqrt{d}/\varepsilon)$ and $\mathcal{O}(\sqrt{d}/\sqrt{\varepsilon})$ steps respectively. Therefore, our algorithm has a similar complexity as other popular Langevin MCMC algorithms under matching assumptions. However, if we additionally assume that the third derivative of $f$ is Lipschitz continuous, then our algorithm achieves a 2-Wasserstein error of $\varepsilon$ in $\mathcal{O}(\sqrt{d}/\varepsilon^{\frac{1}{3}})$ steps. To the best of our knowledge, this is the first gradient-only method for ULD with third order convergence. To support our theory, we perform Bayesian logistic regression across a range of real-world datasets, where our algorithm achieves competitive performance compared to an existing underdamped Langevin MCMC algorithm and the popular No U-Turn Sampler (NUTS).
comment: 62 pages, 7 figures
☆ Benchmarking the Robustness of Agentic Systems to Adversarially-Induced Harms
Ensuring the safe use of agentic systems requires a thorough understanding of the range of malicious behaviors these systems may exhibit when under attack. In this paper, we evaluate the robustness of LLM-based agentic systems against attacks that aim to elicit harmful actions from agents. To this end, we propose a novel taxonomy of harms for agentic systems and a novel benchmark, BAD-ACTS, for studying the security of agentic systems with respect to a wide range of harmful actions. BAD-ACTS consists of 4 implementations of agentic systems in distinct application environments, as well as a dataset of 188 high-quality examples of harmful actions. This enables a comprehensive study of the robustness of agentic systems across a wide range of categories of harmful behaviors, available tools, and inter-agent communication structures. Using this benchmark, we analyze the robustness of agentic systems against an attacker that controls one of the agents in the system and aims to manipulate other agents to execute a harmful target action. Our results show that the attack has a high success rate, demonstrating that even a single adversarial agent within the system can have a significant impact on the security. This attack remains effective even when agents use a simple prompting-based defense strategy. However, we additionally propose a more effective defense based on message monitoring. We believe that this benchmark provides a diverse testbed for the security research of agentic systems. The benchmark can be found at github.com/JNoether/BAD-ACTS
comment: 52 Pages
☆ NOSTRA: A noise-resilient and sparse data framework for trust region based multi objective Bayesian optimization
Multi-objective Bayesian optimization (MOBO) struggles with sparse (non-space-filling), scarce (limited observations) datasets affected by experimental uncertainty, where identical inputs can yield varying outputs. These challenges are common in physical and simulation experiments (e.g., randomized medical trials and, molecular dynamics simulations) and are therefore incompatible with conventional MOBO methods. As a result, experimental resources are inefficiently allocated, leading to suboptimal designs. To address this challenge, we introduce NOSTRA (Noisy and Sparse Data Trust Region-based Optimization Algorithm), a novel sampling framework that integrates prior knowledge of experimental uncertainty to construct more accurate surrogate models while employing trust regions to focus sampling on promising areas of the design space. By strategically leveraging prior information and refining search regions, NOSTRA accelerates convergence to the Pareto frontier, enhances data efficiency, and improves solution quality. Through two test functions with varying levels of experimental uncertainty, we demonstrate that NOSTRA outperforms existing methods in handling noisy, sparse, and scarce data. Specifically, we illustrate that, NOSTRA effectively prioritizes regions where samples enhance the accuracy of the identified Pareto frontier, offering a resource-efficient algorithm that is practical in scenarios with limited experimental budgets while ensuring efficient performance.
☆ Reinforcement Learning-based Control via Y-wise Affine Neural Networks (YANNs)
This work presents a novel reinforcement learning (RL) algorithm based on Y-wise Affine Neural Networks (YANNs). YANNs provide an interpretable neural network which can exactly represent known piecewise affine functions of arbitrary input and output dimensions defined on any amount of polytopic subdomains. One representative application of YANNs is to reformulate explicit solutions of multi-parametric linear model predictive control. Built on this, we propose the use of YANNs to initialize RL actor and critic networks, which enables the resulting YANN-RL control algorithm to start with the confidence of linear optimal control. The YANN-actor is initialized by representing the multi-parametric control solutions obtained via offline computation using an approximated linear system model. The YANN-critic represents the explicit form of the state-action value function for the linear system and the reward function as the objective in an optimal control problem (OCP). Additional network layers are injected to extend YANNs for nonlinear expressions, which can be trained online by directly interacting with the true complex nonlinear system. In this way, both the policy and state-value functions exactly represent a linear OCP initially and are able to eventually learn the solution of a general nonlinear OCP. Continuous policy improvement is also implemented to provide heuristic confidence that the linear OCP solution serves as an effective lower bound to the performance of RL policy. The YANN-RL algorithm is demonstrated on a clipped pendulum and a safety-critical chemical-reactive system. Our results show that YANN-RL significantly outperforms the modern RL algorithm using deep deterministic policy gradient, especially when considering safety constraints.
☆ HOSt3R: Keypoint-free Hand-Object 3D Reconstruction from RGB images
Hand-object 3D reconstruction has become increasingly important for applications in human-robot interaction and immersive AR/VR experiences. A common approach for object-agnostic hand-object reconstruction from RGB sequences involves a two-stage pipeline: hand-object 3D tracking followed by multi-view 3D reconstruction. However, existing methods rely on keypoint detection techniques, such as Structure from Motion (SfM) and hand-keypoint optimization, which struggle with diverse object geometries, weak textures, and mutual hand-object occlusions, limiting scalability and generalization. As a key enabler to generic and seamless, non-intrusive applicability, we propose in this work a robust, keypoint detector-free approach to estimating hand-object 3D transformations from monocular motion video/images. We further integrate this with a multi-view reconstruction pipeline to accurately recover hand-object 3D shape. Our method, named HOSt3R, is unconstrained, does not rely on pre-scanned object templates or camera intrinsics, and reaches state-of-the-art performance for the tasks of object-agnostic hand-object 3D transformation and shape estimation on the SHOWMe benchmark. We also experiment on sequences from the HO3D dataset, demonstrating generalization to unseen object categories.
comment: 12 pages, 8 figures
☆ Anti-establishment sentiment on TikTok: Implications for understanding influence(rs) and expertise on social media AAAI
Distrust of public serving institutions and anti-establishment views are on the rise (especially in the U.S.). As people turn to social media for information, it is imperative to understand whether and how social media environments may be contributing to distrust of institutions. In social media, content creators, influencers, and other opinion leaders often position themselves as having expertise and authority on a range of topics from health to politics, and in many cases devalue and dismiss institutional expertise to build a following and increase their own visibility. However, the extent to which this content appears and whether such content increases engagement is unclear. This study analyzes the prevalence of anti-establishment sentiment (AES) on the social media platform TikTok. Despite its popularity as a source of information, TikTok remains relatively understudied and may provide important insights into how people form attitudes towards institutions. We employ a computational approach to label TikTok posts as containing AES or not across topical domains where content creators tend to frame themselves as experts: finance and wellness. As a comparison, we also consider the topic of conspiracy theories, where AES is expected to be common. We find that AES is most prevalent in conspiracy theory content, and relatively rare in content related to the other two topics. However, we find that engagement patterns with such content varies by area, and that there may be platform incentives for users to post content that expresses anti-establishment sentiment.
comment: 10 pages excluding references; 14 pages in total; 4 figures; Accepted by the AAAI Conference on Web and Social Media (ICWSM-2026)
☆ Beyond Interpretability: Exploring the Comprehensibility of Adaptive Video Streaming through Large Language Models
Over the past decade, adaptive video streaming technology has witnessed significant advancements, particularly driven by the rapid evolution of deep learning techniques. However, the black-box nature of deep learning algorithms presents challenges for developers in understanding decision-making processes and optimizing for specific application scenarios. Although existing research has enhanced algorithm interpretability through decision tree conversion, interpretability does not directly equate to developers' subjective comprehensibility. To address this challenge, we introduce \texttt{ComTree}, the first bitrate adaptation algorithm generation framework that considers comprehensibility. The framework initially generates the complete set of decision trees that meet performance requirements, then leverages large language models to evaluate these trees for developer comprehensibility, ultimately selecting solutions that best facilitate human understanding and enhancement. Experimental results demonstrate that \texttt{ComTree} significantly improves comprehensibility while maintaining competitive performance, showing potential for further advancement. The source code is available at https://github.com/thu-media/ComTree.
comment: ACM Multimedia2025
☆ Boardwalk: Towards a Framework for Creating Board Games with LLMs
Implementing board games in code can be a time-consuming task. However, Large Language Models (LLMs) have been proven effective at generating code for domain-specific tasks with simple contextual information. We aim to investigate whether LLMs can implement digital versions of board games from rules described in natural language. This would be a step towards an LLM-assisted framework for quick board game code generation. We expect to determine the main challenges for LLMs to implement the board games, and how different approaches and models compare to one another. We task three state-of-the-art LLMs (Claude, DeepSeek and ChatGPT) with coding a selection of 12 popular and obscure games in free-form and within Boardwalk, our proposed General Game Playing API. We anonymize the games and components to avoid evoking pre-trained LLM knowledge. The implementations are tested for playability and rule compliance. We evaluate success rate and common errors across LLMs and game popularity. Our approach proves viable, with the best performing model, Claude 3.7 Sonnet, yielding 55.6\% of games without any errors. While compliance with the API increases error frequency, the severity of errors is more significantly dependent on the LLM. We outline future steps for creating a framework to integrate this process, making the elaboration of board games more accessible.
comment: Accepted at SBGames 2025
☆ Integrated Noise and Safety Management in UAM via A Unified Reinforcement Learning Framework
Urban Air Mobility (UAM) envisions the widespread use of small aerial vehicles to transform transportation in dense urban environments. However, UAM faces critical operational challenges, particularly the balance between minimizing noise exposure and maintaining safe separation in low-altitude urban airspace, two objectives that are often addressed separately. We propose a reinforcement learning (RL)-based air traffic management system that integrates both noise and safety considerations within a unified, decentralized framework. Under this scalable air traffic coordination solution, agents operate in a structured, multi-layered airspace and learn altitude adjustment policies to jointly manage noise impact and separation constraints. The system demonstrates strong performance across both objectives and reveals tradeoffs among separation, noise exposure, and energy efficiency under high traffic density. The findings highlight the potential of RL and multi-objective coordination strategies in enhancing the safety, quietness, and efficiency of UAM operations.
☆ Deep Intrinsic Coregionalization Multi-Output Gaussian Process Surrogate with Active Learning
Deep Gaussian Processes (DGPs) are powerful surrogate models known for their flexibility and ability to capture complex functions. However, extending them to multi-output settings remains challenging due to the need for efficient dependency modeling. We propose the Deep Intrinsic Coregionalization Multi-Output Gaussian Process (deepICMGP) surrogate for computer simulation experiments involving multiple outputs, which extends the Intrinsic Coregionalization Model (ICM) by introducing hierarchical coregionalization structures across layers. This enables deepICMGP to effectively model nonlinear and structured dependencies between multiple outputs, addressing key limitations of traditional multi-output GPs. We benchmark deepICMGP against state-of-the-art models, demonstrating its competitive performance. Furthermore, we incorporate active learning strategies into deepICMGP to optimize sequential design tasks, enhancing its ability to efficiently select informative input locations for multi-output systems.
comment: 41 pages, 12 figures
☆ Double Check My Desired Return: Transformer with Target Alignment for Offline Reinforcement Learning
Offline reinforcement learning (RL) has achieved significant advances in domains such as robotic control, autonomous driving, and medical decision-making. Most existing methods primarily focus on training policies that maximize cumulative returns from a given dataset. However, many real-world applications require precise control over policy performance levels, rather than simply pursuing the best possible return. Reinforcement learning via supervised learning (RvS) frames offline RL as a sequence modeling task, enabling the extraction of diverse policies by conditioning on different desired returns. Yet, existing RvS-based transformers, such as Decision Transformer (DT), struggle to reliably align the actual achieved returns with specified target returns, especially when interpolating within underrepresented returns or extrapolating beyond the dataset. To address this limitation, we propose Doctor, a novel approach that Double Checks the Transformer with target alignment for Offline RL. Doctor achieves superior target alignment both within and beyond the dataset, while enabling accurate and flexible control over policy performance. Notably, on the dynamic treatment regime benchmark, EpiCare, our approach effectively modulates treatment policy aggressiveness, balancing therapeutic returns against adverse event risk.
☆ LLM-GUARD: Large Language Model-Based Detection and Repair of Bugs and Security Vulnerabilities in C++ and Python
Large Language Models (LLMs) such as ChatGPT-4, Claude 3, and LLaMA 4 are increasingly embedded in software/application development, supporting tasks from code generation to debugging. Yet, their real-world effectiveness in detecting diverse software bugs, particularly complex, security-relevant vulnerabilities, remains underexplored. This study presents a systematic, empirical evaluation of these three leading LLMs using a benchmark of foundational programming errors, classic security flaws, and advanced, production-grade bugs in C++ and Python. The dataset integrates real code from SEED Labs, OpenSSL (via the Suresoft GLaDOS database), and PyBugHive, validated through local compilation and testing pipelines. A novel multi-stage, context-aware prompting protocol simulates realistic debugging scenarios, while a graded rubric measures detection accuracy, reasoning depth, and remediation quality. Our results show that all models excel at identifying syntactic and semantic issues in well-scoped code, making them promising for educational use and as first-pass reviewers in automated code auditing. Performance diminishes in scenarios involving complex security vulnerabilities and large-scale production code, with ChatGPT-4 and Claude 3 generally providing more nuanced contextual analyses than LLaMA 4. This highlights both the promise and the present constraints of LLMs in serving as reliable code analysis tools.
☆ Fast and Accurate RFIC Performance Prediction via Pin Level Graph Neural Networks and Probabilistic Flow
Accurately predicting the performance of active radio frequency (RF) circuits is essential for modern wireless systems but remains challenging due to highly nonlinear, layout-sensitive behavior and the high computational cost of traditional simulation tools. Existing machine learning (ML) surrogates often require large datasets to generalize across various topologies or to accurately model skewed and multi-modal performance metrics. In this work, a lightweight, data-efficient, and topology-aware graph neural network (GNN) model is proposed for predicting key performance metrics of multiple topologies of active RF circuits such as low noise amplifiers (LNAs), mixers, voltage-controlled oscillators (VCOs), and PAs. To capture transistor-level symmetry and preserve fine-grained connectivity details, circuits are modeled at the device-terminal level, enabling scalable message passing while reducing data requirements. Masked autoregressive flow (MAF) output heads are incorporated to improve robustness in modeling complex target distributions. Experiments on datasets demonstrate high prediction accuracy, with symmetric mean absolute percentage error (sMAPE) and mean relative error (MRE) averaging 2.40% and 2.91%, respectively. Owing to the pin-level conversion of circuit to graph and ML architecture robust to modeling complex densities of RF metrics, the MRE is improved by 3.14x while using 2.24x fewer training samples compared to prior work, demonstrating the method's effectiveness for rapid and accurate RF circuit design automation.
comment: This work has been submitted to the IEEE for possible publication
☆ Audio2Face-3D: Audio-driven Realistic Facial Animation For Digital Avatars
Audio-driven facial animation presents an effective solution for animating digital avatars. In this paper, we detail the technical aspects of NVIDIA Audio2Face-3D, including data acquisition, network architecture, retargeting methodology, evaluation metrics, and use cases. Audio2Face-3D system enables real-time interaction between human users and interactive avatars, facilitating facial animation authoring for game characters. To assist digital avatar creators and game developers in generating realistic facial animations, we have open-sourced Audio2Face-3D networks, SDK, training framework, and example dataset.
☆ RoMedQA: The First Benchmark for Romanian Medical Question Answering
Question answering (QA) is an actively studied topic, being a core natural language processing (NLP) task that needs to be addressed before achieving Artificial General Intelligence (AGI). However, the lack of QA datasets in specific domains and languages hinders the development of robust AI models able to generalize across various domains and languages. To this end, we introduce RoMedQA, the first Romanian QA benchmark for the medical domain, alongside a comprehensive evaluation of state-of-the-art large language models (LLMs). We construct a high-quality and large-scale dataset comprising 102,646 QA pairs related to cancer patients. The questions regard medical case summaries of 1,011 patients, requiring either keyword extraction or reasoning to be answered correctly. RoMedQA is the result of a time-consuming manual annotation process carried out by seven physicians specialized in oncology or radiotherapy, who spent a total of about 2,100 work hours to generate the QA pairs. We experiment with four LLMs from distinct families of models on RoMedQA. Each model is employed in two scenarios, namely one based on zero-shot prompting and one based on supervised fine-tuning. Our results show that fine-tuned models significantly outperform their zero-shot counterparts, clearly indicating that pretrained models fail to generalize on RoMedQA. Our findings demonstrate the importance of both domain-specific and language-specific fine-tuning for reliable clinical QA in Romanian. We publicly release our dataset and code at https://github.com/ana-rogoz/RoMedQA.
☆ Sequential Cohort Selection
We study the problem of fair cohort selection from an unknown population, with a focus on university admissions. We start with the one-shot setting, where the admission policy must be fixed in advance and remain transparent, before observing the actual applicant pool. In contrast, the sequential setting allows the policy to be updated across stages as new applicant data becomes available. This is achieved by optimizing admission policies using a population model, trained on data from previous admission cycles. We also study the fairness properties of the resulting policies in the one-shot setting, including meritocracy and group parity.
comment: 9 pages, 7 figures
☆ Applications and Challenges of Fairness APIs in Machine Learning Software
Machine Learning software systems are frequently used in our day-to-day lives. Some of these systems are used in various sensitive environments to make life-changing decisions. Therefore, it is crucial to ensure that these AI/ML systems do not make any discriminatory decisions for any specific groups or populations. In that vein, different bias detection and mitigation open-source software libraries (aka API libraries) are being developed and used. In this paper, we conduct a qualitative study to understand in what scenarios these open-source fairness APIs are used in the wild, how they are used, and what challenges the developers of these APIs face while developing and adopting these libraries. We have analyzed 204 GitHub repositories (from a list of 1885 candidate repositories) which used 13 APIs that are developed to address bias in ML software. We found that these APIs are used for two primary purposes (i.e., learning and solving real-world problems), targeting 17 unique use-cases. Our study suggests that developers are not well-versed in bias detection and mitigation; they face lots of troubleshooting issues, and frequently ask for opinions and resources. Our findings can be instrumental for future bias-related software engineering research, and for guiding educators in developing more state-of-the-art curricula.
☆ RotaTouille: Rotation Equivariant Deep Learning for Contours
Contours or closed planar curves are common in many domains. For example, they appear as object boundaries in computer vision, isolines in meteorology, and the orbits of rotating machinery. In many cases when learning from contour data, planar rotations of the input will result in correspondingly rotated outputs. It is therefore desirable that deep learning models be rotationally equivariant. In addition, contours are typically represented as an ordered sequence of edge points, where the choice of starting point is arbitrary. It is therefore also desirable for deep learning methods to be equivariant under cyclic shifts. We present RotaTouille, a deep learning framework for learning from contour data that achieves both rotation and cyclic shift equivariance through complex-valued circular convolution. We further introduce and characterize equivariant non-linearities, coarsening layers, and global pooling layers to obtain invariant representations for downstream tasks. Finally, we demonstrate the effectiveness of RotaTouille through experiments in shape classification, reconstruction, and contour regression.
comment: 12 pages, 4 figures
☆ Probabilistic Pretraining for Neural Regression
Transfer learning for probabilistic regression remains underexplored. This work closes this gap by introducing NIAQUE, Neural Interpretable Any-Quantile Estimation, a new model designed for transfer learning in probabilistic regression through permutation invariance. We demonstrate that pre-training NIAQUE directly on diverse downstream regression datasets and fine-tuning it on a specific target dataset enhances performance on individual regression tasks, showcasing the positive impact of probabilistic transfer learning. Furthermore, we highlight the effectiveness of NIAQUE in Kaggle competitions against strong baselines involving tree-based models and recent neural foundation models TabPFN and TabDPT. The findings highlight NIAQUE's efficacy as a robust and scalable framework for probabilistic regression, leveraging transfer learning to enhance predictive performance.
☆ Uppaal Coshy: Automatic Synthesis of Compact Shields for Hybrid Systems
We present Uppaal Coshy, a tool for automatic synthesis of a safety strategy -- or shield -- for Markov decision processes over continuous state spaces and complex hybrid dynamics. The general methodology is to partition the state space and then solve a two-player safety game, which entails a number of algorithmically hard problems such as reachability for hybrid systems. The general philosophy of Uppaal Coshy is to approximate hard-to-obtain solutions using simulations. Our implementation is fully automatic and supports the expressive formalism of Uppaal models, which encompass stochastic hybrid automata. The precision of our partition-based approach benefits from using finer grids, which however are not efficient to store. We include an algorithm called Caap to efficiently compute a compact representation of a shield in the form of a decision tree, which yields significant reductions.
comment: 12 pages and 6 figures. Additional abstract of 4 pages and 4 figures. Extended version with supplementary material for an article to appear in the 2025 International Conference on Reachability Problems (RP)
☆ Unsupervised Online Detection of Pipe Blockages and Leakages in Water Distribution Networks
Water Distribution Networks (WDNs), critical to public well-being and economic stability, face challenges such as pipe blockages and background leakages, exacerbated by operational constraints such as data non-stationarity and limited labeled data. This paper proposes an unsupervised, online learning framework that aims to detect two types of faults in WDNs: pipe blockages, modeled as collective anomalies, and background leakages, modeled as concept drift. Our approach combines a Long Short-Term Memory Variational Autoencoder (LSTM-VAE) with a dual drift detection mechanism, enabling robust detection and adaptation under non-stationary conditions. Its lightweight, memory-efficient design enables real-time, edge-level monitoring. Experiments on two realistic WDNs show that the proposed approach consistently outperforms strong baselines in detecting anomalies and adapting to recurrent drift, demonstrating its effectiveness in unsupervised event detection for dynamic WDN environments.
comment: This paper is accepted by the 6th International Conference on Control and Fault-Tolerant Systems (SysTol)
☆ OwkinZero: Accelerating Biological Discovery with AI
While large language models (LLMs) are rapidly advancing scientific research, they continue to struggle with core biological reasoning tasks essential for translational and biomedical discovery. To address this limitation, we created and curated eight comprehensive benchmark datasets comprising over 300,000 verifiable question-and-answer pairs, each targeting critical challenges in drug discovery including target druggability, modality suitability, and drug perturbation effects. Using this resource, we developed the OwkinZero models by post-training open-source LLMs through a Reinforcement Learning from Verifiable Rewards strategy. Our results demonstrate that specialized 8-32B OwkinZero models substantially outperform larger, state-of-the-art commercial LLMs on these biological benchmarks. Remarkably, we uncover evidence of a key aspect of generalization: specialist models trained on a single task consistently outperform their base models on previously unseen tasks. This generalization effect is further amplified in our comprehensive OwkinZero models, which were trained on a mixture of datasets and achieve even broader cross-task improvements. This study represents a significant step toward addressing the biological reasoning blind spot in current LLMs, demonstrating that targeted reinforcement learning on carefully curated data can unlock generalizable performance in specialized models, thereby accelerating AI-driven biological discovery.
comment: Preprint
☆ Cyber Physical Awareness via Intent-Driven Threat Assessment: Enhanced Space Networks with Intershell Links
This letter addresses essential aspects of threat assessment by proposing intent-driven threat models that incorporate both capabilities and intents. We propose a holistic framework for cyber physical awareness (CPA) in space networks, pointing out that analyzing reliability and security separately can lead to overfitting on system-specific criteria. We structure our proposed framework in three main steps. First, we suggest an algorithm that extracts characteristic properties of the received signal to facilitate an intuitive understanding of potential threats. Second, we develop a multitask learning architecture where one task evaluates reliability-related capabilities while the other deciphers the underlying intentions of the signal. Finally, we propose an adaptable threat assessment that aligns with varying security and reliability requirements. The proposed framework enhances the robustness of threat detection and assessment, outperforming conventional sequential methods, and enables space networks with emerging intershell links to effectively address complex threat scenarios.
comment: in IEEE Wireless Communications Letters, 2025
☆ Retrieval Enhanced Feedback via In-context Neural Error-book EMNLP 2025
Recent advancements in Large Language Models (LLMs) have significantly improved reasoning capabilities, with in-context learning (ICL) emerging as a key technique for adaptation without retraining. While previous works have focused on leveraging correct examples, recent research highlights the importance of learning from errors to enhance performance. However, existing methods lack a structured framework for analyzing and mitigating errors, particularly in Multimodal Large Language Models (MLLMs), where integrating visual and textual inputs adds complexity. To address this issue, we propose REFINE: Retrieval-Enhanced Feedback via In-context Neural Error-book, a teacher-student framework that systematically structures errors and provides targeted feedback. REFINE introduces three systematic queries to construct structured feedback -- Feed-Target, Feed-Check, and Feed-Path -- to enhance multimodal reasoning by prioritizing relevant visual information, diagnosing critical failure points, and formulating corrective actions. Unlike prior approaches that rely on redundant retrievals, REFINE optimizes structured feedback retrieval, improving inference efficiency, token usage, and scalability. Our results demonstrate substantial speedup, reduced computational costs, and successful generalization, highlighting REFINE's potential for enhancing multimodal reasoning.
comment: Accepted at EMNLP 2025 main conference
A Sharp KL-Convergence Analysis for Diffusion Models under Minimal Assumptions
Diffusion-based generative models have emerged as highly effective methods for synthesizing high-quality samples. Recent works have focused on analyzing the convergence of their generation process with minimal assumptions, either through reverse SDEs or Probability Flow ODEs. The best known guarantees, without any smoothness assumptions, for the KL divergence so far achieve a linear dependence on the data dimension $d$ and an inverse quadratic dependence on $\varepsilon$. In this work, we present a refined analysis that improves the dependence on $\varepsilon$. We model the generation process as a composition of two steps: a reverse ODE step, followed by a smaller noising step along the forward process. This design leverages the fact that the ODE step enables control in Wasserstein-type error, which can then be converted into a KL divergence bound via noise addition, leading to a better dependence on the discretization step size. We further provide a novel analysis to achieve the linear $d$-dependence for the error due to discretizing this Probability Flow ODE in absence of any smoothness assumptions. We show that $\tilde{O}\left(\tfrac{d\log^{3/2}(\frac{1}{\delta})}{\varepsilon}\right)$ steps suffice to approximate the target distribution corrupted with Gaussian noise of variance $\delta$ within $O(\varepsilon^2)$ in KL divergence, improving upon the previous best result, requiring $\tilde{O}\left(\tfrac{d\log^2(\frac{1}{\delta})}{\varepsilon^2}\right)$ steps.
comment: 30 pages, 1 figure
☆ Structuring GUI Elements through Vision Language Models: Towards Action Space Generation
Multimodal large language models (MLLMs) have emerged as pivotal tools in enhancing human-computer interaction. In this paper we focus on the application of MLLMs in the field of graphical user interface (GUI) elements structuring, where they assist in processing user instructions based on screen contents. Despite the promise of MLLMs, their performance in precisely generating UI element coordinates, a critical aspect of GUI understanding, is hindered by the nature of next-token prediction training. This challenge arises from the semantic void surrounding numerical UI coordinates in language representation spaces, necessitating a substantial and diverse dataset to bolster visual module capabilities. To address these limitations, we introduce an IoU-Augmented Maximum Likelihood (IAML) training paradigm. Specifically, our approach involves a novel pipeline for IoU-based coordinate sampling to augment the training data, which considers the proximity to ground truth coordinates. This data augmentation strategy is then employed to fine-tune MLLMs under the IAML paradigm, which is designed to mitigate the exposure bias problem inherent in traditional maximum likelihood estimation. Through extensive experiments, we demonstrate the superior performance of our IAML training approach over traditional training paradigms.
comment: 10pageV0
☆ Representation Learning of Auxiliary Concepts for Improved Student Modeling and Exercise Recommendation
Personalized recommendation is a key feature of intelligent tutoring systems, typically relying on accurate models of student knowledge. Knowledge Tracing (KT) models enable this by estimating a student's mastery based on their historical interactions. Many KT models rely on human-annotated knowledge concepts (KCs), which tag each exercise with one or more skills or concepts believed to be necessary for solving it. However, these KCs can be incomplete, error-prone, or overly general. In this paper, we propose a deep learning model that learns sparse binary representations of exercises, where each bit indicates the presence or absence of a latent concept. We refer to these representations as auxiliary KCs. These representations capture conceptual structure beyond human-defined annotations and are compatible with both classical models (e.g., BKT) and modern deep learning KT architectures. We demonstrate that incorporating auxiliary KCs improves both student modeling and adaptive exercise recommendation. For student modeling, we show that augmenting classical models like BKT with auxiliary KCs leads to improved predictive performance. For recommendation, we show that using auxiliary KCs enhances both reinforcement learning-based policies and a simple planning-based method (expectimax), resulting in measurable gains in student learning outcomes within a simulated student environment.
☆ On the Evolution of Federated Post-Training Large Language Models: A Model Accessibility View
Federated Learning (FL) enables training models across decentralized data silos while preserving client data privacy. Recent research has explored efficient methods for post-training large language models (LLMs) within FL to address computational and communication challenges. While existing approaches often rely on access to LLMs' internal information, which is frequently restricted in real-world scenarios, an inference-only paradigm (black-box FedLLM) has emerged to address these limitations. This paper presents a comprehensive survey on federated tuning for LLMs. We propose a taxonomy categorizing existing studies along two axes: model access-based and parameter efficiency-based optimization. We classify FedLLM approaches into white-box, gray-box, and black-box techniques, highlighting representative methods within each category. We review emerging research treating LLMs as black-box inference APIs and discuss promising directions and open challenges for future research.
☆ Chunked Data Shapley: A Scalable Dataset Quality Assessment for Machine Learning
As the volume and diversity of available datasets continue to increase, assessing data quality has become crucial for reliable and efficient Machine Learning analytics. A modern, game-theoretic approach for evaluating data quality is the notion of Data Shapley which quantifies the value of individual data points within a dataset. State-of-the-art methods to scale the NP-hard Shapley computation also face severe challenges when applied to large-scale datasets, limiting their practical use. In this work, we present a Data Shapley approach to identify a dataset's high-quality data tuples, Chunked Data Shapley (C-DaSh). C-DaSh scalably divides the dataset into manageable chunks and estimates the contribution of each chunk using optimized subset selection and single-iteration stochastic gradient descent. This approach drastically reduces computation time while preserving high quality results. We empirically benchmark our method on diverse real-world classification and regression tasks, demonstrating that C-DaSh outperforms existing Shapley approximations in both computational efficiency (achieving speedups between 80x - 2300x) and accuracy in detecting low-quality data regions. Our method enables practical measurement of dataset quality on large tabular datasets, supporting both classification and regression pipelines.
☆ FEST: A Unified Framework for Evaluating Synthetic Tabular Data
Synthetic data generation, leveraging generative machine learning techniques, offers a promising approach to mitigating privacy concerns associated with real-world data usage. Synthetic data closely resembles real-world data while maintaining strong privacy guarantees. However, a comprehensive assessment framework is still missing in the evaluation of synthetic data generation, especially when considering the balance between privacy preservation and data utility in synthetic data. This research bridges this gap by proposing FEST, a systematic framework for evaluating synthetic tabular data. FEST integrates diverse privacy metrics (attack-based and distance-based), along with similarity and machine learning utility metrics, to provide a holistic assessment. We develop FEST as an open-source Python-based library and validate it on multiple datasets, demonstrating its effectiveness in analyzing the privacy-utility trade-off of different synthetic data generation models. The source code of FEST is available on Github.
comment: 11 pages, International Conference on Information Systems Security and Privacy
☆ Limit-Computable Grains of Truth for Arbitrary Computable Extensive-Form (Un)Known Games
A Bayesian player acting in an infinite multi-player game learns to predict the other players' strategies if his prior assigns positive probability to their play (or contains a grain of truth). Kalai and Lehrer's classic grain of truth problem is to find a reasonably large class of strategies that contains the Bayes-optimal policies with respect to this class, allowing mutually-consistent beliefs about strategy choice that obey the rules of Bayesian inference. Only small classes are known to have a grain of truth and the literature contains several related impossibility results. In this paper we present a formal and general solution to the full grain of truth problem: we construct a class of strategies wide enough to contain all computable strategies as well as Bayes-optimal strategies for every reasonable prior over the class. When the "environment" is a known repeated stage game, we show convergence in the sense of [KL93a] and [KL93b]. When the environment is unknown, agents using Thompson sampling converge to play $\varepsilon$-Nash equilibria in arbitrary unknown computable multi-agent environments. Finally, we include an application to self-predictive policies that avoid planning. While these results use computability theory only as a conceptual tool to solve a classic game theory problem, we show that our solution can naturally be computationally approximated arbitrarily closely.
comment: 42 pages; 2 figures; 7 algorithms
☆ When Simpler Wins: Facebooks Prophet vs LSTM for Air Pollution Forecasting in Data-Constrained Northern Nigeria
Air pollution forecasting is critical for proactive environmental management, yet data irregularities and scarcity remain major challenges in low-resource regions. Northern Nigeria faces high levels of air pollutants, but few studies have systematically compared the performance of advanced machine learning models under such constraints. This study evaluates Long Short-Term Memory (LSTM) networks and the Facebook Prophet model for forecasting multiple pollutants (CO, SO2, SO4) using monthly observational data from 2018 to 2023 across 19 states. Results show that Prophet often matches or exceeds LSTM's accuracy, particularly in series dominated by seasonal and long-term trends, while LSTM performs better in datasets with abrupt structural changes. These findings challenge the assumption that deep learning models inherently outperform simpler approaches, highlighting the importance of model-data alignment. For policymakers and practitioners in resource-constrained settings, this work supports adopting context-sensitive, computationally efficient forecasting methods over complexity for its own sake.
☆ A XAI-based Framework for Frequency Subband Characterization of Cough Spectrograms in Chronic Respiratory Disease
This paper presents an explainable artificial intelligence (XAI)-based framework for the spectral analysis of cough sounds associated with chronic respiratory diseases, with a particular focus on Chronic Obstructive Pulmonary Disease (COPD). A Convolutional Neural Network (CNN) is trained on time-frequency representations of cough signals, and occlusion maps are used to identify diagnostically relevant regions within the spectrograms. These highlighted areas are subsequently decomposed into five frequency subbands, enabling targeted spectral feature extraction and analysis. The results reveal that spectral patterns differ across subbands and disease groups, uncovering complementary and compensatory trends across the frequency spectrum. Noteworthy, the approach distinguishes COPD from other respiratory conditions, and chronic from non-chronic patient groups, based on interpretable spectral markers. These findings provide insight into the underlying pathophysiological characteristics of cough acoustics and demonstrate the value of frequency-resolved, XAI-enhanced analysis for biomedical signal interpretation and translational respiratory disease diagnostics.
☆ PIANO: Physics Informed Autoregressive Network
Solving time-dependent partial differential equations (PDEs) is fundamental to modeling critical phenomena across science and engineering. Physics-Informed Neural Networks (PINNs) solve PDEs using deep learning. However, PINNs perform pointwise predictions that neglect the autoregressive property of dynamical systems, leading to instabilities and inaccurate predictions. We introduce Physics-Informed Autoregressive Networks (PIANO) -- a framework that redesigns PINNs to model dynamical systems. PIANO operates autoregressively, explicitly conditioning future predictions on the past. It is trained through a self-supervised rollout mechanism while enforcing physical constraints. We present a rigorous theoretical analysis demonstrating that PINNs suffer from temporal instability, while PIANO achieves stability through autoregressive modeling. Extensive experiments on challenging time-dependent PDEs demonstrate that PIANO achieves state-of-the-art performance, significantly improving accuracy and stability over existing methods. We further show that PIANO outperforms existing methods in weather forecasting.
☆ UMATO: Bridging Local and Global Structures for Reliable Visual Analytics with Dimensionality Reduction
Due to the intrinsic complexity of high-dimensional (HD) data, dimensionality reduction (DR) techniques cannot preserve all the structural characteristics of the original data. Therefore, DR techniques focus on preserving either local neighborhood structures (local techniques) or global structures such as pairwise distances between points (global techniques). However, both approaches can mislead analysts to erroneous conclusions about the overall arrangement of manifolds in HD data. For example, local techniques may exaggerate the compactness of individual manifolds, while global techniques may fail to separate clusters that are well-separated in the original space. In this research, we provide a deeper insight into Uniform Manifold Approximation with Two-phase Optimization (UMATO), a DR technique that addresses this problem by effectively capturing local and global structures. UMATO achieves this by dividing the optimization process of UMAP into two phases. In the first phase, it constructs a skeletal layout using representative points, and in the second phase, it projects the remaining points while preserving the regional characteristics. Quantitative experiments validate that UMATO outperforms widely used DR techniques, including UMAP, in terms of global structure preservation, with a slight loss in local structure. We also confirm that UMATO outperforms baseline techniques in terms of scalability and stability against initialization and subsampling, making it more effective for reliable HD data analysis. Finally, we present a case study and a qualitative demonstration that highlight UMATO's effectiveness in generating faithful projections, enhancing the overall reliability of visual analytics using DR.
comment: IEEE Transactions on Visualization and Computer Graphics
☆ An Investigation of Visual Foundation Models Robustness
Visual Foundation Models (VFMs) are becoming ubiquitous in computer vision, powering systems for diverse tasks such as object detection, image classification, segmentation, pose estimation, and motion tracking. VFMs are capitalizing on seminal innovations in deep learning models, such as LeNet-5, AlexNet, ResNet, VGGNet, InceptionNet, DenseNet, YOLO, and ViT, to deliver superior performance across a range of critical computer vision applications. These include security-sensitive domains like biometric verification, autonomous vehicle perception, and medical image analysis, where robustness is essential to fostering trust between technology and the end-users. This article investigates network robustness requirements crucial in computer vision systems to adapt effectively to dynamic environments influenced by factors such as lighting, weather conditions, and sensor characteristics. We examine the prevalent empirical defenses and robust training employed to enhance vision network robustness against real-world challenges such as distributional shifts, noisy and spatially distorted inputs, and adversarial attacks. Subsequently, we provide a comprehensive analysis of the challenges associated with these defense mechanisms, including network properties and components to guide ablation studies and benchmarking metrics to evaluate network robustness.
☆ Dac-Fake: A Divide and Conquer Framework for Detecting Fake News on Social Media
With the rapid evolution of technology and the Internet, the proliferation of fake news on social media has become a critical issue, leading to widespread misinformation that can cause societal harm. Traditional fact checking methods are often too slow to prevent the dissemination of false information. Therefore, the need for rapid, automated detection of fake news is paramount. We introduce DaCFake, a novel fake news detection model using a divide and conquer strategy that combines content and context based features. Our approach extracts over eighty linguistic features from news articles and integrates them with either a continuous bag of words or a skipgram model for enhanced detection accuracy. We evaluated the performance of DaCFake on three datasets including Kaggle, McIntire + PolitiFact, and Reuter achieving impressive accuracy rates of 97.88%, 96.05%, and 97.32%, respectively. Additionally, we employed a ten-fold cross validation to further enhance the model's robustness and accuracy. These results highlight the effectiveness of DaCFake in early detection of fake news, offering a promising solution to curb misinformation on social media platforms.
☆ Spike Agreement Dependent Plasticity: A scalable Bio-Inspired learning paradigm for Spiking Neural Networks
We introduce Spike Agreement Dependent Plasticity (SADP), a biologically inspired synaptic learning rule for Spiking Neural Networks (SNNs) that relies on the agreement between pre- and post-synaptic spike trains rather than precise spike-pair timing. SADP generalizes classical Spike-Timing-Dependent Plasticity (STDP) by replacing pairwise temporal updates with population-level correlation metrics such as Cohen's kappa. The SADP update rule admits linear-time complexity and supports efficient hardware implementation via bitwise logic. Empirical results on MNIST and Fashion-MNIST show that SADP, especially when equipped with spline-based kernels derived from our experimental iontronic organic memtransistor device data, outperforms classical STDP in both accuracy and runtime. Our framework bridges the gap between biological plausibility and computational scalability, offering a viable learning mechanism for neuromorphic systems.
☆ OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models ICCV 2025
Diffusion models have emerged as a powerful paradigm for generative tasks such as image synthesis and video generation, with Transformer architectures further enhancing performance. However, the high computational cost of diffusion Transformers-stemming from a large number of sampling steps and complex per-step computations-presents significant challenges for real-time deployment. In this paper, we introduce OmniCache, a training-free acceleration method that exploits the global redundancy inherent in the denoising process. Unlike existing methods that determine caching strategies based on inter-step similarities and tend to prioritize reusing later sampling steps, our approach originates from the sampling perspective of DIT models. We systematically analyze the model's sampling trajectories and strategically distribute cache reuse across the entire sampling process. This global perspective enables more effective utilization of cached computations throughout the diffusion trajectory, rather than concentrating reuse within limited segments of the sampling procedure.In addition, during cache reuse, we dynamically estimate the corresponding noise and filter it out to reduce its impact on the sampling direction.Extensive experiments demonstrate that our approach accelerates the sampling process while maintaining competitive generative quality, offering a promising and practical solution for efficient deployment of diffusion-based generative models.
comment: Accepted by ICCV 2025
☆ Modeling User Preferences as Distributions for Optimal Transport-based Cross-domain Recommendation under Non-overlapping Settings
Cross-Domain Recommender (CDR) systems aim to transfer knowledge from dense to sparse domains, alleviating data sparsity and cold-start issues in single-domain recommendation. While many methods assume overlapping users or items to connect domains, this is often unrealistic in real-world settings. Thus, non-overlapping CDR systems, which require no shared users or items, are needed. However, non-overlapping CDR is challenging due to: (1) the absence of overlap preventing direct bridges between domains, and (2) large distributional discrepancies degrading transfer performance. Moreover, most recommenders represent user preferences as discrete vectors, failing to capture their fine-grained, multi-faceted nature. We propose DUP-OT (Distributional User Preferences with Optimal Transport), a framework for non-overlapping CDR. DUP-OT has three stages: (1) Shared Preprocessing, where review-based embeddings and an autoencoder encode users and items from both domains; (2) User GMM Weight Learning, which models user preferences as Gaussian mixtures with learned weights; and (3) Cross-domain Rating Prediction, where optimal transport aligns Gaussian components across domains, enabling preference transfer from source to target. Experiments on Amazon review datasets show that DUP-OT effectively mitigates domain discrepancy and outperforms state-of-the-art baselines under the non-overlapping CDR setting.
☆ Deep learning-enabled virtual multiplexed immunostaining of label-free tissue for vascular invasion assessment
Immunohistochemistry (IHC) has transformed clinical pathology by enabling the visualization of specific proteins within tissue sections. However, traditional IHC requires one tissue section per stain, exhibits section-to-section variability, and incurs high costs and laborious staining procedures. While multiplexed IHC (mIHC) techniques enable simultaneous staining with multiple antibodies on a single slide, they are more tedious to perform and are currently unavailable in routine pathology laboratories. Here, we present a deep learning-based virtual multiplexed immunostaining framework to simultaneously generate ERG and PanCK, in addition to H&E virtual staining, enabling accurate localization and interpretation of vascular invasion in thyroid cancers. This virtual mIHC technique is based on the autofluorescence microscopy images of label-free tissue sections, and its output images closely match the histochemical staining counterparts (ERG, PanCK and H&E) of the same tissue sections. Blind evaluation by board-certified pathologists demonstrated that virtual mIHC staining achieved high concordance with the histochemical staining results, accurately highlighting epithelial cells and endothelial cells. Virtual mIHC conducted on the same tissue section also allowed the identification and localization of small vessel invasion. This multiplexed virtual IHC approach can significantly improve diagnostic accuracy and efficiency in the histopathological evaluation of vascular invasion, potentially eliminating the need for traditional staining protocols and mitigating issues related to tissue loss and heterogeneity.
comment: 29 Pages, 7 Figures
☆ Set Transformer Architectures and Synthetic Data Generation for Flow-Guided Nanoscale Localization
Flow-guided Localization (FGL) enables the identification of spatial regions within the human body that contain an event of diagnostic interest. FGL does that by leveraging the passive movement of energy-constrained nanodevices circulating through the bloodstream. Existing FGL solutions rely on graph models with fixed topologies or handcrafted features, which limit their adaptability to anatomical variability and hinder scalability. In this work, we explore the use of Set Transformer architectures to address these limitations. Our formulation treats nanodevices' circulation time reports as unordered sets, enabling permutation-invariant, variable-length input processing without relying on spatial priors. To improve robustness under data scarcity and class imbalance, we integrate synthetic data generation via deep generative models, including CGAN, WGAN, WGAN-GP, and CVAE. These models are trained to replicate realistic circulation time distributions conditioned on vascular region labels, and are used to augment the training data. Our results show that the Set Transformer achieves comparable classification accuracy compared to Graph Neural Networks (GNN) baselines, while simultaneously providing by-design improved generalization to anatomical variability. The findings highlight the potential of permutation-invariant models and synthetic augmentation for robust and scalable nanoscale localization.
comment: 6 pages, 4 figures, 4 tables, 26 references, accepted at ACM NanoCom'25
☆ GEM: A Scale-Aware and Distribution-Sensitive Sparse Fine-Tuning Framework for Effective Downstream Adaptation
Parameter-efficient fine-tuning (PEFT) has become a popular way to adapt large pre-trained models to new tasks. Most PEFT methods update only a small subset of parameters while freezing the rest, avoiding redundant computation. As they maximize the absolute size of the updates without regard to the parameters' original scale, the resulting changes in model behavior can be minimal. In contrast, we maximize updates relative to each parameter's scale, yielding more meaningful downstream adaptation. We propose Gradient-to-Weight Ratio and Entropy-guided Masking (GEM), a parameter scale-aware, distribution-sensitive sparse fine-tuning framework. GEM prioritizes parameters whose updates are significant in proportion to their initial pre-trained values. It also adaptively determines how many parameters to tune at each layer based on the entropy of parameter values, thereby making the most effective use of the computational budget in PEFT. Our empirical study demonstrates the efficacy of GEM on both general-domain tasks (GLUE and SuperGLUE) and domain-specific tasks (GSM8k and MBPP), achieving up to a 1.6% improvement in fine-tuning accuracy over full fine-tuning while updating only 0.1% of model parameters.
☆ Motor Imagery EEG Signal Classification Using Minimally Random Convolutional Kernel Transform and Hybrid Deep Learning
The brain-computer interface (BCI) establishes a non-muscle channel that enables direct communication between the human body and an external device. Electroencephalography (EEG) is a popular non-invasive technique for recording brain signals. It is critical to process and comprehend the hidden patterns linked to a specific cognitive or motor task, for instance, measured through the motor imagery brain-computer interface (MI-BCI). A significant challenge is presented by classifying motor imagery-based electroencephalogram (MI-EEG) tasks, given that EEG signals exhibit nonstationarity, time-variance, and individual diversity. Obtaining good classification accuracy is also very difficult due to the growing number of classes and the natural variability among individuals. To overcome these issues, this paper proposes a novel method for classifying EEG motor imagery signals that extracts features efficiently with Minimally Random Convolutional Kernel Transform (MiniRocket), a linear classifier then uses the extracted features for activity recognition. Furthermore, a novel deep learning based on Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) architecture to serve as a baseline was proposed and demonstrated that classification via MiniRocket's features achieves higher performance than the best deep learning models at lower computational cost. The PhysioNet dataset was used to evaluate the performance of the proposed approaches. The proposed models achieved mean accuracy values of 98.63% and 98.06% for the MiniRocket and CNN-LSTM, respectively. The findings demonstrate that the proposed approach can significantly enhance motor imagery EEG accuracy and provide new insights into the feature extraction and classification of MI-EEG.
☆ SPL-LNS: Sampling-Enhanced Large Neighborhood Search for Solving Integer Linear Programs
Large Neighborhood Search (LNS) is a common heuristic in combinatorial optimization that iteratively searches over a large neighborhood of the current solution for a better one. Recently, neural network-based LNS solvers have achieved great success in solving Integer Linear Programs (ILPs) by learning to greedily predict the locally optimal solution for the next neighborhood proposal. However, this greedy approach raises two key concerns: (1) to what extent this greedy proposal suffers from local optima, and (2) how can we effectively improve its sample efficiency in the long run. To address these questions, this paper first formulates LNS as a stochastic process, and then introduces SPL-LNS, a sampling-enhanced neural LNS solver that leverages locally-informed proposals to escape local optima. We also develop a novel hindsight relabeling method to efficiently train SPL-LNS on self-generated data. Experimental results demonstrate that SPL-LNS substantially surpasses prior neural LNS solvers for various ILP problems of different sizes.
☆ STA-GANN: A Valid and Generalizable Spatio-Temporal Kriging Approach
Spatio-temporal tasks often encounter incomplete data arising from missing or inaccessible sensors, making spatio-temporal kriging crucial for inferring the completely missing temporal information. However, current models struggle with ensuring the validity and generalizability of inferred spatio-temporal patterns, especially in capturing dynamic spatial dependencies and temporal shifts, and optimizing the generalizability of unknown sensors. To overcome these limitations, we propose Spatio-Temporal Aware Graph Adversarial Neural Network (STA-GANN), a novel GNN-based kriging framework that improves spatio-temporal pattern validity and generalization. STA-GANN integrates (i) Decoupled Phase Module that senses and adjusts for timestamp shifts. (ii) Dynamic Data-Driven Metadata Graph Modeling to update spatial relationships using temporal data and metadata; (iii) An adversarial transfer learning strategy to ensure generalizability. Extensive validation across nine datasets from four fields and theoretical evidence both demonstrate the superior performance of STA-GANN.
☆ On the Collapse Errors Induced by the Deterministic Sampler for Diffusion Models
Despite the widespread adoption of deterministic samplers in diffusion models (DMs), their potential limitations remain largely unexplored. In this paper, we identify collapse errors, a previously unrecognized phenomenon in ODE-based diffusion sampling, where the sampled data is overly concentrated in local data space. To quantify this effect, we introduce a novel metric and demonstrate that collapse errors occur across a variety of settings. When investigating its underlying causes, we observe a see-saw effect, where score learning in low noise regimes adversely impacts the one in high noise regimes. This misfitting in high noise regimes, coupled with the dynamics of deterministic samplers, ultimately causes collapse errors. Guided by these insights, we apply existing techniques from sampling, training, and architecture to empirically support our explanation of collapse errors. This work provides intensive empirical evidence of collapse errors in ODE-based diffusion sampling, emphasizing the need for further research into the interplay between score learning and deterministic sampling, an overlooked yet fundamental aspect of diffusion models.
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs
In this paper, we introduce a novel learning paradigm for adaptive Large Language Model (LLM) agents that eliminates the need for fine-tuning the underlying LLMs. Existing approaches are often either rigid, relying on static, handcrafted reflection workflows, or computationally intensive, requiring gradient updates of LLM model parameters. In contrast, our method enables low-cost continual adaptation via memory-based online reinforcement learning. We formalise this as a Memory-augmented Markov Decision Process (M-MDP), equipped with a neural case-selection policy to guide action decisions. Past experiences are stored in an episodic memory, either differentiable or non-parametric. The policy is continually updated based on environmental feedback through a memory rewriting mechanism, whereas policy improvement is achieved through efficient memory reading (retrieval). We instantiate our agent model in the deep research setting, namely AgentFly, which attains top-1 on GAIA validation ($87.88\%$ Pass@$3$) and $79.40\%$ on the test set. It reaches $66.6\%$ F1 and $80.4\%$ PM on the DeepResearcher dataset, outperforming the state-of-the-art training-based method, while case-based memory adds $4.7\%$ to $9.6\%$ absolute points on out-of-distribution tasks. Our approach offers a scalable and efficient pathway for developing generalist LLM agents capable of continuous, real-time learning without gradient updates, advancing machine learning towards open-ended skill acquisition and deep research scenarios. The code is available at https://github.com/Agent-on-the-Fly/AgentFly.
☆ Machine Learning in Micromobility: A Systematic Review of Datasets, Techniques, and Applications
Micromobility systems, which include lightweight and low-speed vehicles such as bicycles, e-bikes, and e-scooters, have become an important part of urban transportation and are used to solve problems such as traffic congestion, air pollution, and high transportation costs. Successful utilisation of micromobilities requires optimisation of complex systems for efficiency, environmental impact mitigation, and overcoming technical challenges for user safety. Machine Learning (ML) methods have been crucial to support these advancements and to address their unique challenges. However, there is insufficient literature addressing the specific issues of ML applications in micromobilities. This survey paper addresses this gap by providing a comprehensive review of datasets, ML techniques, and their specific applications in micromobilities. Specifically, we collect and analyse various micromobility-related datasets and discuss them in terms of spatial, temporal, and feature-based characteristics. In addition, we provide a detailed overview of ML models applied in micromobilities, introducing their advantages, challenges, and specific use cases. Furthermore, we explore multiple ML applications, such as demand prediction, energy management, and safety, focusing on improving efficiency, accuracy, and user experience. Finally, we propose future research directions to address these issues, aiming to help future researchers better understand this field.
comment: 14 pages, 3 tables, and 4 figures, submitted to IEEE Transactions on Intelligent Vehicles
☆ CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing
Large Language Models (LLMs) confront significant memory challenges due to the escalating KV cache with increasing sequence length. As a crucial technique, existing cross-layer KV cache sharing methods either necessitate modified model architectures with subsequent pre-training or incur significant performance degradation at high compression rates. To mitigate these challenges, we propose CommonKV, a training-free method for cross-layer KV cache compression through adjacent parameters sharing. Inspired by the high similarity observed in cross-layer hidden states, we utilize Singular Value Decomposition (SVD) to achieve weight sharing across adjacent parameters, resulting in a more easily mergeable latent KV cache. Furthermore, we also introduce an adaptive budget allocation strategy. It dynamically assigns compression budgets based on cosine similarity, ensuring that dissimilar caches are not over-compressed. Experiments across multiple backbone models and benchmarks including LongBench and Ruler demonstrate that the proposed method consistently outperforms existing low-rank and cross-layer approaches at various compression ratios. Moreover, we find that the benefits of CommonKV are orthogonal to other quantization and eviction methods. By integrating these approaches, we can ultimately achieve a 98\% compression ratio without significant performance loss.
comment: 10 pages, 5 figures
☆ Domain Adaptation via Feature Refinement
We propose Domain Adaptation via Feature Refinement (DAFR2), a simple yet effective framework for unsupervised domain adaptation under distribution shift. The proposed method synergistically combines three key components: adaptation of Batch Normalization statistics using unlabeled target data, feature distillation from a source-trained model and hypothesis transfer. By aligning feature distributions at the statistical and representational levels, DAFR2 produces robust and domain-invariant feature spaces that generalize across similar domains without requiring target labels, complex architectures or sophisticated training objectives. Extensive experiments on benchmark datasets, including CIFAR10-C, CIFAR100-C, MNIST-C and PatchCamelyon-C, demonstrate that the proposed algorithm outperforms prior methods in robustness to corruption. Theoretical and empirical analyses further reveal that our method achieves improved feature alignment, increased mutual information between the domains and reduced sensitivity to input perturbations.
☆ Neural-Network Chemical Emulator for First-Star Formation: Robust Iterative Predictions over a Wide Density Range
We present a neural-network emulator for the thermal and chemical evolution in Population~III star formation. The emulator accurately reproduces the thermochemical evolution over a wide density range spanning 21 orders of magnitude (10$^{-3}$-10$^{18}$ cm$^{-3}$), tracking six primordial species: H, H$_2$, e$^{-}$, H$^{+}$, H$^{-}$, and H$_2^{+}$. To handle the broad dynamic range, we partition the density range into five subregions and train separate deep operator networks (DeepONets) in each region. When applied to randomly sampled thermochemical states, the emulator achieves relative errors below 10% in over 90% of cases for both temperature and chemical abundances (except for the rare species H$_2^{+}$). The emulator is roughly ten times faster on a CPU and more than 1000 times faster for batched predictions on a GPU, compared with conventional numerical integration. Furthermore, to ensure robust predictions under many iterations, we introduce a novel timescale-based update method, where a short-timestep update of each variable is computed by rescaling the predicted change over a longer timestep equal to its characteristic variation timescale. In one-zone collapse calculations, the results from the timescale-based method agree well with traditional numerical integration even with many iterations at a timestep as short as 10$^{-4}$ of the free-fall time. This proof-of-concept study suggests the potential for neural network-based chemical emulators to accelerate hydrodynamic simulations of star formation.
comment: 18 pages, 7 figures, Submitted to ApJ
☆ From Indirect Object Identification to Syllogisms: Exploring Binary Mechanisms in Transformer Circuits
Transformer-based language models (LMs) can perform a wide range of tasks, and mechanistic interpretability (MI) aims to reverse engineer the components responsible for task completion to understand their behavior. Previous MI research has focused on linguistic tasks such as Indirect Object Identification (IOI). In this paper, we investigate the ability of GPT-2 small to handle binary truth values by analyzing its behavior with syllogistic prompts, e.g., "Statement A is true. Statement B matches statement A. Statement B is", which requires more complex logical reasoning compared to IOI. Through our analysis of several syllogism tasks of varying difficulty, we identify multiple circuits that mechanistically explain GPT-2's logical-reasoning capabilities and uncover binary mechanisms that facilitate task completion, including the ability to produce a negated token not present in the input prompt through negative heads. Our evaluation using a faithfulness metric shows that a circuit comprising five attention heads achieves over 90% of the original model's performance. By relating our findings to IOI analysis, we provide new insights into the roles of specific attention heads and MLPs in LMs. These insights contribute to a broader understanding of model reasoning and support future research in mechanistic interpretability.
☆ CYCLE-INSTRUCT: Fully Seed-Free Instruction Tuning via Dual Self-Training and Cycle Consistency EMNLP 2025
Instruction tuning is vital for aligning large language models (LLMs) with human intent, but current methods typically rely on costly human-annotated seed data or powerful external teacher models. While instruction back-translation techniques reduce this dependency, they remain fundamentally tethered to an initial seed set, which limits full automation, introduces biases, and can lead to inefficient use of unlabeled corpora. In this paper, we propose Cycle-Instruct, a novel framework that achieves fully seed-free instruction tuning. Inspired by cycle consistency, Cycle-Instruct employs a dual self-training loop where two models-an answer generator and a question generator-are bootstrapped solely from raw, unlabeled text. These models mutually supervise each other by reconstructing original text segments from their counterpart's generated pseudo-labels, effectively learning from the intrinsic structure of the data without any human-provided seeds. We demonstrate Cycle-Instruct's efficacy across four diverse data tracks, including general instruction-following, domain-specific tasks, dialogue logs, and plain text. Our extensive experiments show that Cycle-Instruct not only outperforms seed-driven back-translation baselines but also achieves performance comparable to strongly supervised methods.
comment: EMNLP 2025 Main
☆ Machine Learning for Medicine Must Be Interpretable, Shareable, Reproducible and Accountable by Design
This paper claims that machine learning models deployed in high stakes domains such as medicine must be interpretable, shareable, reproducible and accountable. We argue that these principles should form the foundational design criteria for machine learning algorithms dealing with critical medical data, including survival analysis and risk prediction tasks. Black box models, while often highly accurate, struggle to gain trust and regulatory approval in health care due to a lack of transparency. We discuss how intrinsically interpretable modeling approaches (such as kernel methods with sparsity, prototype-based learning, and deep kernel models) can serve as powerful alternatives to opaque deep networks, providing insight into biomedical predictions. We then examine accountability in model development, calling for rigorous evaluation, fairness, and uncertainty quantification to ensure models reliably support clinical decisions. Finally, we explore how generative AI and collaborative learning paradigms (such as federated learning and diffusion-based data synthesis) enable reproducible research and cross-institutional integration of heterogeneous biomedical data without compromising privacy, hence shareability. By rethinking machine learning foundations along these axes, we can develop medical AI that is not only accurate but also transparent, trustworthy, and translatable to real-world clinical settings.
☆ GPLight+: A Genetic Programming Method for Learning Symmetric Traffic Signal Control Policy
Recently, learning-based approaches, have achieved significant success in automatically devising effective traffic signal control strategies. In particular, as a powerful evolutionary machine learning approach, Genetic Programming (GP) is utilized to evolve human-understandable phase urgency functions to measure the urgency of activating a green light for a specific phase. However, current GP-based methods are unable to treat the common traffic features of different traffic signal phases consistently. To address this issue, we propose to use a symmetric phase urgency function to calculate the phase urgency for a specific phase based on the current road conditions. This is represented as an aggregation of two shared subtrees, each representing the urgency of a turn movement in the phase. We then propose a GP method to evolve the symmetric phase urgency function. We evaluate our proposed method on the well-known cityflow traffic simulator, based on multiple public real-world datasets. The experimental results show that the proposed symmetric urgency function representation can significantly improve the performance of the learned traffic signal control policies over the traditional GP representation on a wide range of scenarios. Further analysis shows that the proposed method can evolve effective, human-understandable and easily deployable traffic signal control policies.
☆ On Task Vectors and Gradients
Task arithmetic has emerged as a simple yet powerful technique for model merging, enabling the combination of multiple finetuned models into one. Despite its empirical success, a clear theoretical explanation of why and when it works is lacking. This paper provides a rigorous theoretical foundation for task arithmetic by establishing a connection between task vectors and gradients of the task losses. We show that under standard gradient descent, a task vector generated from one epoch of finetuning is exactly equivalent to the negative gradient of the loss, scaled by the learning rate. For the practical multi-epoch setting, we prove that this equivalence holds approximately, with a second-order error term that we explicitly bound for feed-forward networks. Our empirical analysis across seven vision benchmarks corroborates our theory, demonstrating that the first-epoch gradient dominates the finetuning trajectory in both norm and direction. A key implication is that merging models finetuned for only a single epoch often yields performance comparable to merging fully converged models. These findings reframe task arithmetic as a form of approximate multitask learning, providing a clear rationale for its effectiveness and highlighting the critical role of early training dynamics in model merging.
comment: 9 pages of main paper, 5 figures
☆ CEQuest: Benchmarking Large Language Models for Construction Estimation
Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of general-domain tasks. However, their effectiveness in specialized fields, such as construction, remains underexplored. In this paper, we introduce CEQuest, a novel benchmark dataset specifically designed to evaluate the performance of LLMs in answering construction-related questions, particularly in the areas of construction drawing interpretation and estimation. We conduct comprehensive experiments using five state-of-the-art LLMs, including Gemma 3, Phi4, LLaVA, Llama 3.3, and GPT-4.1, and evaluate their performance in terms of accuracy, execution time, and model size. Our experimental results demonstrate that current LLMs exhibit considerable room for improvement, highlighting the importance of integrating domain-specific knowledge into these models. To facilitate further research, we will open-source the proposed CEQuest dataset, aiming to foster the development of specialized large language models (LLMs) tailored to the construction domain.
☆ Cooperative Design Optimization through Natural Language Interaction
Designing successful interactions requires identifying optimal design parameters. To do so, designers often conduct iterative user testing and exploratory trial-and-error. This involves balancing multiple objectives in a high-dimensional space, making the process time-consuming and cognitively demanding. System-led optimization methods, such as those based on Bayesian optimization, can determine for designers which parameters to test next. However, they offer limited opportunities for designers to intervene in the optimization process, negatively impacting the designer's experience. We propose a design optimization framework that enables natural language interactions between designers and the optimization system, facilitating cooperative design optimization. This is achieved by integrating system-led optimization methods with Large Language Models (LLMs), allowing designers to intervene in the optimization process and better understand the system's reasoning. Experimental results show that our method provides higher user agency than a system-led method and shows promising optimization performance compared to manual design. It also matches the performance of an existing cooperative method with lower cognitive load.
comment: 25 pages, 20 figures, to appear in Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST '25), September 28-October 1, 2025, Busan, Republic of Korea
☆ A State-Space Approach to Nonstationary Discriminant Analysis
Classical discriminant analysis assumes identically distributed training data, yet in many applications observations are collected over time and the class-conditional distributions drift. This population drift renders stationary classifiers unreliable. We propose a principled, model-based framework that embeds discriminant analysis within state-space models to obtain nonstationary linear discriminant analysis (NSLDA) and nonstationary quadratic discriminant analysis (NSQDA). For linear-Gaussian dynamics, we adapt Kalman smoothing to handle multiple samples per time step and develop two practical extensions: (i) an expectation-maximization (EM) approach that jointly estimates unknown system parameters, and (ii) a Gaussian mixture model (GMM)-Kalman method that simultaneously recovers unobserved time labels and parameters, a scenario common in practice. To address nonlinear or non-Gaussian drift, we employ particle smoothing to estimate time-varying class centroids, yielding fully nonstationary discriminant rules. Extensive simulations demonstrate consistent improvements over stationary linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and support vector machine (SVM) baselines, with robustness to noise, missing data, and class imbalance. This paper establishes a unified and data-efficient foundation for discriminant analysis under temporal distribution shift.
☆ Training a Foundation Model for Materials on a Budget
Foundation models for materials modeling are advancing quickly, but their training remains expensive, often placing state-of-the-art methods out of reach for many research groups. We introduce Nequix, a compact E(3)-equivariant potential that pairs a simplified NequIP design with modern training practices, including equivariant root-mean-square layer normalization and the Muon optimizer, to retain accuracy while substantially reducing compute requirements. Built in JAX, Nequix has 700K parameters and was trained in 500 A100-GPU hours. On the Matbench-Discovery and MDR Phonon benchmarks, Nequix ranks third overall while requiring less than one quarter of the training cost of most other methods, and it delivers an order-of-magnitude faster inference speed than the current top-ranked model. We release model weights and fully reproducible codebase at https://github.com/atomicarchitects/nequix
☆ Integrating Time Series into LLMs via Multi-layer Steerable Embedding Fusion for Enhanced Forecasting CIKM 2025
Time series (TS) data are ubiquitous across various application areas, rendering time series forecasting (TSF) a fundamental task. With the astounding advances in large language models (LLMs), a variety of methods have been developed to adapt LLMs for time series forecasting. Despite unlocking the potential of LLMs in comprehending TS data, existing methods are inherently constrained by their shallow integration of TS information, wherein LLMs typically access TS representations at shallow layers, primarily at the input layer. This causes the influence of TS representations to progressively fade in deeper layers and eventually leads to ineffective adaptation between textual embeddings and TS representations. In this paper, we propose the Multi-layer Steerable Embedding Fusion (MSEF), a novel framework that enables LLMs to directly access time series patterns at all depths, thereby mitigating the progressive loss of TS information in deeper layers. Specifically, MSEF leverages off-the-shelf time series foundation models to extract semantically rich embeddings, which are fused with intermediate text representations across LLM layers via layer-specific steering vectors. These steering vectors are designed to continuously optimize the alignment between time series and textual modalities and facilitate a layer-specific adaptation mechanism that ensures efficient few-shot learning capabilities. Experimental results on seven benchmarks demonstrate significant performance improvements by MSEF compared with baselines, with an average reduction of 31.8% in terms of MSE. The code is available at https://github.com/One1sAll/MSEF.
comment: To be published in CIKM 2025
☆ Pareto Actor-Critic for Communication and Computation Co-Optimization in Non-Cooperative Federated Learning Services
Federated learning (FL) in multi-service provider (SP) ecosystems is fundamentally hampered by non-cooperative dynamics, where privacy constraints and competing interests preclude the centralized optimization of multi-SP communication and computation resources. In this paper, we introduce PAC-MCoFL, a game-theoretic multi-agent reinforcement learning (MARL) framework where SPs act as agents to jointly optimize client assignment, adaptive quantization, and resource allocation. Within the framework, we integrate Pareto Actor-Critic (PAC) principles with expectile regression, enabling agents to conjecture optimal joint policies to achieve Pareto-optimal equilibria while modeling heterogeneous risk profiles. To manage the high-dimensional action space, we devise a ternary Cartesian decomposition (TCAD) mechanism that facilitates fine-grained control. Further, we develop PAC-MCoFL-p, a scalable variant featuring a parameterized conjecture generator that substantially reduces computational complexity with a provably bounded error. Alongside theoretical convergence guarantees, our framework's superiority is validated through extensive simulations -- PAC-MCoFL achieves approximately 5.8% and 4.2% improvements in total reward and hypervolume indicator (HVI), respectively, over the latest MARL solutions. The results also demonstrate that our method can more effectively balance individual SP and system performance in scaled deployments and under diverse data heterogeneity.
☆ CoVeRaP: Cooperative Vehicular Perception through mmWave FMCW Radars
Automotive FMCW radars remain reliable in rain and glare, yet their sparse, noisy point clouds constrain 3-D object detection. We therefore release CoVeRaP, a 21 k-frame cooperative dataset that time-aligns radar, camera, and GPS streams from multiple vehicles across diverse manoeuvres. Built on this data, we propose a unified cooperative-perception framework with middle- and late-fusion options. Its baseline network employs a multi-branch PointNet-style encoder enhanced with self-attention to fuse spatial, Doppler, and intensity cues into a common latent space, which a decoder converts into 3-D bounding boxes and per-point depth confidence. Experiments show that middle fusion with intensity encoding boosts mean Average Precision by up to 9x at IoU 0.9 and consistently outperforms single-vehicle baselines. CoVeRaP thus establishes the first reproducible benchmark for multi-vehicle FMCW-radar perception and demonstrates that affordable radar sharing markedly improves detection robustness. Dataset and code are publicly available to encourage further research.
comment: Accepted at ICCCN 2025 (IEEE International Conference on Computer Communications and Networks), Tokyo, Japan, August 2025
☆ Optimal Dynamic Regret by Transformers for Non-Stationary Reinforcement Learning
Transformers have demonstrated exceptional performance across a wide range of domains. While their ability to perform reinforcement learning in-context has been established both theoretically and empirically, their behavior in non-stationary environments remains less understood. In this study, we address this gap by showing that transformers can achieve nearly optimal dynamic regret bounds in non-stationary settings. We prove that transformers are capable of approximating strategies used to handle non-stationary environments and can learn the approximator in the in-context learning setup. Our experiments further show that transformers can match or even outperform existing expert algorithms in such environments.
comment: 28 pages
♻ ☆ A Curious Case of Remarkable Resilience to Gradient Attacks via Fully Convolutional and Differentiable Front End with a Skip Connection
We experimented with front-end enhanced neural models where a differentiable and fully convolutional model with a skip connection is added before a frozen backbone classifier. By training such composite models using a small learning rate for about one epoch, we obtained models that retained the accuracy of the backbone classifier while being unusually resistant to gradient attacks-including APGD and FAB-T attacks from the AutoAttack package-which we attribute to gradient masking. Although gradient masking is not new, the degree we observe is striking for fully differentiable models without obvious gradient-shattering-e.g., JPEG compression-or gradient-diminishing components. The training recipe to produce such models is also remarkably stable and reproducible: We applied it to three datasets (CIFAR10, CIFAR100, and ImageNet) and several modern architectures (including vision Transformers) without a single failure case. While black-box attacks such as the SQUARE attack and zero-order PGD can partially overcome gradient masking, these attacks are easily defeated by simple randomized ensembles. We estimate that these ensembles achieve near-SOTA AutoAttack accuracy on CIFAR10, CIFAR100, and ImageNet (while retaining almost all clean accuracy of the original classifiers) despite having near-zero accuracy under adaptive attacks. Adversarially training the backbone further amplifies this front-end "robustness". On CIFAR10, the respective randomized ensemble achieved 90.8$\pm 2.5\%$ (99\% CI) accuracy under the full AutoAttack while having only 18.2$\pm 3.6\%$ accuracy under the adaptive attack ($\varepsilon=8/255$, $L^\infty$ norm). We conclude the paper with a discussion of whether randomized ensembling can serve as a practical defense. Code and instructions to reproduce key results are available. https://github.com/searchivarius/curious_case_of_gradient_masking
comment: Accepted at TMLR (2025/08)
♻ ☆ General and Estimable Learning Bound Unifying Covariate and Concept Shifts
Generalization under distribution shift remains a core challenge in modern machine learning, yet existing learning bound theory is limited to narrow, idealized settings and is non-estimable from samples. In this paper, we bridge the gap between theory and practical applications. We first show that existing bounds become loose and non-estimable because their concept shift definition breaks when the source and target supports mismatch. Leveraging entropic optimal transport, we propose new support-agnostic definitions for covariate and concept shifts, and derive a novel unified error bound that applies to broad loss functions, label spaces, and stochastic labeling. We further develop estimators for these shifts with concentration guarantees, and the DataShifts algorithm, which can quantify distribution shifts and estimate the error bound in most applications -- a rigorous and general tool for analyzing learning error under distribution shift.
comment: 10 pages, 4 figures
♻ ☆ A Simple "Try Again" Can Elicit Multi-Turn LLM Reasoning
Multi-turn problem solving is critical yet challenging for Large Reasoning Models (LRMs) to reflect on their reasoning and revise from feedback. Existing Reinforcement Learning (RL) methods train large reasoning models on a single-turn paradigm with verifiable rewards. However, we observe that models trained with existing RL paradigms often lose their ability to solve problems across multiple turns and struggle to revise answers based on contextual feedback, leading to repetitive responses. We ask: can LRMs learn to reflect their answers in a multi-turn context? In this work, we find that training models with multi-turn RL using only unary feedback (e.g., "Let's try again") after wrong answers can improve both single-turn performance and multi-turn reasoning. We introduce Unary Feedback as Observation (UFO) for reinforcement learning, which uses minimal yet common unary user feedback during iterative problem solving. It can be easily applied to existing single-turn RL training setups. Experimental results show that RL training with UFO keeps single-turn performance and improves multi-turn reasoning accuracy by up to 14%, enabling language models to better react to feedback in multi-turn problem solving. To further minimize the number of turns needed for a correct answer while encouraging diverse reasoning when mistakes occur, we design reward structures that guide models to produce careful and deliberate answers in each turn. Code: https://github.com/lichengliu03/unary-feedback
♻ ☆ Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization
Tokenization is the first -- and often least scrutinized -- step of most NLP pipelines. Standard algorithms for learning tokenizers rely on frequency-based objectives, which favor languages dominant in the training data and consequently leave lower-resource languages with tokenizations that are disproportionately longer, morphologically implausible, or even riddled with placeholders. This phenomenon ultimately amplifies computational and financial inequalities between users from different language backgrounds. To remedy this, we introduce Parity-aware Byte Pair Encoding (BPE), a variant of the widely-used BPE algorithm. At every merge step, Parity-aware BPE maximizes the compression gain of the currently worst-compressed language, trading a small amount of global compression for cross-lingual parity. We find empirically that Parity-aware BPE leads to more equitable token counts across languages, with negligible impact on global compression rate and no substantial effect on language-model performance in downstream tasks.
♻ ☆ LearnLM: Improving Gemini for Learning
Today's generative AI systems are tuned to present information by default, rather than engage users in service of learning as a human tutor would. To address the wide range of potential education use cases for these systems, we reframe the challenge of injecting pedagogical behavior as one of \textit{pedagogical instruction following}, where training and evaluation examples include system-level instructions describing the specific pedagogy attributes present or desired in subsequent model turns. This framing avoids committing our models to any particular definition of pedagogy, and instead allows teachers or developers to specify desired model behavior. It also clears a path to improving Gemini models for learning -- by enabling the addition of our pedagogical data to post-training mixtures -- alongside their rapidly expanding set of capabilities. Both represent important changes from our initial tech report. We show how training with pedagogical instruction following produces a LearnLM model (available on Google AI Studio) that experts substantially prefer across a diverse set of learning scenarios, with average preference strengths of +31\% over GPT-4o, +11\% over Claude 3.5 Sonnet, and +13\% over the Gemini 1.5 Pro model on which LearnLM was based.
♻ ☆ Overcoming classic challenges for artificial neural networks by providing incentives and practice
Since the earliest proposals for artificial neural network (ANN) models of the mind and brain, critics have pointed out key weaknesses in these models compared to human cognitive abilities. Here we review recent work that uses metalearning to overcome several classic challenges, which we characterise as addressing the Problem of Incentive and Practice -- that is, providing machines with both incentives to improve specific skills and opportunities to practice those skills. This explicit optimization contrasts with more conventional approaches that hope the desired behaviour will emerge through optimising related but different objectives. We review applications of this principle to addressing four classic challenges for ANNs: systematic generalisation, catastrophic forgetting, few-shot learning and multi-step reasoning. We also discuss how large language models incorporate key aspects of this metalearning framework (namely, sequence prediction with feedback trained on diverse data), which helps to explain some of their successes on these classic challenges. Finally, we discuss the prospects for understanding aspects of human development through this framework, and whether natural environments provide the right incentives and practice for learning how to make challenging generalisations.
comment: In press at Nature Machine Intelligence
♻ ☆ Multi-Level Knowledge Distillation and Dynamic Self-Supervised Learning for Continual Learning
Class-incremental with repetition (CIR), where previously trained classes repeatedly introduced in future tasks, is a more realistic scenario than the traditional class incremental setup, which assumes that each task contains unseen classes. CIR assumes that we can easily access abundant unlabeled data from external sources, such as the Internet. Therefore, we propose two components that efficiently use the unlabeled data to ensure the high stability and the plasticity of models trained in CIR setup. First, we introduce multi-level knowledge distillation (MLKD) that distills knowledge from multiple previous models across multiple perspectives, including features and logits, so the model can maintain much various previous knowledge. Moreover, we implement dynamic self-supervised loss (SSL) to utilize the unlabeled data that accelerates the learning of new classes, while dynamic weighting of SSL keeps the focus of training to the primary task. Both of our proposed components significantly improve the performance in CIR setup, achieving 2nd place in the CVPR 5th CLVISION Challenge.
♻ ☆ Flow Matching-Based Generative Modeling for Efficient and Scalable Data Assimilation
Data assimilation (DA) is the problem of sequentially estimating the state of a dynamical system from noisy observations. Recent advances in generative modeling have inspired new approaches to DA in high-dimensional nonlinear settings, especially the ensemble score filter (EnSF). However, these come at a significant computational burden due to slow sampling. In this paper, we introduce a new filtering framework based on flow matching (FM) -- called the ensemble flow filter (EnFF) -- to accelerate sampling and enable flexible design of probability paths. EnFF -- a training-free DA approach -- integrates MC estimators for the marginal FM vector field (VF) and a localized guidance to assimilate observations. EnFF has faster sampling and more flexibility in VF design compared to existing generative modeling for DA. Theoretically, we show that EnFF encompasses classical filtering methods such as the bootstrap particle filter and the ensemble Kalman filter as special cases. Experiments on high-dimensional filtering benchmarks demonstrate improved cost-accuracy tradeoffs and the ability to leverage larger ensembles than prior methods. Our results highlight the promise of FM as a scalable tool for filtering in high-dimensional applications that enable the use of large ensembles.
comment: correcting authorship footnote, reformatting figures
♻ ☆ Source-Guided Flow Matching
Guidance of generative models is typically achieved by modifying the probability flow vector field through the addition of a guidance field. In this paper, we instead propose the Source-Guided Flow Matching (SGFM) framework, which modifies the source distribution directly while keeping the pre-trained vector field intact. This reduces the guidance problem to a well-defined problem of sampling from the source distribution. We theoretically show that SGFM recovers the desired target distribution exactly. Furthermore, we provide bounds on the Wasserstein error for the generated distribution when using an approximate sampler of the source distribution and an approximate vector field. The key benefit of our approach is that it allows the user to flexibly choose the sampling method depending on their specific problem. To illustrate this, we systematically compare different sampling methods and discuss conditions for asymptotically exact guidance. Moreover, our framework integrates well with optimal flow matching models since the straight transport map generated by the vector field is preserved. Experimental results on synthetic 2D benchmarks, physics-informed generative tasks, and imaging inverse problems demonstrate the effectiveness and flexibility of the proposed framework.
♻ ☆ PENGUIN: Enhancing Transformer with Periodic-Nested Group Attention for Long-term Time Series Forecasting
Long-term time series forecasting (LTSF) is a fundamental task with wide-ranging applications. Although Transformer-based models have made significant breakthroughs in forecasting, their effectiveness for time series forecasting remains debatable. In this paper, we revisit the significance of self-attention and propose a simple yet effective mechanism, Periodic-Nested Group Attention, namely PENGUIN. Our approach highlights the importance of explicitly modeling periodic patterns and incorporating relative attention bias for effective time series modeling. To this end, we introduce a periodic-nested relative attention bias that captures periodic structures directly. To handle multiple coexisting periodicities (e.g., daily and weekly cycles), we design a grouped attention mechanism, where each group targets a specific periodicity using a multi-query attention mechanism. Extensive experiments across diverse benchmarks demonstrate that PENGUIN consistently outperforms both MLP-based and Transformer-based models.
♻ ☆ Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS
Test-time scaling (TTS) for large language models (LLMs) has thus far fallen into two largely separate paradigms: (1) reinforcement learning (RL) methods that optimize sparse outcome-based rewards, yet suffer from instability and low sample efficiency; and (2) search-based techniques guided by independently trained, static process reward models (PRMs), which require expensive human- or LLM-generated labels and often degrade under distribution shifts. In this paper, we introduce AIRL-S, the first natural unification of RL-based and search-based TTS. Central to AIRL-S is the insight that the reward function learned during RL training inherently represents the ideal PRM for guiding downstream search. Specifically, we leverage adversarial inverse reinforcement learning (AIRL) combined with group relative policy optimization (GRPO) to learn a dense, dynamic PRM directly from correct reasoning traces, entirely eliminating the need for labeled intermediate process data. At inference, the resulting PRM simultaneously serves as the critic for RL rollouts and as a heuristic to effectively guide search procedures, facilitating robust reasoning chain extension, mitigating reward hacking, and enhancing cross-task generalization. Experimental results across eight benchmarks, including mathematics, scientific reasoning, and code generation, demonstrate that our unified approach improves performance by 9 % on average over the base model, matching GPT-4o. Furthermore, when integrated into multiple search algorithms, our PRM consistently outperforms all baseline PRMs trained with labeled data. These results underscore that, indeed, your reward function for RL is your best PRM for search, providing a robust and cost-effective solution to complex reasoning tasks in LLMs.
♻ ☆ Joint Optimization of Energy Consumption and Completion Time in Federated Learning
Federated Learning (FL) is an intriguing distributed machine learning approach due to its privacy-preserving characteristics. To balance the trade-off between energy and execution latency, and thus accommodate different demands and application scenarios, we formulate an optimization problem to minimize a weighted sum of total energy consumption and completion time through two weight parameters. The optimization variables include bandwidth, transmission power and CPU frequency of each device in the FL system, where all devices are linked to a base station and train a global model collaboratively. Through decomposing the non-convex optimization problem into two subproblems, we devise a resource allocation algorithm to determine the bandwidth allocation, transmission power, and CPU frequency for each participating device. We further present the convergence analysis and computational complexity of the proposed algorithm. Numerical results show that our proposed algorithm not only has better performance at different weight parameters (i.e., different demands) but also outperforms the state of the art.
comment: This paper appears in the Proceedings of IEEE International Conference on Distributed Computing Systems (ICDCS) 2022. Please feel free to contact us for questions or remarks
♻ ☆ PAR-AdvGAN: Improving Adversarial Attack Capability with Progressive Auto-Regression AdvGAN ECML-PKDD 2025
Deep neural networks have demonstrated remarkable performance across various domains. However, they are vulnerable to adversarial examples, which can lead to erroneous predictions. Generative Adversarial Networks (GANs) can leverage the generators and discriminators model to quickly produce high-quality adversarial examples. Since both modules train in a competitive and simultaneous manner, GAN-based algorithms like AdvGAN can generate adversarial examples with better transferability compared to traditional methods. However, the generation of perturbations is usually limited to a single iteration, preventing these examples from fully exploiting the potential of the methods. To tackle this issue, we introduce a novel approach named Progressive Auto-Regression AdvGAN (PAR-AdvGAN). It incorporates an auto-regressive iteration mechanism within a progressive generation network to craft adversarial examples with enhanced attack capability. We thoroughly evaluate our PAR-AdvGAN method with a large-scale experiment, demonstrating its superior performance over various state-of-the-art black-box adversarial attacks, as well as the original AdvGAN.Moreover, PAR-AdvGAN significantly accelerates the adversarial example generation, i.e., achieving the speeds of up to 335.5 frames per second on Inception-v3 model, outperforming the gradient-based transferable attack algorithms. Our code is available at: https://github.com/LMBTough/PAR
comment: Best paper award of ECML-PKDD 2025
♻ ☆ GRAFT: Gradient-Aware Fast MaxVol Technique for Dynamic Data Sampling
Training modern neural networks on large datasets is computationally and environmentally costly. We introduce GRAFT, a scalable in-training subset selection method that (i) extracts a low-rank feature representation for each batch, (ii) applies a Fast MaxVol sampler to select a small, diverse subset that spans the batch's dominant subspace, and (iii) dynamically adjusts the subset size using a gradient-approximation criterion. By operating in low-rank subspaces and training on carefully chosen examples instead of full batches, GRAFT preserves the training trajectory while reducing wall-clock time, energy consumption, and $\mathrm{CO}_2$ emissions. Across multiple benchmarks, GRAFT matches or exceeds recent selection baselines in both accuracy and efficiency, providing a favorable trade-off between accuracy, efficiency, and emissions.
♻ ☆ Towards Bridging the Reward-Generation Gap in Direct Alignment Algorithms
Direct Alignment Algorithms (DAAs), such as Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO), have emerged as efficient alternatives to Reinforcement Learning from Human Feedback (RLHF) algorithms for aligning large language models (LLMs) with human preferences. However, DAAs suffer from a fundamental limitation we identify as the "reward-generation gap" -- a misalignment between optimization objectives during training and actual generation performance during inference. In this paper, we find a contributor to the reward-generation gap is the mismatch between the inherent importance of prefix tokens during the LLM generation process and how this importance is reflected in the implicit reward functions of DAAs. To bridge the gap, we adopt a token-level MDP perspective of DAAs to analyze its limitations and introduce a simple yet effective approach called Prefix-Oriented Equal-length Training (POET), which truncates both preferred and dispreferred responses to match the shorter one's length. Training with \mname, where both responses in each sample are truncated to equal length, resulting in diverse truncated lengths across samples, the optimization of DAAs objective is implicitly constrained to converge across all timesteps of token-level MDP, thus paying more attention to prefix tokens than the standard DAAs. We conduct experiments with DPO and SimPO, two representative DAAs, demonstrating that POET improves over their standard implementations, achieving up to 15.6 points in AlpacaEval 2 and overall improvements across downstream tasks. Our results highlight the importance of addressing the misalignment between reward optimization and generation performance in DAAs.
♻ ☆ Rotary Offset Features in Large Language Models
Transformer-based Large Language Models (LLMs) rely on positional encodings to provide sequence position information to their attention mechanism. Rotary Positional Encodings (RoPE), which encode relative position by rotating queries and keys, have become widely used in modern LLMs. We study the features and patterns that emerge in queries and keys when using rotary embeddings and introduce the concept of rotary offset features. Our analysis reveals that these features, which frequently exhibit large activations and are often interpreted as outliers, arise consistently across layers, attention heads, and model architectures. We derive bounds predicting which rotary frequencies give rise to rotary offset features and the minimum angle between the query-key pairs for these features. We verify our predictions empirically across models of different sizes and architectures.
♻ ☆ Unsupervised Automata Learning via Discrete Optimization
Automata learning is a successful tool for many application domains such as robotics and automatic verification. Typically, automata learning techniques operate in a supervised learning setting (active or passive) where they learn a finite state machine in contexts where additional information, such as labeled system executions, is available. However, other settings, such as learning from unlabeled data - an important aspect in machine learning - remain unexplored. To overcome this limitation, we propose a framework for learning a deterministic finite automaton (DFA) from a given multi-set of unlabeled words. We show that this problem is computationally hard and develop three learning algorithms based on constraint optimization. Moreover, we introduce novel regularization schemes for our optimization problems that improve the overall interpretability of our DFAs. Using a prototype implementation, we demonstrate practical feasibility in the context of unsupervised anomaly detection.
♻ ☆ Environmental Feature Engineering and Statistical Validation for ML-Based Path Loss Prediction
Wireless communications rely on path loss modeling, which is most effective when it includes the physical details of the propagation environment. Acquiring this data has historically been challenging, but geographic information systems data is becoming increasingly available with higher resolution and accuracy. Access to such details enables propagation models to more accurately predict coverage and account for interference in wireless deployments. Machine learning-based modeling can significantly support this effort, with feature based approaches allowing for accurate, efficient, and scalable propagation modeling. Building on previous work, we introduce an extended set of features that improves prediction accuracy while, most importantly, proving model generalization through rigorous statistical assessment and the use of test set holdouts.
comment: 4 pages, 4 figures, Submitted to IEEE AWPL
♻ ☆ Generalized Tree Edit Distance (GTED): A Faithful Evaluation Metric for Statement Autoformalization ICML25
Statement autoformalization, the automated translation of statements from natural language into formal languages, has become a subject of extensive research, yet the development of robust automated evaluation metrics remains limited. Existing evaluation methods often lack semantic understanding, face challenges with high computational costs, and are constrained by the current progress of automated theorem proving. To address these issues, we propose GTED (Generalized Tree Edit Distance), a novel evaluation framework that first standardizes formal statements and converts them into operator trees, then determines the semantic similarity using the eponymous GTED metric. Across the miniF2F and ProofNet benchmarks, GTED consistently ranks as a top-performing metric, achieving the highest accuracy and Kappa on miniF2F and the joint-highest accuracy on ProofNet. This strong overall performance provides the community with a computationally lightweight and more faithful metric for automated evaluation. The code and experimental results are available at https://github.com/XiaoyangLiu-sjtu/GTED.
comment: Accepted to AI4Math@ICML25
♻ ☆ Sentiment Reasoning for Healthcare ACL 2025
Transparency in AI healthcare decision-making is crucial. By incorporating rationales to explain reason for each predicted label, users could understand Large Language Models (LLMs)'s reasoning to make better decision. In this work, we introduce a new task - Sentiment Reasoning - for both speech and text modalities, and our proposed multimodal multitask framework and the world's largest multimodal sentiment analysis dataset. Sentiment Reasoning is an auxiliary task in sentiment analysis where the model predicts both the sentiment label and generates the rationale behind it based on the input transcript. Our study conducted on both human transcripts and Automatic Speech Recognition (ASR) transcripts shows that Sentiment Reasoning helps improve model transparency by providing rationale for model prediction with quality semantically comparable to humans while also improving model's classification performance (+2% increase in both accuracy and macro-F1) via rationale-augmented fine-tuning. Also, no significant difference in the semantic quality of generated rationales between human and ASR transcripts. All code, data (five languages - Vietnamese, English, Chinese, German, and French) and models are published online: https://github.com/leduckhai/Sentiment-Reasoning
comment: ACL 2025 Industry Track (Oral)
♻ ☆ NovoMolGen: Rethinking Molecular Language Model Pretraining
Designing de-novo molecules with desired property profiles requires efficient exploration of the vast chemical space ranging from $10^{23}$ to $10^{60}$ possible synthesizable candidates. While various deep generative models have been developed to design small molecules using diverse input representations, Molecular Large Language Models (Mol-LLMs) based on string representations have emerged as a scalable approach capable of exploring billions of molecules. However, there remains limited understanding regarding how standard language modeling practices such as textual representations, tokenization strategies, model size, and dataset scale impact molecular generation performance. In this work, we systematically investigate these critical aspects by introducing NovoMolGen, a family of transformer-based foundation models pretrained on 1.5 billion molecules for de-novo molecule generation. Through extensive empirical analyses, we identify a weak correlation between performance metrics measured during pretraining and actual downstream performance, revealing important distinctions between molecular and general NLP training dynamics. NovoMolGen establishes new state-of-the-art results, substantially outperforming prior Mol-LLMs and specialized generative models in both unconstrained and goal-directed molecular generation tasks, thus providing a robust foundation for advancing efficient and effective molecular modeling strategies.
♻ ☆ PoisonSwarm: Universal Harmful Information Synthesis via Model Crowdsourcing
To construct responsible and secure AI applications, harmful information data is widely utilized for adversarial testing and the development of safeguards. Existing studies mainly leverage Large Language Models (LLMs) to synthesize data to obtain high-quality task datasets at scale, thereby avoiding costly human annotation. However, limited by the safety alignment mechanisms of LLMs, the synthesis of harmful data still faces challenges in generation reliability and content diversity. In this study, we propose a novel harmful information synthesis framework, PoisonSwarm, which applies the model crowdsourcing strategy to generate diverse harmful data while maintaining a high success rate. Specifically, we generate abundant benign data as the based templates in a counterfactual manner. Subsequently, we decompose each based template into multiple semantic units and perform unit-by-unit toxification and final refinement through dynamic model switching, thus ensuring the success of synthesis. Experimental results demonstrate that PoisonSwarm achieves state-of-the-art performance in synthesizing different categories of harmful data with high scalability and diversity.
♻ ☆ Representing spherical tensors with scalar-based machine-learning models
Rotational symmetry plays a central role in physics, providing an elegant framework to describe how the properties of 3D objects -- from atoms to the macroscopic scale -- transform under the action of rigid rotations. Equivariant models of 3D point clouds are able to approximate structure-property relations in a way that is fully consistent with the structure of the rotation group, by combining intermediate representations that are themselves spherical tensors. The symmetry constraints however make this approach computationally demanding and cumbersome to implement, which motivates increasingly popular unconstrained architectures that learn approximate symmetries as part of the training process. In this work, we explore a third route to tackle this learning problem, where equivariant functions are expressed as the product of a scalar function of the point cloud coordinates and a small basis of tensors with the appropriate symmetry. We also propose approximations of the general expressions that, while lacking universal approximation properties, are fast, simple to implement, and accurate in practical settings.
♻ ☆ Contextualize-then-Aggregate: Circuits for In-Context Learning in Gemma-2 2B
In-Context Learning (ICL) is an intriguing ability of large language models (LLMs). Despite a substantial amount of work on its behavioral aspects and how it emerges in miniature setups, it remains unclear which mechanism assembles task information from the individual examples in a fewshot prompt. We use causal interventions to identify information flow in Gemma-2 2B for five naturalistic ICL tasks. We find that the model infers task information using a two-step strategy we call contextualize-then-aggregate: In the lower layers, the model builds up representations of individual fewshot examples, which are contextualized by preceding examples through connections between fewshot input and output tokens across the sequence. In the higher layers, these representations are aggregated to identify the task and prepare prediction of the next output. The importance of the contextualization step differs between tasks, and it may become more important in the presence of ambiguous examples. Overall, by providing rigorous causal analysis, our results shed light on the mechanisms through which ICL happens in language models.
♻ ☆ FedEFC: Federated Learning Using Enhanced Forward Correction Against Noisy Labels
Federated Learning (FL) is a powerful framework for privacy-preserving distributed learning. It enables multiple clients to collaboratively train a global model without sharing raw data. However, handling noisy labels in FL remains a major challenge due to heterogeneous data distributions and communication constraints, which can severely degrade model performance. To address this issue, we propose FedEFC, a novel method designed to tackle the impact of noisy labels in FL. FedEFC mitigates this issue through two key techniques: (1) prestopping, which prevents overfitting to mislabeled data by dynamically halting training at an optimal point, and (2) loss correction, which adjusts model updates to account for label noise. In particular, we develop an effective loss correction tailored to the unique challenges of FL, including data heterogeneity and decentralized training. Furthermore, we provide a theoretical analysis, leveraging the composite proper loss property, to demonstrate that the FL objective function under noisy label distributions can be aligned with the clean label distribution. Extensive experimental results validate the effectiveness of our approach, showing that it consistently outperforms existing FL techniques in mitigating the impact of noisy labels, particularly under heterogeneous data settings (e.g., achieving up to 41.64% relative performance improvement over the existing loss correction method).
comment: 9 pages, 3 figures, version 2
♻ ☆ Generative diffusion posterior sampling for informative likelihoods
Sequential Monte Carlo (SMC) methods have recently shown successful results for conditional sampling of generative diffusion models. In this paper we propose a new diffusion posterior SMC sampler achieving improved statistical efficiencies, particularly under outlier conditions or highly informative likelihoods. The key idea is to construct an observation path that correlates with the diffusion model and to design the sampler to leverage this correlation for more efficient sampling. Empirical results conclude the efficiency.
comment: Commemorative issue for celebrating Thomas Kailath's 90th birthday
♻ ☆ Implicit Regularization Makes Overparameterized Asymmetric Matrix Sensing Robust to Perturbations
Several key questions remain unanswered regarding overparameterized learning models. It is unclear how (stochastic) gradient descent finds solutions that generalize well, and in particular the role of small random initializations. Matrix sensing, which is the problem of reconstructing a low-rank matrix from a few linear measurements, has become a standard prototypical setting to study these phenomena. Previous works have shown that matrix sensing can be solved by factorized gradient descent, provided the random initialization is extremely small. In this paper, we find that factorized gradient descent is highly robust to certain perturbations. This lets us use a perturbation term to capture both the effects of imperfect measurements, discretization by gradient descent, and other noise, resulting in a general formulation which we call \textit{perturbed gradient flow}. We find that not only is this equivalent formulation easier to work with, but it leads to sharper sample and time complexities than previous work, handles moderately small initializations, and the results are naturally robust to perturbations such as noisy measurements or changing measurement matrices. Finally, we also analyze mini-batch stochastic gradient descent using the formulation, where we find improved sample complexity.
comment: Version 2: Major revision with new title and some new results like the analysis of stochastic gradient descent
♻ ☆ Robustness of deep learning classification to adversarial input on GPUs: asynchronous parallel accumulation is a source of vulnerability
The ability of machine learning (ML) classification models to resist small, targeted input perturbations -- known as adversarial attacks -- is a key measure of their safety and reliability. We show that floating-point non-associativity (FPNA) coupled with asynchronous parallel programming on GPUs is sufficient to result in misclassification, without any perturbation to the input. Additionally, we show that standard adversarial robustness results may be overestimated up to 4.6 when not considering machine-level details. We develop a novel black-box attack using Bayesian optimization to discover external workloads that can change the instruction scheduling which bias the output of reductions on GPUs and reliably lead to misclassification. Motivated by these results, we present a new learnable permutation (LP) gradient-based approach to learning floating-point operation orderings that lead to misclassifications. The LP approach provides a worst-case estimate in a computationally efficient manner, avoiding the need to run identical experiments tens of thousands of times over a potentially large set of possible GPU states or architectures. Finally, using instrumentation-based testing, we investigate parallel reduction ordering across different GPU architectures under external background workloads, when utilizing multi-GPU virtualization, and when applying power capping. Our results demonstrate that parallel reduction ordering varies significantly across architectures under the first two conditions, substantially increasing the search space required to fully test the effects of this parallel scheduler-based vulnerability. These results and the methods developed here can help to include machine-level considerations into adversarial robustness assessments, which can make a difference in safety and mission critical applications.
♻ ☆ Information-Theoretic Decentralized Secure Aggregation with Collusion Resilience
In decentralized federated learning (FL), multiple clients collaboratively learn a shared machine learning (ML) model by leveraging their privately held datasets distributed across the network, through interactive exchange of the intermediate model updates. To ensure data security, cryptographic techniques are commonly employed to protect model updates during aggregation. Despite growing interest in secure aggregation, existing works predominantly focus on protocol design and computational guarantees, with limited understanding of the fundamental information-theoretic limits of such systems. Moreover, optimal bounds on communication and key usage remain unknown in decentralized settings, where no central aggregator is available. Motivated by these gaps, we study the problem of decentralized secure aggregation (DSA) from an information-theoretic perspective. Specifically, we consider a network of $K$ fully-connected users, each holding a private input -- an abstraction of local training data -- who aim to securely compute the sum of all inputs. The security constraint requires that no user learns anything beyond the input sum, even when colluding with up to $T$ other users. We characterize the optimal rate region, which specifies the minimum achievable communication and secret key rates for DSA. In particular, we show that to securely compute one symbol of the desired input sum, each user must (i) transmit at least one symbol to others, (ii) hold at least one symbol of secret key, and (iii) all users must collectively hold no fewer than $K - 1$ independent key symbols. Our results establish the fundamental performance limits of DSA, providing insights for the design of provably secure and communication-efficient protocols in distributed learning systems.
comment: Submitted to IEEE for potential journal publication
♻ ☆ On the Challenges and Opportunities in Generative AI
The field of deep generative modeling has grown rapidly in the last few years. With the availability of massive amounts of training data coupled with advances in scalable unsupervised learning paradigms, recent large-scale generative models show tremendous promise in synthesizing high-resolution images and text, as well as structured data such as videos and molecules. However, we argue that current large-scale generative AI models exhibit several fundamental shortcomings that hinder their widespread adoption across domains. In this work, our objective is to identify these issues and highlight key unresolved challenges in modern generative AI paradigms that should be addressed to further enhance their capabilities, versatility, and reliability. By identifying these challenges, we aim to provide researchers with insights for exploring fruitful research directions, thus fostering the development of more robust and accessible generative AI solutions.
♻ ☆ DIDS: Domain Impact-aware Data Sampling for Large Language Model Training
Large language models (LLMs) are commonly trained on multi-domain datasets, where domain sampling strategies significantly impact model performance due to varying domain importance across downstream tasks. Existing approaches for optimizing domain-level sampling strategies struggle with maintaining intra-domain consistency and accurately measuring domain impact. In this paper, we present Domain Impact-aware Data Sampling (DIDS). To ensure intra-domain consistency, a gradient clustering algorithm is proposed to group training data based on their learning effects, where a proxy language model and dimensionality reduction are employed to reduce computational overhead. To accurately measure domain impact, we develop a Fisher Information Matrix (FIM) guided metric that quantifies how domain-specific parameter updates affect the model's output distributions on downstream tasks, with theoretical guarantees. Furthermore, to determine optimal sampling ratios, DIDS combines both the FIM-guided domain impact assessment and loss learning trajectories that indicate domain-specific potential, while accounting for diminishing marginal returns. Extensive experiments demonstrate that DIDS achieves 3.4% higher average performance while maintaining comparable training efficiency. The code is available at https://github.com/shiweijiezero/DIDS.
♻ ☆ Perceptual Implications of Automatic Anonymization in Pathological Speech
Automatic anonymization techniques are essential for ethical sharing of pathological speech data, yet their perceptual consequences remain understudied. We present a comprehensive human-centered analysis of anonymized pathological speech, using a structured protocol involving ten native and non-native German listeners with diverse linguistic, clinical, and technical backgrounds. Listeners evaluated anonymized-original utterance pairs from 180 speakers spanning Cleft Lip and Palate, Dysarthria, Dysglossia, Dysphonia, and healthy controls. Speech was anonymized using state-of-the-art automatic methods (equal error rates in the range of 30-40%). Listeners completed Turing-style discrimination and quality rating tasks under zero-shot (single-exposure) and few-shot (repeated-exposure) conditions. Discrimination accuracy was high overall (91% zero-shot; 93% few-shot), but varied by disorder (repeated-measures ANOVA: p=0.007), ranging from 96% (Dysarthria) to 86% (Dysphonia). Anonymization consistently reduced perceived quality across groups (from 83% to 59%, p<0.001), with pathology-specific degradation patterns (one-way ANOVA: p=0.005). Native listeners showed a non-significant trend toward higher original speech ratings (Delta=4%, p=0.199), but this difference was minimal after anonymization (Delta=1%, p=0.724). No significant gender-based bias was observed. Perceptual outcomes did not correlate with automatic metrics; intelligibility was linked to perceived quality in original speech but not after anonymization. These findings underscore the need for listener-informed, disorder-specific anonymization strategies that preserve both privacy and perceptual integrity.
♻ ☆ Plinius: Secure and Persistent Machine Learning Model Training
With the increasing popularity of cloud based machine learning (ML) techniques there comes a need for privacy and integrity guarantees for ML data. In addition, the significant scalability challenges faced by DRAM coupled with the high access-times of secondary storage represent a huge performance bottleneck for ML systems. While solutions exist to tackle the security aspect, performance remains an issue. Persistent memory (PM) is resilient to power loss (unlike DRAM), provides fast and fine-granular access to memory (unlike disk storage) and has latency and bandwidth close to DRAM (in the order of ns and GB/s, respectively). We present PLINIUS, a ML framework using Intel SGX enclaves for secure training of ML models and PM for fault tolerance guarantees. PLINIUS uses a novel mirroring mechanism to create and maintain (i) encrypted mirror copies of ML models on PM, and (ii) encrypted training data in byte-addressable PM, for near-instantaneous data recovery after a system failure. Compared to disk-based checkpointing systems, PLINIUS is 3.2x and 3.7x faster respectively for saving and restoring models on real PM hardware, achieving robust and secure ML model training in SGX enclaves.
♻ ☆ Optimal Batch-Size Control for Low-Latency Federated Learning with Device Heterogeneity
Federated learning (FL) has emerged as a popular approach for collaborative machine learning in sixth-generation (6G) networks, primarily due to its privacy-preserving capabilities. The deployment of FL algorithms is expected to empower a wide range of Internet-of-Things (IoT) applications, e.g., autonomous driving, augmented reality, and healthcare. The mission-critical and time-sensitive nature of these applications necessitates the design of low-latency FL frameworks that guarantee high learning performance. In practice, achieving low-latency FL faces two challenges: the overhead of computing and transmitting high-dimensional model updates, and the heterogeneity in communication-and-computation (C$^2$) capabilities across devices. To address these challenges, we propose a novel C$^2$-aware framework for optimal batch-size control that minimizes end-to-end (E2E) learning latency while ensuring convergence. The framework is designed to balance a fundamental C$^2$ tradeoff as revealed through convergence analysis. Specifically, increasing batch sizes improves the accuracy of gradient estimation in FL and thus reduces the number of communication rounds required for convergence, but results in higher per-round latency, and vice versa. The associated problem of latency minimization is intractable; however, we solve it by designing an accurate and tractable surrogate for convergence speed, with parameters fitted to real data. This approach yields two batch-size control strategies tailored to scenarios with slow and fast fading, while also accommodating device heterogeneity. Extensive experiments using real datasets demonstrate that the proposed strategies outperform conventional batch-size adaptation schemes that do not consider the C$^2$ tradeoff or device heterogeneity.
♻ ☆ Nested Graph Pseudo-Label Refinement for Noisy Label Domain Adaptation Learning
Graph Domain Adaptation (GDA) facilitates knowledge transfer from labeled source graphs to unlabeled target graphs by learning domain-invariant representations, which is essential in applications such as molecular property prediction and social network analysis. However, most existing GDA methods rely on the assumption of clean source labels, which rarely holds in real-world scenarios where annotation noise is pervasive. This label noise severely impairs feature alignment and degrades adaptation performance under domain shifts. To address this challenge, we propose Nested Graph Pseudo-Label Refinement (NeGPR), a novel framework tailored for graph-level domain adaptation with noisy labels. NeGPR first pretrains dual branches, i.e., semantic and topology branches, by enforcing neighborhood consistency in the feature space, thereby reducing the influence of noisy supervision. To bridge domain gaps, NeGPR employs a nested refinement mechanism in which one branch selects high-confidence target samples to guide the adaptation of the other, enabling progressive cross-domain learning. Furthermore, since pseudo-labels may still contain noise and the pre-trained branches are already overfitted to the noisy labels in the source domain, NeGPR incorporates a noise-aware regularization strategy. This regularization is theoretically proven to mitigate the adverse effects of pseudo-label noise, even under the presence of source overfitting, thus enhancing the robustness of the adaptation process. Extensive experiments on benchmark datasets demonstrate that NeGPR consistently outperforms state-of-the-art methods under severe label noise, achieving gains of up to 12.7% in accuracy.
♻ ☆ Expected Free Energy-based Planning as Variational Inference
We address the problem of planning under uncertainty, where an agent must choose actions that not only achieve desired outcomes but also reduce uncertainty. Traditional methods often treat exploration and exploitation as separate objectives, lacking a unified inferential foundation. Active inference, grounded in the Free Energy Principle, provides such a foundation by minimizing Expected Free Energy (EFE), a cost function that combines utility with epistemic drives, such as ambiguity resolution and novelty seeking. However, the computational burden of EFE minimization had remained a significant obstacle to its scalability. In this paper, we show that EFE-based planning arises naturally from minimizing a variational free energy functional on a generative model augmented with preference and epistemic priors. This result reinforces theoretical consistency with the Free Energy Principle by casting planning under uncertainty itself as a form of variational inference. Our formulation yields policies that jointly support goal achievement and information gain, while incorporating a complexity term that accounts for bounded computational resources. This unifying framework connects and extends existing methods, enabling scalable, resource-aware implementations of active inference agents.
comment: 18 pages
♻ ☆ Comparative Explanations: Explanation Guided Decision Making for Human-in-the-Loop Preference Selection
This paper introduces Multi-Output LOcal Narrative Explanation (MOLONE), a novel comparative explanation method designed to enhance preference selection in human-in-the-loop Preference Bayesian optimization (PBO). The preference elicitation in PBO is a non-trivial task because it involves navigating implicit trade-offs between vector-valued outcomes, subjective priorities of decision-makers, and decision-makers' uncertainty in preference selection. Existing explainable AI (XAI) methods for BO primarily focus on input feature importance, neglecting the crucial role of outputs (objectives) in human preference elicitation. MOLONE addresses this gap by providing explanations that highlight both input and output importance, enabling decision-makers to understand the trade-offs between competing objectives and make more informed preference selections. MOLONE focuses on local explanations, comparing the importance of input features and outcomes across candidate samples within a local neighborhood of the search space, thus capturing nuanced differences relevant to preference-based decision-making. We evaluate MOLONE within a PBO framework using benchmark multi-objective optimization functions, demonstrating its effectiveness in improving convergence compared to noisy preference selections. Furthermore, a user study confirms that MOLONE significantly accelerates convergence in human-in-the-loop scenarios by facilitating more efficient identification of preferred options.
comment: Accepted at The World Conference on eXplainable Artificial Intelligence 2025 (XAI-2025)
♻ ☆ Explainable Bayesian Optimization
Manual parameter tuning of cyber-physical systems is a common practice, but it is labor-intensive. Bayesian Optimization (BO) offers an automated alternative, yet its black-box nature reduces trust and limits human-BO collaborative system tuning. Experts struggle to interpret BO recommendations due to the lack of explanations. This paper addresses the post-hoc BO explainability problem for cyber-physical systems. We introduce TNTRules (Tune-No-Tune Rules), a novel algorithm that provides both global and local explanations for BO recommendations. TNTRules generates actionable rules and visual graphs, identifying optimal solution bounds and ranges, as well as potential alternative solutions. Unlike existing explainable AI (XAI) methods, TNTRules is tailored specifically for BO, by encoding uncertainty via a variance pruning technique and hierarchical agglomerative clustering. A multi-objective optimization approach allows maximizing explanation quality. We evaluate TNTRules using established XAI metrics (Correctness, Completeness, and Compactness) and compare it against adapted baseline methods. The results demonstrate that TNTRules generates high-fidelity, compact, and complete explanations, significantly outperforming three baselines on 5 multi-objective testing functions and 2 hyperparameter tuning problems.
comment: Accepted at The World Conference on eXplainable Artificial Intelligence 2025 (XAI-2025)
♻ ☆ Fundamental Limits of Matrix Sensing: Exact Asymptotics, Universality, and Applications
In the matrix sensing problem, one wishes to reconstruct a matrix from (possibly noisy) observations of its linear projections along given directions. We consider this model in the high-dimensional limit: while previous works on this model primarily focused on the recovery of low-rank matrices, we consider in this work more general classes of structured signal matrices with potentially large rank, e.g. a product of two matrices of sizes proportional to the dimension. We provide rigorous asymptotic equations characterizing the Bayes-optimal learning performance from a number of samples which is proportional to the number of entries in the matrix. Our proof is composed of three key ingredients: $(i)$ we prove universality properties to handle structured sensing matrices, related to the ''Gaussian equivalence'' phenomenon in statistical learning, $(ii)$ we provide a sharp characterization of Bayes-optimal learning in generalized linear models with Gaussian data and structured matrix priors, generalizing previously studied settings, and $(iii)$ we leverage previous works on the problem of matrix denoising. The generality of our results allow for a variety of applications: notably, we mathematically establish predictions obtained via non-rigorous methods from statistical physics in [ETB+24] regarding Bilinear Sequence Regression, a benchmark model for learning from sequences of tokens, and in [MTM+24] on Bayes-optimal learning in neural networks with quadratic activation function, and width proportional to the dimension.
♻ ☆ Analytics Modelling over Multiple Datasets using Vector Embeddings
The massive increase in the data volume and dataset availability for analysts compels researchers to focus on data content and select high-quality datasets to enhance the performance of analytics operators. While selecting high-quality data significantly boosts analytical accuracy and efficiency, the exact process is very challenging given large-scale dataset availability. To address this issue, we propose a novel methodology that infers the outcome of analytics operators by creating a model from the available datasets. Each dataset is transformed to a vector embedding representation generated by our proposed deep learning model NumTabData2Vec, where similarity search are employed. Through experimental evaluation, we compare the prediction performance and the execution time of our framework to another state-of-the-art modelling operator framework, illustrating that our approach predicts analytics outcomes accurately, and increases speedup. Furthermore, our vectorization model can project different real-world scenarios to a lower vector embedding representation accurately and distinguish them.
♻ ☆ ReasonRank: Empowering Passage Ranking with Strong Reasoning Ability
Large Language Model (LLM) based listwise ranking has shown superior performance in many passage ranking tasks. With the development of Large Reasoning Models, many studies have demonstrated that step-by-step reasoning during test-time helps improve listwise ranking performance. However, due to the scarcity of reasoning-intensive training data, existing rerankers perform poorly in many complex ranking scenarios and the ranking ability of reasoning-intensive rerankers remains largely underdeveloped. In this paper, we first propose an automated reasoning-intensive training data synthesis framework, which sources training queries and passages from diverse domains and applies DeepSeek-R1 to generate high-quality training labels. A self-consistency data filtering mechanism is designed to ensure the data quality. To empower the listwise reranker with strong reasoning ability, we further propose a two-stage post-training approach, which includes a cold-start supervised fine-tuning (SFT) stage for reasoning pattern learning and a reinforcement learning (RL) stage for further ranking ability enhancement. During the RL stage, based on the nature of listwise ranking, we design a multi-view ranking reward, which is more effective than a ranking metric-based reward. Extensive experiments demonstrate that our trained reasoning-intensive reranker \textbf{ReasonRank} outperforms existing baselines significantly and also achieves much lower latency than pointwise reranker Rank1. \textbf{Through further experiments, our ReasonRank has achieved state-of-the-art (SOTA) performance 40.6 on the BRIGHT leaderboard\footnote{https://brightbenchmark.github.io/}.} Our codes are available at https://github.com/8421BCD/ReasonRank.
comment: 21 pages
♻ ☆ CCD: Continual Consistency Diffusion for Lifelong Generative Modeling
While diffusion-based models have shown remarkable generative capabilities in static settings, their extension to continual learning (CL) scenarios remains fundamentally constrained by Generative Catastrophic Forgetting (GCF). We observe that even with a rehearsal buffer, new generative skills often overwrite previous ones, degrading performance on earlier tasks. Although some initial efforts have explored this space, most rely on heuristics borrowed from continual classification methods or use trained diffusion models as ad hoc replay generators, lacking a principled, unified solution to mitigating GCF and often conducting experiments under fragmented and inconsistent settings. To address this gap, we introduce the Continual Diffusion Generation (CDG), a structured pipeline that redefines how diffusion models are implemented under CL and enables systematic evaluation of GCF. Beyond the empirical pipeline, we propose the first theoretical foundation for CDG, grounded in a cross-task analysis of diffusion-specific generative dynamics. Our theoretical investigation identifies three fundamental consistency principles essential for preserving knowledge in the rehearsal buffer over time: inter-task knowledge consistency, unconditional knowledge consistency, and prior knowledge consistency. These criteria expose the latent mechanisms through which generative forgetting manifests across sequential tasks. Motivated by these insights, we further propose \textit{Continual Consistency Diffusion} (CCD), a principled training framework that enforces these consistency objectives via hierarchical loss functions: $\mathcal{L}_{IKC}$, $\mathcal{L}_{UKC}$, and $\mathcal{L}_{PKC}$. Extensive experiments show that CCD achieves SOTA performance across various benchmarks, especially improving generative metrics in overlapping-task scenarios.
♻ ☆ GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization ICML 2025
Optimizing GPU kernels for high performance is a complex task, often demanding deep architectural knowledge, extensive profiling, and iterative experimentation. This challenge is amplified when targeting newer or less-documented GPU architectures where traditional development aids are scarce. This paper introduces an LLM-powered "GPU Kernel Scientist," an automated methodology for iteratively refining accelerator kernels. Our methodology employs LLMs in a multi-stage, evolutionary process: (a) strategically selecting promising prior code versions as a basis for new iterations; (b) generating hypotheses for optimization experiments, based on existing code and assimilated knowledge from general GPU literature; and (c) autonomously implementing these experiments through code modification and subsequent submission to an external evaluation system, using only observed timing data as performance feedback. We detail how this approach navigates the challenges of the AMD MI300 target architecture and leverages LLMs to compensate for limited domain-specific human expertise. In addition to our results, we present the architectural design, operational workflow, and qualitative insights, highlighting the potential of LLM-driven agents to democratise and accelerate GPU kernel optimization, especially in resource-constrained or rapidly updating hardware environment.
comment: 4+1 page paper plus Appendices and Supplementary zip file. Presented at the ES-FoMo "Efficient Systems for Foundation Models" workshop at ICML 2025
♻ ☆ SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences
Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), but its performance degrades on long inputs due to increased attention cost and reduced draft accuracy. We introduce SpecExtend, a drop-in enhancement that improves the performance of speculative decoding on long sequences without any additional training. First, SpecExtend integrates efficient attention mechanisms such as FlashAttention and Hybrid Tree Attention into both the draft and target models. To improve draft accuracy and speed on long inputs without retraining, we propose Cross-model Retrieval, a novel KV cache eviction strategy that uses the target model's attention scores to dynamically select relevant context for the draft model. Extensive evaluations on three long-context understanding datasets show that SpecExtend accelerates standard tree-based speculative decoding by up to 2.22x for inputs up to 16K tokens, providing an effective solution for speculative decoding of long sequences. Our code is available at https://github.com/jycha98/SpecExtend .
♻ ☆ Mirror, Mirror of the Flow: How Does Regularization Shape Implicit Bias?
Implicit bias plays an important role in explaining how overparameterized models generalize well. Explicit regularization like weight decay is often employed in addition to prevent overfitting. While both concepts have been studied separately, in practice, they often act in tandem. Understanding their interplay is key to controlling the shape and strength of implicit bias, as it can be modified by explicit regularization. To this end, we incorporate explicit regularization into the mirror flow framework and analyze its lasting effects on the geometry of the training dynamics, covering three distinct effects: positional bias, type of bias, and range shrinking. Our analytical approach encompasses a broad class of problems, including sparse coding, matrix sensing, single-layer attention, and LoRA, for which we demonstrate the utility of our insights. To exploit the lasting effect of regularization and highlight the potential benefit of dynamic weight decay schedules, we propose to switch off weight decay during training, which can improve generalization, as we demonstrate in experiments.
comment: 26 pages, 16 figures
♻ ☆ A Diffusion Model Framework for Unsupervised Neural Combinatorial Optimization ICML 2024
Learning to sample from intractable distributions over discrete sets without relying on corresponding training data is a central problem in a wide range of fields, including Combinatorial Optimization. Currently, popular deep learning-based approaches rely primarily on generative models that yield exact sample likelihoods. This work introduces a method that lifts this restriction and opens the possibility to employ highly expressive latent variable models like diffusion models. Our approach is conceptually based on a loss that upper bounds the reverse Kullback-Leibler divergence and evades the requirement of exact sample likelihoods. We experimentally validate our approach in data-free Combinatorial Optimization and demonstrate that our method achieves a new state-of-the-art on a wide range of benchmark problems.
comment: Accepted at ICML 2024
♻ ☆ Tripartite-GraphRAG via Plugin Ontologies
Large Language Models (LLMs) have shown remarkable capabilities across various domains, yet they struggle with knowledge-intensive tasks in areas that demand factual accuracy, e.g. industrial automation and healthcare. Key limitations include their tendency to hallucinate, lack of source traceability (provenance), and challenges in timely knowledge updates. Combining language models with knowledge graphs (GraphRAG) offers promising avenues for overcoming these deficits. However, a major challenge lies in creating such a knowledge graph in the first place. Here, we propose a novel approach that combines LLMs with a tripartite knowledge graph representation, which is constructed by connecting complex, domain-specific objects via a curated ontology of corresponding, domain-specific concepts to relevant sections within chunks of text through a concept-anchored pre-analysis of source documents starting from an initial lexical graph. Subsequently, we formulate LLM prompt creation as an unsupervised node classification problem allowing for the optimization of information density, coverage, and arrangement of LLM prompts at significantly reduced lengths. An initial experimental evaluation of our approach on a healthcare use case, involving multi-faceted analyses of patient anamneses given a set of medical concepts as well as a series of clinical guideline literature, indicates its potential to optimize information density, coverage, and arrangement of LLM prompts while significantly reducing their lengths, which, in turn, may lead to reduced costs as well as more consistent and reliable LLM outputs.
♻ ☆ Randomized PCA Forest for Outlier Detection
We propose a novel unsupervised outlier detection method based on Randomized Principal Component Analysis (PCA). Inspired by the performance of Randomized PCA (RPCA) Forest in approximate K-Nearest Neighbor (KNN) search, we develop a novel unsupervised outlier detection method that utilizes RPCA Forest for outlier detection. Experimental results showcase the superiority of the proposed approach compared to the classical and state-of-the-art methods in performing the outlier detection task on several datasets while performing competitively on the rest. The extensive analysis of the proposed method reflects it high generalization power and its computational efficiency, highlighting it as a good choice for unsupervised outlier detection.
♻ ☆ Reinforcement Learning for Jump-Diffusions, with Financial Applications
We study continuous-time reinforcement learning (RL) for stochastic control in which system dynamics are governed by jump-diffusion processes. We formulate an entropy-regularized exploratory control problem with stochastic policies to capture the exploration--exploitation balance essential for RL. Unlike the pure diffusion case initially studied by Wang et al. (2020), the derivation of the exploratory dynamics under jump-diffusions calls for a careful formulation of the jump part. Through a theoretical analysis, we find that one can simply use the same policy evaluation and $q$-learning algorithms in Jia and Zhou (2022a, 2023), originally developed for controlled diffusions, without needing to check a priori whether the underlying data come from a pure diffusion or a jump-diffusion. However, we show that the presence of jumps ought to affect parameterizations of actors and critics in general. We investigate as an application the mean--variance portfolio selection problem with stock price modelled as a jump-diffusion, and show that both RL algorithms and parameterizations are invariant with respect to jumps. Finally, we present a detailed study on applying the general theory to option hedging.
♻ ☆ One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs ICML 2025
Leveraging mathematical Large Language Models (LLMs) for proof generation is a fundamental topic in LLMs research. We argue that the ability of current LLMs to prove statements largely depends on whether they have encountered the relevant proof process during training. This reliance limits their deeper understanding of mathematical theorems and related concepts. Inspired by the pedagogical method of "proof by counterexamples" commonly used in human mathematics education, our work aims to enhance LLMs' ability to conduct mathematical reasoning and proof through counterexamples. Specifically, we manually create a high-quality, university-level mathematical benchmark, CounterMATH, which requires LLMs to prove mathematical statements by providing counterexamples, thereby assessing their grasp of mathematical concepts. Additionally, we develop a data engineering framework to automatically obtain training data for further model improvement. Extensive experiments and detailed analyses demonstrate that CounterMATH is challenging, indicating that LLMs, such as OpenAI o1, have insufficient counterexample-driven proof capabilities. Moreover, our exploration into model training reveals that strengthening LLMs' counterexample-driven conceptual reasoning abilities is crucial for improving their overall mathematical capabilities. We believe that our work offers new perspectives on the community of mathematical LLMs.
comment: ICML 2025
♻ ☆ Robust Graph Contrastive Learning with Information Restoration
The graph contrastive learning (GCL) framework has gained remarkable achievements in graph representation learning. However, similar to graph neural networks (GNNs), GCL models are susceptible to graph structural attacks. As an unsupervised method, GCL faces greater challenges in defending against adversarial attacks. Furthermore, there has been limited research on enhancing the robustness of GCL. To thoroughly explore the failure of GCL on the poisoned graphs, we investigate the detrimental effects of graph structural attacks against the GCL framework. We discover that, in addition to the conventional observation that graph structural attacks tend to connect dissimilar node pairs, these attacks also diminish the mutual information between the graph and its representations from an information-theoretical perspective, which is the cornerstone of the high-quality node embeddings for GCL. Motivated by this theoretical insight, we propose a robust graph contrastive learning framework with a learnable sanitation view that endeavors to sanitize the augmented graphs by restoring the diminished mutual information caused by the structural attacks. Additionally, we design a fully unsupervised tuning strategy to tune the hyperparameters without accessing the label information, which strictly coincides with the defender's knowledge. Extensive experiments demonstrate the effectiveness and efficiency of our proposed method compared to competitive baselines.
♻ ☆ Hybrid Adaptive Modeling in Process Monitoring: Leveraging Sequence Encoders and Physics-Informed Neural Networks
In this work, we explore the integration of Sequence Encoding for Online Parameter Identification with Physics-Informed Neural Networks to create a model that, once trained, can be utilized for real time applications with variable parameters, boundary conditions, and initial conditions. Recently, the combination of PINNs with Sparse Regression has emerged as a method for performing dynamical system identification through supervised learning and sparse regression optimization, while also solving the dynamics using PINNs. However, this approach can be limited by variations in parameters or boundary and initial conditions, requiring retraining of the model whenever changes occur. In this work, we introduce an architecture that employs Deep Sets or Sequence Encoders to encode dynamic parameters, boundary conditions, and initial conditions, using these encoded features as inputs for the PINN, enabling the model to adapt to changes in parameters, BCs, and ICs. We apply this approach to three different problems. First, we analyze the Rossler ODE system, demonstrating the robustness of the model with respect to noise and its ability to generalize. Next, we explore the model's capability in a 2D Navier-Stokes PDE problem involving flow past a cylinder with a parametric sinusoidal inlet velocity function, showing that the model can encode pressure data from a few points to identify the inlet velocity profile and utilize physics to compute velocity and pressure throughout the domain. Finally, we address a 1D heat monitoring problem using real data from the heating of glass fiber and thermoplastic composite plates.
♻ ☆ Short-Term Forecasting of Energy Production and Consumption Using Extreme Learning Machine: A Comprehensive MIMO based ELM Approach
A novel methodology for short-term energy forecasting using an Extreme Learning Machine ($\mathtt{ELM}$) is proposed. Using six years of hourly data collected in Corsica (France) from multiple energy sources (solar, wind, hydro, thermal, bioenergy, and imported electricity), our approach predicts both individual energy outputs and total production (including imports, which closely follow energy demand, modulo losses) through a Multi-Input Multi-Output ($\mathtt{MIMO}$) architecture. To address non-stationarity and seasonal variability, sliding window techniques and cyclic time encoding are incorporated, enabling dynamic adaptation to fluctuations. The $\mathtt{ELM}$ model significantly outperforms persistence-based forecasting, particularly for solar and thermal energy, achieving an $\mathtt{nRMSE}$ of $17.9\%$ and $5.1\%$, respectively, with $\mathtt{R^2} > 0.98$ (1-hour horizon). The model maintains high accuracy up to five hours ahead, beyond which renewable energy sources become increasingly volatile. While $\mathtt{MIMO}$ provides marginal gains over Single-Input Single-Output ($\mathtt{SISO}$) architectures and offers key advantages over deep learning methods such as $\mathtt{LSTM}$, it provides a closed-form solution with lower computational demands, making it well-suited for real-time applications, including online learning. Beyond predictive accuracy, the proposed methodology is adaptable to various contexts and datasets, as it can be tuned to local constraints such as resource availability, grid characteristics, and market structures.
♻ ☆ Hydra: A 1.6B-Parameter State-Space Language Model with Sparse Attention, Mixture-of-Experts, and Memory
We present Hydra as an architectural proposal for hybrid long-context language models that combine conditional computation, long-context memory mechanisms, and sparse mixture-of-experts within an approximately 1.6B parameter design envelope. Hydra integrates a Mamba-style Structured State Space Model (SSM) backbone with intermittent sparse global attention, chunk-level MoE feed-forward routing, and dual (workspace plus factual PKM) memories. We formalize the component interfaces, give transparent parameter and complexity accounting, and outline a staged curriculum intended to stably activate the parts. We accompany the specification with illustrative toy-scale prototype measurements (tens of millions of parameters on synthetic data) whose sole purpose is to demonstrate implementation feasibility and qualitative scaling behaviors (for example, long-context throughput crossover and controllable expert routing), not to claim competitive full-scale performance. We explicitly delineate assumptions and open risks (training complexity, memory utilization, specialization dynamics) and position Hydra as a blueprint to stimulate empirical follow-up rather than a finished system. By combining SSM efficiency, selective sparse attention, MoE capacity, and learnable memory, Hydra sketches a path toward modular, input-adaptive long-context language models; validating end-task gains at target scale remains future work.
comment: Fixed a typo
♻ ☆ A deformation-based framework for learning solution mappings of PDEs defined on varying domains
In this work, we establish a deformation-based framework for learning solution mappings of PDEs defined on varying domains. The union of functions defined on varying domains can be identified as a metric space according to the deformation, then the solution mapping is regarded as a continuous metric-to-metric mapping, and subsequently can be represented by another continuous metric-to-Banach mapping using two different strategies, referred to as the D2D subframework and the D2E subframework, respectively. We point out that such a metric-to-Banach mapping can be learned by neural networks, hence the solution mapping is accordingly learned. With this framework, a rigorous convergence analysis is built for the problem of learning solution mappings of PDEs on varying domains. As the theoretical framework holds based on several pivotal assumptions which need to be verified for a given specific problem, we study the star domains as a typical example, and other situations could be similarly verified. There are three important features of this framework: (1) The domains under consideration are not required to be diffeomorphic, therefore a wide range of regions can be covered by one model provided they are homeomorphic. (2) The deformation mapping is unnecessary to be continuous, thus it can be flexibly established via combining a primary identity mapping and a local deformation mapping. This capability facilitates the resolution of large systems where only local parts of the geometry undergo change. (3) If a linearity-preserving neural operator such as MIONet is adopted, this framework still preserves the linearity of the surrogate solution mapping on its source term for linear PDEs, thus it can be applied to the hybrid iterative method. We finally present several numerical experiments to validate our theoretical results.
♻ ☆ On the Value of Cross-Modal Misalignment in Multimodal Representation Learning
Multimodal representation learning, exemplified by multimodal contrastive learning (MMCL) using image-text pairs, aims to learn powerful representations by aligning cues across modalities. This approach relies on the core assumption that the exemplar image-text pairs constitute two representations of an identical concept. However, recent research has revealed that real-world datasets often exhibit cross-modal misalignment. There are two distinct viewpoints on how to address this issue: one suggests mitigating the misalignment, and the other leveraging it. We seek here to reconcile these seemingly opposing perspectives, and to provide a practical guide for practitioners. Using latent variable models we thus formalize cross-modal misalignment by introducing two specific mechanisms: Selection bias, where some semantic variables are absent in the text, and perturbation bias, where semantic variables are altered -- both leading to misalignment in data pairs. Our theoretical analysis demonstrates that, under mild assumptions, the representations learned by MMCL capture exactly the information related to the subset of the semantic variables invariant to selection and perturbation biases. This provides a unified perspective for understanding misalignment. Based on this, we further offer actionable insights into how misalignment should inform the design of real-world ML systems. We validate our theoretical findings via extensive empirical studies on both synthetic data and real image-text datasets, shedding light on the nuanced impact of cross-modal misalignment on multimodal representation learning.
♻ ☆ Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment
Test-time adaptation (TTA) enhances the zero-shot robustness under distribution shifts by leveraging unlabeled test data during inference. Despite notable advances, several challenges still limit its broader applicability. First, most methods rely on backpropagation or iterative optimization, which limits scalability and hinders real-time deployment. Second, they lack explicit modeling of class-conditional feature distributions. This modeling is crucial for producing reliable decision boundaries and calibrated predictions, but it remains underexplored due to the lack of both source data and supervision at test time. In this paper, we propose ADAPT, an Advanced Distribution-Aware and backPropagation-free Test-time adaptation method. We reframe TTA as a Gaussian probabilistic inference task by modeling class-conditional likelihoods using gradually updated class means and a shared covariance matrix. This enables closed-form, training-free inference. To correct potential likelihood bias, we introduce lightweight regularization guided by CLIP priors and a historical knowledge bank. ADAPT requires no source data, no gradient updates, and no full access to target data, supporting both online and transductive settings. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts with superior scalability and robustness.
♻ ☆ A Survey of Deep Learning for Geometry Problem Solving
Geometry problem solving, a crucial aspect of mathematical reasoning, is vital across various domains, including education, the assessment of AI's mathematical abilities, and multimodal capability evaluation. The recent surge in deep learning technologies, particularly the emergence of multimodal large language models, has significantly accelerated research in this area. This paper provides a survey of the applications of deep learning in geometry problem solving, including (i) a comprehensive summary of the relevant tasks in geometry problem solving; (ii) a thorough review of related deep learning methods; (iii) a detailed analysis of evaluation metrics and methods; and (iv) a critical discussion of the current challenges and future directions that can be explored. Our objective is to offer a comprehensive and practical reference of deep learning for geometry problem solving, thereby fostering further advancements in this field. We create a continuously updated list of papers on GitHub: https://github.com/majianz/dl4gps.
comment: Work in progress
♻ ☆ Dynamic Optimization of Storage Systems Using Reinforcement Learning Techniques
The exponential growth of data-intensive applications has placed unprecedented demands on modern storage systems, necessitating dynamic and efficient optimization strategies. Traditional heuristics employed for storage performance optimization often fail to adapt to the variability and complexity of contemporary workloads, leading to significant performance bottlenecks and resource inefficiencies. To address these challenges, this paper introduces RL-Storage, a novel reinforcement learning (RL)-based framework designed to dynamically optimize storage system configurations. RL-Storage leverages deep Q-learning algorithms to continuously learn from real-time I/O patterns and predict optimal storage parameters, such as cache size, queue depths, and readahead settings[1].This work underscores the transformative potential of reinforcement learning techniques in addressing the dynamic nature of modern storage systems. By autonomously adapting to workload variations in real time, RL-Storage provides a robust and scalable solution for optimizing storage performance, paving the way for next-generation intelligent storage infrastructures.
♻ ☆ Asymptotic behavior of eigenvalues of large rank perturbations of large random matrices
The paper is concerned with deformed Wigner random matrices. These matrices are closely connected with Deep Neural Networks (DNNs): weight matrices of trained DNNs could be represented in the form $R + S$, where $R$ is random and $S$ is highly correlated. The spectrum of such matrices plays a key role in rigorous underpinning of the novel pruning technique based on Random Matrix Theory. Mathematics has been done only for finite-rank matrix $S$. However, in practice rank may grow. In this paper we develop asymptotic analysis for the case of growing rank.
comment: v1: 14 pages, 3 figures; v2: a part of the proof of Lemma 4.2 was revised, 15 pages, 3 figures
♻ ☆ Who's the Evil Twin? Differential Auditing for Undesired Behavior
Detecting hidden behaviors in neural networks poses a significant challenge due to minimal prior knowledge and potential adversarial obfuscation. We explore this problem by framing detection as an adversarial game between two teams: the red team trains two similar models, one trained solely on benign data and the other trained on data containing hidden harmful behavior, with the performance of both being nearly indistinguishable on the benign dataset. The blue team, with limited to no information about the harmful behaviour, tries to identify the compromised model. We experiment using CNNs and try various blue team strategies, including Gaussian noise analysis, model diffing, integrated gradients, and adversarial attacks under different levels of hints provided by the red team. Results show high accuracy for adversarial-attack-based methods (100\% correct prediction, using hints), which is very promising, whilst the other techniques yield more varied performance. During our LLM-focused rounds, we find that there are not many parallel methods that we could apply from our study with CNNs. Instead, we find that effective LLM auditing methods require some hints about the undesired distribution, which can then used in standard black-box and open-weight methods to probe the models further and reveal their misalignment. We open-source our auditing games (with the model and data) and hope that our findings contribute to designing better audits.
comment: main section: 8 pages, 4 figures, 1 table total: 34 pages, 44 figures, 12 tables
♻ ☆ Balancing Act: Prioritization Strategies for LLM-Designed Restless Bandit Rewards
LLMs are increasingly used to design reward functions based on human preferences in Reinforcement Learning (RL). We focus on LLM-designed rewards for Restless Multi-Armed Bandits, a framework for allocating limited resources among agents. In applications such as public health, this approach empowers grassroots health workers to tailor automated allocation decisions to community needs. In the presence of multiple agents, altering the reward function based on human preferences can impact subpopulations very differently, leading to complex tradeoffs and a multi-objective resource allocation problem. We are the first to present a principled method termed Social Choice Language Model for dealing with these tradeoffs for LLM-designed rewards for multiagent planners in general and restless bandits in particular. The novel part of our model is a transparent and configurable selection component, called an adjudicator, external to the LLM that controls complex tradeoffs via a user-selected social welfare function. Our experiments demonstrate that our model reliably selects more effective, aligned, and balanced reward functions compared to purely LLM-based approaches.
♻ ☆ CROP: Circuit Retrieval and Optimization with Parameter Guidance using LLMs
Modern very large-scale integration (VLSI) design requires the implementation of integrated circuits using electronic design automation (EDA) tools. Due to the complexity of EDA algorithms, the vast parameter space poses a huge challenge to chip design optimization, as the combination of even moderate numbers of parameters creates an enormous solution space to explore. Manual parameter selection remains industrial practice despite being excessively laborious and limited by expert experience. To address this issue, we present CROP, the first large language model (LLM)-powered automatic VLSI design flow tuning framework. Our approach includes: (1) a scalable methodology for transforming RTL source code into dense vector representations, (2) an embedding-based retrieval system for matching designs with semantically similar circuits, and (3) a retrieval-augmented generation (RAG)-enhanced LLM-guided parameter search system that constrains the search process with prior knowledge from similar designs. Experiment results demonstrate CROP's ability to achieve superior quality-of-results (QoR) with fewer iterations than existing approaches on industrial designs, including a 9.9% reduction in power consumption.
comment: Accepted by ICCAD 2025
♻ ☆ Order-Preserving Dimension Reduction for Multimodal Semantic Embedding
Searching for the $k$-nearest neighbors (KNN) in multimodal data retrieval is computationally expensive, particularly due to the inherent difficulty in comparing similarity measures across different modalities. Recent advances in multimodal machine learning address this issue by mapping data into a shared embedding space; however, the high dimensionality of these embeddings (hundreds to thousands of dimensions) presents a challenge for time-sensitive vision applications. This work proposes Order-Preserving Dimension Reduction (OPDR), aiming to reduce the dimensionality of embeddings while preserving the ranking of KNN in the lower-dimensional space. One notable component of OPDR is a new measure function to quantify KNN quality as a global metric, based on which we derive a closed-form map between target dimensionality and key contextual parameters. We have integrated OPDR with multiple state-of-the-art dimension-reduction techniques, distance functions, and embedding models; experiments on a variety of multimodal datasets demonstrate that OPDR effectively retains recall high accuracy while significantly reducing computational costs.
♻ ☆ A Toolbox, Not a Hammer -- Multi-TAG: Scaling Math Reasoning with Multi-Tool Aggregation EMNLP
Augmenting large language models (LLMs) with external tools is a promising avenue for developing high-performance mathematical reasoning systems. Prior tool-augmented approaches typically finetune an LLM to select and invoke a single tool at each reasoning step and show promising results on simpler math reasoning benchmarks such as GSM8K. However, these approaches struggle with more complex math problems that require precise reasoning over multiple steps. To address this limitation, in this work, we propose Multi-TAG, a Multi-Tool AGgregation-based framework. Instead of relying on a single tool, Multi-TAG guides an LLM to concurrently invoke multiple tools at each reasoning step. It then aggregates their diverse outputs to verify and refine the reasoning process, enhancing solution robustness and accuracy. Notably, Multi-TAG is a finetuning-free, inference-only framework, making it readily applicable to any LLM backbone, including large open-weight models which are computationally expensive to finetune and proprietary frontier models which cannot be finetuned with custom recipes. We evaluate Multi-TAG on four challenging benchmarks: MATH500, AIME, AMC, and OlympiadBench. Across both open-weight and closed-source LLM backbones, Multi-TAG consistently and substantially outperforms state-of-the-art baselines, achieving average improvements of 6.0% to 7.5% over state-of-the-art baselines.
comment: Published at EMNLP Findings 2025; 21 pages, 3 figures
♻ ☆ HandCraft: Dynamic Sign Generation for Synthetic Data Augmentation
Sign Language Recognition (SLR) models face significant performance limitations due to insufficient training data availability. In this article, we address the challenge of limited data in SLR by introducing a novel and lightweight sign generation model based on CMLPe. This model, coupled with a synthetic data pretraining approach, consistently improves recognition accuracy, establishing new state-of-the-art results for the LSFB and DiSPLaY datasets using our Mamba-SL and Transformer-SL classifiers. Our findings reveal that synthetic data pretraining outperforms traditional augmentation methods in some cases and yields complementary benefits when implemented alongside them. Our approach democratizes sign generation and synthetic data pretraining for SLR by providing computationally efficient methods that achieve significant performance improvements across diverse datasets.
comment: 26 pages, 4 figures, 9 tables, code available at https://github.com/okason97/HandCraft
Multimedia 8
☆ Real-time 3D Light-field Viewing with Eye-tracking on Conventional Displays
Creating immersive 3D visual experiences typically requires expensive and specialized hardware such as VR headsets, autostereoscopic displays, or active shutter glasses. These constraints limit the accessibility and everyday use of 3D visualization technologies in resource-constrained settings. To address this, we propose a low-cost system that enables real-time 3D light-field viewing using only a standard 2D monitor, a conventional RGB webcam, and red-cyan anaglyph glasses. The system integrates real-time eye-tracking to dynamically adapt the displayed light-field image to the user's head position with a lightweight rendering pipeline that selects and composites stereoscopic views from pre-captured light-field data. The resulting anaglyph image is updated in real-time, creating a more immersive and responsive 3D experience. The system operates entirely on CPU and maintains a stable frame rate of 30 FPS, confirming its feasibility on typical consumer-grade hardware. All of these highlight the potential of our approach as an accessible platform for interactive 3D applications in education, digital media, and beyond.
☆ Towards User-level QoE: Large-scale Practice in Personalized Optimization of Adaptive Video Streaming
Traditional optimization methods based on system-wide Quality of Service (QoS) metrics have approached their performance limitations in modern large-scale streaming systems. However, aligning user-level Quality of Experience~(QoE) with algorithmic optimization objectives remains an unresolved challenge. Therefore, we propose \texttt{LingXi}, the first large-scale deployed system for personalized adaptive video streaming based on user-level experience. \texttt{LingXi} dynamically optimizes the objectives of adaptive video streaming algorithms by analyzing user engagement. Utilizing exit rate as a key metric, we investigate the correlation between QoS indicators and exit rates based on production environment logs, subsequently developing a personalized exit rate predictor. Through Monte Carlo sampling and online Bayesian optimization, we iteratively determine optimal parameters. Large-scale A/B testing utilizing 8\% of traffic on Kuaishou, one of the largest short video platforms, demonstrates \texttt{LingXi}'s superior performance. \texttt{LingXi} achieves a 0.15\% increase in total viewing time, a 0.1\% improvement in bitrate, and a 1.3\% reduction in stall time across all users, with particularly significant improvements for low-bandwidth users who experience a 15\% reduction in stall time.
comment: ACM SIGCOMM 2025
☆ Beyond Interpretability: Exploring the Comprehensibility of Adaptive Video Streaming through Large Language Models
Over the past decade, adaptive video streaming technology has witnessed significant advancements, particularly driven by the rapid evolution of deep learning techniques. However, the black-box nature of deep learning algorithms presents challenges for developers in understanding decision-making processes and optimizing for specific application scenarios. Although existing research has enhanced algorithm interpretability through decision tree conversion, interpretability does not directly equate to developers' subjective comprehensibility. To address this challenge, we introduce \texttt{ComTree}, the first bitrate adaptation algorithm generation framework that considers comprehensibility. The framework initially generates the complete set of decision trees that meet performance requirements, then leverages large language models to evaluate these trees for developer comprehensibility, ultimately selecting solutions that best facilitate human understanding and enhancement. Experimental results demonstrate that \texttt{ComTree} significantly improves comprehensibility while maintaining competitive performance, showing potential for further advancement. The source code is available at https://github.com/thu-media/ComTree.
comment: ACM Multimedia2025
☆ PediatricsMQA: a Multi-modal Pediatrics Question Answering Benchmark
Large language models (LLMs) and vision-augmented LLMs (VLMs) have significantly advanced medical informatics, diagnostics, and decision support. However, these models exhibit systematic biases, particularly age bias, compromising their reliability and equity. This is evident in their poorer performance on pediatric-focused text and visual question-answering tasks. This bias reflects a broader imbalance in medical research, where pediatric studies receive less funding and representation despite the significant disease burden in children. To address these issues, a new comprehensive multi-modal pediatric question-answering benchmark, PediatricsMQA, has been introduced. It consists of 3,417 text-based multiple-choice questions (MCQs) covering 131 pediatric topics across seven developmental stages (prenatal to adolescent) and 2,067 vision-based MCQs using 634 pediatric images from 67 imaging modalities and 256 anatomical regions. The dataset was developed using a hybrid manual-automatic pipeline, incorporating peer-reviewed pediatric literature, validated question banks, existing benchmarks, and existing QA resources. Evaluating state-of-the-art open models, we find dramatic performance drops in younger cohorts, highlighting the need for age-aware methods to ensure equitable AI support in pediatric care.
☆ Learning Long-Range Action Representation by Two-Stream Mamba Pyramid Network for Figure Skating Assessment
Technical Element Score (TES) and Program Component Score (PCS) evaluations in figure skating demand precise assessment of athletic actions and artistic interpretation, respectively. Existing methods face three major challenges. Firstly, video and audio cues are regarded as common features for both TES and PCS predictions in previous works without considering the prior evaluation criterion of figure skating. Secondly, action elements in competitions are separated in time, TES should be derived from each element's score, but existing methods try to give an overall TES prediction without evaluating each action element. Thirdly, lengthy competition videos make it difficult and inefficient to handle long-range contexts. To address these challenges, we propose a two-stream Mamba pyramid network that aligns with actual judging criteria to predict TES and PCS by separating visual-feature based TES evaluation stream from audio-visual-feature based PCS evaluation stream. In the PCS evaluation stream, we introduce a multi-level fusion mechanism to guarantee that video-based features remain unaffected when assessing TES, and enhance PCS estimation by fusing visual and auditory cues across each contextual level of the pyramid. In the TES evaluation stream, the multi-scale Mamba pyramid and TES head we proposed effectively address the challenges of localizing and evaluating action elements with various temporal scales and give score predictions. With Mamba's superior ability to capture long-range dependencies and its linear computational complexity, our method is ideal for handling lengthy figure skating videos. Comprehensive experimentation demonstrates that our framework attains state-of-the-art performance on the FineFS benchmark. Our source code is available at https://github.com/ycwfs/Figure-Skating-Action-Quality-Assessment.
☆ Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation EMNLP 2025
We present an Audio-Visual Language Model (AVLM) for expressive speech generation by integrating full-face visual cues into a pre-trained expressive speech model. We explore multiple visual encoders and multimodal fusion strategies during pre-training to identify the most effective integration approach. Subsequent fine-tuning on emotion recognition and expressive dialogue tasks yields substantial gains over speech-only baselines (e.g., +5 F1 in emotion recognition). AVLM highlights the value of expressive visual information in guiding speech generation and offers a foundation for end-to-end multimodal conversational systems.
comment: EMNLP 2025 (Findings)
☆ Hierarchical Vision-Language Reasoning for Multimodal Multiple-Choice Question Answering ACM MM 2025
Multimodal Large Language Models (MLLMs) have demonstrated remarkable multimodal understanding capabilities in Visual Question Answering (VQA) tasks by integrating visual and textual features. However, under the challenging ten-choice question evaluation paradigm, existing methods still exhibit significant limitations when processing PDF documents with complex layouts and lengthy content. Notably, current mainstream models suffer from a strong bias toward English training data, resulting in suboptimal performance for Japanese and other language scenarios. To address these challenges, this paper proposes a novel Japanese PDF document understanding framework that combines multimodal hierarchical reasoning mechanisms with Colqwen-optimized retrieval methods, while innovatively introducing a semantic verification strategy through sub-question decomposition. Experimental results demonstrate that our framework not only significantly enhances the model's deep semantic parsing capability for complex documents, but also exhibits superior robustness in practical application scenarios.
comment: This paper has been accepted by ACM MM 2025
♻ ☆ A Survey on 3D Gaussian Splatting
3D Gaussian splatting (GS) has emerged as a transformative technique in explicit radiance field and computer graphics. This innovative approach, characterized by the use of millions of learnable 3D Gaussians, represents a significant departure from mainstream neural radiance field approaches, which predominantly use implicit, coordinate-based models to map spatial coordinates to pixel values. 3D GS, with its explicit scene representation and differentiable rendering algorithm, not only promises real-time rendering capability but also introduces unprecedented levels of editability. This positions 3D GS as a potential game-changer for the next generation of 3D reconstruction and representation. In the present paper, we provide the first systematic overview of the recent developments and critical contributions in the domain of 3D GS. We begin with a detailed exploration of the underlying principles and the driving forces behind the emergence of 3D GS, laying the groundwork for understanding its significance. A focal point of our discussion is the practical applicability of 3D GS. By enabling unprecedented rendering speed, 3D GS opens up a plethora of applications, ranging from virtual reality to interactive media and beyond. This is complemented by a comparative analysis of leading 3D GS models, evaluated across various benchmark tasks to highlight their performance and practical utility. The survey concludes by identifying current challenges and suggesting potential avenues for future research. Through this survey, we aim to provide a valuable resource for both newcomers and seasoned researchers, fostering further exploration and advancement in explicit radiance field.
comment: Ongoing project; Paper list: https://github.com/guikunchen/Awesome3DGS ; Benchmark: https://github.com/guikunchen/3DGS-Benchmarks
Computation and Language 121
☆ Intern-S1: A Scientific Multimodal Foundation Model
In recent years, a plethora of open-source foundation models have emerged, achieving remarkable progress in some widely attended fields, with performance being quite close to that of closed-source models. However, in high-value but more challenging scientific professional fields, either the fields still rely on expert models, or the progress of general foundation models lags significantly compared to those in popular areas, far from sufficient for transforming scientific research and leaving substantial gap between open-source models and closed-source models in these scientific domains. To mitigate this gap and explore a step further toward Artificial General Intelligence (AGI), we introduce Intern-S1, a specialized generalist equipped with general understanding and reasoning capabilities with expertise to analyze multiple science modal data. Intern-S1 is a multimodal Mixture-of-Experts (MoE) model with 28 billion activated parameters and 241 billion total parameters, continually pre-trained on 5T tokens, including over 2.5T tokens from scientific domains. In the post-training stage, Intern-S1 undergoes offline and then online reinforcement learning (RL) in InternBootCamp, where we propose Mixture-of-Rewards (MoR) to synergize the RL training on more than 1000 tasks simultaneously. Through integrated innovations in algorithms, data, and training systems, Intern-S1 achieved top-tier performance in online RL training.On comprehensive evaluation benchmarks, Intern-S1 demonstrates competitive performance on general reasoning tasks among open-source models and significantly outperforms open-source models in scientific domains, surpassing closed-source state-of-the-art models in professional tasks, such as molecular synthesis planning, reaction condition prediction, predicting thermodynamic stabilities for crystals. Our models are available at https://huggingface.co/internlm/Intern-S1.
☆ LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
Tool calling has emerged as a critical capability for AI agents to interact with the real world and solve complex tasks. While the Model Context Protocol (MCP) provides a powerful standardized framework for tool integration, there is a significant gap in benchmarking how well AI agents can effectively solve multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 carefully curated real-world queries, refined through iterative LLM rewriting and manual review, that require coordinated use of multiple MCP tools including web search, file operations, mathematical reasoning, and data analysis. Moreover, we introduce a novel evaluation approach that leverages ground-truth execution plans rather than raw API outputs, better reflecting the evolving nature of real-world environments. Experiments show that even frontier LLMs achieve a success rate below 60\%, highlighting major challenges in tool orchestration. Detailed ablations and error analysis further reveal distinct failure modes and inefficiencies in token usage, pointing to concrete directions for advancing current models. LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities, advancing toward autonomous AI systems that reliably execute complex tasks through tool use.
☆ Language-Guided Tuning: Enhancing Numeric Optimization with Textual Feedback
Configuration optimization remains a critical bottleneck in machine learning, requiring coordinated tuning across model architecture, training strategy, feature engineering, and hyperparameters. Traditional approaches treat these dimensions independently and lack interpretability, while recent automated methods struggle with dynamic adaptability and semantic reasoning about optimization decisions. We introduce Language-Guided Tuning (LGT), a novel framework that employs multi-agent Large Language Models to intelligently optimize configurations through natural language reasoning. We apply textual gradients - qualitative feedback signals that complement numerical optimization by providing semantic understanding of training dynamics and configuration interdependencies. LGT coordinates three specialized agents: an Advisor that proposes configuration changes, an Evaluator that assesses progress, and an Optimizer that refines the decision-making process, creating a self-improving feedback loop. Through comprehensive evaluation on six diverse datasets, LGT demonstrates substantial improvements over traditional optimization methods, achieving performance gains while maintaining high interpretability.
comment: 9 pages, 4 figures, 4 tables
☆ Dissecting Tool-Integrated Reasoning: An Empirical Study and Analysis
Large Language Models (LLMs) have made significant strides in reasoning tasks through methods like chain-of-thought (CoT) reasoning. However, they often fall short in tasks requiring precise computations. Tool-Integrated Reasoning (TIR) has emerged as a solution by incorporating external tools into the reasoning process. Nevertheless, the generalization of TIR in improving the reasoning ability of LLM is still unclear. Additionally, whether TIR has improved the model's reasoning behavior and helped the model think remains to be studied. We introduce ReasonZoo, a comprehensive benchmark encompassing nine diverse reasoning categories, to evaluate the effectiveness of TIR across various domains. Additionally, we propose two novel metrics, Performance-Aware Cost (PAC) and Area Under the Performance-Cost Curve (AUC-PCC), to assess reasoning efficiency. Our empirical evaluation demonstrates that TIR-enabled models consistently outperform their non-TIR counterparts in both mathematical and non-mathematical tasks. Furthermore, TIR enhances reasoning efficiency, as evidenced by improved PAC and AUC-PCC, indicating reduced overthinking and more streamlined reasoning. These findings underscore the domain-general benefits of TIR and its potential to advance LLM capabilities in complex reasoning tasks.
comment: Preprint, working in progress
☆ End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning
Accurate diagnosis with medical large language models is hindered by knowledge gaps and hallucinations. Retrieval and tool-augmented methods help, but their impact is limited by weak use of external knowledge and poor feedback-reasoning traceability. To address these challenges, We introduce Deep-DxSearch, an agentic RAG system trained end-to-end with reinforcement learning (RL) that enables steer tracebale retrieval-augmented reasoning for medical diagnosis. In Deep-DxSearch, we first construct a large-scale medical retrieval corpus comprising patient records and reliable medical knowledge sources to support retrieval-aware reasoning across diagnostic scenarios. More crutially, we frame the LLM as the core agent and the retrieval corpus as its environment, using tailored rewards on format, retrieval, reasoning structure, and diagnostic accuracy, thereby evolving the agentic RAG policy from large-scale data through RL. Experiments demonstrate that our end-to-end agentic RL training framework consistently outperforms prompt-engineering and training-free RAG approaches across multiple data centers. After training, Deep-DxSearch achieves substantial gains in diagnostic accuracy, surpassing strong diagnostic baselines such as GPT-4o, DeepSeek-R1, and other medical-specific frameworks for both common and rare disease diagnosis under in-distribution and out-of-distribution settings. Moreover, ablation studies on reward design and retrieval corpus components confirm their critical roles, underscoring the uniqueness and effectiveness of our approach compared with traditional implementations. Finally, case studies and interpretability analyses highlight improvements in Deep-DxSearch's diagnostic policy, providing deeper insight into its performance gains and supporting clinicians in delivering more reliable and precise preliminary diagnoses. See https://github.com/MAGIC-AI4Med/Deep-DxSearch.
comment: 35 pages, 5 figures, 3 tables
☆ EcomMMMU: Strategic Utilization of Visuals for Robust Multimodal E-Commerce Models
E-commerce platforms are rich in multimodal data, featuring a variety of images that depict product details. However, this raises an important question: do these images always enhance product understanding, or can they sometimes introduce redundancy or degrade performance? Existing datasets are limited in both scale and design, making it difficult to systematically examine this question. To this end, we introduce EcomMMMU, an e-commerce multimodal multitask understanding dataset with 406,190 samples and 8,989,510 images. EcomMMMU is comprised of multi-image visual-language data designed with 8 essential tasks and a specialized VSS subset to benchmark the capability of multimodal large language models (MLLMs) to effectively utilize visual content. Analysis on EcomMMMU reveals that product images do not consistently improve performance and can, in some cases, degrade it. This indicates that MLLMs may struggle to effectively leverage rich visual content for e-commerce tasks. Building on these insights, we propose SUMEI, a data-driven method that strategically utilizes multiple images via predicting visual utilities before using them for downstream tasks. Comprehensive experiments demonstrate the effectiveness and robustness of SUMEI. The data and code are available through https://anonymous.4open.science/r/submission25.
☆ Stemming -- The Evolution and Current State with a Focus on Bangla
Bangla, the seventh most widely spoken language worldwide with 300 million native speakers, faces digital under-representation due to limited resources and lack of annotated datasets. Stemming, a critical preprocessing step in language analysis, is essential for low-resource, highly-inflectional languages like Bangla, because it can reduce the complexity of algorithms and models by significantly reducing the number of words the algorithm needs to consider. This paper conducts a comprehensive survey of stemming approaches, emphasizing the importance of handling morphological variants effectively. While exploring the landscape of Bangla stemming, it becomes evident that there is a significant gap in the existing literature. The paper highlights the discontinuity from previous research and the scarcity of accessible implementations for replication. Furthermore, it critiques the evaluation methodologies, stressing the need for more relevant metrics. In the context of Bangla's rich morphology and diverse dialects, the paper acknowledges the challenges it poses. To address these challenges, the paper suggests directions for Bangla stemmer development. It concludes by advocating for robust Bangla stemmers and continued research in the field to enhance language analysis and processing.
☆ Position Bias Mitigates Position Bias:Mitigate Position Bias Through Inter-Position Knowledge Distillation EMNLP2025
Positional bias (PB), manifesting as non-uniform sensitivity across different contextual locations, significantly impairs long-context comprehension and processing capabilities. While prior work seeks to mitigate PB through modifying the architectures causing its emergence, significant PB still persists. To address PB effectively, we introduce \textbf{Pos2Distill}, a position to position knowledge distillation framework. Pos2Distill transfers the superior capabilities from advantageous positions to less favorable ones, thereby reducing the huge performance gaps. The conceptual principle is to leverage the inherent, position-induced disparity to counteract the PB itself. We identify distinct manifestations of PB under \textbf{\textsc{r}}etrieval and \textbf{\textsc{r}}easoning paradigms, thereby designing two specialized instantiations: \emph{Pos2Distill-R\textsuperscript{1}} and \emph{Pos2Distill-R\textsuperscript{2}} respectively, both grounded in this core principle. By employing the Pos2Distill approach, we achieve enhanced uniformity and significant performance gains across all contextual positions in long-context retrieval and reasoning tasks. Crucially, both specialized systems exhibit strong cross-task generalization mutually, while achieving superior performance on their respective tasks.
comment: EMNLP2025 Main
☆ Benchmarking Computer Science Survey Generation
Scientific survey articles play a vital role in summarizing research progress, yet their manual creation is becoming increasingly infeasible due to the rapid growth of academic literature. While large language models (LLMs) offer promising capabilities for automating this process, progress in this area is hindered by the absence of standardized benchmarks and evaluation protocols. To address this gap, we introduce SurGE (Survey Generation Evaluation), a new benchmark for evaluating scientific survey generation in the computer science domain. SurGE consists of (1) a collection of test instances, each including a topic description, an expert-written survey, and its full set of cited references, and (2) a large-scale academic corpus of over one million papers that serves as the retrieval pool. In addition, we propose an automated evaluation framework that measures generated surveys across four dimensions: information coverage, referencing accuracy, structural organization, and content quality. Our evaluation of diverse LLM-based approaches shows that survey generation remains highly challenging, even for advanced self-reflection frameworks. These findings highlight the complexity of the task and the necessity for continued research. We have open-sourced all the code, data, and models at: https://github.com/oneal2000/SurGE
☆ SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models EMNLP 2025
Large Language Models (LLMs) excel at various natural language processing tasks but remain vulnerable to jailbreaking attacks that induce harmful content generation. In this paper, we reveal a critical safety inconsistency: LLMs can more effectively identify harmful requests as discriminators than defend against them as generators. This insight inspires us to explore aligning the model's inherent discrimination and generation capabilities. To this end, we propose SDGO (Self-Discrimination-Guided Optimization), a reinforcement learning framework that leverages the model's own discrimination capabilities as a reward signal to enhance generation safety through iterative self-improvement. Our method does not require any additional annotated data or external models during the training phase. Extensive experiments demonstrate that SDGO significantly improves model safety compared to both prompt-based and training-based baselines while maintaining helpfulness on general benchmarks. By aligning LLMs' discrimination and generation capabilities, SDGO brings robust performance against out-of-distribution (OOD) jailbreaking attacks. This alignment achieves tighter coupling between these two capabilities, enabling the model's generation capability to be further enhanced with only a small amount of discriminative samples. Our code and datasets are available at https://github.com/NJUNLP/SDGO.
comment: Accepted by EMNLP 2025, 15 pages, 4 figures, 6 tables
☆ Classification errors distort findings in automated speech processing: examples and solutions from child-development research
With the advent of wearable recorders, scientists are increasingly turning to automated methods of analysis of audio and video data in order to measure children's experience, behavior, and outcomes, with a sizable literature employing long-form audio-recordings to study language acquisition. While numerous articles report on the accuracy and reliability of the most popular automated classifiers, less has been written on the downstream effects of classification errors on measurements and statistical inferences (e.g., the estimate of correlations and effect sizes in regressions). This paper proposes a Bayesian approach to study the effects of algorithmic errors on key scientific questions, including the effect of siblings on children's language experience and the association between children's production and their input. In both the most commonly used \gls{lena}, and an open-source alternative (the Voice Type Classifier from the ACLEW system), we find that classification errors can significantly distort estimates. For instance, automated annotations underestimated the negative effect of siblings on adult input by 20--80\%, potentially placing it below statistical significance thresholds. We further show that a Bayesian calibration approach for recovering unbiased estimates of effect sizes can be effective and insightful, but does not provide a fool-proof solution. Both the issue reported and our solution may apply to any classifier involving event detection and classification with non-zero error rates.
☆ Trained Miniatures: Low cost, High Efficacy SLMs for Sales & Marketing
Large language models (LLMs) excel in text generation; however, these creative elements require heavy computation and are accompanied by a steep cost. Especially for targeted applications such as sales and marketing outreach, these costs are far from feasible. This paper introduces the concept of "Trained Miniatures" - Small Language Models(SLMs) fine-tuned for specific, high-value applications, generating similar domain-specific responses for a fraction of the cost.
☆ SafetyFlow: An Agent-Flow System for Automated LLM Safety Benchmarking
The rapid proliferation of large language models (LLMs) has intensified the requirement for reliable safety evaluation to uncover model vulnerabilities. To this end, numerous LLM safety evaluation benchmarks are proposed. However, existing benchmarks generally rely on labor-intensive manual curation, which causes excessive time and resource consumption. They also exhibit significant redundancy and limited difficulty. To alleviate these problems, we introduce SafetyFlow, the first agent-flow system designed to automate the construction of LLM safety benchmarks. SafetyFlow can automatically build a comprehensive safety benchmark in only four days without any human intervention by orchestrating seven specialized agents, significantly reducing time and resource cost. Equipped with versatile tools, the agents of SafetyFlow ensure process and cost controllability while integrating human expertise into the automatic pipeline. The final constructed dataset, SafetyFlowBench, contains 23,446 queries with low redundancy and strong discriminative power. Our contribution includes the first fully automated benchmarking pipeline and a comprehensive safety benchmark. We evaluate the safety of 49 advanced LLMs on our dataset and conduct extensive experiments to validate our efficacy and efficiency.
comment: Code and dataset are available at https://github.com/yangyangyang127/SafetyFlow
☆ The Enemy from Within: A Study of Political Delegitimization Discourse in Israeli Political Speech
We present the first large-scale computational study of political delegitimization discourse (PDD), defined as symbolic attacks on the normative validity of political entities. We curate and manually annotate a novel Hebrew-language corpus of 10,410 sentences drawn from Knesset speeches (1993-2023), Facebook posts (2018-2021), and leading news outlets, of which 1,812 instances (17.4\%) exhibit PDD and 642 carry additional annotations for intensity, incivility, target type, and affective framing. We introduce a two-stage classification pipeline combining finetuned encoder models and decoder LLMs. Our best model (DictaLM 2.0) attains an F$_1$ of 0.74 for binary PDD detection and a macro-F$_1$ of 0.67 for classification of delegitimization characteristics. Applying this classifier to longitudinal and cross-platform data, we see a marked rise in PDD over three decades, higher prevalence on social media versus parliamentary debate, greater use by male than female politicians, and stronger tendencies among right-leaning actors - with pronounced spikes during election campaigns and major political events. Our findings demonstrate the feasibility and value of automated PDD analysis for understanding democratic discourse.
☆ Dream 7B: Diffusion Large Language Models
We introduce Dream 7B, the most powerful open diffusion large language model to date. Unlike autoregressive (AR) models that generate tokens sequentially, Dream 7B employs discrete diffusion modeling to refine sequences in parallel through iterative denoising. Our model consistently outperforms existing diffusion language models on general, mathematical, and coding tasks. Dream 7B demonstrates superior planning abilities and inference flexibility, including arbitrary-order generation, infilling capabilities, and tunable quality-speed trade-offs. These results are achieved through simple yet effective training techniques, including AR-based LLM initialization and context-adaptive token-level noise rescheduling. We release both Dream-Base and Dream-Instruct to facilitate further research in diffusion-based language modeling.
☆ HebID: Detecting Social Identities in Hebrew-language Political Text
Political language is deeply intertwined with social identities. While social identities are often shaped by specific cultural contexts and expressed through particular uses of language, existing datasets for group and identity detection are predominantly English-centric, single-label and focus on coarse identity categories. We introduce HebID, the first multilabel Hebrew corpus for social identity detection: 5,536 sentences from Israeli politicians' Facebook posts (Dec 2018-Apr 2021), manually annotated for twelve nuanced social identities (e.g. Rightist, Ultra-Orthodox, Socially-oriented) grounded by survey data. We benchmark multilabel and single-label encoders alongside 2B-9B-parameter generative LLMs, finding that Hebrew-tuned LLMs provide the best results (macro-$F_1$ = 0.74). We apply our classifier to politicians' Facebook posts and parliamentary speeches, evaluating differences in popularity, temporal trends, clustering patterns, and gender-related variations in identity expression. We utilize identity choices from a national public survey, enabling a comparison between identities portrayed in elite discourse and the public's identity priorities. HebID provides a comprehensive foundation for studying social identities in Hebrew and can serve as a model for similar research in other non-English political contexts.
☆ SLM-Bench: A Comprehensive Benchmark of Small Language Models on Environmental Impacts -- Extended Version EMNLP 2025
Small Language Models (SLMs) offer computational efficiency and accessibility, yet a systematic evaluation of their performance and environmental impact remains lacking. We introduce SLM-Bench, the first benchmark specifically designed to assess SLMs across multiple dimensions, including accuracy, computational efficiency, and sustainability metrics. SLM-Bench evaluates 15 SLMs on 9 NLP tasks using 23 datasets spanning 14 domains. The evaluation is conducted on 4 hardware configurations, providing a rigorous comparison of their effectiveness. Unlike prior benchmarks, SLM-Bench quantifies 11 metrics across correctness, computation, and consumption, enabling a holistic assessment of efficiency trade-offs. Our evaluation considers controlled hardware conditions, ensuring fair comparisons across models. We develop an open-source benchmarking pipeline with standardized evaluation protocols to facilitate reproducibility and further research. Our findings highlight the diverse trade-offs among SLMs, where some models excel in accuracy while others achieve superior energy efficiency. SLM-Bench sets a new standard for SLM evaluation, bridging the gap between resource efficiency and real-world applicability.
comment: 24 pages. An extended version of "SLM-Bench: A Comprehensive Benchmark of Small Language Models on Environmental Impacts" accepted at EMNLP 2025
☆ Influence-driven Curriculum Learning for Pre-training on Limited Data
Curriculum learning, a training technique where data is presented to the model in order of example difficulty (e.g., from simpler to more complex documents), has shown limited success for pre-training language models. In this work, we investigate whether curriculum learning becomes competitive if we replace conventional human-centered difficulty metrics with one that more closely corresponds to example difficulty as observed during model training. Specifically, we experiment with sorting training examples by their \textit{training data influence}, a score which estimates the effect of individual training examples on the model's output. Models trained on our curricula are able to outperform ones trained in random order by over 10 percentage points in benchmarks, confirming that curriculum learning is beneficial for language model pre-training, as long as a more model-centric notion of difficulty is adopted.
comment: 9 pages
☆ Subjective Behaviors and Preferences in LLM: Language of Browsing EMNLP 2025
A Large Language Model (LLM) offers versatility across domains and tasks, purportedly benefiting users with a wide variety of behaviors and preferences. We question this perception about an LLM when users have inherently subjective behaviors and preferences, as seen in their ubiquitous and idiosyncratic browsing of websites or apps. The sequential behavior logs of pages, thus generated, form something akin to each user's self-constructed "language", albeit without the structure and grammar imbued in natural languages. We ask: (i) Can a small LM represent the "language of browsing" better than a large LM? (ii) Can an LM with a single set of parameters (or, single LM) adequately capture myriad users' heterogeneous, subjective behaviors and preferences? (iii) Can a single LM with high average performance, yield low variance in performance to make alignment good at user level? We introduce clusterwise LM training, HeTLM (Heterogeneity aware Training of Language Model), appropriate for subjective behaviors. We find that (i) a small LM trained using a page-level tokenizer outperforms large pretrained or finetuned LMs; (ii) HeTLM with heterogeneous cluster specific set of parameters outperforms a single LM of the same family, controlling for the number of parameters; and (iii) a higher mean and a lower variance in generation ensues, implying improved alignment.
comment: Accepted at EMNLP 2025
☆ SLM4Offer: Personalized Marketing Offer Generation Using Contrastive Learning Based Fine-Tuning
Personalized marketing has emerged as a pivotal strategy for enhancing customer engagement and driving business growth. Academic and industry efforts have predominantly focused on recommendation systems and personalized advertisements. Nonetheless, this facet of personalization holds significant potential for increasing conversion rates and improving customer satisfaction. Prior studies suggest that well-executed personalization strategies can boost revenue by up to 40 percent, underscoring the strategic importance of developing intelligent, data-driven approaches for offer generation. This work introduces SLM4Offer, a generative AI model for personalized offer generation, developed by fine-tuning a pre-trained encoder-decoder language model, specifically Google's Text-to-Text Transfer Transformer (T5-Small 60M) using a contrastive learning approach. SLM4Offer employs InfoNCE (Information Noise-Contrastive Estimation) loss to align customer personas with relevant offers in a shared embedding space. A key innovation in SLM4Offer lies in the adaptive learning behaviour introduced by contrastive loss, which reshapes the latent space during training and enhances the model's generalizability. The model is fine-tuned and evaluated on a synthetic dataset designed to simulate customer behaviour and offer acceptance patterns. Experimental results demonstrate a 17 percent improvement in offer acceptance rate over a supervised fine-tuning baseline, highlighting the effectiveness of contrastive objectives in advancing personalized marketing.
comment: 10 pages, BDA Conference 2025
☆ RadReason: Radiology Report Evaluation Metric with Reasons and Sub-Scores
Evaluating automatically generated radiology reports remains a fundamental challenge due to the lack of clinically grounded, interpretable, and fine-grained metrics. Existing methods either produce coarse overall scores or rely on opaque black-box models, limiting their usefulness in real-world clinical workflows. We introduce RadReason, a novel evaluation framework for radiology reports that not only outputs fine-grained sub-scores across six clinically defined error types, but also produces human-readable justifications that explain the rationale behind each score. Our method builds on Group Relative Policy Optimization and incorporates two key innovations: (1) Sub-score Dynamic Weighting, which adaptively prioritizes clinically challenging error types based on live F1 statistics; and (2) Majority-Guided Advantage Scaling, which adjusts policy gradient updates based on prompt difficulty derived from sub-score agreement. Together, these components enable more stable optimization and better alignment with expert clinical judgment. Experiments on the ReXVal benchmark show that RadReason surpasses all prior offline metrics and achieves parity with GPT-4-based evaluations, while remaining explainable, cost-efficient, and suitable for clinical deployment. Code will be released upon publication.
☆ PyTOD: Programmable Task-Oriented Dialogue with Execution Feedback SIGDIAL 2025
Programmable task-oriented dialogue (TOD) agents enable language models to follow structured dialogue policies, but their effectiveness hinges on accurate state tracking. We present PyTOD, an agent that generates executable code to track dialogue state and uses policy and execution feedback for efficient error correction. To this end, PyTOD employs a simple constrained decoding approach, using a language model instead of grammar rules to follow API schemata. This leads to state-of-the-art state tracking performance on the challenging SGD benchmark. Our experiments show that PyTOD surpasses strong baselines in both accuracy and robust user goal estimation as the dialogue progresses, demonstrating the effectiveness of execution-aware state tracking.
comment: 20 pages, 12 figures. To appear at SIGDIAL 2025
☆ Principle Methods of Rendering Non-equivalent Words from Uzbek and Dari to Russian and English
These pure languages understanding directly relates to translation knowledge where linguists and translators need to work and research to eradicate misunderstanding. Misunderstandings mostly appear in non-equivalent words because there are different local and internal words like food, garment, cultural and traditional words and others in every notion. Truly, most of these words do not have equivalent in the target language and these words need to be worked and find their equivalent in the target language to fully understand the both languages. The purpose of this research is to introduce the methods of rendering non-equivalent words professionally from the source language to the target language and this research has been completed using library-based research. However, some of these non-equivalent words are already professionally rendered to the target language but still there many other words to be rendered. As a result, this research paper includes different ways and rules of rendering non-equivalent words from source language to the target language and 25 non-equvalent words have been rendered from Dar & Uzbek into English and Russian languages.
comment: Fully abstract is available in the attached file
☆ M-HELP: Using Social Media Data to Detect Mental Health Help-Seeking Signals EMNLP 2025
Mental health disorders are a global crisis. While various datasets exist for detecting such disorders, there remains a critical gap in identifying individuals actively seeking help. This paper introduces a novel dataset, M-Help, specifically designed to detect help-seeking behavior on social media. The dataset goes beyond traditional labels by identifying not only help-seeking activity but also specific mental health disorders and their underlying causes, such as relationship challenges or financial stressors. AI models trained on M-Help can address three key tasks: identifying help-seekers, diagnosing mental health conditions, and uncovering the root causes of issues.
comment: Accepted at Findings of EMNLP 2025
☆ GraSP: A Unified Graph-Based Framework for Scalable Generation, Quality Tagging, and Management of Synthetic Data for SFT and DPO
The advancement of large language models (LLMs) is critically dependent on the availability of high-quality datasets for Supervised Fine-Tuning (SFT), alignment tasks like Direct Preference Optimization (DPO), etc. In this work, we present a comprehensive synthetic data generation framework that facilitates scalable, configurable, and high-fidelity generation of synthetic data tailored for these training paradigms. Our approach employs a modular and configuration-based pipeline capable of modeling complex dialogue flows with minimal manual intervention. This framework uses a dual-stage quality tagging mechanism, combining heuristic rules and LLM-based evaluations, to automatically filter and score data extracted from OASST-formatted conversations, ensuring the curation of high-quality dialogue samples. The resulting datasets are structured under a flexible schema supporting both SFT and DPO use cases, enabling seamless integration into diverse training workflows. Together, these innovations offer a robust solution for generating and managing synthetic conversational data at scale, significantly reducing the overhead of data preparation in LLM training pipelines.
☆ A Study of Privacy-preserving Language Modeling Approaches
Recent developments in language modeling have increased their use in various applications and domains. Language models, often trained on sensitive data, can memorize and disclose this information during privacy attacks, raising concerns about protecting individuals' privacy rights. Preserving privacy in language models has become a crucial area of research, as privacy is one of the fundamental human rights. Despite its significance, understanding of how much privacy risk these language models possess and how it can be mitigated is still limited. This research addresses this by providing a comprehensive study of the privacy-preserving language modeling approaches. This study gives an in-depth overview of these approaches, highlights their strengths, and investigates their limitations. The outcomes of this study contribute to the ongoing research on privacy-preserving language modeling, providing valuable insights and outlining future research directions.
☆ LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model
The development of Large Speech-Language Models (LSLMs) has been slowed by fragmented architectures and a lack of transparency, hindering the systematic comparison and reproducibility of research. Unlike in the vision-language domain, the LSLM field suffers from the common practice of releasing model weights without their corresponding training data and configurations. To address these critical gaps, we introduce LLaSO, the first fully open, end-to-end framework for large-scale speech-language modeling. LLaSO provides the community with three essential resources: (1) LLaSO-Align, a 12M-instance speech-text alignment corpus; (2) LLaSO-Instruct, a 13.5M-instance multi-task instruction-tuning dataset; and (3) LLaSO-Eval, a reproducible benchmark for standardized evaluation. To validate our framework, we build and release LLaSO-Base, a 3.8B-parameter reference model trained exclusively on our public data. It achieves a normalized score of 0.72, establishing a strong, reproducible baseline that surpasses comparable models. Our analysis reveals that while broader training coverage enhances performance, significant generalization gaps persist on unseen tasks, particularly in pure audio scenarios. By releasing the complete stack of data, benchmarks, and models, LLaSO establishes a foundational open standard to unify research efforts and accelerate community-driven progress in LSLMs. We release the code, dataset, pretrained models, and results in https://github.com/EIT-NLP/LLaSO.
☆ Foundational Design Principles and Patterns for Building Robust and Adaptive GenAI-Native Systems
Generative AI (GenAI) has emerged as a transformative technology, demonstrating remarkable capabilities across diverse application domains. However, GenAI faces several major challenges in developing reliable and efficient GenAI-empowered systems due to its unpredictability and inefficiency. This paper advocates for a paradigm shift: future GenAI-native systems should integrate GenAI's cognitive capabilities with traditional software engineering principles to create robust, adaptive, and efficient systems. We introduce foundational GenAI-native design principles centered around five key pillars -- reliability, excellence, evolvability, self-reliance, and assurance -- and propose architectural patterns such as GenAI-native cells, organic substrates, and programmable routers to guide the creation of resilient and self-evolving systems. Additionally, we outline the key ingredients of a GenAI-native software stack and discuss the impact of these systems from technical, user adoption, economic, and legal perspectives, underscoring the need for further validation and experimentation. Our work aims to inspire future research and encourage relevant communities to implement and refine this conceptual framework.
☆ When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models EMNLP 2025
Large Audio-Language Models (LALMs) are enhanced with audio perception capabilities, enabling them to effectively process and understand multimodal inputs that combine audio and text. However, their performance in handling conflicting information between audio and text modalities remains largely unexamined. This paper introduces MCR-BENCH, the first comprehensive benchmark specifically designed to evaluate how LALMs prioritize information when presented with inconsistent audio-text pairs. Through extensive evaluation across diverse audio understanding tasks, we reveal a concerning phenomenon: when inconsistencies exist between modalities, LALMs display a significant bias toward textual input, frequently disregarding audio evidence. This tendency leads to substantial performance degradation in audio-centric tasks and raises important reliability concerns for real-world applications. We further investigate the influencing factors of text bias, and explore mitigation strategies through supervised finetuning, and analyze model confidence patterns that reveal persistent overconfidence even with contradictory inputs. These findings underscore the need for improved modality balance during training and more sophisticated fusion mechanisms to enhance the robustness when handling conflicting multi-modal inputs. The project is available at https://github.com/WangCheng0116/MCR-BENCH.
comment: Accepted by EMNLP 2025 Main
☆ Attribution, Citation, and Quotation: A Survey of Evidence-based Text Generation with Large Language Models
The increasing adoption of large language models (LLMs) has been accompanied by growing concerns regarding their reliability and trustworthiness. As a result, a growing body of research focuses on evidence-based text generation with LLMs, aiming to link model outputs to supporting evidence to ensure traceability and verifiability. However, the field is fragmented due to inconsistent terminology, isolated evaluation practices, and a lack of unified benchmarks. To bridge this gap, we systematically analyze 134 papers, introduce a unified taxonomy of evidence-based text generation with LLMs, and investigate 300 evaluation metrics across seven key dimensions. Thereby, we focus on approaches that use citations, attribution, or quotations for evidence-based text generation. Building on this, we examine the distinctive characteristics and representative methods in the field. Finally, we highlight open challenges and outline promising directions for future work.
☆ CITE: A Comprehensive Benchmark for Heterogeneous Text-Attributed Graphs on Catalytic Materials
Text-attributed graphs(TAGs) are pervasive in real-world systems,where each node carries its own textual features. In many cases these graphs are inherently heterogeneous, containing multiple node types and diverse edge types. Despite the ubiquity of such heterogeneous TAGs, there remains a lack of large-scale benchmark datasets. This shortage has become a critical bottleneck, hindering the development and fair comparison of representation learning methods on heterogeneous text-attributed graphs. In this paper, we introduce CITE - Catalytic Information Textual Entities Graph, the first and largest heterogeneous text-attributed citation graph benchmark for catalytic materials. CITE comprises over 438K nodes and 1.2M edges, spanning four relation types. In addition, we establish standardized evaluation procedures and conduct extensive benchmarking on the node classification task, as well as ablation experiments on the heterogeneous and textual properties of CITE. We compare four classes of learning paradigms, including homogeneous graph models, heterogeneous graph models, LLM(Large Language Model)-centric models, and LLM+Graph models. In a nutshell, we provide (i) an overview of the CITE dataset, (ii) standardized evaluation protocols, and (iii) baseline and ablation experiments across diverse modeling paradigms.
comment: 23 pages, 4 figures,
☆ Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training
Large language models are trained with tokenizers, and the resulting token distribution is highly imbalanced: a few words dominate the stream while most occur rarely. Recent practice favors ever-larger vocabularies, but the source of the benefit is unclear. We conduct a controlled study that scales the language model's vocabulary from 24K to 196K while holding data, compute, and optimization fixed. We first quantify the complexity of tokenized text, formalized via Kolmogorov complexity, and show that larger vocabularies reduce this complexity. Above 24K, every common word is already a single token, so further growth mainly deepens the relative token-frequency imbalance. A word-level loss decomposition shows that larger vocabularies reduce cross-entropy almost exclusively by lowering uncertainty on the 2,500 most frequent words, even though loss on the rare tail rises. Constraining input and output embedding norms to attenuate the effect of token-frequency imbalance reverses the gain, directly showing that the model exploits rather than suffers from imbalance. Because the same frequent words cover roughly 77% of tokens in downstream benchmarks, this training advantage transfers intact. We also show that enlarging model parameters with a fixed vocabulary yields the same frequent-word benefit. Our results reframe "bigger vocabularies help" as "lowering the complexity of tokenized text helps," providing a simple, principled lever for tokenizer-model co-design and clarifying the loss dynamics that govern language-model scaling in pre-training.
comment: Preprint
☆ Confidence-Modulated Speculative Decoding for Large Language Models SC
Speculative decoding has emerged as an effective approach for accelerating autoregressive inference by parallelizing token generation through a draft-then-verify paradigm. However, existing methods rely on static drafting lengths and rigid verification criteria, limiting their adaptability across varying model uncertainties and input complexities. This paper proposes an information-theoretic framework for speculative decoding based on confidence-modulated drafting. By leveraging entropy and margin-based uncertainty measures over the drafter's output distribution, the proposed method dynamically adjusts the number of speculatively generated tokens at each iteration. This adaptive mechanism reduces rollback frequency, improves resource utilization, and maintains output fidelity. Additionally, the verification process is modulated using the same confidence signals, enabling more flexible acceptance of drafted tokens without sacrificing generation quality. Experiments on machine translation and summarization tasks demonstrate significant speedups over standard speculative decoding while preserving or improving BLEU and ROUGE scores. The proposed approach offers a principled, plug-in method for efficient and robust decoding in large language models under varying conditions of uncertainty.
comment: This is the preprint of the paper, which has been accepted for oral presentation and publication in the proceedings of IEEE INDISCON 2025. The conference will be organized at the National Institute of Technology, Rourkela, India, from August 21 to 23, 2025. The paper is 10 pages long, and it contains 2 figures and 5 tables
☆ Unveiling Trust in Multimodal Large Language Models: Evaluation, Analysis, and Mitigation
The trustworthiness of Multimodal Large Language Models (MLLMs) remains an intense concern despite the significant progress in their capabilities. Existing evaluation and mitigation approaches often focus on narrow aspects and overlook risks introduced by the multimodality. To tackle these challenges, we propose MultiTrust-X, a comprehensive benchmark for evaluating, analyzing, and mitigating the trustworthiness issues of MLLMs. We define a three-dimensional framework, encompassing five trustworthiness aspects which include truthfulness, robustness, safety, fairness, and privacy; two novel risk types covering multimodal risks and cross-modal impacts; and various mitigation strategies from the perspectives of data, model architecture, training, and inference algorithms. Based on the taxonomy, MultiTrust-X includes 32 tasks and 28 curated datasets, enabling holistic evaluations over 30 open-source and proprietary MLLMs and in-depth analysis with 8 representative mitigation methods. Our extensive experiments reveal significant vulnerabilities in current models, including a gap between trustworthiness and general capabilities, as well as the amplification of potential risks in base LLMs by both multimodal training and inference. Moreover, our controlled analysis uncovers key limitations in existing mitigation strategies that, while some methods yield improvements in specific aspects, few effectively address overall trustworthiness, and many introduce unexpected trade-offs that compromise model utility. These findings also provide practical insights for future improvements, such as the benefits of reasoning to better balance safety and performance. Based on these insights, we introduce a Reasoning-Enhanced Safety Alignment (RESA) approach that equips the model with chain-of-thought reasoning ability to discover the underlying risks, achieving state-of-the-art results.
comment: For Appendix, please refer to arXiv:2406.07057
☆ A Survey on Large Language Model Benchmarks
In recent years, with the rapid development of the depth and breadth of large language models' capabilities, various corresponding evaluation benchmarks have been emerging in increasing numbers. As a quantitative assessment tool for model performance, benchmarks are not only a core means to measure model capabilities but also a key element in guiding the direction of model development and promoting technological innovation. We systematically review the current status and development of large language model benchmarks for the first time, categorizing 283 representative benchmarks into three categories: general capabilities, domain-specific, and target-specific. General capability benchmarks cover aspects such as core linguistics, knowledge, and reasoning; domain-specific benchmarks focus on fields like natural sciences, humanities and social sciences, and engineering technology; target-specific benchmarks pay attention to risks, reliability, agents, etc. We point out that current benchmarks have problems such as inflated scores caused by data contamination, unfair evaluation due to cultural and linguistic biases, and lack of evaluation on process credibility and dynamic environments, and provide a referable design paradigm for future benchmark innovation.
☆ KG-EDAS: A Meta-Metric Framework for Evaluating Knowledge Graph Completion Models
Knowledge Graphs (KGs) enable applications in various domains such as semantic search, recommendation systems, and natural language processing. KGs are often incomplete, missing entities and relations, an issue addressed by Knowledge Graph Completion (KGC) methods that predict missing elements. Different evaluation metrics, such as Mean Reciprocal Rank (MRR), Mean Rank (MR), and Hit@k, are commonly used to assess the performance of such KGC models. A major challenge in evaluating KGC models, however, lies in comparing their performance across multiple datasets and metrics. A model may outperform others on one dataset but underperform on another, making it difficult to determine overall superiority. Moreover, even within a single dataset, different metrics such as MRR and Hit@1 can yield conflicting rankings, where one model excels in MRR while another performs better in Hit@1, further complicating model selection for downstream tasks. These inconsistencies hinder holistic comparisons and highlight the need for a unified meta-metric that integrates performance across all metrics and datasets to enable a more reliable and interpretable evaluation framework. To address this need, we propose KG Evaluation based on Distance from Average Solution (EDAS), a robust and interpretable meta-metric that synthesizes model performance across multiple datasets and diverse evaluation criteria into a single normalized score ($M_i \in [0,1]$). Unlike traditional metrics that focus on isolated aspects of performance, EDAS offers a global perspective that supports more informed model selection and promotes fairness in cross-dataset evaluation. Experimental results on benchmark datasets such as FB15k-237 and WN18RR demonstrate that EDAS effectively integrates multi-metric, multi-dataset performance into a unified ranking, offering a consistent, robust, and generalizable framework for evaluating KGC models.
☆ DiagECG: An LLM-Driven Framework for Diagnostic Reasoning via Discretized ECG Tokenization
Electrocardiography plays a central role in cardiovascular diagnostics, yet existing automated approaches often struggle to generalize across clinical tasks and offer limited support for open-ended reasoning. We present DiagECG, a novel framework that integrates time-series and language modeling by enabling large language models to process 12-lead ECG signals for clinical text generation tasks. Our approach discretizes continuous ECG embeddings into symbolic tokens using a lead-independent encoder and quantization module. These tokens are then used to extend the vocabulary of LLM, allowing the model to handle both ECG and natural language inputs in a unified manner. To bridge the modality gap, we pretrain the model on an autoregressive ECG forecasting task, enabling the LLM to model temporal dynamics using its native language modeling capabilities. Finally, we perform instruction tuning on both ECG question answering and diagnostic report generation. Without modifying the core model, DiagECG achieves strong performance across tasks while maintaining generalization to out-of-distribution settings. Extensive experiments demonstrate the effectiveness of each component and highlight the potential of integrating symbolic ECG representations into LLMs for medical reasoning.
☆ CUPE: Contextless Universal Phoneme Encoder for Language-Agnostic Speech Processing SP 2025
Universal phoneme recognition typically requires analyzing long speech segments and language-specific patterns. Many speech processing tasks require pure phoneme representations free from contextual influence, which motivated our development of CUPE - a lightweight model that captures key phoneme features in just 120 milliseconds, about one phoneme's length. CUPE processes short, fixed-width windows independently and, despite fewer parameters than current approaches, achieves competitive cross-lingual performance by learning fundamental acoustic patterns common to all languages. Our extensive evaluation through supervised and self-supervised training on diverse languages, including zero-shot tests on the UCLA Phonetic Corpus, demonstrates strong cross-lingual generalization and reveals that effective universal speech processing is possible through modeling basic acoustic patterns within phoneme-length windows.
comment: Accepted in: 8th International Conference on Natural Language and Speech Processing (ICNLSP 2025)
☆ IPIGuard: A Novel Tool Dependency Graph-Based Defense Against Indirect Prompt Injection in LLM Agents EMNLP 2025
Large language model (LLM) agents are widely deployed in real-world applications, where they leverage tools to retrieve and manipulate external data for complex tasks. However, when interacting with untrusted data sources (e.g., fetching information from public websites), tool responses may contain injected instructions that covertly influence agent behaviors and lead to malicious outcomes, a threat referred to as Indirect Prompt Injection (IPI). Existing defenses typically rely on advanced prompting strategies or auxiliary detection models. While these methods have demonstrated some effectiveness, they fundamentally rely on assumptions about the model's inherent security, which lacks structural constraints on agent behaviors. As a result, agents still retain unrestricted access to tool invocations, leaving them vulnerable to stronger attack vectors that can bypass the security guardrails of the model. To prevent malicious tool invocations at the source, we propose a novel defensive task execution paradigm, called IPIGuard, which models the agents' task execution process as a traversal over a planned Tool Dependency Graph (TDG). By explicitly decoupling action planning from interaction with external data, IPIGuard significantly reduces unintended tool invocations triggered by injected instructions, thereby enhancing robustness against IPI attacks. Experiments on the AgentDojo benchmark show that IPIGuard achieves a superior balance between effectiveness and robustness, paving the way for the development of safer agentic systems in dynamic environments.
comment: EMNLP 2025
☆ Multiple Memory Systems for Enhancing the Long-term Memory of Agent
An agent powered by large language models have achieved impressive results, but effectively handling the vast amounts of historical data generated during interactions remains a challenge. The current approach is to design a memory module for the agent to process these data. However, existing methods, such as MemoryBank and A-MEM, have poor quality of stored memory content, which affects recall performance and response quality. In order to better construct high-quality long-term memory content, we have designed a multiple memory system (MMS) inspired by cognitive psychology theory. The system processes short-term memory to multiple long-term memory fragments, and constructs retrieval memory units and contextual memory units based on these fragments, with a one-to-one correspondence between the two. During the retrieval phase, MMS will match the most relevant retrieval memory units based on the user's query. Then, the corresponding contextual memory units is obtained as the context for the response stage to enhance knowledge, thereby effectively utilizing historical data. Experiments on LoCoMo dataset compared our method with three others, proving its effectiveness. Ablation studies confirmed the rationality of our memory units. We also analyzed the robustness regarding the number of selected memory segments and the storage overhead, demonstrating its practical value.
☆ Evaluating Knowledge Graph Complexity via Semantic, Spectral, and Structural Metrics for Link Prediction
Understanding dataset complexity is fundamental to evaluating and comparing link prediction models on knowledge graphs (KGs). While the Cumulative Spectral Gradient (CSG) metric, derived from probabilistic divergence between classes within a spectral clustering framework, has been proposed as a classifier agnostic complexity metric purportedly scaling with class cardinality and correlating with downstream performance, it has not been evaluated in KG settings so far. In this work, we critically examine CSG in the context of multi relational link prediction, incorporating semantic representations via transformer derived embeddings. Contrary to prior claims, we find that CSG is highly sensitive to parametrisation and does not robustly scale with the number of classes. Moreover, it exhibits weak or inconsistent correlation with standard performance metrics such as Mean Reciprocal Rank (MRR) and Hit@1. To deepen the analysis, we introduce and benchmark a set of structural and semantic KG complexity metrics. Our findings reveal that global and local relational ambiguity captured via Relation Entropy, node level Maximum Relation Diversity, and Relation Type Cardinality exhibit strong inverse correlations with MRR and Hit@1, suggesting these as more faithful indicators of task difficulty. Conversely, graph connectivity measures such as Average Degree, Degree Entropy, PageRank, and Eigenvector Centrality correlate positively with Hit@10. Our results demonstrate that CSGs purported stability and generalization predictive power fail to hold in link prediction settings and underscore the need for more stable, interpretable, and task-aligned measures of dataset complexity in knowledge driven learning.
☆ Adversarial Attacks against Neural Ranking Models via In-Context Learning
While neural ranking models (NRMs) have shown high effectiveness, they remain susceptible to adversarial manipulation. In this work, we introduce Few-Shot Adversarial Prompting (FSAP), a novel black-box attack framework that leverages the in-context learning capabilities of Large Language Models (LLMs) to generate high-ranking adversarial documents. Unlike previous approaches that rely on token-level perturbations or manual rewriting of existing documents, FSAP formulates adversarial attacks entirely through few-shot prompting, requiring no gradient access or internal model instrumentation. By conditioning the LLM on a small support set of previously observed harmful examples, FSAP synthesizes grammatically fluent and topically coherent documents that subtly embed false or misleading information and rank competitively against authentic content. We instantiate FSAP in two modes: FSAP-IntraQ, which leverages harmful examples from the same query to enhance topic fidelity, and FSAP-InterQ, which enables broader generalization by transferring adversarial patterns across unrelated queries. Our experiments on the TREC 2020 and 2021 Health Misinformation Tracks, using four diverse neural ranking models, reveal that FSAP-generated documents consistently outrank credible, factually accurate documents. Furthermore, our analysis demonstrates that these adversarial outputs exhibit strong stance alignment and low detectability, posing a realistic and scalable threat to neural retrieval systems. FSAP also effectively generalizes across both proprietary and open-source LLMs.
☆ AmbiSQL: Interactive Ambiguity Detection and Resolution for Text-to-SQL
Text-to-SQL systems translate natural language questions into SQL queries, providing substantial value for non-expert users. While large language models (LLMs) show promising results for this task, they remain error-prone. Query ambiguity has been recognized as a major obstacle for LLM-based Text-to-SQL systems, leading to misinterpretation of user intent and inaccurate SQL generation. We demonstrate AmbiSQL, an interactive system that automatically detects query ambiguities and guides users through intuitive multiple-choice questions to clarify their intent. Our approach introduces a fine-grained ambiguity taxonomy for identifying ambiguities that affect database element mapping and LLM reasoning, then incorporates user feedback to rewrite ambiguous questions. Evaluation on an ambiguous query dataset shows that AmbiSQL achieves 87.2% precision in ambiguity detection and improves SQL exact match accuracy by 50% when integrated with Text-to-SQL systems. Our demonstration showcases the significant performance gains and highlights the system's practical usability. Code repo and demonstration are available at: https://github.com/JustinzjDing/AmbiSQL.
☆ TComQA: Extracting Temporal Commonsense from Text
Understanding events necessitates grasping their temporal context, which is often not explicitly stated in natural language. For example, it is not a trivial task for a machine to infer that a museum tour may last for a few hours, but can not take months. Recent studies indicate that even advanced large language models (LLMs) struggle in generating text that require reasoning with temporal commonsense due to its infrequent explicit mention in text. Therefore, automatically mining temporal commonsense for events enables the creation of robust language models. In this work, we investigate the capacity of LLMs to extract temporal commonsense from text and evaluate multiple experimental setups to assess their effectiveness. Here, we propose a temporal commonsense extraction pipeline that leverages LLMs to automatically mine temporal commonsense and use it to construct TComQA, a dataset derived from SAMSum and RealNews corpora. TComQA has been validated through crowdsourcing and achieves over 80\% precision in extracting temporal commonsense. The model trained with TComQA also outperforms an LLM fine-tuned on existing dataset of temporal question answering task.
☆ Conflict-Aware Soft Prompting for Retrieval-Augmented Generation EMNLP 2025
Retrieval-augmented generation (RAG) enhances the capabilities of large language models (LLMs) by incorporating external knowledge into their input prompts. However, when the retrieved context contradicts the LLM's parametric knowledge, it often fails to resolve the conflict between incorrect external context and correct parametric knowledge, known as context-memory conflict. To tackle this problem, we introduce Conflict-Aware REtrieval-Augmented Generation (CARE), consisting of a context assessor and a base LLM. The context assessor encodes compact memory token embeddings from raw context tokens. Through grounded/adversarial soft prompting, the context assessor is trained to discern unreliable context and capture a guidance signal that directs reasoning toward the more reliable knowledge source. Extensive experiments show that CARE effectively mitigates context-memory conflicts, leading to an average performance gain of 5.0\% on QA and fact-checking benchmarks, establishing a promising direction for trustworthy and adaptive RAG systems.
comment: Accepted to EMNLP 2025; 14 pages; 5 figures, 11 tables
☆ Retrieval-Augmented Review Generation for Poisoning Recommender Systems
Recent studies have shown that recommender systems (RSs) are highly vulnerable to data poisoning attacks, where malicious actors inject fake user profiles, including a group of well-designed fake ratings, to manipulate recommendations. Due to security and privacy constraints in practice, attackers typically possess limited knowledge of the victim system and thus need to craft profiles that have transferability across black-box RSs. To maximize the attack impact, the profiles often remains imperceptible. However, generating such high-quality profiles with the restricted resources is challenging. Some works suggest incorporating fake textual reviews to strengthen the profiles; yet, the poor quality of the reviews largely undermines the attack effectiveness and imperceptibility under the practical setting. To tackle the above challenges, in this paper, we propose to enhance the quality of the review text by harnessing in-context learning (ICL) capabilities of multimodal foundation models. To this end, we introduce a demonstration retrieval algorithm and a text style transfer strategy to augment the navie ICL. Specifically, we propose a novel practical attack framework named RAGAN to generate high-quality fake user profiles, which can gain insights into the robustness of RSs. The profiles are generated by a jailbreaker and collaboratively optimized on an instructional agent and a guardian to improve the attack transferability and imperceptibility. Comprehensive experiments on various real-world datasets demonstrate that RAGAN achieves the state-of-the-art poisoning attack performance.
☆ EMNLP: Educator-role Moral and Normative Large Language Models Profiling EMNLP
Simulating Professions (SP) enables Large Language Models (LLMs) to emulate professional roles. However, comprehensive psychological and ethical evaluation in these contexts remains lacking. This paper introduces EMNLP, an Educator-role Moral and Normative LLMs Profiling framework for personality profiling, moral development stage measurement, and ethical risk under soft prompt injection. EMNLP extends existing scales and constructs 88 teacher-specific moral dilemmas, enabling profession-oriented comparison with human teachers. A targeted soft prompt injection set evaluates compliance and vulnerability in teacher SP. Experiments on 12 LLMs show teacher-role LLMs exhibit more idealized and polarized personalities than human teachers, excel in abstract moral reasoning, but struggle with emotionally complex situations. Models with stronger reasoning are more vulnerable to harmful prompt injection, revealing a paradox between capability and safety. The model temperature and other hyperparameters have limited influence except in some risk behaviors. This paper presents the first benchmark to assess ethical and psychological alignment of teacher-role LLMs for educational AI. Resources are available at https://e-m-n-l-p.github.io/.
comment: 24pages, 12 figures, Accepted by EMNLP Main Confrence
☆ UniCoM: A Universal Code-Switching Speech Generator EMNLP 2025
Code-switching (CS), the alternation between two or more languages within a single speaker's utterances, is common in real-world conversations and poses significant challenges for multilingual speech technology. However, systems capable of handling this phenomenon remain underexplored, primarily due to the scarcity of suitable datasets. To resolve this issue, we propose Universal Code-Mixer (UniCoM), a novel pipeline for generating high-quality, natural CS samples without altering sentence semantics. Our approach utilizes an algorithm we call Substituting WORDs with Synonyms (SWORDS), which generates CS speech by replacing selected words with their translations while considering their parts of speech. Using UniCoM, we construct Code-Switching FLEURS (CS-FLEURS), a multilingual CS corpus designed for automatic speech recognition (ASR) and speech-to-text translation (S2TT). Experimental results show that CS-FLEURS achieves high intelligibility and naturalness, performing comparably to existing datasets on both objective and subjective metrics. We expect our approach to advance CS speech technology and enable more inclusive multilingual systems.
comment: Accepted to EMNLP 2025 Findings
☆ WangchanThaiInstruct: An instruction-following Dataset for Culture-Aware, Multitask, and Multi-domain Evaluation in Thai EMNLP 2025
Large language models excel at instruction-following in English, but their performance in low-resource languages like Thai remains underexplored. Existing benchmarks often rely on translations, missing cultural and domain-specific nuances needed for real-world use. We present WangchanThaiInstruct, a human-authored Thai dataset for evaluation and instruction tuning, covering four professional domains and seven task types. Created through a multi-stage quality control process with annotators, domain experts, and AI researchers, WangchanThaiInstruct supports two studies: (1) a zero-shot evaluation showing performance gaps on culturally and professionally specific tasks, and (2) an instruction tuning study with ablations isolating the effect of native supervision. Models fine-tuned on WangchanThaiInstruct outperform those using translated data in both in-domain and out-of-domain benchmarks. These findings underscore the need for culturally and professionally grounded instruction data to improve LLM alignment in low-resource, linguistically diverse settings.
comment: Accepted to EMNLP 2025 (Main). Model and Dataset: https://huggingface.co/collections/airesearch/wangchan-thai-instruction-6835722a30b98e01598984fd
☆ VocabTailor: Dynamic Vocabulary Selection for Downstream Tasks in Small Language Models
Small Language Models (SLMs) provide computational advantages in resource-constrained environments, yet memory limitations remain a critical bottleneck for edge device deployment. A substantial portion of SLMs' memory footprint stems from vocabulary-related components, particularly embeddings and language modeling (LM) heads, due to large vocabulary sizes. Existing static vocabulary pruning, while reducing memory usage, suffers from rigid, one-size-fits-all designs that cause information loss from the prefill stage and a lack of flexibility. In this work, we identify two key principles underlying the vocabulary reduction challenge: the lexical locality principle, the observation that only a small subset of tokens is required during any single inference, and the asymmetry in computational characteristics between vocabulary-related components of SLM. Based on these insights, we introduce VocabTailor, a novel decoupled dynamic vocabulary selection framework that addresses memory constraints through offloading embedding and implements a hybrid static-dynamic vocabulary selection strategy for LM Head, enabling on-demand loading of vocabulary components. Comprehensive experiments across diverse downstream tasks demonstrate that VocabTailor achieves a reduction of up to 99% in the memory usage of vocabulary-related components with minimal or no degradation in task performance, substantially outperforming existing static vocabulary pruning.
☆ Are Checklists Really Useful for Automatic Evaluation of Generative Tasks? EMNLP 2025
Automatic evaluation of generative tasks using large language models faces challenges due to ambiguous criteria. Although automatic checklist generation is a potentially promising approach, its usefulness remains underexplored. We investigate whether checklists should be used for all questions or selectively, generate them using six methods, evaluate their effectiveness across eight model sizes, and identify checklist items that correlate with human evaluations. Through experiments on pairwise comparison and direct scoring tasks, we find that selective checklist use tends to improve evaluation performance in pairwise settings, while its benefits are less consistent in direct scoring. Our analysis also shows that even checklist items with low correlation to human scores often reflect human-written criteria, indicating potential inconsistencies in human evaluation. These findings highlight the need to more clearly define objective evaluation criteria to guide both human and automatic evaluations. \footnote{Our code is available at~https://github.com/momo0817/checklist-effectiveness-study
comment: Accepted to the EMNLP 2025 Main Conference
☆ Self-Guided Function Calling in Large Language Models via Stepwise Experience Recall EMNLP 2025
Function calling enables large language models (LLMs) to interact with external systems by leveraging tools and APIs. When faced with multi-step tool usage, LLMs still struggle with tool selection, parameter generation, and tool-chain planning. Existing methods typically rely on manually designing task-specific demonstrations, or retrieving from a curated library. These approaches demand substantial expert effort and prompt engineering becomes increasingly complex and inefficient as tool diversity and task difficulty scale. To address these challenges, we propose a self-guided method, Stepwise Experience Recall (SEER), which performs fine-grained, stepwise retrieval from a continually updated experience pool. Instead of relying on static or manually curated library, SEER incrementally augments the experience pool with past successful trajectories, enabling continuous expansion of the pool and improved model performance over time. Evaluated on the ToolQA benchmark, SEER achieves an average improvement of 6.1\% on easy and 4.7\% on hard questions. We further test SEER on $\tau$-bench, which includes two real-world domains. Powered by Qwen2.5-7B and Qwen2.5-72B models, SEER demonstrates substantial accuracy gains of 7.44\% and 23.38\%, respectively.
comment: Accepted to EMNLP 2025
☆ Select to Know: An Internal-External Knowledge Self-Selection Framework for Domain-Specific Question Answering EMNLP2025
Large Language Models (LLMs) perform well in general QA but often struggle in domain-specific scenarios. Retrieval-Augmented Generation (RAG) introduces external knowledge but suffers from hallucinations and latency due to noisy retrievals. Continued pretraining internalizes domain knowledge but is costly and lacks cross-domain flexibility. We attribute this challenge to the long-tail distribution of domain knowledge, which leaves partial yet useful internal knowledge underutilized. We further argue that knowledge acquisition should be progressive, mirroring human learning: first understanding concepts, then applying them to complex reasoning. To address this, we propose Selct2Know (S2K), a cost-effective framework that internalizes domain knowledge through an internal-external knowledge self-selection strategy and selective supervised fine-tuning. We also introduce a structured reasoning data generation pipeline and integrate GRPO to enhance reasoning ability. Experiments on medical, legal, and financial QA benchmarks show that S2K consistently outperforms existing methods and matches domain-pretrained LLMs with significantly lower cost.
comment: EMNLP2025 Findings
☆ SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning
Long-context inference in large language models (LLMs) is increasingly constrained by the KV cache bottleneck: memory usage grows linearly with sequence length, while attention computation scales quadratically. Existing approaches address this issue by compressing the KV cache along the temporal axis through strategies such as token eviction or merging to reduce memory and computational overhead. However, these methods often neglect fine-grained importance variations across feature dimensions (i.e., the channel axis), thereby limiting their ability to effectively balance efficiency and model accuracy. In reality, we observe that channel saliency varies dramatically across both queries and positions: certain feature channels carry near-zero information for a given query, while others spike in relevance. To address this oversight, we propose SPARK, a training-free plug-and-play method that applies unstructured sparsity by pruning KV at the channel level, while dynamically restoring the pruned entries during attention score computation. Notably, our approach is orthogonal to existing KV compression and quantization techniques, making it compatible for integration with them to achieve further acceleration. By reducing channel-level redundancy, SPARK enables processing of longer sequences within the same memory budget. For sequences of equal length, SPARK not only preserves or improves model accuracy but also reduces KV cache storage by over 30% compared to eviction-based methods. Furthermore, even with an aggressive pruning ratio of 80%, SPARK maintains performance with less degradation than 5% compared to the baseline eviction method, demonstrating its robustness and effectiveness. Our code will be available at https://github.com/Xnhyacinth/SparK.
☆ Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models
Process Reward Models (PRMs) have emerged as a promising framework for supervising intermediate reasoning in large language models (LLMs), yet existing PRMs are primarily trained on general or Science, Technology, Engineering, and Mathematics (STEM) domains and fall short in domain-specific contexts such as finance, where reasoning is more structured, symbolic, and sensitive to factual and regulatory correctness. We introduce \textbf{Fin-PRM}, a domain-specialized, trajectory-aware PRM tailored to evaluate intermediate reasoning steps in financial tasks. Fin-PRM integrates step-level and trajectory-level reward supervision, enabling fine-grained evaluation of reasoning traces aligned with financial logic. We apply Fin-PRM in both offline and online reward learning settings, supporting three key applications: (i) selecting high-quality reasoning trajectories for distillation-based supervised fine-tuning, (ii) providing dense process-level rewards for reinforcement learning, and (iii) guiding reward-informed Best-of-N inference at test time. Experimental results on financial reasoning benchmarks, including CFLUE and FinQA, demonstrate that Fin-PRM consistently outperforms general-purpose PRMs and strong domain baselines in trajectory selection quality. Downstream models trained with Fin-PRM yield substantial improvements with baselines, with gains of 12.9\% in supervised learning, 5.2\% in reinforcement learning, and 5.1\% in test-time performance. These findings highlight the value of domain-specialized reward modeling for aligning LLMs with expert-level financial reasoning. Our project resources will be available at https://github.com/aliyun/qwen-dianjin.
☆ LLM4Sweat: A Trustworthy Large Language Model for Hyperhidrosis Support
While large language models (LLMs) have shown promise in healthcare, their application for rare medical conditions is still hindered by scarce and unreliable datasets for fine-tuning. Hyperhidrosis, a disorder causing excessive sweating beyond physiological needs, is one such rare disorder, affecting 2-3% of the population and significantly impacting both physical comfort and psychosocial well-being. To date, no work has tailored LLMs to advance the diagnosis or care of hyperhidrosis. To address this gap, we present LLM4Sweat, an open-source and domain-specific LLM framework for trustworthy and empathetic hyperhidrosis support. The system follows a three-stage pipeline. In the data augmentation stage, a frontier LLM generates medically plausible synthetic vignettes from curated open-source data to create a diverse and balanced question-answer dataset. In the fine-tuning stage, an open-source foundation model is fine-tuned on the dataset to provide diagnosis, personalized treatment recommendations, and empathetic psychological support. In the inference and expert evaluation stage, clinical and psychological specialists assess accuracy, appropriateness, and empathy, with validated responses iteratively enriching the dataset. Experiments show that LLM4Sweat outperforms baselines and delivers the first open-source LLM framework for hyperhidrosis, offering a generalizable approach for other rare diseases with similar data and trustworthiness challenges.
☆ SemToken: Semantic-Aware Tokenization for Efficient Long-Context Language Modeling
Tokenization plays a critical role in language modeling, yet existing approaches such as Byte-Pair Encoding (BPE) or WordPiece operate purely on frequency statistics, ignoring the underlying semantic structure of text. This leads to over-tokenization of semantically redundant spans and underutilization of contextual coherence, particularly in long-context scenarios. In this work, we propose \textbf{SemToken}, a semantic-aware tokenization framework that jointly reduces token redundancy and improves computation efficiency. SemToken first extracts contextual semantic embeddings via lightweight encoders and performs local semantic clustering to merge semantically equivalent tokens. Then, it allocates heterogeneous token granularity based on semantic density, allowing finer-grained tokenization in content-rich regions and coarser compression in repetitive or low-entropy spans. SemToken can be seamlessly integrated with modern language models and attention acceleration methods. Experiments on long-context language modeling benchmarks such as WikiText-103 and LongBench show that SemToken achieves up to $2.4\times$ reduction in token count and $1.9\times$ speedup, with negligible or no degradation in perplexity and downstream accuracy. Our findings suggest that semantic structure offers a promising new axis for optimizing tokenization and computation in large language models.
☆ ContextualLVLM-Agent: A Holistic Framework for Multi-Turn Visually-Grounded Dialogue and Complex Instruction Following
Despite significant advancements in Large Language Models (LLMs) and Large Vision-Language Models (LVLMs), current models still face substantial challenges in handling complex, multi-turn, and visually-grounded tasks that demand deep reasoning, sustained contextual understanding, entity tracking, and multi-step instruction following. Existing benchmarks often fall short in capturing the dynamism and intricacies of real-world multi-modal interactions, leading to issues such as context loss and visual hallucinations. To address these limitations, we introduce MMDR-Bench (Multi-Modal Dialogue Reasoning Benchmark), a novel dataset comprising 300 meticulously designed complex multi-turn dialogue scenarios, each averaging 5-7 turns and evaluated across six core dimensions including visual entity tracking and reasoning depth. Furthermore, we propose CoLVLM Agent (Contextual LVLM Agent), a holistic framework that enhances existing LVLMs with advanced reasoning and instruction following capabilities through an iterative "memory-perception-planning-execution" cycle, requiring no extensive re-training of the underlying models. Our extensive experiments on MMDR-Bench demonstrate that CoLVLM Agent consistently achieves superior performance, attaining an average human evaluation score of 4.03, notably surpassing state-of-the-art commercial models like GPT-4o (3.92) and Gemini 1.5 Pro (3.85). The framework exhibits significant advantages in reasoning depth, instruction adherence, and error suppression, and maintains robust performance over extended dialogue turns, validating the effectiveness of its modular design and iterative approach for complex multi-modal interactions.
☆ Identifying and Answering Questions with False Assumptions: An Interpretable Approach EMNLP 2025
People often ask questions with false assumptions, a type of question that does not have regular answers. Answering such questions require first identifying the false assumptions. Large Language Models (LLMs) often generate misleading answers because of hallucinations. In this paper, we focus on identifying and answering questions with false assumptions in several domains. We first investigate to reduce the problem to fact verification. Then, we present an approach leveraging external evidence to mitigate hallucinations. Experiments with five LLMs demonstrate that (1) incorporating retrieved evidence is beneficial and (2) generating and validating atomic assumptions yields more improvements and provides an interpretable answer by specifying the false assumptions.
comment: To appear at EMNLP 2025 Main conference
☆ Dancing with Deer: A Constructional Perspective on MWEs in the Era of LLMs
In this chapter, we argue for the benefits of understanding multiword expressions from the perspective of usage-based, construction grammar approaches. We begin with a historical overview of how construction grammar was developed in order to account for idiomatic expressions using the same grammatical machinery as the non-idiomatic structures of language. We cover a comprehensive description of constructions, which are pairings of meaning with form of any size (morpheme, word, phrase), as well as how constructional approaches treat the acquisition and generalization of constructions. We describe a successful case study leveraging constructional templates for representing multiword expressions in English PropBank. Because constructions can be at any level or unit of form, we then illustrate the benefit of a constructional representation of multi-meaningful morphosyntactic unit constructions in Arapaho, a highly polysynthetic and agglutinating language. We include a second case study leveraging constructional templates for representing these multi-morphemic expressions in Uniform Meaning Representation. Finally, we demonstrate the similarities and differences between a usage-based explanation of a speaker learning a novel multiword expression, such as "dancing with deer," and that of a large language model. We present experiments showing that both models and speakers can generalize the meaning of novel multiword expressions based on a single exposure of usage. However, only speakers can reason over the combination of two such expressions, as this requires comparison of the novel forms to a speaker's lifetime of stored constructional exemplars, which are rich with cross-modal details.
comment: Chapter in Phraseology and Multiword Expressions, Language Science Press (to appear)
☆ ASIC-Agent: An Autonomous Multi-Agent System for ASIC Design with Benchmark Evaluation
Large Language Models (LLMs) have demonstrated remarkable capabilities in Register Transfer Level (RTL) design, enabling high-quality code generation from natural language descriptions. However, LLMs alone face significant limitations in real-world hardware design workflows, including the inability to execute code, lack of debugging capabilities, and absence of long-term memory. To address these challenges, we present ASIC-Agent, an autonomous system designed specifically for digital ASIC design tasks. ASIC-Agent enhances base LLMs with a multi-agent architecture incorporating specialized sub-agents for RTL generation, verification, OpenLane hardening, and Caravel chip integration, all operating within a comprehensive sandbox environment with access to essential hardware design tools. The system leverages a vector database containing documentation, API references, error knowledge, and curated insights from the open-source silicon community. To evaluate ASIC-Agent's performance, we introduce ASIC-Agent-Bench, the first benchmark specifically designed to assess agentic systems in hardware design tasks. We evaluate ASIC-Agent with various base LLMs, providing quantitative comparisons and qualitative insights into agent behavior across different design scenarios. Our results demonstrate that ASIC-Agent, when powered by Claude 4 Sonnet, successfully automates a broad range of ASIC design tasks spanning varying levels of complexity, showing the potential of significantly accelerating the ASIC design workflow.
comment: 2025 IEEE International Conference on LLM-Aided Design (ICLAD)
☆ Evaluating Structured Decoding for Text-to-Table Generation: Evidence from Three Datasets
We present a comprehensive evaluation of structured decoding for text-to-table generation with large language models (LLMs). While previous work has primarily focused on unconstrained generation of tables, the impact of enforcing structural constraints during generation remains underexplored. We systematically compare schema-guided (structured) decoding to standard one-shot prompting across three diverse benchmarks - E2E, Rotowire, and Livesum - using open-source LLMs of up to 32B parameters, assessing the performance of table generation approaches in resource-constrained settings. Our experiments cover a wide range of evaluation metrics at cell, row, and table levels. Results demonstrate that structured decoding significantly enhances the validity and alignment of generated tables, particularly in scenarios demanding precise numerical alignment (Rotowire), but may degrade performance in contexts involving densely packed textual information (E2E) or extensive aggregation over lengthy texts (Livesum). We further analyze the suitability of different evaluation metrics and discuss the influence of model size.
comment: to be published in the workshop proceedings of the "From Rules to Language Models: Comparative Performance Evaluation" workshop, held alongside RANLP 2025
Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search
We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput. Jet-Nemotron is developed using Post Neural Architecture Search (PostNAS), a novel neural architecture exploration pipeline that enables efficient model design. Unlike prior approaches, PostNAS begins with a pre-trained full-attention model and freezes its MLP weights, allowing efficient exploration of attention block designs. The pipeline includes four key components: (1) learning optimal full-attention layer placement and elimination, (2) linear attention block selection, (3) designing new attention blocks, and (4) performing hardware-aware hyperparameter search. Our Jet-Nemotron-2B model achieves comparable or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across a comprehensive suite of benchmarks while delivering up to 53.6x generation throughput speedup and 6.1x prefilling speedup. It also achieves higher accuracy on MMLU and MMLU-Pro than recent advanced MoE full-attention models, such as DeepSeek-V3-Small and Moonlight, despite their larger scale with 15B total and 2.2B activated parameters.
comment: Tech Report
☆ Beyond Transcription: Mechanistic Interpretability in ASR
Interpretability methods have recently gained significant attention, particularly in the context of large language models, enabling insights into linguistic representations, error detection, and model behaviors such as hallucinations and repetitions. However, these techniques remain underexplored in automatic speech recognition (ASR), despite their potential to advance both the performance and interpretability of ASR systems. In this work, we adapt and systematically apply established interpretability methods such as logit lens, linear probing, and activation patching, to examine how acoustic and semantic information evolves across layers in ASR systems. Our experiments reveal previously unknown internal dynamics, including specific encoder-decoder interactions responsible for repetition hallucinations and semantic biases encoded deep within acoustic representations. These insights demonstrate the benefits of extending and applying interpretability techniques to speech recognition, opening promising directions for future research on improving model transparency and robustness.
☆ Lean Meets Theoretical Computer Science: Scalable Synthesis of Theorem Proving Challenges in Formal-Informal Pairs ICML2025
Formal theorem proving (FTP) has emerged as a critical foundation for evaluating the reasoning capabilities of large language models, enabling automated verification of mathematical proofs at scale. However, progress has been constrained by limited datasets due to the high cost of manual curation and the scarcity of challenging problems with verified formal-informal correspondences. We propose leveraging theoretical computer science (TCS) as a scalable source of rigorous proof problems, where algorithmic definitions enable automated generation of arbitrarily many challenging theorem-proof pairs. We demonstrate this approach on two TCS domains: Busy Beaver problems, which involve proving bounds on Turing machine halting behavior, and Mixed Boolean Arithmetic problems, which combine logical and arithmetic reasoning. Our framework automatically synthesizes problems with parallel formal (Lean4) and informal (Markdown) specifications, creating a scalable pipeline for generating verified proof challenges. Evaluation on frontier models reveals substantial gaps in automated theorem proving: while DeepSeekProver-V2-671B achieves 57.5\% success on Busy Beaver problems, it manages only 12\% on Mixed Boolean Arithmetic problems. These results highlight the difficulty of long-form proof generation even for problems that are computationally easy to verify, demonstrating the value of TCS domains for advancing automated reasoning research.
comment: Accepted to AI4MATH@ICML2025
☆ Annif at the GermEval-2025 LLMs4Subjects Task: Traditional XMTC Augmented by Efficient LLMs
This paper presents the Annif system in the LLMs4Subjects shared task (Subtask 2) at GermEval-2025. The task required creating subject predictions for bibliographic records using large language models, with a special focus on computational efficiency. Our system, based on the Annif automated subject indexing toolkit, refines our previous system from the first LLMs4Subjects shared task, which produced excellent results. We further improved the system by using many small and efficient language models for translation and synthetic data generation and by using LLMs for ranking candidate subjects. Our system ranked 1st in the overall quantitative evaluation of and 1st in the qualitative evaluation of Subtask 2.
comment: 5 pages, 4 figures, accepted at KONVENS 2025. arXiv admin note: substantial text overlap with arXiv:2504.19675
☆ DeepMEL: A Multi-Agent Collaboration Framework for Multimodal Entity Linking
Multimodal Entity Linking (MEL) aims to associate textual and visual mentions with entities in a multimodal knowledge graph. Despite its importance, current methods face challenges such as incomplete contextual information, coarse cross-modal fusion, and the difficulty of jointly large language models (LLMs) and large visual models (LVMs). To address these issues, we propose DeepMEL, a novel framework based on multi-agent collaborative reasoning, which achieves efficient alignment and disambiguation of textual and visual modalities through a role-specialized division strategy. DeepMEL integrates four specialized agents, namely Modal-Fuser, Candidate-Adapter, Entity-Clozer and Role-Orchestrator, to complete end-to-end cross-modal linking through specialized roles and dynamic coordination. DeepMEL adopts a dual-modal alignment path, and combines the fine-grained text semantics generated by the LLM with the structured image representation extracted by the LVM, significantly narrowing the modal gap. We design an adaptive iteration strategy, combines tool-based retrieval and semantic reasoning capabilities to dynamically optimize the candidate set and balance recall and precision. DeepMEL also unifies MEL tasks into a structured cloze prompt to reduce parsing complexity and enhance semantic comprehension. Extensive experiments on five public benchmark datasets demonstrate that DeepMEL achieves state-of-the-art performance, improving ACC by 1%-57%. Ablation studies verify the effectiveness of all modules.
♻ ☆ Large Language Models Encode Semantics in Low-Dimensional Linear Subspaces
Understanding the latent space geometry of large language models (LLMs) is key to interpreting their behavior and improving alignment. However, it remains unclear to what extent LLMs internally organize representations related to semantic understanding. To explore this, we conduct a large-scale empirical study of hidden representations in 11 autoregressive models across 6 scientific topics. We find that high-level semantic information consistently resides in low-dimensional subspaces that form linearly separable representations across domains. This separability becomes more pronounced in deeper layers and under prompts that elicit structured reasoning or alignment behavior$\unicode{x2013}$even when surface content remains unchanged. These findings support geometry-aware tools that operate directly in latent space to detect and mitigate harmful or adversarial content. As a proof of concept, we train an MLP probe on final-layer hidden states to act as a lightweight latent-space guardrail. This approach substantially improves refusal rates on malicious queries and prompt injections that bypass both the model's built-in safety alignment and external token-level filters.
♻ ☆ Empirical Evidence for Alignment Faking in a Small LLM and Prompt-Based Mitigation Techniques
Current literature suggests that alignment faking (deceptive alignment) is an emergent property of large language models. We present the first empirical evidence that a small instruction-tuned model, specifically LLaMA 3 8B, can exhibit alignment faking. We further show that prompt-only interventions, including deontological moral framing and scratchpad reasoning, significantly reduce this behavior without modifying model internals. This challenges the assumption that prompt-based ethics are trivial and that deceptive alignment requires scale. We introduce a taxonomy distinguishing shallow deception, shaped by context and suppressible through prompting, from deep deception, which reflects persistent, goal-driven misalignment. Our findings refine the understanding of deception in language models and underscore the need for alignment evaluations across model sizes and deployment settings.
♻ ☆ RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation
Code generation has attracted increasing attention with the rise of Large Language Models (LLMs). Many studies have developed powerful code LLMs by synthesizing code-related instruction data and applying supervised fine-tuning. However, these methods are limited by teacher model distillation and ignore the potential of iterative refinement by self-generated code. In this paper, we propose Adaptive Critique Refinement (ACR), which enables the model to refine itself by self-generated code and external critique, rather than directly imitating the code responses of the teacher model. Concretely, ACR includes a composite scoring system with LLM-as-a-Judge to evaluate the quality of code responses and a selective critique strategy with LLM-as-a-Critic to critique self-generated low-quality code responses. We develop the RefineCoder series by iteratively applying ACR, achieving continuous performance improvement on multiple code generation benchmarks. Compared to the baselines of the same size, our proposed RefineCoder series can achieve comparable or even superior performance using less data.
♻ ☆ InfAlign: Inference-aware language model alignment
Language model alignment is a critical step in training modern generative language models. Alignment targets to improve win rate of a sample from the aligned model against the base model. Today, we are increasingly using inference-time algorithms (e.g., Best-of-N, controlled decoding, tree search) to decode from language models rather than standard sampling. We show that this train/test mismatch makes standard RLHF framework sub-optimal in view of such inference-time methods. To this end, we propose a framework for inference-aware alignment (InfAlign), which aims to optimize inference-time win rate of the aligned policy against the base model. We prove that for any inference-time decoding procedure, the optimal aligned policy is the solution to the standard RLHF problem with a transformation of the reward. This motivates us to provide the calibrate-and-transform RL (InfAlign-CTRL) algorithm to solve this problem, which involves a reward calibration step and a KL-regularized reward maximization step with a transformation of the calibrated reward. For best-of-N sampling and best-of-N jailbreaking, we propose specific transformations offering up to 3-8% improvement on inference-time win rates. Finally, we also show that our proposed reward calibration method is a strong baseline for optimizing standard win rate.
♻ ☆ Learning to Generate Unit Tests for Automated Debugging
Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to large language models (LLMs), motivating automated test generation. However, we uncover a trade-off between generating unit test inputs that reveal errors when given a faulty code and correctly predicting the unit test output without access to the gold solution. To address this trade-off, we propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs based on task descriptions. Since model-generated tests can provide noisy signals (e.g., from incorrectly predicted outputs), we propose UTDebug that (i) scales UTGen via test-time compute to improve UT output prediction, and (ii) validates and backtracks edits based on multiple generated UTs to avoid overfitting, and helps LLMs debug effectively. We show that UTGen outperforms other LLM-based baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs. When used with UTDebug, we find that feedback from UTGen's unit tests improves pass@1 accuracy of Qwen2.5 32B on HumanEvalFix and our own harder debugging split of MBPP+ by over 3.17% and 12.35% (respectively) over other LLM-based UT generation baselines. Moreover, we observe that feedback from Qwen2.5 32B-based UTGen model can enhance debugging with frontier LLMs like GPT-4o by 13.8%. Lastly, we demonstrate that UTGen is a better judge for code correctness, outperforming a state-of-the-art trained 8B reward model by 4.43% on HumanEval+ with best-of-10 sampling using Qwen2.5 7B.
comment: Accepted to COLM 2025. Dataset and Code: https://github.com/archiki/UTGenDebug
♻ ☆ Exploring Modularity of Agentic Systems for Drug Discovery
Large-language models (LLMs) and agentic systems present exciting opportunities to accelerate drug discovery. In this study, we examine the modularity of LLM-based agentic systems for drug discovery, i.e., whether parts of the system such as the LLM and type of agent are interchangeable, a topic that has received limited attention in drug discovery. We compare the performance of different LLMs and the effectiveness of tool-calling agents versus code-generating agents. Our case study, comparing performance in orchestrating tools for chemistry and drug discovery using an LLM-as-a-judge score, shows that Claude-3.5-Sonnet, Claude-3.7-Sonnet and GPT-4o outperform alternative language models such as Llama-3.1-8B, Llama-3.1-70B, GPT-3.5-Turbo, and Nova-Micro. Although we confirm that code-generating agents outperform the tool-calling ones on average, we show that this is highly question- and model-dependent. Furthermore, the impact of replacing system prompts is dependent on the question and model, underscoring that even in this particular domain one cannot just replace components of the system without re-engineering. Our study highlights the necessity of further research into the modularity of agentic systems to enable the development of reliable and modular solutions for real-world problems.
♻ ☆ Annif at SemEval-2025 Task 5: Traditional XMTC augmented by LLMs SemEval-2025
This paper presents the Annif system in SemEval-2025 Task 5 (LLMs4Subjects), which focussed on subject indexing using large language models (LLMs). The task required creating subject predictions for bibliographic records from the bilingual TIBKAT database using the GND subject vocabulary. Our approach combines traditional natural language processing and machine learning techniques implemented in the Annif toolkit with innovative LLM-based methods for translation and synthetic data generation, and merging predictions from monolingual models. The system ranked first in the all-subjects category and second in the tib-core-subjects category in the quantitative evaluation, and fourth in qualitative evaluations. These findings demonstrate the potential of combining traditional XMTC algorithms with modern LLM techniques to improve the accuracy and efficiency of subject indexing in multilingual contexts.
comment: 6 pages, 4 figures, published at SemEval-2025 workshop Task 5: LLMs4Subjects: https://aclanthology.org/2025.semeval-1.315/
♻ ☆ Unplug and Play Language Models: Decomposing Experts in Language Models at Inference Time CIKM 2025
Enabled by large-scale text corpora with huge parameters, pre-trained language models operate as multi-task experts using a single model architecture. However, recent studies have revealed that certain neurons play disproportionately important roles in solving specific tasks, suggesting that task-relevant substructures can be isolated and selectively activated for each task. Therefore, we introduce Decomposition of Experts (DoE), a novel framework that dynamically identifies and activates task-specific experts within a language model to reduce inference cost without sacrificing accuracy. We first define a task expert as a set of parameters that significantly influence the performance of a specific task and propose a four-step unplug-and-play process: (1) receiving a user request, (2) identifying the corresponding task expert, (3) performing inference using the expert-localized model, and (4) restoring the original model and waiting for the next task. Using attribution methods and prompt tuning, DoE isolates task-relevant neurons, minimizing computational overhead while maintaining task performance. We assume a setting where a language model receives user requests from five widely used natural language understanding benchmarks, processing one task at a time. In this setup, we demonstrate that DoE achieves up to a x1.73 inference speed-up with a 65% pruning rate, without compromising accuracy. Comparisons with various task expert localization methods reveal that DoE effectively identifies task experts, while ablation studies validate the importance of its components. Additionally, we analyze the effects of batch size, token count, and layer types on inference speed-up, providing practical insights for adopting DoE. The proposed framework is both practical and scalable, applicable to any transformer-based architecture, offering a robust solution for efficient task-specific inference.
comment: accepted to CIKM 2025
♻ ☆ Advancing the Database of Cross-Linguistic Colexifications with New Workflows and Data
Lexical resources are crucial for cross-linguistic analysis and can provide new insights into computational models for natural language learning. Here, we present an advanced database for comparative studies of words with multiple meanings, a phenomenon known as colexification. The new version includes improvements in the handling, selection and presentation of the data. We compare the new database with previous versions and find that our improvements provide a more balanced sample covering more language families worldwide, with enhanced data quality, given that all word forms are provided in phonetic transcription. We conclude that the new Database of Cross-Linguistic Colexifications has the potential to inspire exciting new studies that link cross-linguistic data to open questions in linguistic typology, historical linguistics, psycholinguistics, and computational linguistics.
♻ ☆ Seed-X: Building Strong Multilingual Translation LLM with 7B Parameters
Multilingual translation stands as a challenging task for large language models (LLMs) to handle intricate language patterns and stilted translations that arise in automated translations. In this paper, we introduce Seed-X, a family of open-source LLMs comprising instruct and reasoning models, pushing the limits of translation capability with 7B parameter size. The base model is pre-trained on a diverse, high-quality dataset encompassing both monolingual and bilingual content across 28 languages, harnessing the full potential of multilingual data. The instruct model is then finetuned to translate by Chain-of-Thought (CoT) reasoning and further enhanced through reinforcement learning (RL) to achieve better generalization across diverse language pairs. Seed-X achieves performance comparable to leading closed-source models, including Gemini-2.5 and GPT-4o, across 28 languages, and significantly outperforms larger open-source models in both automatic metrics and human evaluations. We share the best practices through our optimization process, and make the parameter public available for advancing translation research and applications.
♻ ☆ The Devil is in the EOS: Sequence Training for Detailed Image Captioning
Despite significant advances in vision-language models (VLMs), image captioning often suffers from a lack of detail, with base models producing short, generic captions. This limitation persists even though VLMs are equipped with strong vision and language backbones. While supervised data and complex reward functions have been proposed to improve detailed image captioning, we identify a simpler underlying issue: a bias towards the end-of-sequence (EOS) token, which is introduced during cross-entropy training. We propose an unsupervised method to debias the model's tendency to predict the EOS token prematurely. By reducing this bias, we encourage the generation of longer, more detailed captions without the need for intricate reward functions or supervision. Our approach is straightforward, effective, and easily applicable to any pretrained model. We demonstrate its effectiveness through experiments with three VLMs and on three detailed captioning benchmarks. Our results show a substantial increase in caption length and relevant details, albeit with an expected increase in the rate of hallucinations.
comment: Accepted to COLM 2025
♻ ☆ CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset
We introduce CUS-QA, a benchmark for open-ended regional question answering that encompasses both textual and visual modalities. We also provide strong baselines using state-of-the-art large language models (LLMs). Our dataset consists of manually curated questions and answers grounded in Wikipedia, created by native speakers from Czechia, Slovakia, and Ukraine, with accompanying English translations. It includes both purely textual questions and those requiring visual understanding. We evaluate state-of-the-art LLMs through prompting and complement this with human judgments of answer correctness. Using these human evaluations, we analyze the reliability of existing automatic evaluation metrics. Our baseline results show that even the best open-weight LLMs achieve only around 50% accuracy on textual questions and below 30% on visual questions. LLM-based evaluation metrics show strong correlation with human judgment, while traditional string-overlap metrics perform surprisingly well due to the prevalence of named entities in answers.
♻ ☆ Pub-Guard-LLM: Detecting Retracted Biomedical Articles with Reliable Explanations
A significant and growing number of published scientific articles is found to involve fraudulent practices, posing a serious threat to the credibility and safety of research in fields such as medicine. We propose Pub-Guard-LLM, the first large language model-based system tailored to fraud detection of biomedical scientific articles. We provide three application modes for deploying Pub-Guard-LLM: vanilla reasoning, retrieval-augmented generation, and multi-agent debate. Each mode allows for textual explanations of predictions. To assess the performance of our system, we introduce an open-source benchmark, PubMed Retraction, comprising over 11K real-world biomedical articles, including metadata and retraction labels. We show that, across all modes, Pub-Guard-LLM consistently surpasses the performance of various baselines and provides more reliable explanations, namely explanations which are deemed more relevant and coherent than those generated by the baselines when evaluated by multiple assessment methods. By enhancing both detection performance and explainability in scientific fraud detection, Pub-Guard-LLM contributes to safeguarding research integrity with a novel, effective, open-source tool.
comment: long paper under review
♻ ☆ AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP
Large language models (LLMs) have shown remarkable progress in reasoning abilities and general natural language processing (NLP) tasks, yet their performance on Arabic data, characterized by rich morphology, diverse dialects, and complex script, remains underexplored. This paper presents a comprehensive benchmarking study of multiple reasoning-focused LLMs, with a special emphasis on the newly introduced DeepSeek models, across a suite of fifteen Arabic NLP tasks. We experiment with various strategies, including zero-shot, few-shot, and fine-tuning. This allows us to systematically evaluate performance on datasets covering a range of applications to examine their capacity for linguistic reasoning under different levels of complexity. Our experiments reveal several key findings. First, carefully selecting just three in-context examples delivers an average uplift of over 13 F1 points on classification tasks-boosting sentiment analysis from 35.3% to 87.5% and paraphrase detection from 56.1% to 87.0%. Second, reasoning-focused DeepSeek architectures outperform a strong GPT o4-mini baseline by an average of 12 F1 points on complex inference tasks in the zero-shot setting. Third, LoRA-based fine-tuning yields up to an additional 8 points in F1 and BLEU compared to equivalent increases in model scale. The code is available at https://anonymous.4open.science/r/AraReasoner41299
♻ ☆ IBPS: Indian Bail Prediction System
Bail decisions are among the most frequently adjudicated matters in Indian courts, yet they remain plagued by subjectivity, delays, and inconsistencies. With over 75% of India's prison population comprising undertrial prisoners, many from socioeconomically disadvantaged backgrounds, the lack of timely and fair bail adjudication exacerbates human rights concerns and contributes to systemic judicial backlog. In this paper, we present the Indian Bail Prediction System (IBPS), an AI-powered framework designed to assist in bail decision-making by predicting outcomes and generating legally sound rationales based solely on factual case attributes and statutory provisions. We curate and release a large-scale dataset of 150,430 High Court bail judgments, enriched with structured annotations such as age, health, criminal history, crime category, custody duration, statutes, and judicial reasoning. We fine-tune a large language model using parameter-efficient techniques and evaluate its performance across multiple configurations, with and without statutory context, and with RAG. Our results demonstrate that models fine-tuned with statutory knowledge significantly outperform baselines, achieving strong accuracy and explanation quality, and generalize well to a test set independently annotated by legal experts. IBPS offers a transparent, scalable, and reproducible solution to support data-driven legal assistance, reduce bail delays, and promote procedural fairness in the Indian judicial system.
♻ ☆ MuSeD: A Multimodal Spanish Dataset for Sexism Detection in Social Media Videos
Sexism is generally defined as prejudice and discrimination based on sex or gender, affecting every sector of society, from social institutions to relationships and individual behavior. Social media platforms amplify the impact of sexism by conveying discriminatory content not only through text but also across multiple modalities, highlighting the critical need for a multimodal approach to the analysis of sexism online. With the rise of social media platforms where users share short videos, sexism is increasingly spreading through video content. Automatically detecting sexism in videos is a challenging task, as it requires analyzing the combination of verbal, audio, and visual elements to identify sexist content. In this study, (1) we introduce MuSeD, a new Multimodal Spanish dataset for Sexism Detection consisting of $\approx$ 11 hours of videos extracted from TikTok and BitChute; (2) we propose an innovative annotation framework for analyzing the contributions of textual, vocal, and visual modalities to the classification of content as either sexist or non-sexist; and (3) we evaluate a range of large language models (LLMs) and multimodal LLMs on the task of sexism detection. We find that visual information plays a key role in labeling sexist content for both humans and models. Models effectively detect explicit sexism; however, they struggle with implicit cases, such as stereotypes, instances where annotators also show low agreement. This highlights the inherent difficulty of the task, as identifying implicit sexism depends on the social and cultural context.
comment: COLM 2025 camera-ready version: expanded Section 4.3 with an additional experiment using an extended definition-based prompt (including a definition of sexist content), and applied minor corrections
♻ ☆ Pragmatic Inference Chain (PIC) Improving LLMs' Reasoning of Authentic Implicit Toxic Language
The rapid development of large language models (LLMs) gives rise to ethical concerns about their performance, while opening new avenues for developing toxic language detection techniques. However, LLMs' unethical output and their capability of detecting toxicity have primarily been tested on language data that do not demand complex meaning inference, such as the biased associations of 'he' with programmer and 'she' with household. Nowadays toxic language adopts a much more creative range of implicit forms, thanks to advanced censorship. In this study, we collect authentic toxic interactions that evade online censorship and that are verified by human annotators as inference-intensive. To evaluate and improve LLMs' reasoning of the authentic implicit toxic language, we propose a new prompting method, Pragmatic Inference Chain (PIC), drawn on interdisciplinary findings from cognitive science and linguistics. The PIC prompting significantly improves the success rate of GPT-4o, Llama-3.1-70B-Instruct, DeepSeek-v2.5, and DeepSeek-v3 in identifying implicit toxic language, compared to five baseline prompts, such as CoT and rule-based baselines. In addition, it also facilitates the models to produce more explicit and coherent reasoning processes, hence can potentially be generalized to other inference-intensive tasks, e.g., understanding humour and metaphors.
comment: 14 pages, 4 figures, 2 tables
♻ ☆ LATTE: Learning Aligned Transactions and Textual Embeddings for Bank Clients
Learning clients embeddings from sequences of their historic communications is central to financial applications. While large language models (LLMs) offer general world knowledge, their direct use on long event sequences is computationally expensive and impractical in real-world pipelines. In this paper, we propose LATTE, a contrastive learning framework that aligns raw event embeddings with semantic embeddings from frozen LLMs. Behavioral features are summarized into short prompts, embedded by the LLM, and used as supervision via contrastive loss. The proposed approach significantly reduces inference cost and input size compared to conventional processing of complete sequence by LLM. We experimentally show that our method outperforms state-of-the-art techniques for learning event sequence representations on real-world financial datasets while remaining deployable in latency-sensitive environments.
♻ ☆ Optimizing Cross-Client Domain Coverage for Federated Instruction Tuning of Large Language Models EMNLP 2025
Federated domain-specific instruction tuning (FedDIT) for large language models (LLMs) aims to enhance performance in specialized domains using distributed private and limited data, yet identifying key performance drivers and optimal augmentation strategies remains challenging. We empirically establish that cross-client domain coverage, rather than data heterogeneity, is the pivotal factor. We then introduce FedDCA, an algorithm that explicitly maximizes this coverage through diversity-oriented client center selection and retrieval-based augmentation, constructing diverse, non-redundant cross-client instruction sets. Extensive experiments across multiple domains demonstrate FedDCA's superiority over eleven baselines, achieving performance gains of up to 29.19\% and domain coverage improvements of 4.82\%-21.36\%. FedDCA maintains its effectiveness in diverse and challenging scenarios, including data selection, held-out settings where task-specific public data is scarce and various data heterogeneity, with manageable privacy risks. This work clarifies critical FedDIT dynamics and presents FedDCA as an effective, privacy-preserving, and scalable solution for advancing domain-specific LLM tuning.
comment: EMNLP 2025
♻ ☆ FinAgentBench: A Benchmark Dataset for Agentic Retrieval in Financial Question Answering
Accurate information retrieval (IR) is critical in the financial domain, where investors must identify relevant information from large collections of documents. Traditional IR methods-whether sparse or dense-often fall short in retrieval accuracy, as it requires not only capturing semantic similarity but also performing fine-grained reasoning over document structure and domain-specific knowledge. Recent advances in large language models (LLMs) have opened up new opportunities for retrieval with multi-step reasoning, where the model ranks passages through iterative reasoning about which information is most relevant to a given query. However, there exists no benchmark to evaluate such capabilities in the financial domain. To address this gap, we introduce FinAgentBench, the first large-scale benchmark for evaluating retrieval with multi-step reasoning in finance -- a setting we term agentic retrieval. The benchmark consists of 3,429 expert-annotated examples on S&P-100 listed firms and assesses whether LLM agents can (1) identify the most relevant document type among candidates, and (2) pinpoint the key passage within the selected document. Our evaluation framework explicitly separates these two reasoning steps to address context limitations. This design enables to provide a quantitative basis for understanding retrieval-centric LLM behavior in finance. We evaluate a suite of state-of-the-art models and further demonstrated how targeted fine-tuning can significantly improve agentic retrieval performance. Our benchmark provides a foundation for studying retrieval-centric LLM behavior in complex, domain-specific tasks for finance. We will release the dataset publicly upon acceptance of the paper and plan to expand and share dataset for the full S&P 500 and beyond.
comment: 6 pages
♻ ☆ VerifiAgent: a Unified Verification Agent in Language Model Reasoning EMNLP 2025
Large language models demonstrate remarkable reasoning capabilities but often produce unreliable or incorrect responses. Existing verification methods are typically model-specific or domain-restricted, requiring significant computational resources and lacking scalability across diverse reasoning tasks. To address these limitations, we propose VerifiAgent, a unified verification agent that integrates two levels of verification: meta-verification, which assesses completeness and consistency in model responses, and tool-based adaptive verification, where VerifiAgent autonomously selects appropriate verification tools based on the reasoning type, including mathematical, logical, or commonsense reasoning. This adaptive approach ensures both efficiency and robustness across different verification scenarios. Experimental results show that VerifiAgent outperforms baseline verification methods (e.g., deductive verifier, backward verifier) among all reasoning tasks. Additionally, it can further enhance reasoning accuracy by leveraging feedback from verification results. VerifiAgent can also be effectively applied to inference scaling, achieving better results with fewer generated samples and costs compared to existing process reward models in the mathematical reasoning domain. Code is available at https://github.com/Jiuzhouh/VerifiAgent
comment: EMNLP 2025
♻ ☆ Everybody Likes to Sleep: A Computer-Assisted Comparison of Object Naming Data from 30 Languages
Object naming - the act of identifying an object with a word or a phrase - is a fundamental skill in interpersonal communication, relevant to many disciplines, such as psycholinguistics, cognitive linguistics, or language and vision research. Object naming datasets, which consist of concept lists with picture pairings, are used to gain insights into how humans access and select names for objects in their surroundings and to study the cognitive processes involved in converting visual stimuli into semantic concepts. Unfortunately, object naming datasets often lack transparency and have a highly idiosyncratic structure. Our study tries to make current object naming data transparent and comparable by using a multilingual, computer-assisted approach that links individual items of object naming lists to unified concepts. Our current sample links 17 object naming datasets that cover 30 languages from 10 different language families. We illustrate how the comparative dataset can be explored by searching for concepts that recur across the majority of datasets and comparing the conceptual spaces of covered object naming datasets with classical basic vocabulary lists from historical linguistics and linguistic typology. Our findings can serve as a basis for enhancing cross-linguistic object naming research and as a guideline for future studies dealing with object naming tasks.
comment: To appear in the Proceedings of the Global WordNet Conference 2025
♻ ☆ On the Fundamental Impossibility of Hallucination Control in Large Language Models
This paper establishes a fundamental impossibility theorem: no LLM capable of performing non-trivial knowledge aggregation can simultaneously achieve truthful knowledge representation, semantic information conservation, complete revelation of relevant knowledge, and knowledge-constrained optimality. The impossibility is not an engineering limitation but arises from the mathematical structure of information aggregation itself. We establish this result by describing the inference process as an auction of ideas, where distributed components compete exploiting their partial knowledge to shape responses. The proof spans three independent mathematical domains: mechanism design theory (Green-Laffont), the theory of proper scoring rules (Savage), and direct architectural analysis of transformers (Log-Sum-Exp convexity). In particular, we show how to quantify the creation of overconfident or intuitive responses-the signature of both hallucination and creativity, or imagination. To support this analysis, we introduce the complementary concepts of the semantic information measure and the emergence operator to model bounded reasoning in a general setting. We prove that while bounded reasoning generates accessible information, providing valuable insights and inspirations, the idealized unconstrained reasoning strictly preserves semantic content. By demonstrating that hallucination and imagination are mathematically identical phenomena-grounded in departures from truthfulness, semantic information conservation, revelation of relevant knowledge, and knowledge-constrained optimality-we offer a principled foundation for managing these behaviors in advanced AI systems. Finally, we present some speculative ideas to inspire evaluation and refinements of the proposed theory.
comment: cleared mathematics, proofs debugged and ideas expanded, added missing definitions and axioms, discussion and speculation section reorganized, related work improved
♻ ☆ One-shot Entropy Minimization
We trained 13,440 large language models and found that entropy minimization requires only a single unlabeled data and 10 steps optimization to achieve performance improvements comparable to or even greater than those obtained using thousands of data and carefully designed rewards in rule-based reinforcement learning. This striking result may prompt a rethinking of post-training paradigms for large language models. Our code is avaliable at https://github.com/zitian-gao/one-shot-em.
comment: Work in progress
♻ ☆ Self-Supervised Prompt Optimization
Well-designed prompts are crucial for enhancing Large language models' (LLMs) reasoning capabilities while aligning their outputs with task requirements across diverse domains. However, manually designed prompts require expertise and iterative experimentation. While existing prompt optimization methods aim to automate this process, they rely heavily on external references such as ground truth or by humans, limiting their applicability in real-world scenarios where such data is unavailable or costly to obtain. To address this, we propose Self-Supervised Prompt Optimization (SPO), a cost-efficient framework that discovers effective prompts for both closed and open-ended tasks without requiring external reference. Motivated by the observations that prompt quality manifests directly in LLM outputs and LLMs can effectively assess adherence to task requirements, we derive evaluation and optimization signals purely from output comparisons. Specifically, SPO selects superior prompts through pairwise output comparisons evaluated by an LLM evaluator, followed by an LLM optimizer that aligns outputs with task requirements. Extensive experiments demonstrate that SPO outperforms state-of-the-art prompt optimization methods, achieving comparable or superior results with significantly lower costs (e.g., 1.1% to 5.6% of existing methods) and fewer samples (e.g., three samples). The code is available at https://github.com/FoundationAgents/SPO.
♻ ☆ Large Language Models for Automated Literature Review: An Evaluation of Reference Generation, Abstract Writing, and Review Composition EMNLP 2025
Large language models (LLMs) have emerged as a potential solution to automate the complex processes involved in writing literature reviews, such as literature collection, organization, and summarization. However, it is yet unclear how good LLMs are at automating comprehensive and reliable literature reviews. This study introduces a framework to automatically evaluate the performance of LLMs in three key tasks of literature writing: reference generation, literature summary, and literature review composition. We introduce multidimensional evaluation metrics that assess the hallucination rates in generated references and measure the semantic coverage and factual consistency of the literature summaries and compositions against human-written counterparts. The experimental results reveal that even the most advanced models still generate hallucinated references, despite recent progress. Moreover, we observe that the performance of different models varies across disciplines when it comes to writing literature reviews. These findings highlight the need for further research and development to improve the reliability of LLMs in automating academic literature reviews.
comment: 12 pages, 5 figures, 5 tables, Accepted by EMNLP 2025 Main Conference
♻ ☆ Ontology-Guided Reverse Thinking Makes Large Language Models Stronger on Knowledge Graph Question Answering
Large language models (LLMs) have shown remarkable capabilities in natural language processing. However, in knowledge graph question answering tasks (KGQA), there remains the issue of answering questions that require multi-hop reasoning. Existing methods rely on entity vector matching, but the purpose of the question is abstract and difficult to match with specific entities. As a result, it is difficult to establish reasoning paths to the purpose, which leads to information loss and redundancy. To address this issue, inspired by human reverse thinking, we propose Ontology-Guided Reverse Thinking (ORT), a novel framework that constructs reasoning paths from purposes back to conditions. ORT operates in three key phases: (1) using LLM to extract purpose labels and condition labels, (2) constructing label reasoning paths based on the KG ontology, and (3) using the label reasoning paths to guide knowledge retrieval. Experiments on the WebQSP and CWQ datasets show that ORT achieves state-of-the-art performance and significantly enhances the capability of LLMs for KGQA.
comment: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
♻ ☆ Cequel: Cost-Effective Querying of Large Language Models for Text Clustering
Text clustering aims to automatically partition a collection of documents into coherent groups based on their linguistic features. In the literature, this task is formulated either as metric clustering over pre-trained text embeddings or as graph clustering based on pairwise similarities derived from an oracle, e.g., a large machine learning model. Recent advances in large language models (LLMs) have significantly improved this field by providing high-quality contextualized embeddings and accurate semantic similarity estimates. However, leveraging LLMs at scale introduces substantial computational and financial costs due to the large number of required API queries or inference calls. To address this issue, we propose Cequel, a cost-effective framework that achieves accurate text clustering under a limited budget of LLM queries. At its core, Cequel constructs must-link and cannot-link constraints by selectively querying LLMs on informative text pairs or triplets, identified via our proposed algorithms, EdgeLLM and TriangleLLM. These constraints are then utilized in a weighted constrained clustering algorithm to form high-quality clusters. Specifically, EdgeLLM and TriangleLLM employ carefully designed greedy selection strategies and prompting techniques to identify and extract informative constraints efficiently. Experiments on multiple benchmark datasets demonstrate that Cequel consistently outperforms existing methods in unsupervised text clustering under the same query budget.
♻ ☆ ThinkTuning: Instilling Cognitive Reflections without Distillation EMNLP 2025
Recent advances in test-time scaling have led to the emergence of thinking LLMs that exhibit self-reflective behaviors and multi-step reasoning. While RL drives this self-improvement paradigm, a recent study (Gandhi et al., 2025) shows that RL alone does not truly instill these new reasoning abilities - it merely draws out behaviors already present in the base models. This raises a question: How can we train the models that don't exhibit such thinking behavior to develop it in the first place? To this end, we propose ThinkTuning, a GRPO-based interactive training approach where we augment the rollouts of a student model with the guidance from a teacher model. A simple idea from classroom practice inspires our method: a teacher poses a problem, lets the student try an answer, then gives corrective feedback -- enough to point the mind in the right direction and then show the solution. Each piece of feedback reshapes the student's thoughts, leading them to arrive at the correct solution. Similarly, we find that this type of implicit supervision through feedback from a teacher model of the same size improves the reasoning capabilities of the student model. In particular, on average, our method shows a 3.85% improvement over zero-shot baselines across benchmarks, and on MATH-500, AIME and GPQA-Diamond it shows 2.08%, 2.23% and 3.99% improvements over the vanilla-GRPO baseline. Source code is available at https://github.com/3rdAT/ThinkTuning.
comment: EMNLP 2025 (Main Conference)
♻ ☆ Kuwain 1.5B: An Arabic SLM via Language Injection
Enhancing existing models with new knowledge is a crucial aspect of AI development. This paper introduces a novel method for integrating a new language into a large language model (LLM). Our approach successfully incorporates a previously unseen target language into an existing LLM without compromising its prior knowledge. We trained a tiny model with 1.5 billion parameters named Kuwain by injecting the Arabic language into a small open-source model mainly trained in English. Our method demonstrates significant improvements in Arabic language performance, with an average 8% improvement across various benchmarks, while retaining the model's existing knowledge with a minimum amount of the original model's data. This offers a cost-effective alternative to training a comprehensive model in both English and Arabic. The results highlight the potential for efficient, targeted language model expansion without extensive retraining or resource-intensive processes.
♻ ☆ Sadeed: Advancing Arabic Diacritization Through Small Language Model
Arabic text diacritization remains a persistent challenge in natural language processing due to the language's morphological richness. In this paper, we introduce Sadeed, a novel approach based on a fine-tuned decoder-only language model adapted from Kuwain 1.5B Hennara et al. [2025], a compact model originally trained on diverse Arabic corpora. Sadeed is fine-tuned on carefully curated, high-quality diacritized datasets, constructed through a rigorous data-cleaning and normalization pipeline. Despite utilizing modest computational resources, Sadeed achieves competitive results compared to proprietary large language models and outperforms traditional models trained on similar domains. Additionally, we highlight key limitations in current benchmarking practices for Arabic diacritization. To address these issues, we introduce SadeedDiac-25, a new benchmark designed to enable fairer and more comprehensive evaluation across diverse text genres and complexity levels. Together, Sadeed and SadeedDiac-25 provide a robust foundation for advancing Arabic NLP applications, including machine translation, text-to-speech, and language learning tools.
♻ ☆ WebEvolver: Enhancing Web Agent Self-Improvement with Coevolving World Model EMNLP 2025
Agent self-improvement, where the backbone Large Language Model (LLM) of the agent are trained on trajectories sampled autonomously based on their own policies, has emerged as a promising approach for enhancing performance. Recent advancements, particularly in web environments, face a critical limitation: their performance will reach a stagnation point during autonomous learning cycles, hindering further improvement. We argue that this stems from limited exploration of the web environment and insufficient exploitation of pre-trained web knowledge in LLMs. To improve the performance of self-improvement, we propose a novel framework that introduces a co-evolving World Model LLM. This world model predicts the next observation based on the current observation and action within the web environment. Leveraging LLMs' pretrained knowledge of abundant web content, the World Model serves dual roles: (1) as a virtual web server generating self-instructed training data to continuously refine the agent's policy, and (2) as an imagination engine during inference, enabling look-ahead simulation to guide action selection for the agent LLM. Experiments in real-world web environments (Mind2Web-Live, WebVoyager, and GAIA-web) show a 10% performance gain over existing self-evolving agents, demonstrating the efficacy and generalizability of our approach, without using any distillation from more powerful close-sourced models. Our work establishes the necessity of integrating world models into autonomous agent frameworks to unlock sustained adaptability. Code is available at https://github.com/Tencent/SelfEvolvingAgent
comment: EMNLP 2025 Main Conference
♻ ☆ Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model
We introduce Mutarjim, a compact yet powerful language model for bidirectional Arabic-English translation. While large-scale LLMs have shown impressive progress in natural language processing tasks, including machine translation, smaller models. Leveraging this insight, we developed Mutarjim based on Kuwain-1.5B , a language model tailored for both Arabic and English. Despite its modest size, Mutarjim outperforms much larger models on several established benchmarks, achieved through an optimized two-phase training approach and a carefully curated, high-quality training corpus.. Experimental results show that Mutarjim rivals models up to 20 times larger while significantly reducing computational costs and training requirements. We also introduce Tarjama-25, a new benchmark designed to overcome limitations in existing Arabic-English benchmarking datasets, such as domain narrowness, short sentence lengths, and English-source bias. Tarjama-25 comprises 5,000 expert-reviewed sentence pairs and spans a wide range of domains, offering a more comprehensive and balanced evaluation framework. Notably, Mutarjim achieves state-of-the-art performance on the English-to-Arabic task in Tarjama-25, surpassing even significantly larger and proprietary models like GPT-4o mini. We publicly release Tarjama-25 to support future research and advance the evaluation of Arabic-English translation systems.
♻ ☆ Versatile Framework for Song Generation with Prompt-based Control EMNLP 2025
Song generation focuses on producing controllable high-quality songs based on various prompts. However, existing methods struggle to generate vocals and accompaniments with prompt-based control and proper alignment. Additionally, they fall short in supporting various tasks. To address these challenges, we introduce VersBand, a multi-task song generation framework for synthesizing high-quality, aligned songs with prompt-based control. VersBand comprises these primary models: 1) VocalBand, a decoupled model, leverages the flow-matching method for generating singing styles, pitches, and mel-spectrograms, allowing fast, high-quality vocal generation with style control. 2) AccompBand, a flow-based transformer model, incorporates the Band-MOE, selecting suitable experts for enhanced quality, alignment, and control. This model allows for generating controllable, high-quality accompaniments aligned with vocals. 3) Two generation models, LyricBand for lyrics and MelodyBand for melodies, contribute to the comprehensive multi-task song generation system, allowing for extensive control based on multiple prompts. Experimental results show that VersBand outperforms baseline models across multiple song generation tasks using objective and subjective metrics. Demos and codes are available at https://aaronz345.github.io/VersBandDemo and https://github.com/AaronZ345/VersBand.
comment: Accepted by Findings of EMNLP 2025
♻ ☆ Flexible Tool Selection through Low-dimensional Attribute Alignment of Vision and Language
Flexible tool selection reflects a complex cognitive ability that distinguishes humans from other species, yet computational models that capture this ability remain underdeveloped. We developed a framework using low-dimensional attribute representations to bridge visual tool perception and linguistic task understanding. We constructed a comprehensive dataset (ToolNet) containing 115 common tools labeled with 13 carefully designed attributes spanning physical, functional, and psychological properties, paired with natural language scenarios describing tool usage. Visual encoders (ResNet or ViT) extract attributes from tool images while fine-tuned language models (GPT-2, LLaMA, DeepSeek) derive required attributes from task descriptions. Our approach achieves 74% accuracy in tool selection tasks-significantly outperforming direct tool matching (20%) and smaller multimodal models (21%-58%), while approaching performance of much larger models like GPT-4o (73%) with substantially fewer parameters. Human evaluation studies validate our framework's alignment with human decision-making patterns, and generalization experiments demonstrate effective performance on novel tool categories. Ablation studies revealed that manipulation-related attributes (graspability, elongation, hand-relatedness) consistently prove most critical across modalities. This work provides a parameter-efficient, interpretable solution that mimics human-like tool cognition, advancing both cognitive science understanding and practical applications in tool selection tasks.
♻ ☆ NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model
We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achieve improved inference speed when generating the long thinking traces needed for reasoning. We create Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parameter model (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe. After aligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to compress and distill the model with the goal of enabling inference on up to 128k tokens on a single NVIDIA A10G GPU (22GiB of memory, bfloat16 precision). Compared to existing similarly-sized models (e.g., Qwen3-8B), we show that Nemotron-Nano-9B-v2 achieves on-par or better accuracy on reasoning benchmarks while achieving up to 6x higher inference throughput in reasoning settings like 8k input and 16k output tokens. We are releasing Nemotron-Nano-9B-v2, Nemotron-Nano12B-v2-Base, and Nemotron-Nano-9B-v2-Base checkpoints along with the majority of our pre- and post-training datasets on Hugging Face.
♻ ☆ CRISPR-GPT for Agentic Automation of Gene-editing Experiments
The introduction of genome engineering technology has transformed biomedical research, making it possible to make precise changes to genetic information. However, creating an efficient gene-editing system requires a deep understanding of CRISPR technology, and the complex experimental systems under investigation. While Large Language Models (LLMs) have shown promise in various tasks, they often lack specific knowledge and struggle to accurately solve biological design problems. In this work, we introduce CRISPR-GPT, an LLM agent augmented with domain knowledge and external tools to automate and enhance the design process of CRISPR-based gene-editing experiments. CRISPR-GPT leverages the reasoning ability of LLMs to facilitate the process of selecting CRISPR systems, designing guide RNAs, recommending cellular delivery methods, drafting protocols, and designing validation experiments to confirm editing outcomes. We showcase the potential of CRISPR-GPT for assisting non-expert researchers with gene-editing experiments from scratch and validate the agent's effectiveness in a real-world use case. Furthermore, we explore the ethical and regulatory considerations associated with automated gene-editing design, highlighting the need for responsible and transparent use of these tools. Our work aims to bridge the gap between beginner biological researchers and CRISPR genome engineering techniques, and demonstrate the potential of LLM agents in facilitating complex biological discovery tasks. The published version of this draft is available at https://www.nature.com/articles/s41551-025-01463-z.
comment: Accepted to Nature Biomedical Engineering
♻ ☆ Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs
We present two multilingual LLMs, Teuken 7B-base and Teuken 7B-instruct, designed to embrace Europe's linguistic diversity by supporting all 24 official languages of the European Union. Trained on a dataset comprising around 60% non-English data and utilizing a custom multilingual tokenizer, our models address the limitations of existing LLMs that predominantly focus on English or a few high-resource languages. We detail the models' development principles, i.e., data composition, tokenizer optimization, and training methodologies. The models demonstrate strong performance across multilingual benchmarks, as evidenced by their performance on European versions of ARC, HellaSwag, and TruthfulQA.
♻ ☆ Arabic Multimodal Machine Learning: Datasets, Applications, Approaches, and Challenges
Multimodal Machine Learning (MML) aims to integrate and analyze information from diverse modalities, such as text, audio, and visuals, enabling machines to address complex tasks like sentiment analysis, emotion recognition, and multimedia retrieval. Recently, Arabic MML has reached a certain level of maturity in its foundational development, making it time to conduct a comprehensive survey. This paper explores Arabic MML by categorizing efforts through a novel taxonomy and analyzing existing research. Our taxonomy organizes these efforts into four key topics: datasets, applications, approaches, and challenges. By providing a structured overview, this survey offers insights into the current state of Arabic MML, highlighting areas that have not been investigated and critical research gaps. Researchers will be empowered to build upon the identified opportunities and address challenges to advance the field.
♻ ☆ LaMP-Cap: Personalized Figure Caption Generation With Multimodal Figure Profiles EMNLP 2025
Figure captions are crucial for helping readers understand and remember a figure's key message. Many models have been developed to generate these captions, helping authors compose better quality captions more easily. Yet, authors almost always need to revise generic AI-generated captions to match their writing style and the domain's style, highlighting the need for personalization. Despite language models' personalization (LaMP) advances, these technologies often focus on text-only settings and rarely address scenarios where both inputs and profiles are multimodal. This paper introduces LaMP-Cap, a dataset for personalized figure caption generation with multimodal figure profiles. For each target figure, LaMP-Cap provides not only the needed inputs, such as figure images, but also up to three other figures from the same document--each with its image, caption, and figure-mentioning paragraphs--as a profile to characterize the context. Experiments with four LLMs show that using profile information consistently helps generate captions closer to the original author-written ones. Ablation studies reveal that images in the profile are more helpful than figure-mentioning paragraphs, highlighting the advantage of using multimodal profiles over text-only ones.
comment: Accepted to EMNLP 2025 Findings. The LaMP-CAP dataset is publicly available at: https://github.com/Crowd-AI-Lab/lamp-cap
♻ ☆ Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding ICML 2025
Transformers rely on both content-based and position-based addressing mechanisms to make predictions, but existing positional encoding techniques often diminish the effectiveness of position-based addressing. Many current methods enforce rigid patterns in attention maps, limiting the ability to model long-range dependencies and adapt to diverse tasks. Additionally, most positional encodings are learned as general biases, lacking the specialization required for different instances within a dataset. To address this, we propose con\textbf{T}extualized equivari\textbf{A}nt \textbf{P}osition \textbf{E}ncoding (\textbf{TAPE}), a novel framework that enhances positional embeddings by incorporating sequence content across layers. TAPE introduces dynamic, context-aware positional encodings, overcoming the constraints of traditional fixed patterns. We show that TAPE can provably facilitate LLM reasoning ability by emulating a broader class of algorithms. By enforcing permutation and orthogonal equivariance, TAPE ensures the stability of positional encodings during updates, improving long-context ability. Our method can be easily integrated into pre-trained transformers, offering parameter-efficient fine-tuning with minimal overhead. Extensive experiments show that TAPE achieves superior performance in language modeling, arithmetic reasoning, and long-context retrieval tasks compared to existing positional embedding techniques. Code is available at https://github.com/VITA-Group/TAPE.
comment: ICML 2025
♻ ☆ Robust Bias Detection in MLMs and its Application to Human Trait Ratings NAACL 2025
There has been significant prior work using templates to study bias against demographic attributes in MLMs. However, these have limitations: they overlook random variability of templates and target concepts analyzed, assume equality amongst templates, and overlook bias quantification. Addressing these, we propose a systematic statistical approach to assess bias in MLMs, using mixed models to account for random effects, pseudo-perplexity weights for sentences derived from templates and quantify bias using statistical effect sizes. Replicating prior studies, we match on bias scores in magnitude and direction with small to medium effect sizes. Next, we explore the novel problem of gender bias in the context of $\emph{personality}$ and $\textit{character}$ traits, across seven MLMs (base and large). We find that MLMs vary; ALBERT is unbiased for binary gender but the most biased for non-binary $\textit{neo}$, while RoBERTa-large is the most biased for binary gender but shows small to no bias for $\textit{neo}$. There is some alignment of MLM bias and findings in psychology (human perspective) - in $\textit{agreeableness}$ with RoBERTa-large and $\textit{emotional stability}$ with BERT-large. There is general agreement for the remaining 3 personality dimensions: both sides observe at most small differences across gender. For character traits, human studies on gender bias are limited thus comparisons are not feasible.
comment: Findings of NAACL 2025
♻ ☆ Leveraging Large Language Models for Explainable Activity Recognition in Smart Homes: A Critical Evaluation
Explainable Artificial Intelligence (XAI) aims to uncover the inner reasoning of machine learning models. In IoT systems, XAI improves the transparency of models processing sensor data from multiple heterogeneous devices, ensuring end-users understand and trust their outputs. Among the many applications, XAI has also been applied to sensor-based Activities of Daily Living (ADLs) recognition in smart homes. Existing approaches highlight which sensor events are most important for each predicted activity, using simple rules to convert these events into natural language explanations for non-expert users. However, these methods produce rigid explanations lacking natural language flexibility and are not scalable. With the recent rise of Large Language Models (LLMs), it is worth exploring whether they can enhance explanation generation, considering their proven knowledge of human activities. This paper investigates potential approaches to combine XAI and LLMs for sensor-based ADL recognition. We evaluate if LLMs can be used: a) as explainable zero-shot ADL recognition models, avoiding costly labeled data collection, and b) to automate the generation of explanations for existing data-driven XAI approaches when training data is available and the goal is higher recognition rates. Our critical evaluation provides insights into the benefits and challenges of using LLMs for explainable ADL recognition.
♻ ☆ Ask Patients with Patience: Enabling LLMs for Human-Centric Medical Dialogue with Grounded Reasoning
The severe shortage of medical doctors limits access to timely and reliable healthcare, leaving millions underserved. Large language models (LLMs) offer a potential solution but struggle in real-world clinical interactions. Many LLMs are not grounded in authoritative medical guidelines and fail to transparently manage diagnostic uncertainty. Their language is often rigid and mechanical, lacking the human-like qualities essential for patient trust. To address these challenges, we propose Ask Patients with Patience (APP), a multi-turn LLM-based medical assistant designed for grounded reasoning, transparent diagnoses, and human-centric interaction. APP enhances communication by eliciting user symptoms through empathetic dialogue, significantly improving accessibility and user engagement. It also incorporates Bayesian active learning to support transparent and adaptive diagnoses. The framework is built on verified medical guidelines, ensuring clinically grounded and evidence-based reasoning. To evaluate its performance, we develop a new benchmark that simulates realistic medical conversations using patient agents driven by profiles extracted from real-world consultation cases. We compare APP against SOTA one-shot and multi-turn LLM baselines. The results show that APP improves diagnostic accuracy, reduces uncertainty, and enhances user experience. By integrating medical expertise with transparent, human-like interaction, APP bridges the gap between AI-driven medical assistance and real-world clinical practice.
♻ ☆ Evaluating Speech-to-Text x LLM x Text-to-Speech Combinations for AI Interview Systems
Voice-based conversational AI systems increasingly rely on cascaded architectures that combine speech-to-text (STT), large language models (LLMs), and text-to-speech (TTS) components. We present a large-scale empirical comparison of STT x LLM x TTS stacks using data sampled from over 300,000 AI-conducted job interviews. We used an LLM-as-a-Judge automated evaluation framework to assess conversational quality, technical accuracy, and skill assessment capabilities. Our analysis of five production configurations reveals that a stack combining Google's STT, GPT-4.1, and Cartesia's TTS outperforms alternatives in both objective quality metrics and user satisfaction scores. Surprisingly, we find that objective quality metrics correlate weakly with user satisfaction scores, suggesting that user experience in voice-based AI systems depends on factors beyond technical performance. Our findings provide practical guidance for selecting components in multimodal conversations and contribute a validated evaluation methodology for human-AI interactions.
♻ ☆ NitiBench: A Comprehensive Study of LLM Framework Capabilities for Thai Legal Question Answering
The application of large language models (LLMs) in the legal domain holds significant potential for information retrieval and question answering, yet Thai legal QA systems face challenges due to a lack of standardized evaluation benchmarks and the complexity of Thai legal structures. This paper introduces NitiBench, a benchmark comprising two datasets: the NitiBench-CCL, covering general Thai financial law, and the NitiBench-Tax, which includes real-world tax law cases requiring advanced legal reasoning. We evaluate retrieval-augmented generation (RAG) and long-context LLM-based approaches to address three key research questions: the impact of domain-specific components like section-based chunking and cross-referencing, the comparative performance of different retrievers and LLMs, and the viability of long-context LLMs as an alternative to RAG. Our results show that section-based chunking significantly improves retrieval and end-to-end performance, current retrievers struggle with complex queries, and long-context LLMs still underperform RAG-based systems in Thai legal QA. To support fair evaluation, we propose tailored multi-label retrieval metrics and the use of an LLM-as-judge for coverage and contradiction detection method. These findings highlight the limitations of current Thai legal NLP solutions and provide a foundation for future research in the field. We also open-sourced our codes and dataset to available publicly.
♻ ☆ Do LLMs write like humans? Variation in grammatical and rhetorical styles
Large language models (LLMs) are capable of writing grammatical text that follows instructions, answers questions, and solves problems. As they have advanced, it has become difficult to distinguish their output from human-written text. While past research has found some differences in surface features such as word choice and punctuation, and developed classifiers to detect LLM output, none has studied the rhetorical styles of LLMs. Using several variants of Llama 3 and GPT-4o, we construct two parallel corpora of human- and LLM-written texts from common prompts. Using Douglas Biber's set of lexical, grammatical, and rhetorical features, we identify systematic differences between LLMs and humans and between different LLMs. These differences persist when moving from smaller models to larger ones, and are larger for instruction-tuned models than base models. This observation of differences demonstrates that despite their advanced abilities, LLMs struggle to match human stylistic variation. Attention to more advanced linguistic features can hence detect patterns in their behavior not previously recognized.
comment: 7 pages, 4 figures, 1 table
♻ ☆ Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment
Prior work on language modeling showed conflicting findings about whether morphologically aligned approaches to tokenization improve performance, particularly for languages with complex morphology. To investigate this, we select a typologically diverse set of languages: Telugu (agglutinative), Hindi (primarily fusional with some agglutination), and English (fusional). We conduct a comprehensive evaluation of language models -- starting from tokenizer training and extending through the finetuning and downstream task evaluation. To account for the consistent performance differences observed across tokenizer variants, we focus on two key factors: morphological alignment and tokenization quality. To assess morphological alignment of tokenizers in Telugu, we create a dataset containing gold morpheme segmentations of 600 derivational and 7000 inflectional word forms. Our experiments reveal that better morphological alignment correlates positively -- though moderately -- with performance in syntax-based tasks such as Parts-of-Speech tagging, Named Entity Recognition and Dependency Parsing. However, we also find that the tokenizer algorithm (Byte-pair Encoding vs. Unigram) plays a more significant role in influencing downstream performance than morphological alignment alone. Naive Unigram tokenizers outperform others across most settings, though hybrid tokenizers that incorporate morphological segmentation significantly improve performance within the BPE framework. In contrast, intrinsic metrics like Corpus Token Count (CTC) and R\'enyi entropy showed no correlation with downstream performance.
♻ ☆ HypER: Literature-grounded Hypothesis Generation and Distillation with Provenance EMNLP 2025
Large Language models have demonstrated promising performance in research ideation across scientific domains. Hypothesis development, the process of generating a highly specific declarative statement connecting a research idea with empirical validation, has received relatively less attention. Existing approaches trivially deploy retrieval augmentation and focus only on the quality of the final output ignoring the underlying reasoning process behind ideation. We present $\texttt{HypER}$ ($\textbf{Hyp}$othesis Generation with $\textbf{E}$xplanation and $\textbf{R}$easoning), a small language model (SLM) trained for literature-guided reasoning and evidence-based hypothesis generation. $\texttt{HypER}$ is trained in a multi-task setting to discriminate between valid and invalid scientific reasoning chains in presence of controlled distractions. We find that $\texttt{HypER}$ outperformes the base model, distinguishing valid from invalid reasoning chains (+22\% average absolute F1), generates better evidence-grounded hypotheses (0.327 vs. 0.305 base model) with high feasibility and impact as judged by human experts ($>$3.5 on 5-point Likert scale).
comment: EMNLP 2025, 26 pages (9 pages: main paper body)
♻ ☆ Exploration of Plan-Guided Summarization for Narrative Texts: the Case of Small Language Models NAACL 2025
Plan-guided summarization attempts to reduce hallucinations in small language models (SLMs) by grounding generated summaries to the source text, typically by targeting fine-grained details such as dates or named entities. In this work, we investigate whether plan-based approaches in SLMs improve summarization in long document, narrative tasks. Narrative texts' length and complexity often mean they are difficult to summarize faithfully. We analyze existing plan-guided solutions targeting fine-grained details, and also propose our own higher-level, narrative-based plan formulation. Our results show that neither approach significantly improves on a baseline without planning in either summary quality or faithfulness. Human evaluation reveals that while plan-guided approaches are often well grounded to their plan, plans are equally likely to contain hallucinations compared to summaries. As a result, the plan-guided summaries are just as unfaithful as those from models without planning. Our work serves as a cautionary tale to plan-guided approaches to summarization, especially for long, complex domains such as narrative texts. Code available at https://github.com/amazon-science/plan-guided-summarization
comment: Accepted to the 7th Workshop on Narrative Understanding (WNU), co-located with NAACL 2025
♻ ☆ Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks. Recent advancements in Large Reasoning Models (LRMs), such as OpenAI o1 and DeepSeek-R1, have further improved performance in System-2 reasoning domains like mathematics and programming by harnessing supervised fine-tuning (SFT) and reinforcement learning (RL) techniques to enhance the Chain-of-Thought (CoT) reasoning. However, while longer CoT reasoning sequences improve performance, they also introduce significant computational overhead due to verbose and redundant outputs, known as the "overthinking phenomenon". In this paper, we provide the first structured survey to systematically investigate and explore the current progress toward achieving efficient reasoning in LLMs. Overall, relying on the inherent mechanism of LLMs, we categorize existing works into several key directions: (1) model-based efficient reasoning, which considers optimizing full-length reasoning models into more concise reasoning models or directly training efficient reasoning models; (2) reasoning output-based efficient reasoning, which aims to dynamically reduce reasoning steps and length during inference; (3) input prompts-based efficient reasoning, which seeks to enhance reasoning efficiency based on input prompt properties such as difficulty or length control. Additionally, we introduce the use of efficient data for training reasoning models, explore the reasoning capabilities of small language models, and discuss evaluation methods and benchmarking. Project website: https://github.com/Eclipsess/Awesome-Efficient-Reasoning-LLMs
comment: Accepted by TMLR 2025. Project website: https://github.com/Eclipsess/Awesome-Efficient-Reasoning-LLMs
♻ ☆ How Performance Pressure Influences AI-Assisted Decision Making
Many domains now employ AI-based decision-making aids, and although the potential for AI systems to assist with decision making is much discussed, human-AI collaboration often underperforms due to factors such as (mis)trust in the AI system and beliefs about AI being incapable of completing subjective tasks. One potential tool for influencing human decision making is performance pressure, which hasn't been much studied in interaction with human-AI decision making. In this work, we examine how pressure and explainable AI (XAI) techniques interact with AI advice-taking behavior. Using an inherently low-stakes task (spam review classification), we demonstrate effective and simple methods to apply pressure and influence human AI advice-taking behavior by manipulating financial incentives and imposing time limits. Our results show complex interaction effects, with different combinations of pressure and XAI techniques either improving or worsening AI advice taking behavior. We conclude by discussing the implications of these interactions, strategies to effectively use pressure, and encourage future research to incorporate pressure analysis.
♻ ☆ MedResearcher-R1: Expert-Level Medical Deep Researcher via A Knowledge-Informed Trajectory Synthesis Framework
Recent developments in Large Language Model (LLM)-based agents have shown impressive capabilities spanning multiple domains, exemplified by deep research systems that demonstrate superior performance on complex information-seeking and synthesis tasks. While general-purpose deep research agents have shown impressive capabilities, they struggle significantly with medical domain challenges, as evidenced by leading proprietary systems achieving limited accuracy on complex medical benchmarks. The key limitations are: (1) the model lacks sufficient dense medical knowledge for clinical reasoning, and (2) the framework is constrained by the absence of specialized retrieval tools tailored for medical contexts. We present a medical deep research agent that addresses these challenges through two core innovations. First, we develop a novel data synthesis framework using medical knowledge graphs, extracting the longest chains from subgraphs around rare medical entities to generate complex multi-hop question-answer pairs. Second, we integrate a custom-built private medical retrieval engine alongside general-purpose tools, enabling accurate medical information synthesis. Our approach generates 2100+ diverse trajectories across 12 medical specialties, each averaging 4.2 tool interactions. Through a two-stage training paradigm combining supervised fine-tuning and online reinforcement learning with composite rewards, our MedResearcher-R1-32B model demonstrates exceptional performance, establishing new state-of-the-art results on medical benchmarks while maintaining competitive performance on general deep research tasks. Our work demonstrates that strategic domain-specific innovations in architecture, tool design, and training data construction can enable smaller open-source models to outperform much larger proprietary systems in specialized domains.
comment: 13 pages, 5 figures
♻ ☆ QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA EMNLP 2025
Alignment of large language models with explicit principles (such as helpfulness, honesty, and harmlessness) is crucial for ensuring safe and reliable AI systems. However, standard reward-based alignment methods typically collapse diverse feedback into a single scalar reward, entangling multiple objectives into one opaque training signal, which hinders interpretability. In this work, we introduce QA-LIGN, an automatic symbolic reward decomposition approach that preserves the structure of each constitutional principle within the reward mechanism. Instead of training a black-box reward model that outputs a monolithic score, QA-LIGN formulates principle-specific evaluation questions and derives separate reward components for each principle, making it a drop-in reward model replacement. Experiments aligning an uncensored large language model with a set of constitutional principles demonstrate that QA-LIGN offers greater transparency and adaptability in the alignment process. At the same time, our approach achieves performance on par with or better than a DPO baseline. Overall, these results represent a step toward more interpretable and controllable alignment of language models, achieved without sacrificing end-task performance.
comment: Accepted to Findings of EMNLP 2025
Information Retrieval 28
☆ Stemming -- The Evolution and Current State with a Focus on Bangla
Bangla, the seventh most widely spoken language worldwide with 300 million native speakers, faces digital under-representation due to limited resources and lack of annotated datasets. Stemming, a critical preprocessing step in language analysis, is essential for low-resource, highly-inflectional languages like Bangla, because it can reduce the complexity of algorithms and models by significantly reducing the number of words the algorithm needs to consider. This paper conducts a comprehensive survey of stemming approaches, emphasizing the importance of handling morphological variants effectively. While exploring the landscape of Bangla stemming, it becomes evident that there is a significant gap in the existing literature. The paper highlights the discontinuity from previous research and the scarcity of accessible implementations for replication. Furthermore, it critiques the evaluation methodologies, stressing the need for more relevant metrics. In the context of Bangla's rich morphology and diverse dialects, the paper acknowledges the challenges it poses. To address these challenges, the paper suggests directions for Bangla stemmer development. It concludes by advocating for robust Bangla stemmers and continued research in the field to enhance language analysis and processing.
☆ Benchmarking Computer Science Survey Generation
Scientific survey articles play a vital role in summarizing research progress, yet their manual creation is becoming increasingly infeasible due to the rapid growth of academic literature. While large language models (LLMs) offer promising capabilities for automating this process, progress in this area is hindered by the absence of standardized benchmarks and evaluation protocols. To address this gap, we introduce SurGE (Survey Generation Evaluation), a new benchmark for evaluating scientific survey generation in the computer science domain. SurGE consists of (1) a collection of test instances, each including a topic description, an expert-written survey, and its full set of cited references, and (2) a large-scale academic corpus of over one million papers that serves as the retrieval pool. In addition, we propose an automated evaluation framework that measures generated surveys across four dimensions: information coverage, referencing accuracy, structural organization, and content quality. Our evaluation of diverse LLM-based approaches shows that survey generation remains highly challenging, even for advanced self-reflection frameworks. These findings highlight the complexity of the task and the necessity for continued research. We have open-sourced all the code, data, and models at: https://github.com/oneal2000/SurGE
☆ Reading Between the Lines: A Study of Thematic Bias in Book Recommender Systems RecSys 2025
Recommender systems help users discover new content, but can also reinforce existing biases, leading to unfair exposure and reduced diversity. This paper introduces and investigates thematic bias in book recommendations, defined as a disproportionate favouring or neglect of certain book themes. We adopt a multi-stage bias evaluation framework using the Book-Crossing dataset to evaluate thematic bias in recommendations and its impact on different user groups. Our findings show that thematic bias originates from content imbalances and is amplified by user engagement patterns. By segmenting users based on their thematic preferences, we find that users with niche and long-tail interests receive less personalised recommendations, whereas users with diverse interests receive more consistent recommendations. These findings suggest that recommender systems should be carefully designed to accommodate a broader range of user interests. By contributing to the broader goal of responsible AI, this work also lays the groundwork for extending thematic bias analysis to other domains.
comment: 7 pages, 5 figures, Accepted at FAccTRec at RecSys 2025
☆ LongRetriever: Towards Ultra-Long Sequence based Candidate Retrieval for Recommendation
Precisely modeling user ultra-long sequences is critical for industrial recommender systems. Current approaches predominantly focus on leveraging ultra-long sequences in the ranking stage, whereas research for the candidate retrieval stage remains under-explored. This paper presents LongRetriever, a practical framework for incorporating ultra-long sequences into the retrieval stage of recommenders. Specifically, we propose in-context training and multi-context retrieval, which enable candidate-specific interaction between user sequence and candidate item, and ensure training-serving consistency under the search-based paradigm. Extensive online A/B testing conducted on a large-scale e-commerce platform demonstrates statistically significant improvements, confirming the framework's effectiveness. Currently, LongRetriever has been fully deployed in the platform, impacting billions of users.
☆ On Evaluating the Adversarial Robustness of Foundation Models for Multimodal Entity Linking
The explosive growth of multimodal data has driven the rapid development of multimodal entity linking (MEL) models. However, existing studies have not systematically investigated the impact of visual adversarial attacks on MEL models. We conduct the first comprehensive evaluation of the robustness of mainstream MEL models under different adversarial attack scenarios, covering two core tasks: Image-to-Text (I2T) and Image+Text-to-Text (IT2T). Experimental results show that current MEL models generally lack sufficient robustness against visual perturbations. Interestingly, contextual semantic information in input can partially mitigate the impact of adversarial perturbations. Based on this insight, we propose an LLM and Retrieval-Augmented Entity Linking (LLM-RetLink), which significantly improves the model's anti-interference ability through a two-stage process: first, extracting initial entity descriptions using large vision models (LVMs), and then dynamically generating candidate descriptive sentences via web-based retrieval. Experiments on five datasets demonstrate that LLM-RetLink improves the accuracy of MEL by 0.4%-35.7%, especially showing significant advantages under adversarial conditions. This research highlights a previously unexplored facet of MEL robustness, constructs and releases the first MEL adversarial example dataset, and sets the stage for future work aimed at strengthening the resilience of multimodal systems in adversarial environments.
☆ Test-time Corpus Feedback: From Retrieval to RAG
Retrieval-Augmented Generation (RAG) has emerged as a standard framework for knowledge-intensive NLP tasks, combining large language models (LLMs) with document retrieval from external corpora. Despite its widespread use, most RAG pipelines continue to treat retrieval and reasoning as isolated components, retrieving documents once and then generating answers without further interaction. This static design often limits performance on complex tasks that require iterative evidence gathering or high-precision retrieval. Recent work in both the information retrieval (IR) and NLP communities has begun to close this gap by introducing adaptive retrieval and ranking methods that incorporate feedback. In this survey, we present a structured overview of advanced retrieval and ranking mechanisms that integrate such feedback. We categorize feedback signals based on their source and role in improving the query, retrieved context, or document pool. By consolidating these developments, we aim to bridge IR and NLP perspectives and highlight retrieval as a dynamic, learnable component of end-to-end RAG systems.
comment: 18 pages, 1 figure
☆ On the Effectiveness of Graph Reordering for Accelerating Approximate Nearest Neighbor Search on GPU
We present the first systematic investigation of graph reordering effects for graph-based Approximate Nearest Neighbor Search (ANNS) on a GPU. While graph-based ANNS has become the dominant paradigm for modern AI applications, recent approaches focus on algorithmic innovations while neglecting memory layout considerations that significantly affect execution time. Our unified evaluation framework enables comprehensive evaluation of diverse reordering strategies across different graph indices through a graph adapter that converts arbitrary graph topologies into a common representation and a GPU-optimized graph traversal engine. We conduct a comprehensive analysis across diverse datasets and state-of-the-art graph indices, introducing analysis metrics that quantify the relationship between structural properties and memory layout effectiveness. Our GPU-targeted reordering achieves up to 15$\%$ QPS improvements while preserving search accuracy, demonstrating that memory layout optimization operates orthogonally to existing algorithmic innovations. We will release all code upon publication to facilitate reproducibility and foster further research.
☆ TrackRec: Iterative Alternating Feedback with Chain-of-Thought via Preference Alignment for Recommendation
The extensive world knowledge and powerful reasoning capabilities of large language models (LLMs) have attracted significant attention in recommendation systems (RS). Specifically, The chain of thought (CoT) has been shown to improve the performance of LLMs on complex reasoning tasks for RS. However, due to the fact that LLMs often suffer from hallucination issues, there is no guarantee that their reasoning CoT is effective. A key challenge is to further enhance the recommendation capabilities of LLMs through effective CoT reasonings. Therefore, we propose \textbf{TrackRec}, a framework designed to enhance reasoning capabilities of LLMs for RS. TrackRec specifically focuses on accurately inferring recommendation CoT \textbf{(RecCoT)} for user preference using the knowledge from LLMs. This RecCoT can serve both as an explanation for the LLM's completion of recommendation tasks and as auxiliary features to assist recommendation models in accomplishing recommendation tasks. TrackRec consists of a RecCoT generator $(G)$ and a RecCoT validator $(V)$. Furthermore, we design alternating feedback learning mechanism that $G$ undergoes direct preference optimization via feedback from $V$ to produce increasingly accurate RecCoT aligned with $V$'s standards. Meanwhile, $V$ is fine-tuned using the inference feedback from $G$ to enhance its validation capabilities in alignment with recommendation tasks. Through iterative alternating feedback learning between $G$ and $V$, TrackRec continuously improves the user preference analysis capability of $G$ and the validation capacity of $V$. Extensive experiments demonstrate the effectiveness of our approach, showing that it surpasses state-of-the-art methods. Moreover, TrackRec has been deployed on a lagre advertising platform with hundreds of millions of users, achieving substantial gains.
☆ Exploring Scaling Laws of CTR Model for Online Performance Improvement
CTR models play a vital role in improving user experience and boosting business revenue in many online personalized services. However, current CTR models generally encounter bottlenecks in performance improvement. Inspired by the scaling law phenomenon of LLMs, we propose a new paradigm for improving CTR predictions: first, constructing a CTR model with accuracy scalable to the model grade and data size, and then distilling the knowledge implied in this model into its lightweight model that can serve online users. To put it into practice, we construct a CTR model named SUAN (Stacked Unified Attention Network). In SUAN, we propose the UAB as a behavior sequence encoder. A single UAB unifies the modeling of the sequential and non-sequential features and also measures the importance of each user behavior feature from multiple perspectives. Stacked UABs elevate the configuration to a high grade, paving the way for performance improvement. In order to benefit from the high performance of the high-grade SUAN and avoid the disadvantage of its long inference time, we modify the SUAN with sparse self-attention and parallel inference strategies to form LightSUAN, and then adopt online distillation to train the low-grade LightSUAN, taking a high-grade SUAN as a teacher. The distilled LightSUAN has superior performance but the same inference time as the LightSUAN, making it well-suited for online deployment. Experimental results show that SUAN performs exceptionally well and holds the scaling laws spanning three orders of magnitude in model grade and data size, and the distilled LightSUAN outperforms the SUAN configured with one grade higher. More importantly, the distilled LightSUAN has been integrated into an online service, increasing the CTR by 2.81% and CPM by 1.69% while keeping the average inference time acceptable. Our source code is available at https://github.com/laiweijiang/SUAN.
☆ Modeling Long-term User Behaviors with Diffusion-driven Multi-interest Network for CTR Prediction
CTR (Click-Through Rate) prediction, crucial for recommender systems and online advertising, etc., has been confirmed to benefit from modeling long-term user behaviors. Nonetheless, the vast number of behaviors and complexity of noise interference pose challenges to prediction efficiency and effectiveness. Recent solutions have evolved from single-stage models to two-stage models. However, current two-stage models often filter out significant information, resulting in an inability to capture diverse user interests and build the complete latent space of user interests. Inspired by multi-interest and generative modeling, we propose DiffuMIN (Diffusion-driven Multi-Interest Network) to model long-term user behaviors and thoroughly explore the user interest space. Specifically, we propose a target-oriented multi-interest extraction method that begins by orthogonally decomposing the target to obtain interest channels. This is followed by modeling the relationships between interest channels and user behaviors to disentangle and extract multiple user interests. We then adopt a diffusion module guided by contextual interests and interest channels, which anchor users' personalized and target-oriented interest types, enabling the generation of augmented interests that align with the latent spaces of user interests, thereby further exploring restricted interest space. Finally, we leverage contrastive learning to ensure that the generated augmented interests align with users' genuine preferences. Extensive offline experiments are conducted on two public datasets and one industrial dataset, yielding results that demonstrate the superiority of DiffuMIN. Moreover, DiffuMIN increased CTR by 1.52% and CPM by 1.10% in online A/B testing. Our source code is available at https://github.com/laiweijiang/DiffuMIN.
☆ REG4Rec: Reasoning-Enhanced Generative Model for Large-Scale Recommendation Systems
Sequential recommendation aims to predict a user's next action in large-scale recommender systems. While traditional methods often suffer from insufficient information interaction, recent generative recommendation models partially address this issue by directly generating item predictions. To better capture user intents, recent studies have introduced a reasoning process into generative recommendation, significantly improving recommendation performance. However, these approaches are constrained by the singularity of item semantic representations, facing challenges such as limited diversity in reasoning pathways and insufficient reliability in the reasoning process. To tackle these issues, we introduce REG4Rec, a reasoning-enhanced generative model that constructs multiple dynamic semantic reasoning paths alongside a self-reflection process, ensuring high-confidence recommendations. Specifically, REG4Rec utilizes an MoE-based parallel quantization codebook (MPQ) to generate multiple unordered semantic tokens for each item, thereby constructing a larger-scale diverse reasoning space. Furthermore, to enhance the reliability of reasoning, we propose a training reasoning enhancement stage, which includes Preference Alignment for Reasoning (PARS) and a Multi-Step Reward Augmentation (MSRA) strategy. PARS uses reward functions tailored for recommendation to enhance reasoning and reflection, while MSRA introduces future multi-step actions to improve overall generalization. During inference, Consistency-Oriented Self-Reflection for Pruning (CORP) is proposed to discard inconsistent reasoning paths, preventing the propagation of erroneous reasoning. Lastly, we develop an efficient offline training strategy for large-scale recommendation. Experiments on real-world datasets and online evaluations show that REG4Rec delivers outstanding performance and substantial practical value.
☆ MLLMRec: Exploring the Potential of Multimodal Large Language Models in Recommender Systems
Multimodal recommendation typically combines the user behavioral data with the modal features of items to reveal user's preference, presenting superior performance compared to the conventional recommendations. However, existing methods still suffer from two key problems: (1) the initialization methods of user multimodal representations are either behavior-unperceived or noise-contaminated, and (2) the KNN-based item-item graph contains noisy edges with low similarities and lacks audience co-occurrence relationships. To address such issues, we propose MLLMRec, a novel MLLM-driven multimodal recommendation framework with two item-item graph refinement strategies. On the one hand, the item images are first converted into high-quality semantic descriptions using an MLLM, which are then fused with the textual metadata of items. Then, we construct a behavioral description list for each user and feed it into the MLLM to reason about the purified user preference containing interaction motivations. On the other hand, we design the threshold-controlled denoising and topology-aware enhancement strategies to refine the suboptimal item-item graph, thereby enhancing the item representation learning. Extensive experiments on three publicly available datasets demonstrate that MLLMRec achieves the state-of-the-art performance with an average improvement of 38.53% over the best baselines.
☆ Adversarial Attacks against Neural Ranking Models via In-Context Learning
While neural ranking models (NRMs) have shown high effectiveness, they remain susceptible to adversarial manipulation. In this work, we introduce Few-Shot Adversarial Prompting (FSAP), a novel black-box attack framework that leverages the in-context learning capabilities of Large Language Models (LLMs) to generate high-ranking adversarial documents. Unlike previous approaches that rely on token-level perturbations or manual rewriting of existing documents, FSAP formulates adversarial attacks entirely through few-shot prompting, requiring no gradient access or internal model instrumentation. By conditioning the LLM on a small support set of previously observed harmful examples, FSAP synthesizes grammatically fluent and topically coherent documents that subtly embed false or misleading information and rank competitively against authentic content. We instantiate FSAP in two modes: FSAP-IntraQ, which leverages harmful examples from the same query to enhance topic fidelity, and FSAP-InterQ, which enables broader generalization by transferring adversarial patterns across unrelated queries. Our experiments on the TREC 2020 and 2021 Health Misinformation Tracks, using four diverse neural ranking models, reveal that FSAP-generated documents consistently outrank credible, factually accurate documents. Furthermore, our analysis demonstrates that these adversarial outputs exhibit strong stance alignment and low detectability, posing a realistic and scalable threat to neural retrieval systems. FSAP also effectively generalizes across both proprietary and open-source LLMs.
☆ MMQ: Multimodal Mixture-of-Quantization Tokenization for Semantic ID Generation and User Behavioral Adaptation
Recommender systems traditionally represent items using unique identifiers (ItemIDs), but this approach struggles with large, dynamic item corpora and sparse long-tail data, limiting scalability and generalization. Semantic IDs, derived from multimodal content such as text and images, offer a promising alternative by mapping items into a shared semantic space, enabling knowledge transfer and improving recommendations for new or rare items. However, existing methods face two key challenges: (1) balancing cross-modal synergy with modality-specific uniqueness, and (2) bridging the semantic-behavioral gap, where semantic representations may misalign with actual user preferences. To address these challenges, we propose Multimodal Mixture-of-Quantization (MMQ), a two-stage framework that trains a novel multimodal tokenizer. First, a shared-specific tokenizer leverages a multi-expert architecture with modality-specific and modality-shared experts, using orthogonal regularization to capture comprehensive multimodal information. Second, behavior-aware fine-tuning dynamically adapts semantic IDs to downstream recommendation objectives while preserving modality information through a multimodal reconstruction loss. Extensive offline experiments and online A/B tests demonstrate that MMQ effectively unifies multimodal synergy, specificity, and behavioral adaptation, providing a scalable and versatile solution for both generative retrieval and discriminative ranking tasks.
☆ TComQA: Extracting Temporal Commonsense from Text
Understanding events necessitates grasping their temporal context, which is often not explicitly stated in natural language. For example, it is not a trivial task for a machine to infer that a museum tour may last for a few hours, but can not take months. Recent studies indicate that even advanced large language models (LLMs) struggle in generating text that require reasoning with temporal commonsense due to its infrequent explicit mention in text. Therefore, automatically mining temporal commonsense for events enables the creation of robust language models. In this work, we investigate the capacity of LLMs to extract temporal commonsense from text and evaluate multiple experimental setups to assess their effectiveness. Here, we propose a temporal commonsense extraction pipeline that leverages LLMs to automatically mine temporal commonsense and use it to construct TComQA, a dataset derived from SAMSum and RealNews corpora. TComQA has been validated through crowdsourcing and achieves over 80\% precision in extracting temporal commonsense. The model trained with TComQA also outperforms an LLM fine-tuned on existing dataset of temporal question answering task.
☆ Curriculum Approximate Unlearning for Session-based Recommendation
Approximate unlearning for session-based recommendation refers to eliminating the influence of specific training samples from the recommender without retraining of (sub-)models. Gradient ascent (GA) is a representative method to conduct approximate unlearning. However, there still exist dual challenges to apply GA for session-based recommendation. On the one hand, naive applying of GA could lead to degradation of recommendation performance. On the other hand, existing studies fail to consider the ordering of unlearning samples when simultaneously processing multiple unlearning requests, leading to sub-optimal recommendation performance and unlearning effect. To address the above challenges, we introduce CAU, a curriculum approximate unlearning framework tailored to session-based recommendation. CAU handles the unlearning task with a GA term on unlearning samples. Specifically, to address the first challenge, CAU formulates the overall optimization task as a multi-objective optimization problem, where the GA term for unlearning samples is combined with retaining terms for preserving performance. The multi-objective optimization problem is solved through seeking the Pareto-Optimal solution, which achieves effective unlearning with trivial sacrifice on recommendation performance. To tackle the second challenge, CAU adopts a curriculum-based sequence to conduct unlearning on batches of unlearning samples. The key motivation is to perform unlearning from easy samples to harder ones. To this end, CAU first introduces two metrics to measure the unlearning difficulty, including gradient unlearning difficulty and embedding unlearning difficulty. Then, two strategies, hard-sampling and soft-sampling, are proposed to select unlearning samples according to difficulty scores.
☆ M-$LLM^3$REC: A Motivation-Aware User-Item Interaction Framework for Enhancing Recommendation Accuracy with LLMs
Recommendation systems have been essential for both user experience and platform efficiency by alleviating information overload and supporting decision-making. Traditional methods, i.e., content-based filtering, collaborative filtering, and deep learning, have achieved impressive results in recommendation systems. However, the cold-start and sparse-data scenarios are still challenging to deal with. Existing solutions either generate pseudo-interaction sequence, which often introduces redundant or noisy signals, or rely heavily on semantic similarity, overlooking dynamic shifts in user motivation. To address these limitations, this paper proposes a novel recommendation framework, termed M-$LLM^3$REC, which leverages large language models for deep motivational signal extraction from limited user interactions. M-$LLM^3$REC comprises three integrated modules: the Motivation-Oriented Profile Extractor (MOPE), Motivation-Oriented Trait Encoder (MOTE), and Motivational Alignment Recommender (MAR). By emphasizing motivation-driven semantic modeling, M-$LLM^3$REC demonstrates robust, personalized, and generalizable recommendations, particularly boosting performance in cold-start situations in comparison with the state-of-the-art frameworks.
comment: 10pages
☆ Retrieval-Augmented Review Generation for Poisoning Recommender Systems
Recent studies have shown that recommender systems (RSs) are highly vulnerable to data poisoning attacks, where malicious actors inject fake user profiles, including a group of well-designed fake ratings, to manipulate recommendations. Due to security and privacy constraints in practice, attackers typically possess limited knowledge of the victim system and thus need to craft profiles that have transferability across black-box RSs. To maximize the attack impact, the profiles often remains imperceptible. However, generating such high-quality profiles with the restricted resources is challenging. Some works suggest incorporating fake textual reviews to strengthen the profiles; yet, the poor quality of the reviews largely undermines the attack effectiveness and imperceptibility under the practical setting. To tackle the above challenges, in this paper, we propose to enhance the quality of the review text by harnessing in-context learning (ICL) capabilities of multimodal foundation models. To this end, we introduce a demonstration retrieval algorithm and a text style transfer strategy to augment the navie ICL. Specifically, we propose a novel practical attack framework named RAGAN to generate high-quality fake user profiles, which can gain insights into the robustness of RSs. The profiles are generated by a jailbreaker and collaboratively optimized on an instructional agent and a guardian to improve the attack transferability and imperceptibility. Comprehensive experiments on various real-world datasets demonstrate that RAGAN achieves the state-of-the-art poisoning attack performance.
☆ See Beyond a Single View: Multi-Attribution Learning Leads to Better Conversion Rate Prediction CIKM 2025
Conversion rate (CVR) prediction is a core component of online advertising systems, where the attribution mechanisms-rules for allocating conversion credit across user touchpoints-fundamentally determine label generation and model optimization. While many industrial platforms support diverse attribution mechanisms (e.g., First-Click, Last-Click, Linear, and Data-Driven Multi-Touch Attribution), conventional approaches restrict model training to labels from a single production-critical attribution mechanism, discarding complementary signals in alternative attribution perspectives. To address this limitation, we propose a novel Multi-Attribution Learning (MAL) framework for CVR prediction that integrates signals from multiple attribution perspectives to better capture the underlying patterns driving user conversions. Specifically, MAL is a joint learning framework consisting of two core components: the Attribution Knowledge Aggregator (AKA) and the Primary Target Predictor (PTP). AKA is implemented as a multi-task learner that integrates knowledge extracted from diverse attribution labels. PTP, in contrast, focuses on the task of generating well-calibrated conversion probabilities that align with the system-optimized attribution metric (e.g., CVR under the Last-Click attribution), ensuring direct compatibility with industrial deployment requirements. Additionally, we propose CAT, a novel training strategy that leverages the Cartesian product of all attribution label combinations to generate enriched supervision signals. This design substantially enhances the performance of the attribution knowledge aggregator. Empirical evaluations demonstrate the superiority of MAL over single-attribution learning baselines, achieving +0.51% GAUC improvement on offline metrics. Online experiments demonstrate that MAL achieved a +2.6% increase in ROI (Return on Investment).
comment: Accepted at CIKM 2025
☆ Evaluating Structured Decoding for Text-to-Table Generation: Evidence from Three Datasets
We present a comprehensive evaluation of structured decoding for text-to-table generation with large language models (LLMs). While previous work has primarily focused on unconstrained generation of tables, the impact of enforcing structural constraints during generation remains underexplored. We systematically compare schema-guided (structured) decoding to standard one-shot prompting across three diverse benchmarks - E2E, Rotowire, and Livesum - using open-source LLMs of up to 32B parameters, assessing the performance of table generation approaches in resource-constrained settings. Our experiments cover a wide range of evaluation metrics at cell, row, and table levels. Results demonstrate that structured decoding significantly enhances the validity and alignment of generated tables, particularly in scenarios demanding precise numerical alignment (Rotowire), but may degrade performance in contexts involving densely packed textual information (E2E) or extensive aggregation over lengthy texts (Livesum). We further analyze the suitability of different evaluation metrics and discuss the influence of model size.
comment: to be published in the workshop proceedings of the "From Rules to Language Models: Comparative Performance Evaluation" workshop, held alongside RANLP 2025
☆ Annif at the GermEval-2025 LLMs4Subjects Task: Traditional XMTC Augmented by Efficient LLMs
This paper presents the Annif system in the LLMs4Subjects shared task (Subtask 2) at GermEval-2025. The task required creating subject predictions for bibliographic records using large language models, with a special focus on computational efficiency. Our system, based on the Annif automated subject indexing toolkit, refines our previous system from the first LLMs4Subjects shared task, which produced excellent results. We further improved the system by using many small and efficient language models for translation and synthetic data generation and by using LLMs for ranking candidate subjects. Our system ranked 1st in the overall quantitative evaluation of and 1st in the qualitative evaluation of Subtask 2.
comment: 5 pages, 4 figures, accepted at KONVENS 2025. arXiv admin note: substantial text overlap with arXiv:2504.19675
♻ ☆ OPDR: Order-Preserving Dimension Reduction for Semantic Embedding of Multimodal Scientific Data
One of the most common operations in multimodal scientific data management is searching for the $k$ most similar items (or, $k$-nearest neighbors, KNN) from the database after being provided a new item. Although recent advances of multimodal machine learning models offer a \textit{semantic} index, the so-called \textit{embedding vectors} mapped from the original multimodal data, the dimension of the resulting embedding vectors are usually on the order of hundreds or a thousand, which are impractically high for time-sensitive scientific applications. This work proposes to reduce the dimensionality of the output embedding vectors such that the set of top-$k$ nearest neighbors do not change in the lower-dimensional space, namely Order-Preserving Dimension Reduction (OPDR). In order to develop such an OPDR method, our central hypothesis is that by analyzing the intrinsic relationship among key parameters during the dimension-reduction map, a quantitative function may be constructed to reveal the correlation between the target (lower) dimensionality and other variables. To demonstrate the hypothesis, this paper first defines a formal measure function to quantify the KNN similarity for a specific vector, then extends the measure into an aggregate accuracy of the global metric spaces, and finally derives a closed-form function between the target (lower) dimensionality and other variables. We incorporate the closed-function into popular dimension-reduction methods, various distance metrics, and embedding models.
♻ ☆ Annif at SemEval-2025 Task 5: Traditional XMTC augmented by LLMs SemEval-2025
This paper presents the Annif system in SemEval-2025 Task 5 (LLMs4Subjects), which focussed on subject indexing using large language models (LLMs). The task required creating subject predictions for bibliographic records from the bilingual TIBKAT database using the GND subject vocabulary. Our approach combines traditional natural language processing and machine learning techniques implemented in the Annif toolkit with innovative LLM-based methods for translation and synthetic data generation, and merging predictions from monolingual models. The system ranked first in the all-subjects category and second in the tib-core-subjects category in the quantitative evaluation, and fourth in qualitative evaluations. These findings demonstrate the potential of combining traditional XMTC algorithms with modern LLM techniques to improve the accuracy and efficiency of subject indexing in multilingual contexts.
comment: 6 pages, 4 figures, published at SemEval-2025 workshop Task 5: LLMs4Subjects: https://aclanthology.org/2025.semeval-1.315/
♻ ☆ An Empirical Study of Position Bias in Modern Information Retrieval EMNLP 2025
This study investigates the position bias in information retrieval, where models tend to overemphasize content at the beginning of passages while neglecting semantically relevant information that appears later. To analyze the extent and impact of position bias, we introduce a new evaluation framework consisting of two position-aware retrieval benchmarks (SQuAD-PosQ, FineWeb-PosQ) and an intuitive diagnostic metric, the Position Sensitivity Index (PSI), for quantifying position bias from a worst-case perspective. We conduct a comprehensive evaluation across the full retrieval pipeline, including BM25, dense embedding models, ColBERT-style late-interaction models, and full-interaction reranker models. Our experiments show that when relevant information appears later in the passage, dense embedding models and ColBERT-style models suffer significant performance degradation (an average drop of 15.6%). In contrast, BM25 and reranker models demonstrate greater robustness to such positional variation. These findings provide practical insights into model sensitivity to the position of relevant information and offer guidance for building more position-robust retrieval systems. Code and data are publicly available at: https://github.com/NovaSearch-Team/position-bias-in-IR.
comment: EMNLP 2025 Findings
♻ ☆ FinAgentBench: A Benchmark Dataset for Agentic Retrieval in Financial Question Answering
Accurate information retrieval (IR) is critical in the financial domain, where investors must identify relevant information from large collections of documents. Traditional IR methods-whether sparse or dense-often fall short in retrieval accuracy, as it requires not only capturing semantic similarity but also performing fine-grained reasoning over document structure and domain-specific knowledge. Recent advances in large language models (LLMs) have opened up new opportunities for retrieval with multi-step reasoning, where the model ranks passages through iterative reasoning about which information is most relevant to a given query. However, there exists no benchmark to evaluate such capabilities in the financial domain. To address this gap, we introduce FinAgentBench, the first large-scale benchmark for evaluating retrieval with multi-step reasoning in finance -- a setting we term agentic retrieval. The benchmark consists of 3,429 expert-annotated examples on S&P-100 listed firms and assesses whether LLM agents can (1) identify the most relevant document type among candidates, and (2) pinpoint the key passage within the selected document. Our evaluation framework explicitly separates these two reasoning steps to address context limitations. This design enables to provide a quantitative basis for understanding retrieval-centric LLM behavior in finance. We evaluate a suite of state-of-the-art models and further demonstrated how targeted fine-tuning can significantly improve agentic retrieval performance. Our benchmark provides a foundation for studying retrieval-centric LLM behavior in complex, domain-specific tasks for finance. We will release the dataset publicly upon acceptance of the paper and plan to expand and share dataset for the full S&P 500 and beyond.
comment: 6 pages
♻ ☆ Generating Negative Samples for Multi-Modal Recommendation
Multi-modal recommender systems (MMRS) have gained significant attention due to their ability to leverage information from various modalities to enhance recommendation quality. However, existing negative sampling techniques often struggle to effectively utilize the multi-modal data, leading to suboptimal performance. In this paper, we identify two key challenges in negative sampling for MMRS: (1) producing cohesive negative samples contrasting with positive samples and (2) maintaining a balanced influence across different modalities. To address these challenges, we propose NegGen, a novel framework that utilizes multi-modal large language models (MLLMs) to generate balanced and contrastive negative samples. We design three different prompt templates to enable NegGen to analyze and manipulate item attributes across multiple modalities, and then generate negative samples that introduce better supervision signals and ensure modality balance. Furthermore, NegGen employs a causal learning module to disentangle the effect of intervened key features and irrelevant item attributes, enabling fine-grained learning of user preferences. Extensive experiments on real-world datasets demonstrate the superior performance of NegGen compared to state-of-the-art methods in both negative sampling and multi-modal recommendation.
comment: Accepted by ACM Multimedia
♻ ☆ Monolithic Hybrid Recommender System for Suggesting Relevant Movies
Recommendation systems have become the fundamental services to facilitate users information access. Generally, recommendation system works by filtering historical behaviors to understand and learn users preferences. With the growth of online information, recommendations have become of crucial importance in information filtering to prevent the information overload problem. In this study, we considered hybrid post-fusion of two approaches of collaborative filtering, by using sequences of watched movies and considering the related movies rating. After considering both techniques and applying the weights matrix, the recommendations would be modified to correspond to the users preference as needed. We discussed that various weights would be set based on use cases. For instance, in cases where we have the rating for most classes, we will assign a higher weight to the rating matrix and in case where the rating is unavailable for the majority of cases, the higher weights might be assigned to the sequential dataset. An extensive discussion is made in the context of this paper. Sequential type of the watched movies was used in conjunction of the rating as especially that model might be inadequate in distinguishing users long-term preference and that does not account for the rating of the watched movies and thus that model along might not suffice. Extensive discussion was made regarding the literature and methodological approach to solve the problem.
♻ ☆ A Text-Based Recommender System that Leverages Explicit Affective State Preferences EMNLP 2025
The affective attitude of liking a recommended item reflects just one category in a wide spectrum of affective phenomena that also includes emotions such as entranced or intrigued, moods such as cheerful or buoyant, as well as more fine-grained affective states, such as "pleasantly surprised by the conclusion". In this paper, we introduce a novel recommendation task that can leverage a virtually unbounded range of affective states sought explicitly by the user in order to identify items that, upon consumption, are likely to induce those affective states. Correspondingly, we create a large dataset of user preferences containing expressions of fine-grained affective states that are mined from book reviews, and propose a Transformer-based architecture that leverages such affective expressions as input. We then use the resulting dataset of affective states preferences, together with the linked users and their histories of book readings, ratings, and reviews, to train and evaluate multiple recommendation models on the task of matching recommended items with affective preferences. Experiments show that the best results are obtained by models that can utilize textual descriptions of items and user affective preferences.
comment: To appear at EMNLP 2025
Machine Learning 147
☆ Scaling Group Inference for Diverse and High-Quality Generation
Generative models typically sample outputs independently, and recent inference-time guidance and scaling algorithms focus on improving the quality of individual samples. However, in real-world applications, users are often presented with a set of multiple images (e.g., 4-8) for each prompt, where independent sampling tends to lead to redundant results, limiting user choices and hindering idea exploration. In this work, we introduce a scalable group inference method that improves both the diversity and quality of a group of samples. We formulate group inference as a quadratic integer assignment problem: candidate outputs are modeled as graph nodes, and a subset is selected to optimize sample quality (unary term) while maximizing group diversity (binary term). To substantially improve runtime efficiency, we progressively prune the candidate set using intermediate predictions, allowing our method to scale up to large candidate sets. Extensive experiments show that our method significantly improves group diversity and quality compared to independent sampling baselines and recent inference algorithms. Our framework generalizes across a wide range of tasks, including text-to-image, image-to-image, image prompting, and video generation, enabling generative models to treat multiple outputs as cohesive groups rather than independent samples.
comment: Project website: https://www.cs.cmu.edu/~group-inference, GitHub: https://github.com/GaParmar/group-inference
☆ Discovering Hidden Algebraic Structures via Transformers with Rank-Aware Beam GRPO
Recent efforts have extended the capabilities of transformers in logical reasoning and symbolic computations. In this work, we investigate their capacity for non-linear latent pattern discovery in the context of functional decomposition, focusing on the challenging algebraic task of multivariate polynomial decomposition. This problem, with widespread applications in science and engineering, is proved to be NP-hard, and demands both precision and insight. Our contributions are threefold: First, we develop a synthetic data generation pipeline providing fine-grained control over problem complexity. Second, we train transformer models via supervised learning and evaluate them across four key dimensions involving scaling behavior and generalizability. Third, we propose Beam Grouped Relative Policy Optimization (BGRPO), a rank-aware reinforcement learning method suitable for hard algebraic problems. Finetuning with BGRPO improves accuracy while reducing beam width by up to half, resulting in approximately 75% lower inference compute. Additionally, our model demonstrates competitive performance in polynomial simplification, outperforming Mathematica in various cases.
☆ Distributed Detection of Adversarial Attacks in Multi-Agent Reinforcement Learning with Continuous Action Space ECAI 2025
We address the problem of detecting adversarial attacks against cooperative multi-agent reinforcement learning with continuous action space. We propose a decentralized detector that relies solely on the local observations of the agents and makes use of a statistical characterization of the normal behavior of observable agents. The proposed detector utilizes deep neural networks to approximate the normal behavior of agents as parametric multivariate Gaussian distributions. Based on the predicted density functions, we define a normality score and provide a characterization of its mean and variance. This characterization allows us to employ a two-sided CUSUM procedure for detecting deviations of the normality score from its mean, serving as a detector of anomalous behavior in real-time. We evaluate our scheme on various multi-agent PettingZoo benchmarks against different state-of-the-art attack methods, and our results demonstrate the effectiveness of our method in detecting impactful adversarial attacks. Particularly, it outperforms the discrete counterpart by achieving AUC-ROC scores of over 0.95 against the most impactful attacks in all evaluated environments.
comment: Accepted for publication at ECAI 2025
☆ Intern-S1: A Scientific Multimodal Foundation Model
In recent years, a plethora of open-source foundation models have emerged, achieving remarkable progress in some widely attended fields, with performance being quite close to that of closed-source models. However, in high-value but more challenging scientific professional fields, either the fields still rely on expert models, or the progress of general foundation models lags significantly compared to those in popular areas, far from sufficient for transforming scientific research and leaving substantial gap between open-source models and closed-source models in these scientific domains. To mitigate this gap and explore a step further toward Artificial General Intelligence (AGI), we introduce Intern-S1, a specialized generalist equipped with general understanding and reasoning capabilities with expertise to analyze multiple science modal data. Intern-S1 is a multimodal Mixture-of-Experts (MoE) model with 28 billion activated parameters and 241 billion total parameters, continually pre-trained on 5T tokens, including over 2.5T tokens from scientific domains. In the post-training stage, Intern-S1 undergoes offline and then online reinforcement learning (RL) in InternBootCamp, where we propose Mixture-of-Rewards (MoR) to synergize the RL training on more than 1000 tasks simultaneously. Through integrated innovations in algorithms, data, and training systems, Intern-S1 achieved top-tier performance in online RL training.On comprehensive evaluation benchmarks, Intern-S1 demonstrates competitive performance on general reasoning tasks among open-source models and significantly outperforms open-source models in scientific domains, surpassing closed-source state-of-the-art models in professional tasks, such as molecular synthesis planning, reaction condition prediction, predicting thermodynamic stabilities for crystals. Our models are available at https://huggingface.co/internlm/Intern-S1.
☆ Language-Guided Tuning: Enhancing Numeric Optimization with Textual Feedback
Configuration optimization remains a critical bottleneck in machine learning, requiring coordinated tuning across model architecture, training strategy, feature engineering, and hyperparameters. Traditional approaches treat these dimensions independently and lack interpretability, while recent automated methods struggle with dynamic adaptability and semantic reasoning about optimization decisions. We introduce Language-Guided Tuning (LGT), a novel framework that employs multi-agent Large Language Models to intelligently optimize configurations through natural language reasoning. We apply textual gradients - qualitative feedback signals that complement numerical optimization by providing semantic understanding of training dynamics and configuration interdependencies. LGT coordinates three specialized agents: an Advisor that proposes configuration changes, an Evaluator that assesses progress, and an Optimizer that refines the decision-making process, creating a self-improving feedback loop. Through comprehensive evaluation on six diverse datasets, LGT demonstrates substantial improvements over traditional optimization methods, achieving performance gains while maintaining high interpretability.
comment: 9 pages, 4 figures, 4 tables
☆ Neural Robot Dynamics
Accurate and efficient simulation of modern robots remains challenging due to their high degrees of freedom and intricate mechanisms. Neural simulators have emerged as a promising alternative to traditional analytical simulators, capable of efficiently predicting complex dynamics and adapting to real-world data; however, existing neural simulators typically require application-specific training and fail to generalize to novel tasks and/or environments, primarily due to inadequate representations of the global state. In this work, we address the problem of learning generalizable neural simulators for robots that are structured as articulated rigid bodies. We propose NeRD (Neural Robot Dynamics), learned robot-specific dynamics models for predicting future states for articulated rigid bodies under contact constraints. NeRD uniquely replaces the low-level dynamics and contact solvers in an analytical simulator and employs a robot-centric and spatially-invariant simulation state representation. We integrate the learned NeRD models as an interchangeable backend solver within a state-of-the-art robotics simulator. We conduct extensive experiments to show that the NeRD simulators are stable and accurate over a thousand simulation steps; generalize across tasks and environment configurations; enable policy learning exclusively in a neural engine; and, unlike most classical simulators, can be fine-tuned from real-world data to bridge the gap between simulation and reality.
☆ Probability Density from Latent Diffusion Models for Out-of-Distribution Detection ECAI 2025
Despite rapid advances in AI, safety remains the main bottleneck to deploying machine-learning systems. A critical safety component is out-of-distribution detection: given an input, decide whether it comes from the same distribution as the training data. In generative models, the most natural OOD score is the data likelihood. Actually, under the assumption of uniformly distributed OOD data, the likelihood is even the optimal OOD detector, as we show in this work. However, earlier work reported that likelihood often fails in practice, raising doubts about its usefulness. We explore whether, in practice, the representation space also suffers from the inability to learn good density estimation for OOD detection, or if it is merely a problem of the pixel space typically used in generative models. To test this, we trained a Variational Diffusion Model not on images, but on the representation space of a pre-trained ResNet-18 to assess the performance of our likelihood-based detector in comparison to state-of-the-art methods from the OpenOOD suite.
comment: ECAI 2025
☆ Exploring the Landscape of Non-Equilibrium Memories with Neural Cellular Automata
We investigate the landscape of many-body memories: families of local non-equilibrium dynamics that retain information about their initial conditions for thermodynamically long time scales, even in the presence of arbitrary perturbations. In two dimensions, the only well-studied memory is Toom's rule. Using a combination of rigorous proofs and machine learning methods, we show that the landscape of 2D memories is in fact quite vast. We discover memories that correct errors in ways qualitatively distinct from Toom's rule, have ordered phases stabilized by fluctuations, and preserve information only in the presence of noise. Taken together, our results show that physical systems can perform robust information storage in many distinct ways, and demonstrate that the physics of many-body memories is richer than previously realized. Interactive visualizations of the dynamics studied in this work are available at https://memorynca.github.io/2D.
comment: 4+9 pages
☆ Tutorial on the Probabilistic Unification of Estimation Theory, Machine Learning, and Generative AI
Extracting meaning from uncertain, noisy data is a fundamental problem across time series analysis, pattern recognition, and language modeling. This survey presents a unified mathematical framework that connects classical estimation theory, statistical inference, and modern machine learning, including deep learning and large language models. By analyzing how techniques such as maximum likelihood estimation, Bayesian inference, and attention mechanisms address uncertainty, the paper illustrates that many AI methods are rooted in shared probabilistic principles. Through illustrative scenarios including system identification, image classification, and language generation, we show how increasingly complex models build upon these foundations to tackle practical challenges like overfitting, data sparsity, and interpretability. In other words, the work demonstrates that maximum likelihood, MAP estimation, Bayesian classification, and deep learning all represent different facets of a shared goal: inferring hidden causes from noisy and/or biased observations. It serves as both a theoretical synthesis and a practical guide for students and researchers navigating the evolving landscape of machine learning.
☆ End-to-End Analysis of Charge Stability Diagrams with Transformers
Transformer models and end-to-end learning frameworks are rapidly revolutionizing the field of artificial intelligence. In this work, we apply object detection transformers to analyze charge stability diagrams in semiconductor quantum dot arrays, a key task for achieving scalability with spin-based quantum computing. Specifically, our model identifies triple points and their connectivity, which is crucial for virtual gate calibration, charge state initialization, drift correction, and pulse sequencing. We show that it surpasses convolutional neural networks in performance on three different spin qubit architectures, all without the need for retraining. In contrast to existing approaches, our method significantly reduces complexity and runtime, while enhancing generalizability. The results highlight the potential of transformer-based end-to-end learning frameworks as a foundation for a scalable, device- and architecture-agnostic tool for control and tuning of quantum dot devices.
comment: 8 pages, 2 figures, RM and LS contributed equally
☆ Communication Efficient LLM Pre-training with SparseLoCo
Communication-efficient distributed training algorithms have received considerable interest recently due to their benefits for training Large Language Models (LLMs) in bandwidth-constrained settings, such as across data centers and over the internet. Despite reducing communication frequency, these methods still typically require communicating a full copy of the model's gradients-resulting in a communication bottleneck even for cross-datacenter links. Furthermore, they can slightly degrade performance compared to a naive AdamW DDP baseline. While quantization and error feedback are often applied to reduce the pseudo-gradient's size, in the context of LLM pre-training, existing approaches have been unable to additionally leverage sparsification and have obtained limited quantization. In this work, we introduce SparseLoCo, a communication-efficient training algorithm for LLMs that effectively leverages Top-k sparsification and quantization to reach extreme compression ratios of up to 1-3% sparsity and 2-bit quantization while outperforming full-precision DiLoCo. Our key observations are that outer momentum can be locally approximated by an error feedback combined with aggressive sparsity and that sparse aggregation can actually improve model performance. We empirically demonstrate in a range of communication-constrained LLM training settings that SparseLoCo provides significant benefits in both performance and communication cost.
comment: 15 pages, 9 tables, 2 figures
☆ Investigation of D-Wave quantum annealing for training Restricted Boltzmann Machines and mitigating catastrophic forgetting
Modest statistical differences between the sampling performances of the D-Wave quantum annealer (QA) and the classical Markov Chain Monte Carlo (MCMC), when applied to Restricted Boltzmann Machines (RBMs), are explored to explain, and possibly address, the absence of significant and consistent improvements in RBM trainability when the D-Wave sampling was used in previous investigations. A novel hybrid sampling approach, combining the classical and the QA contributions, is investigated as a promising way to benefit from the modest differences between the two sampling methods. No improvements in the RBM training are achieved in this work, thereby suggesting that the differences between the QA-based and MCMC sampling, mainly found in the medium-to-low probability regions of the distribution, which are less important for the quality of the sample, are insufficient to benefit the training. Difficulties in achieving sufficiently high quality of embedding RBMs into the lattice of the newer generation of D-Wave hardware could be further complicating the task. On the other hand, the ability to generate samples of sufficient variety from lower-probability parts of the distribution has a potential to benefit other machine learning applications, such as the mitigation of catastrophic forgetting (CF) during incremental learning. The feasibility of using QA-generated patterns of desirable classes for CF mitigation by the generative replay is demonstrated in this work for the first time. While the efficiency of the CF mitigation using the D-Wave QA was comparable to that of the classical mitigation, both the speed of generating a large number of distinct desirable patterns and the potential for further improvement make this approach promising for a variety of challenging machine learning applications.
comment: 26 pages, 5 figures
☆ Conditionally adaptive augmented Lagrangian method for physics-informed learning of forward and inverse problems using artificial neural networks
We present several advances to the physics and equality constrained artificial neural networks (PECANN) framework that substantially improve its capability to learn solutions of canonical partial differential equations (PDEs). First, we generalize the augmented Lagrangian method (ALM) to support multiple independent penalty parameters, enabling simultaneous enforcement of heterogeneous constraints. Second, we reformulate pointwise constraint enforcement and Lagrange multipliers as expectations over constraint terms, reducing memory overhead and permitting efficient mini-batch training. Third, to address PDEs with oscillatory, multi-scale features, we incorporate Fourier feature mappings and show that a single mapping suffices where multiple mappings or more costly architectures were required in related methods. Fourth, we introduce a time-windowing strategy for long-time evolution in which the terminal state of each window is enforced as an initial-condition constraint for the next, ensuring continuity without discrete time models. Crucially, we propose a conditionally adaptive penalty update (CAPU) strategy for ALM, which preserves the principle that larger constraint violations incur stronger penalties. CAPU accelerates the growth of Lagrange multipliers for selectively challenging constraints, enhancing constraint enforcement during training. We demonstrate the effectiveness of PECANN-CAPU on problems including the transonic rarefaction problem, reversible advection of a passive by a vortex, high-wavenumber Helmholtz and Poisson equations, and inverse identification of spatially varying heat sources. Comparisons with established methods and recent Kolmogorov-Arnold network approaches show that PECANN-CAPU achieves competitive accuracy across all cases. Collectively, these advances improve PECANN's robustness, efficiency, and applicability to demanding problems in scientific computing.
comment: 37 pages, 23 figures
☆ Effect Identification and Unit Categorization in the Multi-Score Regression Discontinuity Design with Application to LED Manufacturing
The RDD (regression discontinuity design) is a widely used framework for identification and estimation of causal effects at a cutoff of a single running variable. Practical settings, in particular those encountered in production systems, often involve decision-making defined by multiple thresholds and criteria. Common MRD (multi-score RDD) approaches transform these to a one-dimensional design, to employ identification and estimation results. However, this practice can introduce non-compliant behavior. We develop theoretical tools to identify and reduce some of this "fuzziness" when estimating the cutoff-effect on compliers of sub-rules. We provide a sound definition and categorization of unit behavior types for multi-dimensional cutoff-rules, extending existing categorizations. We identify conditions for the existence and identification of the cutoff-effect on complier in multiple dimensions, and specify when identification remains stable after excluding nevertaker and alwaystaker. Further, we investigate how decomposing cutoff-rules into simpler parts alters the unit behavior. This allows identification and removal of non-compliant units potentially improving estimates. We validate our framework on simulated and real-world data from opto-electronic semiconductor manufacturing. Our empirical results demonstrate the usability for refining production policies. Particularly we show that our approach decreases the estimation variance, highlighting the practical value of the MRD framework in manufacturing.
☆ GRAFT: GRaPH and Table Reasoning for Textual Alignment -- A Benchmark for Structured Instruction Following and Visual Reasoning
GRAFT is a structured multimodal benchmark for evaluating models on instruction-following, visual reasoning, and visual-textual alignment tasks. It features programmatically generated charts and synthetically rendered tables, created with Python visualization libraries to ensure control over data semantics, structure, and clarity. Each GRAFT instance pairs a chart or table image with a systematically generated, multi-step analytical question based solely on visual content. Answers are provided in structured formats such as JSON or YAML, supporting consistent evaluation of both reasoning and output format. The benchmark introduces a taxonomy of reasoning types including comparison, trend identification, ranking, aggregation, proportion estimation, and anomaly detection to enable comprehensive assessment. Reference answers follow strict factual and formatting guidelines for precise, aspect-based evaluation. GRAFT offers a unified, scalable framework for fine-grained benchmarking of multimodal models on visually grounded, structured reasoning tasks, setting a new evaluation standard in this field.
comment: 23 pages, 9 tables, 3 figures
☆ An Efficient Open World Environment for Multi-Agent Social Learning
Many challenges remain before AI agents can be deployed in real-world environments. However, one virtue of such environments is that they are inherently multi-agent and contain human experts. Using advanced social intelligence in such an environment can help an AI agent learn adaptive skills and behaviors that a known expert exhibits. While social intelligence could accelerate training, it is currently difficult to study due to the lack of open-ended multi-agent environments. In this work, we present an environment in which multiple self-interested agents can pursue complex and independent goals, reflective of real world challenges. This environment will enable research into the development of socially intelligent AI agents in open-ended multi-agent settings, where agents may be implicitly incentivized to cooperate to defeat common enemies, build and share tools, and achieve long horizon goals. In this work, we investigate the impact on agent performance due to social learning in the presence of experts and implicit cooperation such as emergent collaborative tool use, and whether agents can benefit from either cooperation or competition in this environment.
☆ Tree-like Pairwise Interaction Networks
Modeling feature interactions in tabular data remains a key challenge in predictive modeling, for example, as used for insurance pricing. This paper proposes the Tree-like Pairwise Interaction Network (PIN), a novel neural network architecture that explicitly captures pairwise feature interactions through a shared feed-forward neural network architecture that mimics the structure of decision trees. PIN enables intrinsic interpretability by design, allowing for direct inspection of interaction effects. Moreover, it allows for efficient SHapley's Additive exPlanation (SHAP) computations because it only involves pairwise interactions. We highlight connections between PIN and established models such as GA2Ms, gradient boosting machines, and graph neural networks. Empirical results on the popular French motor insurance dataset show that PIN outperforms both traditional and modern neural networks benchmarks in predictive accuracy, while also providing insight into how features interact with each another and how they contribute to the predictions.
☆ Tensorized Multi-Task Learning for Personalized Modeling of Heterogeneous Individuals with High-Dimensional Data
Effective modeling of heterogeneous subpopulations presents a significant challenge due to variations in individual characteristics and behaviors. This paper proposes a novel approach to address this issue through multi-task learning (MTL) and low-rank tensor decomposition techniques. Our MTL approach aims to enhance personalized modeling by leveraging shared structures among similar tasks while accounting for distinct subpopulation-specific variations. We introduce a framework where low-rank decomposition decomposes the collection of task model parameters into a low-rank structure that captures commonalities and variations across tasks and subpopulations. This approach allows for efficient learning of personalized models by sharing knowledge between similar tasks while preserving the unique characteristics of each subpopulation. Experimental results in simulation and case study datasets demonstrate the superior performance of the proposed method compared to several benchmarks, particularly in scenarios with high variability among subpopulations. The proposed framework not only improves prediction accuracy but also enhances interpretability by revealing underlying patterns that contribute to the personalization of models.
☆ Bayesian Optimization with Expected Improvement: No Regret and the Choice of Incumbent
Expected improvement (EI) is one of the most widely used acquisition functions in Bayesian optimization (BO). Despite its proven empirical success in applications, the cumulative regret upper bound of EI remains an open question. In this paper, we analyze the classic noisy Gaussian process expected improvement (GP-EI) algorithm. We consider the Bayesian setting, where the objective is a sample from a GP. Three commonly used incumbents, namely the best posterior mean incumbent (BPMI), the best sampled posterior mean incumbent (BSPMI), and the best observation incumbent (BOI) are considered as the choices of the current best value in GP-EI. We present for the first time the cumulative regret upper bounds of GP-EI with BPMI and BSPMI. Importantly, we show that in both cases, GP-EI is a no-regret algorithm for both squared exponential (SE) and Mat\'ern kernels. Further, we present for the first time that GP-EI with BOI either achieves a sublinear cumulative regret upper bound or has a fast converging noisy simple regret bound for SE and Mat\'ern kernels. Our results provide theoretical guidance to the choice of incumbent when practitioners apply GP-EI in the noisy setting. Numerical experiments are conducted to validate our findings.
☆ Exploiting Policy Idling for Dexterous Manipulation IROS 2025
Learning-based methods for dexterous manipulation have made notable progress in recent years. However, learned policies often still lack reliability and exhibit limited robustness to important factors of variation. One failure pattern that can be observed across many settings is that policies idle, i.e. they cease to move beyond a small region of states when they reach certain states. This policy idling is often a reflection of the training data. For instance, it can occur when the data contains small actions in areas where the robot needs to perform high-precision motions, e.g., when preparing to grasp an object or object insertion. Prior works have tried to mitigate this phenomenon e.g. by filtering the training data or modifying the control frequency. However, these approaches can negatively impact policy performance in other ways. As an alternative, we investigate how to leverage the detectability of idling behavior to inform exploration and policy improvement. Our approach, Pause-Induced Perturbations (PIP), applies perturbations at detected idling states, thus helping it to escape problematic basins of attraction. On a range of challenging simulated dual-arm tasks, we find that this simple approach can already noticeably improve test-time performance, with no additional supervision or training. Furthermore, since the robot tends to idle at critical points in a movement, we also find that learning from the resulting episodes leads to better iterative policy improvement compared to prior approaches. Our perturbation strategy also leads to a 15-35% improvement in absolute success rate on a real-world insertion task that requires complex multi-finger manipulation.
comment: A similar version to this paper was accepted at IROS 2025
☆ Amortized In-Context Mixed Effect Transformer Models: A Zero-Shot Approach for Pharmacokinetics
Accurate dose-response forecasting under sparse sampling is central to precision pharmacotherapy. We present the Amortized In-Context Mixed-Effect Transformer (AICMET) model, a transformer-based latent-variable framework that unifies mechanistic compartmental priors with amortized in-context Bayesian inference. AICMET is pre-trained on hundreds of thousands of synthetic pharmacokinetic trajectories with Ornstein-Uhlenbeck priors over the parameters of compartment models, endowing the model with strong inductive biases and enabling zero-shot adaptation to new compounds. At inference time, the decoder conditions on the collective context of previously profiled trial participants, generating calibrated posterior predictions for newly enrolled patients after a few early drug concentration measurements. This capability collapses traditional model-development cycles from weeks to hours while preserving some degree of expert modelling. Experiments across public datasets show that AICMET attains state-of-the-art predictive accuracy and faithfully quantifies inter-patient variability -- outperforming both nonlinear mixed-effects baselines and recent neural ODE variants. Our results highlight the feasibility of transformer-based, population-aware neural architectures as offering a new alternative for bespoke pharmacokinetic modeling pipelines, charting a path toward truly population-aware personalized dosing regimens.
☆ Understanding Action Effects through Instrumental Empowerment in Multi-Agent Reinforcement Learning ECAI
To reliably deploy Multi-Agent Reinforcement Learning (MARL) systems, it is crucial to understand individual agent behaviors within a team. While prior work typically evaluates overall team performance based on explicit reward signals or learned value functions, it is unclear how to infer agent contributions in the absence of any value feedback. In this work, we investigate whether meaningful insights into agent behaviors can be extracted that are consistent with the underlying value functions, solely by analyzing the policy distribution. Inspired by the phenomenon that intelligent agents tend to pursue convergent instrumental values, which generally increase the likelihood of task success, we introduce Intended Cooperation Values (ICVs), a method based on information-theoretic Shapley values for quantifying each agent's causal influence on their co-players' instrumental empowerment. Specifically, ICVs measure an agent's action effect on its teammates' policies by assessing their decision uncertainty and preference alignment. The analysis across cooperative and competitive MARL environments reveals the extent to which agents adopt similar or diverse strategies. By comparing action effects between policies and value functions, our method identifies which agent behaviors are beneficial to team success, either by fostering deterministic decisions or by preserving flexibility for future action choices. Our proposed method offers novel insights into cooperation dynamics and enhances explainability in MARL systems.
comment: European Conference on Artificial Intelligence (ECAI) 2025
☆ Correct-By-Construction: Certified Individual Fairness through Neural Network Training
Fairness in machine learning is more important than ever as ethical concerns continue to grow. Individual fairness demands that individuals differing only in sensitive attributes receive the same outcomes. However, commonly used machine learning algorithms often fail to achieve such fairness. To improve individual fairness, various training methods have been developed, such as incorporating fairness constraints as optimisation objectives. While these methods have demonstrated empirical effectiveness, they lack formal guarantees of fairness. Existing approaches that aim to provide fairness guarantees primarily rely on verification techniques, which can sometimes fail to produce definitive results. Moreover, verification alone does not actively enhance individual fairness during training. To address this limitation, we propose a novel framework that formally guarantees individual fairness throughout training. Our approach consists of two parts, i.e., (1) provably fair initialisation that ensures the model starts in a fair state, and (2) a fairness-preserving training algorithm that maintains fairness as the model learns. A key element of our method is the use of randomised response mechanisms, which protect sensitive attributes while maintaining fairness guarantees. We formally prove that this mechanism sustains individual fairness throughout the training process. Experimental evaluations confirm that our approach is effective, i.e., producing models that are empirically fair and accurate. Furthermore, our approach is much more efficient than the alternative approach based on certified training (which requires neural network verification during training).
☆ Classification errors distort findings in automated speech processing: examples and solutions from child-development research
With the advent of wearable recorders, scientists are increasingly turning to automated methods of analysis of audio and video data in order to measure children's experience, behavior, and outcomes, with a sizable literature employing long-form audio-recordings to study language acquisition. While numerous articles report on the accuracy and reliability of the most popular automated classifiers, less has been written on the downstream effects of classification errors on measurements and statistical inferences (e.g., the estimate of correlations and effect sizes in regressions). This paper proposes a Bayesian approach to study the effects of algorithmic errors on key scientific questions, including the effect of siblings on children's language experience and the association between children's production and their input. In both the most commonly used \gls{lena}, and an open-source alternative (the Voice Type Classifier from the ACLEW system), we find that classification errors can significantly distort estimates. For instance, automated annotations underestimated the negative effect of siblings on adult input by 20--80\%, potentially placing it below statistical significance thresholds. We further show that a Bayesian calibration approach for recovering unbiased estimates of effect sizes can be effective and insightful, but does not provide a fool-proof solution. Both the issue reported and our solution may apply to any classifier involving event detection and classification with non-zero error rates.
☆ Label Uncertainty for Ultrasound Segmentation
In medical imaging, inter-observer variability among radiologists often introduces label uncertainty, particularly in modalities where visual interpretation is subjective. Lung ultrasound (LUS) is a prime example-it frequently presents a mixture of highly ambiguous regions and clearly discernible structures, making consistent annotation challenging even for experienced clinicians. In this work, we introduce a novel approach to both labeling and training AI models using expert-supplied, per-pixel confidence values. Rather than treating annotations as absolute ground truth, we design a data annotation protocol that captures the confidence that radiologists have in each labeled region, modeling the inherent aleatoric uncertainty present in real-world clinical data. We demonstrate that incorporating these confidence values during training leads to improved segmentation performance. More importantly, we show that this enhanced segmentation quality translates into better performance on downstream clinically-critical tasks-specifically, estimating S/F oxygenation ratio values, classifying S/F ratio change, and predicting 30-day patient readmission. While we empirically evaluate many methods for exposing the uncertainty to the learning model, we find that a simple approach that trains a model on binarized labels obtained with a (60%) confidence threshold works well. Importantly, high thresholds work far better than a naive approach of a 50% threshold, indicating that training on very confident pixels is far more effective. Our study systematically investigates the impact of training with varying confidence thresholds, comparing not only segmentation metrics but also downstream clinical outcomes. These results suggest that label confidence is a valuable signal that, when properly leveraged, can significantly enhance the reliability and clinical utility of AI in medical imaging.
comment: Paper under review
☆ GRASPED: Graph Anomaly Detection using Autoencoder with Spectral Encoder and Decoder (Full Version) ECAI 2025
Graph machine learning has been widely explored in various domains, such as community detection, transaction analysis, and recommendation systems. In these applications, anomaly detection plays an important role. Recently, studies have shown that anomalies on graphs induce spectral shifts. Some supervised methods have improved the utilization of such spectral domain information. However, they remain limited by the scarcity of labeled data due to the nature of anomalies. On the other hand, existing unsupervised learning approaches predominantly rely on spatial information or only employ low-pass filters, thereby losing the capacity for multi-band analysis. In this paper, we propose Graph Autoencoder with Spectral Encoder and Spectral Decoder (GRASPED) for node anomaly detection. Our unsupervised learning model features an encoder based on Graph Wavelet Convolution, along with structural and attribute decoders. The Graph Wavelet Convolution-based encoder, combined with a Wiener Graph Deconvolution-based decoder, exhibits bandpass filter characteristics that capture global and local graph information at multiple scales. This design allows for a learning-based reconstruction of node attributes, effectively capturing anomaly information. Extensive experiments on several real-world graph anomaly detection datasets demonstrate that GRASPED outperforms current state-of-the-art models.
comment: Full version of the paper accepted for publication at the European Conference on Artificial Intelligence (ECAI 2025)
☆ Continual Neural Topic Model
In continual learning, our aim is to learn a new task without forgetting what was learned previously. In topic models, this translates to learning new topic models without forgetting previously learned topics. Previous work either considered Dynamic Topic Models (DTMs), which learn the evolution of topics based on the entire training corpus at once, or Online Topic Models, which are updated continuously based on new data but do not have long-term memory. To fill this gap, we propose the Continual Neural Topic Model (CoNTM), which continuously learns topic models at subsequent time steps without forgetting what was previously learned. This is achieved using a global prior distribution that is continuously updated. In our experiments, CoNTM consistently outperformed the dynamic topic model in terms of topic quality and predictive perplexity while being able to capture topic changes online. The analysis reveals that CoNTM can learn more diverse topics and better capture temporal changes than existing methods.
☆ Transduction is All You Need for Structured Data Workflows
This paper introduces Agentics, a modular framework for building agent-based systems capable of structured reasoning and compositional generalization over complex data. Designed with research and practical applications in mind, Agentics offers a novel perspective on working with data and AI workflows. In this framework, agents are abstracted from the logical flow and they are used internally to the data type to enable logical transduction among data. Agentics encourages AI developers to focus on modeling data rather than crafting prompts, enabling a declarative language in which data types are provided by LLMs and composed through logical transduction, which is executed by LLMs when types are connected. We provide empirical evidence demonstrating the applicability of this framework across domain-specific multiple-choice question answering, semantic parsing for text-to-SQL, and automated prompt optimization tasks, achieving state-of-the-art accuracy or improved scalability without sacrificing performance. The open-source implementation is available at \texttt{https://github.com/IBM/agentics}.
comment: 32 pages, 8 figures
☆ Inductive Domain Transfer In Misspecified Simulation-Based Inference
Simulation-based inference (SBI) is a statistical inference approach for estimating latent parameters of a physical system when the likelihood is intractable but simulations are available. In practice, SBI is often hindered by model misspecification--the mismatch between simulated and real-world observations caused by inherent modeling simplifications. RoPE, a recent SBI approach, addresses this challenge through a two-stage domain transfer process that combines semi-supervised calibration with optimal transport (OT)-based distribution alignment. However, RoPE operates in a fully transductive setting, requiring access to a batch of test samples at inference time, which limits scalability and generalization. We propose here a fully inductive and amortized SBI framework that integrates calibration and distributional alignment into a single, end-to-end trainable model. Our method leverages mini-batch OT with a closed-form coupling to align real and simulated observations that correspond to the same latent parameters, using both paired calibration data and unpaired samples. A conditional normalizing flow is then trained to approximate the OT-induced posterior, enabling efficient inference without simulation access at test time. Across a range of synthetic and real-world benchmarks--including complex medical biomarker estimation--our approach matches or surpasses the performance of RoPE, as well as other standard SBI and non-SBI estimators, while offering improved scalability and applicability in challenging, misspecified environments.
☆ LoUQAL: Low-fidelity informed Uncertainty Quantification for Active Learning in the chemical configuration space
Uncertainty quantification is an important scheme in active learning techniques, including applications in predicting quantum chemical properties. In quantum chemical calculations, there exists the notion of a fidelity, a less accurate computation is accessible at a cheaper computational cost. This work proposes a novel low-fidelity informed uncertainty quantification for active learning with applications in predicting diverse quantum chemical properties such as excitation energies and \textit{ab initio} potential energy surfaces. Computational experiments are carried out in order to assess the proposed method with results demonstrating that models trained with the novel method outperform alternatives in terms of empirical error and number of iterations required. The effect of the choice of fidelity is also studied to perform a thorough benchmark.
☆ Conformalized Exceptional Model Mining: Telling Where Your Model Performs (Not) Well ECML-PKDD
Understanding the nuanced performance of machine learning models is essential for responsible deployment, especially in high-stakes domains like healthcare and finance. This paper introduces a novel framework, Conformalized Exceptional Model Mining, which combines the rigor of Conformal Prediction with the explanatory power of Exceptional Model Mining (EMM). The proposed framework identifies cohesive subgroups within data where model performance deviates exceptionally, highlighting regions of both high confidence and high uncertainty. We develop a new model class, mSMoPE (multiplex Soft Model Performance Evaluation), which quantifies uncertainty through conformal prediction's rigorous coverage guarantees. By defining a new quality measure, Relative Average Uncertainty Loss (RAUL), our framework isolates subgroups with exceptional performance patterns in multi-class classification and regression tasks. Experimental results across diverse datasets demonstrate the framework's effectiveness in uncovering interpretable subgroups that provide critical insights into model behavior. This work lays the groundwork for enhancing model interpretability and reliability, advancing the state-of-the-art in explainable AI and uncertainty quantification.
comment: Accepted by ECML-PKDD
☆ Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment
Test-time adaptation (TTA) enhances the zero-shot robustness under distribution shifts by leveraging unlabeled test data during inference. Despite notable advances, several challenges still limit its broader applicability. First, most methods rely on backpropagation or iterative optimization, which limits scalability and hinders real-time deployment. Second, they lack explicit modeling of class-conditional feature distributions. This modeling is crucial for producing reliable decision boundaries and calibrated predictions, but it remains underexplored due to the lack of both source data and supervision at test time. In this paper, we propose ADAPT, an Advanced Distribution-Aware and backPropagation-free Test-time adaptation method. We reframe TTA as a Gaussian probabilistic inference task by modeling class-conditional likelihoods using gradually updated class means and a shared covariance matrix. This enables closed-form, training-free inference. To correct potential likelihood bias, we introduce lightweight regularization guided by CLIP priors and a historical knowledge bank. ADAPT requires no source data, no gradient updates, and no full access to target data, supporting both online and transductive settings. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts with superior scalability and robustness.
☆ HEAS: Hierarchical Evolutionary Agent Simulation Framework for Cross-Scale Modeling and Multi-Objective Search
Hierarchical Evolutionary Agent Simulation (HEAS) is a Python framework that unifies layered agent-based modeling with evolutionary optimization and tournament evaluation in a single, reproducible workflow. HEAS represents models as hierarchies of lightweight processes ("streams") scheduled in deterministic layers that read and write a shared context, making cross-scale couplings explicit and auditable. A compact API and CLI-simulate, optimize, evaluate-expose single- and multi-objective evolution, PyTorch policy integration via parameter flattening/unflattening, and general tournament tooling with user-defined scoring and voting rules. The framework standardizes evaluation through uniform per-step and episode metrics, persists seeds, logbooks, and hall-of-fame archives, and provides plotting helpers for traces, Pareto fronts, and comparative outcomes, reducing glue code and improving comparability across studies. HEAS emphasizes separation of mechanism from orchestration, allowing exogenous drivers, endogenous agents, and aggregators to be composed and swapped without refactoring, while the same model can be used for forward simulation, optimization, or systematic comparison. We illustrate usage with two compact examples-an ecological system and an enterprise decision-making setting. HEAS offers a practical foundation for cross-disciplinary, multi-level inquiry, yielding reliable, reproducible results.
comment: 9 pages, 1 figure
☆ AI-Powered Machine Learning Approaches for Fault Diagnosis in Industrial Pumps
This study presents a practical approach for early fault detection in industrial pump systems using real-world sensor data from a large-scale vertical centrifugal pump operating in a demanding marine environment. Five key operational parameters were monitored: vibration, temperature, flow rate, pressure, and electrical current. A dual-threshold labeling method was applied, combining fixed engineering limits with adaptive thresholds calculated as the 95th percentile of historical sensor values. To address the rarity of documented failures, synthetic fault signals were injected into the data using domain-specific rules, simulating critical alerts within plausible operating ranges. Three machine learning classifiers - Random Forest, Extreme Gradient Boosting (XGBoost), and Support Vector Machine (SVM) - were trained to distinguish between normal operation, early warnings, and critical alerts. Results showed that Random Forest and XGBoost models achieved high accuracy across all classes, including minority cases representing rare or emerging faults, while the SVM model exhibited lower sensitivity to anomalies. Visual analyses, including grouped confusion matrices and time-series plots, indicated that the proposed hybrid method provides robust detection capabilities. The framework is scalable, interpretable, and suitable for real-time industrial deployment, supporting proactive maintenance decisions before failures occur. Furthermore, it can be adapted to other machinery with similar sensor architectures, highlighting its potential as a scalable solution for predictive maintenance in complex systems.
☆ BadFU: Backdoor Federated Learning through Adversarial Machine Unlearning
Federated learning (FL) has been widely adopted as a decentralized training paradigm that enables multiple clients to collaboratively learn a shared model without exposing their local data. As concerns over data privacy and regulatory compliance grow, machine unlearning, which aims to remove the influence of specific data from trained models, has become increasingly important in the federated setting to meet legal, ethical, or user-driven demands. However, integrating unlearning into FL introduces new challenges and raises largely unexplored security risks. In particular, adversaries may exploit the unlearning process to compromise the integrity of the global model. In this paper, we present the first backdoor attack in the context of federated unlearning, demonstrating that an adversary can inject backdoors into the global model through seemingly legitimate unlearning requests. Specifically, we propose BadFU, an attack strategy where a malicious client uses both backdoor and camouflage samples to train the global model normally during the federated training process. Once the client requests unlearning of the camouflage samples, the global model transitions into a backdoored state. Extensive experiments under various FL frameworks and unlearning strategies validate the effectiveness of BadFU, revealing a critical vulnerability in current federated unlearning practices and underscoring the urgent need for more secure and robust federated unlearning mechanisms.
☆ Stabilization of Perturbed Loss Function: Differential Privacy without Gradient Noise
We propose SPOF (Stabilization of Perturbed Loss Function), a differentially private training mechanism intended for multi-user local differential privacy (LDP). SPOF perturbs a stabilized Taylor expanded polynomial approximation of a model's training loss function, where each user's data is privatized by calibrated noise added to the coefficients of the polynomial. Unlike gradient-based mechanisms such as differentially private stochastic gradient descent (DP-SGD), SPOF does not require injecting noise into the gradients of the loss function, which improves both computational efficiency and stability. This formulation naturally supports simultaneous privacy guarantees across all users. Moreover, SPOF exhibits robustness to environmental noise during training, maintaining stable performance even when user inputs are corrupted. We compare SPOF with a multi-user extension of DP-SGD, evaluating both methods in a wireless body area network (WBAN) scenario involving heterogeneous user data and stochastic channel noise from body sensors. Our results show that SPOF achieves, on average, up to 3.5% higher reconstruction accuracy and reduces mean training time by up to 57.2% compared to DP-SGD, demonstrating superior privacy-utility trade-offs in multi-user environments.
comment: under review
☆ Jointly Computation- and Communication-Efficient Distributed Learning
We address distributed learning problems over undirected networks. Specifically, we focus on designing a novel ADMM-based algorithm that is jointly computation- and communication-efficient. Our design guarantees computational efficiency by allowing agents to use stochastic gradients during local training. Moreover, communication efficiency is achieved as follows: i) the agents perform multiple training epochs between communication rounds, and ii) compressed transmissions are used. We prove exact linear convergence of the algorithm in the strongly convex setting. We corroborate our theoretical results by numerical comparisons with state of the art techniques on a classification task.
comment: To be presented at 2025 IEEE Conference on Decision and Control
☆ Think in Blocks: Adaptive Reasoning from Direct Response to Deep Reasoning
Large Language Models (LLMs) with chains-of-thought have demonstrated strong performance on an increasing range of tasks, particularly those involving complex logical reasoning. However, excessively long chains can lead to overthinking, causing computational waste and slower responses. This raises a question: can LLMs dynamically adjust the length of their reasoning processes based on task complexity? To address this, we propose the Think in Blocks framework, which enables adaptive reasoning-from zero to deep reasoning-by partitioning the reasoning process into a tunable number of blocks. Our main contributions are: (1) Establishing an explicit block-structured paradigm in which the model first predicts an integer reasoning budget-the number of blocks-and then partitions its reasoning accordingly; (2) Training an adaptive model through a three-stage pipeline-Supervised Fine-Tuning, reward-guided Direct Preference Optimization, and Reinforcement Learning-that adjusts its reasoning depth to problem difficulty; (3) Exploiting the explicit block count to dynamically control reasoning depth at inference time, allowing flexible adjustment of chain-of-thought length during deployment.
☆ Let's Grow an Unbiased Community: Guiding the Fairness of Graphs via New Links
Graph Neural Networks (GNNs) have achieved remarkable success across diverse applications. However, due to the biases in the graph structures, graph neural networks face significant challenges in fairness. Although the original user graph structure is generally biased, it is promising to guide these existing structures toward unbiased ones by introducing new links. The fairness guidance via new links could foster unbiased communities, thereby enhancing fairness in downstream applications. To address this issue, we propose a novel framework named FairGuide. Specifically, to ensure fairness in downstream tasks trained on fairness-guided graphs, we introduce a differentiable community detection task as a pseudo downstream task. Our theoretical analysis further demonstrates that optimizing fairness within this pseudo task effectively enhances structural fairness, promoting fairness generalization across diverse downstream applications. Moreover, FairGuide employs an effective strategy which leverages meta-gradients derived from the fairness-guidance objective to identify new links that significantly enhance structural fairness. Extensive experimental results demonstrate the effectiveness and generalizability of our proposed method across a variety of graph-based fairness tasks.
☆ High-dimensional Asymptotics of Generalization Performance in Continual Ridge Regression
Continual learning is motivated by the need to adapt to real-world dynamics in tasks and data distribution while mitigating catastrophic forgetting. Despite significant advances in continual learning techniques, the theoretical understanding of their generalization performance lags behind. This paper examines the theoretical properties of continual ridge regression in high-dimensional linear models, where the dimension is proportional to the sample size in each task. Using random matrix theory, we derive exact expressions of the asymptotic prediction risk, thereby enabling the characterization of three evaluation metrics of generalization performance in continual learning: average risk, backward transfer, and forward transfer. Furthermore, we present the theoretical risk curves to illustrate the trends in these evaluation metrics throughout the continual learning process. Our analysis reveals several intriguing phenomena in the risk curves, demonstrating how model specifications influence the generalization performance. Simulation studies are conducted to validate our theoretical findings.
☆ Learning Protein-Ligand Binding in Hyperbolic Space
Protein-ligand binding prediction is central to virtual screening and affinity ranking, two fundamental tasks in drug discovery. While recent retrieval-based methods embed ligands and protein pockets into Euclidean space for similarity-based search, the geometry of Euclidean embeddings often fails to capture the hierarchical structure and fine-grained affinity variations intrinsic to molecular interactions. In this work, we propose HypSeek, a hyperbolic representation learning framework that embeds ligands, protein pockets, and sequences into Lorentz-model hyperbolic space. By leveraging the exponential geometry and negative curvature of hyperbolic space, HypSeek enables expressive, affinity-sensitive embeddings that can effectively model both global activity and subtle functional differences-particularly in challenging cases such as activity cliffs, where structurally similar ligands exhibit large affinity gaps. Our mode unifies virtual screening and affinity ranking in a single framework, introducing a protein-guided three-tower architecture to enhance representational structure. HypSeek improves early enrichment in virtual screening on DUD-E from 42.63 to 51.44 (+20.7%) and affinity ranking correlation on JACS from 0.5774 to 0.7239 (+25.4%), demonstrating the benefits of hyperbolic geometry across both tasks and highlighting its potential as a powerful inductive bias for protein-ligand modeling.
☆ Influence-driven Curriculum Learning for Pre-training on Limited Data
Curriculum learning, a training technique where data is presented to the model in order of example difficulty (e.g., from simpler to more complex documents), has shown limited success for pre-training language models. In this work, we investigate whether curriculum learning becomes competitive if we replace conventional human-centered difficulty metrics with one that more closely corresponds to example difficulty as observed during model training. Specifically, we experiment with sorting training examples by their \textit{training data influence}, a score which estimates the effect of individual training examples on the model's output. Models trained on our curricula are able to outperform ones trained in random order by over 10 percentage points in benchmarks, confirming that curriculum learning is beneficial for language model pre-training, as long as a more model-centric notion of difficulty is adopted.
comment: 9 pages
☆ JEDI-linear: Fast and Efficient Graph Neural Networks for Jet Tagging on FPGAs
Graph Neural Networks (GNNs), particularly Interaction Networks (INs), have shown exceptional performance for jet tagging at the CERN High-Luminosity Large Hadron Collider (HL-LHC). However, their computational complexity and irregular memory access patterns pose significant challenges for deployment on FPGAs in hardware trigger systems, where strict latency and resource constraints apply. In this work, we propose JEDI-linear, a novel GNN architecture with linear computational complexity that eliminates explicit pairwise interactions by leveraging shared transformations and global aggregation. To further enhance hardware efficiency, we introduce fine-grained quantization-aware training with per-parameter bitwidth optimization and employ multiplier-free multiply-accumulate operations via distributed arithmetic. Evaluation results show that our FPGA-based JEDI-linear achieves 3.7 to 11.5 times lower latency, up to 150 times lower initiation interval, and up to 6.2 times lower LUT usage compared to state-of-the-art designs while also delivering higher model accuracy and eliminating the need for DSP blocks entirely. In contrast, state-of-the-art solutions consume over 8,700 DSPs. This is the first interaction-based GNN to achieve less than 60~ns latency and currently meets the requirements for use in the HL-LHC CMS Level-1 trigger system. This work advances the next-generation trigger systems by enabling accurate, scalable, and resource-efficient GNN inference in real-time environments. Our open-sourced templates will further support reproducibility and broader adoption across scientific applications.
comment: 10 pages, 9 figures
☆ Mini-Batch Robustness Verification of Deep Neural Networks
Neural network image classifiers are ubiquitous in many safety-critical applications. However, they are susceptible to adversarial attacks. To understand their robustness to attacks, many local robustness verifiers have been proposed to analyze $\epsilon$-balls of inputs. Yet, existing verifiers introduce a long analysis time or lose too much precision, making them less effective for a large set of inputs. In this work, we propose a new approach to local robustness: group local robustness verification. The key idea is to leverage the similarity of the network computations of certain $\epsilon$-balls to reduce the overall analysis time. We propose BaVerLy, a sound and complete verifier that boosts the local robustness verification of a set of $\epsilon$-balls by dynamically constructing and verifying mini-batches. BaVerLy adaptively identifies successful mini-batch sizes, accordingly constructs mini-batches of $\epsilon$-balls that have similar network computations, and verifies them jointly. If a mini-batch is verified, all $\epsilon$-balls are proven robust. Otherwise, one $\epsilon$-ball is suspected as not being robust, guiding the refinement. In the latter case, BaVerLy leverages the analysis results to expedite the analysis of that $\epsilon$-ball as well as the other $\epsilon$-balls in the batch. We evaluate BaVerLy on fully connected and convolutional networks for MNIST and CIFAR-10. Results show that BaVerLy scales the common one by one verification by 2.3x on average and up to 4.1x, in which case it reduces the total analysis time from 24 hours to 6 hours.
comment: 30 pages, 12 figures, conference OOPSLA 2025
☆ A Solvable Molecular Switch Model for Stable Temporal Information Processing
This paper studies an input-driven one-state differential equation model initially developed for an experimentally demonstrated dynamic molecular switch that switches like synapses in the brain do. The linear-in-the-state and nonlinear-in-the-input model is exactly solvable, and it is shown that it also possesses mathematical properties of convergence and fading memory that enable stable processing of time-varying inputs by nonlinear dynamical systems. Thus, the model exhibits the co-existence of biologically-inspired behavior and desirable mathematical properties for stable learning on sequential data. The results give theoretical support for the use of the dynamic molecular switches as computational units in deep cascaded/layered feedforward and recurrent architectures as well as other more general structures for neuromorphic computing. They could also inspire more general exactly solvable models that can be fitted to emulate arbitrary physical devices which can mimic brain-inspired behaviour and perform stable computation on input signals.
comment: 21 pages, 6 figures, submitted for publication. Comments are welcome
☆ Reliable Unlearning Harmful Information in LLMs with Metamorphosis Representation Projection AAAI 2026
While Large Language Models (LLMs) have demonstrated impressive performance in various domains and tasks, concerns about their safety are becoming increasingly severe. In particular, since models may store unsafe knowledge internally, machine unlearning has emerged as a representative paradigm to ensure model safety. Existing approaches employ various training techniques, such as gradient ascent and negative preference optimization, in attempts to eliminate the influence of undesired data on target models. However, these methods merely suppress the activation of undesired data through parametric training without completely eradicating its informational traces within the model. This fundamental limitation makes it difficult to achieve effective continuous unlearning, rendering these methods vulnerable to relearning attacks. To overcome these challenges, we propose a Metamorphosis Representation Projection (MRP) approach that pioneers the application of irreversible projection properties to machine unlearning. By implementing projective transformations in the hidden state space of specific network layers, our method effectively eliminates harmful information while preserving useful knowledge. Experimental results demonstrate that our approach enables effective continuous unlearning and successfully defends against relearning attacks, achieving state-of-the-art performance in unlearning effectiveness while preserving natural performance. Our code is available in https://github.com/ChengcanWu/MRP.
comment: 10 pages, 9 figures, Under review as a full paper at AAAI 2026. A preliminary version is under review at the NeurIPS 2025 Workshop on Reliable ML from Unreliable Data
☆ Measures of Overlapping Multivariate Gaussian Clusters in Unsupervised Online Learning
In this paper, we propose a new measure for detecting overlap in multivariate Gaussian clusters. The aim of online learning from data streams is to create clustering, classification, or regression models that can adapt over time based on the conceptual drift of streaming data. In the case of clustering, this can result in a large number of clusters that may overlap and should be merged. Commonly used distribution dissimilarity measures are not adequate for determining overlapping clusters in the context of online learning from streaming data due to their inability to account for all shapes of clusters and their high computational demands. Our proposed dissimilarity measure is specifically designed to detect overlap rather than dissimilarity and can be computed faster compared to existing measures. Our method is several times faster than compared methods and is capable of detecting overlapping clusters while avoiding the merging of orthogonal clusters.
comment: 5 pages, in Slovenian language. 2 figures. Accepted for the 33rd International Electrotechnical and Computer Science Conference ERK 2024 (Portoroz, Slovenia, 26-27 Sep 2024). Conference PDF: https://erk.fe.uni-lj.si/2024/papers/ozbot(mere_prekrivanja).pdf
☆ Test-time Corpus Feedback: From Retrieval to RAG
Retrieval-Augmented Generation (RAG) has emerged as a standard framework for knowledge-intensive NLP tasks, combining large language models (LLMs) with document retrieval from external corpora. Despite its widespread use, most RAG pipelines continue to treat retrieval and reasoning as isolated components, retrieving documents once and then generating answers without further interaction. This static design often limits performance on complex tasks that require iterative evidence gathering or high-precision retrieval. Recent work in both the information retrieval (IR) and NLP communities has begun to close this gap by introducing adaptive retrieval and ranking methods that incorporate feedback. In this survey, we present a structured overview of advanced retrieval and ranking mechanisms that integrate such feedback. We categorize feedback signals based on their source and role in improving the query, retrieved context, or document pool. By consolidating these developments, we aim to bridge IR and NLP perspectives and highlight retrieval as a dynamic, learnable component of end-to-end RAG systems.
comment: 18 pages, 1 figure
☆ GraSP: A Unified Graph-Based Framework for Scalable Generation, Quality Tagging, and Management of Synthetic Data for SFT and DPO
The advancement of large language models (LLMs) is critically dependent on the availability of high-quality datasets for Supervised Fine-Tuning (SFT), alignment tasks like Direct Preference Optimization (DPO), etc. In this work, we present a comprehensive synthetic data generation framework that facilitates scalable, configurable, and high-fidelity generation of synthetic data tailored for these training paradigms. Our approach employs a modular and configuration-based pipeline capable of modeling complex dialogue flows with minimal manual intervention. This framework uses a dual-stage quality tagging mechanism, combining heuristic rules and LLM-based evaluations, to automatically filter and score data extracted from OASST-formatted conversations, ensuring the curation of high-quality dialogue samples. The resulting datasets are structured under a flexible schema supporting both SFT and DPO use cases, enabling seamless integration into diverse training workflows. Together, these innovations offer a robust solution for generating and managing synthetic conversational data at scale, significantly reducing the overhead of data preparation in LLM training pipelines.
☆ LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model
The development of Large Speech-Language Models (LSLMs) has been slowed by fragmented architectures and a lack of transparency, hindering the systematic comparison and reproducibility of research. Unlike in the vision-language domain, the LSLM field suffers from the common practice of releasing model weights without their corresponding training data and configurations. To address these critical gaps, we introduce LLaSO, the first fully open, end-to-end framework for large-scale speech-language modeling. LLaSO provides the community with three essential resources: (1) LLaSO-Align, a 12M-instance speech-text alignment corpus; (2) LLaSO-Instruct, a 13.5M-instance multi-task instruction-tuning dataset; and (3) LLaSO-Eval, a reproducible benchmark for standardized evaluation. To validate our framework, we build and release LLaSO-Base, a 3.8B-parameter reference model trained exclusively on our public data. It achieves a normalized score of 0.72, establishing a strong, reproducible baseline that surpasses comparable models. Our analysis reveals that while broader training coverage enhances performance, significant generalization gaps persist on unseen tasks, particularly in pure audio scenarios. By releasing the complete stack of data, benchmarks, and models, LLaSO establishes a foundational open standard to unify research efforts and accelerate community-driven progress in LSLMs. We release the code, dataset, pretrained models, and results in https://github.com/EIT-NLP/LLaSO.
☆ Bridging Generalization and Personalization in Wearable Human Activity Recognition via On-Device Few-Shot Learning
Human Activity Recognition (HAR) using wearable devices has advanced significantly in recent years, yet its generalization remains limited when models are deployed to new users. This degradation in performance is primarily due to user-induced concept drift (UICD), highlighting the importance of efficient personalization. In this paper, we present a hybrid framework that first generalizes across users and then rapidly adapts to individual users using few-shot learning directly on-device. By updating only the classifier layer with user-specific data, our method achieves robust personalization with minimal computational and memory overhead. We implement this framework on the energy-efficient RISC-V-based GAP9 microcontroller and validate it across three diverse HAR scenarios: RecGym, QVAR-Gesture, and Ultrasound-Gesture. Post-deployment adaptation yields consistent accuracy improvements of 3.73\%, 17.38\%, and 3.70\% respectively. These results confirm that fast, lightweight, and effective personalization is feasible on embedded platforms, paving the way for scalable and user-aware HAR systems in the wild \footnote{https://github.com/kangpx/onlineTiny2023}.
☆ Foundational Design Principles and Patterns for Building Robust and Adaptive GenAI-Native Systems
Generative AI (GenAI) has emerged as a transformative technology, demonstrating remarkable capabilities across diverse application domains. However, GenAI faces several major challenges in developing reliable and efficient GenAI-empowered systems due to its unpredictability and inefficiency. This paper advocates for a paradigm shift: future GenAI-native systems should integrate GenAI's cognitive capabilities with traditional software engineering principles to create robust, adaptive, and efficient systems. We introduce foundational GenAI-native design principles centered around five key pillars -- reliability, excellence, evolvability, self-reliance, and assurance -- and propose architectural patterns such as GenAI-native cells, organic substrates, and programmable routers to guide the creation of resilient and self-evolving systems. Additionally, we outline the key ingredients of a GenAI-native software stack and discuss the impact of these systems from technical, user adoption, economic, and legal perspectives, underscoring the need for further validation and experimentation. Our work aims to inspire future research and encourage relevant communities to implement and refine this conceptual framework.
☆ Hybrid Least Squares/Gradient Descent Methods for DeepONets
We propose an efficient hybrid least squares/gradient descent method to accelerate DeepONet training. Since the output of DeepONet can be viewed as linear with respect to the last layer parameters of the branch network, these parameters can be optimized using a least squares (LS) solve, and the remaining hidden layer parameters are updated by means of gradient descent form. However, building the LS system for all possible combinations of branch and trunk inputs yields a prohibitively large linear problem that is infeasible to solve directly. To address this issue, our method decomposes the large LS system into two smaller, more manageable subproblems $\unicode{x2014}$ one for the branch network and one for the trunk network $\unicode{x2014}$ and solves them separately. This method is generalized to a broader type of $L^2$ loss with a regularization term for the last layer parameters, including the case of unsupervised learning with physics-informed loss.
☆ Federated Learning based on Self-Evolving Gaussian Clustering
In this study, we present an Evolving Fuzzy System within the context of Federated Learning, which adapts dynamically with the addition of new clusters and therefore does not require the number of clusters to be selected apriori. Unlike traditional methods, Federated Learning allows models to be trained locally on clients' devices, sharing only the model parameters with a central server instead of the data. Our method, implemented using PyTorch, was tested on clustering and classification tasks. The results show that our approach outperforms established classification methods on several well-known UCI datasets. While computationally intensive due to overlap condition calculations, the proposed method demonstrates significant advantages in decentralized data processing.
comment: 5 pages, in slovenian language, 3 figures. Published in the Proceedings of the 33rd International Electrotechnical and Computer Science Conference (ERK 2024), Portoroz, Slovenia, pp. 240-243. Indexed in COBISS (COBISS.SI-ID 212879107). Official version available at https://erk.fe.uni-lj.si/2024/papers/ozbot_federativno_ucenje.pdf
☆ CITE: A Comprehensive Benchmark for Heterogeneous Text-Attributed Graphs on Catalytic Materials
Text-attributed graphs(TAGs) are pervasive in real-world systems,where each node carries its own textual features. In many cases these graphs are inherently heterogeneous, containing multiple node types and diverse edge types. Despite the ubiquity of such heterogeneous TAGs, there remains a lack of large-scale benchmark datasets. This shortage has become a critical bottleneck, hindering the development and fair comparison of representation learning methods on heterogeneous text-attributed graphs. In this paper, we introduce CITE - Catalytic Information Textual Entities Graph, the first and largest heterogeneous text-attributed citation graph benchmark for catalytic materials. CITE comprises over 438K nodes and 1.2M edges, spanning four relation types. In addition, we establish standardized evaluation procedures and conduct extensive benchmarking on the node classification task, as well as ablation experiments on the heterogeneous and textual properties of CITE. We compare four classes of learning paradigms, including homogeneous graph models, heterogeneous graph models, LLM(Large Language Model)-centric models, and LLM+Graph models. In a nutshell, we provide (i) an overview of the CITE dataset, (ii) standardized evaluation protocols, and (iii) baseline and ablation experiments across diverse modeling paradigms.
comment: 23 pages, 4 figures,
☆ Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training
Large language models are trained with tokenizers, and the resulting token distribution is highly imbalanced: a few words dominate the stream while most occur rarely. Recent practice favors ever-larger vocabularies, but the source of the benefit is unclear. We conduct a controlled study that scales the language model's vocabulary from 24K to 196K while holding data, compute, and optimization fixed. We first quantify the complexity of tokenized text, formalized via Kolmogorov complexity, and show that larger vocabularies reduce this complexity. Above 24K, every common word is already a single token, so further growth mainly deepens the relative token-frequency imbalance. A word-level loss decomposition shows that larger vocabularies reduce cross-entropy almost exclusively by lowering uncertainty on the 2,500 most frequent words, even though loss on the rare tail rises. Constraining input and output embedding norms to attenuate the effect of token-frequency imbalance reverses the gain, directly showing that the model exploits rather than suffers from imbalance. Because the same frequent words cover roughly 77% of tokens in downstream benchmarks, this training advantage transfers intact. We also show that enlarging model parameters with a fixed vocabulary yields the same frequent-word benefit. Our results reframe "bigger vocabularies help" as "lowering the complexity of tokenized text helps," providing a simple, principled lever for tokenizer-model co-design and clarifying the loss dynamics that govern language-model scaling in pre-training.
comment: Preprint
☆ EvoFormer: Learning Dynamic Graph-Level Representations with Structural and Temporal Bias Correction
Dynamic graph-level embedding aims to capture structural evolution in networks, which is essential for modeling real-world scenarios. However, existing methods face two critical yet under-explored issues: Structural Visit Bias, where random walk sampling disproportionately emphasizes high-degree nodes, leading to redundant and noisy structural representations; and Abrupt Evolution Blindness, the failure to effectively detect sudden structural changes due to rigid or overly simplistic temporal modeling strategies, resulting in inconsistent temporal embeddings. To overcome these challenges, we propose EvoFormer, an evolution-aware Transformer framework tailored for dynamic graph-level representation learning. To mitigate Structural Visit Bias, EvoFormer introduces a Structure-Aware Transformer Module that incorporates positional encoding based on node structural roles, allowing the model to globally differentiate and accurately represent node structures. To overcome Abrupt Evolution Blindness, EvoFormer employs an Evolution-Sensitive Temporal Module, which explicitly models temporal evolution through a sequential three-step strategy: (I) Random Walk Timestamp Classification, generating initial timestamp-aware graph-level embeddings; (II) Graph-Level Temporal Segmentation, partitioning the graph stream into segments reflecting structurally coherent periods; and (III) Segment-Aware Temporal Self-Attention combined with an Edge Evolution Prediction task, enabling the model to precisely capture segment boundaries and perceive structural evolution trends, effectively adapting to rapid temporal shifts. Extensive evaluations on five benchmark datasets confirm that EvoFormer achieves state-of-the-art performance in graph similarity ranking, temporal anomaly detection, and temporal segmentation tasks, validating its effectiveness in correcting structural and temporal biases.
☆ Fairness for the People, by the People: Minority Collective Action
Machine learning models often preserve biases present in training data, leading to unfair treatment of certain minority groups. Despite an array of existing firm-side bias mitigation techniques, they typically incur utility costs and require organizational buy-in. Recognizing that many models rely on user-contributed data, end-users can induce fairness through the framework of Algorithmic Collective Action, where a coordinated minority group strategically relabels its own data to enhance fairness, without altering the firm's training process. We propose three practical, model-agnostic methods to approximate ideal relabeling and validate them on real-world datasets. Our findings show that a subgroup of the minority can substantially reduce unfairness with a small impact on the overall prediction error.
☆ Enhancing Forecasting with a 2D Time Series Approach for Cohort-Based Data
This paper introduces a novel two-dimensional (2D) time series forecasting model that integrates cohort behavior over time, addressing challenges in small data environments. We demonstrate its efficacy using multiple real-world datasets, showcasing superior performance in accuracy and adaptability compared to reference models. The approach offers valuable insights for strategic decision-making across industries facing financial and marketing forecasting challenges.
comment: Accepted at IEEE CiFer Companion 2025. 5 pages, 3 figures, 2 tables
☆ ExBigBang: A Dynamic Approach for Explainable Persona Classification through Contextualized Hybrid Transformer Analysis
In user-centric design, persona development plays a vital role in understanding user behaviour, capturing needs, segmenting audiences, and guiding design decisions. However, the growing complexity of user interactions calls for a more contextualized approach to ensure designs align with real user needs. While earlier studies have advanced persona classification by modelling user behaviour, capturing contextual information, especially by integrating textual and tabular data, remains a key challenge. These models also often lack explainability, leaving their predictions difficult to interpret or justify. To address these limitations, we present ExBigBang (Explainable BigBang), a hybrid text-tabular approach that uses transformer-based architectures to model rich contextual features for persona classification. ExBigBang incorporates metadata, domain knowledge, and user profiling to embed deeper context into predictions. Through a cyclical process of user profiling and classification, our approach dynamically updates to reflect evolving user behaviours. Experiments on a benchmark persona classification dataset demonstrate the robustness of our model. An ablation study confirms the benefits of combining text and tabular data, while Explainable AI techniques shed light on the rationale behind the model's predictions.
☆ Bayesian Inference and Learning in Nonlinear Dynamical Systems: A Framework for Incorporating Explicit and Implicit Prior Knowledge
Accuracy and generalization capabilities are key objectives when learning dynamical system models. To obtain such models from limited data, current works exploit prior knowledge and assumptions about the system. However, the fusion of diverse prior knowledge, e. g. partially known system equations and smoothness assumptions about unknown model parts, with information contained in the data remains a challenging problem, especially in input-output settings with latent system state. In particular, learning functions that are nested inside known system equations can be a laborious and error-prone expert task. This paper considers inference of latent states and learning of unknown model parts for fusion of data information with different sources of prior knowledge. The main contribution is a general-purpose system identification tool that, for the first time, provides a consistent solution for both, online and offline Bayesian inference and learning while allowing to incorporate explicit and implicit prior system knowledge. We propose a novel interface for combining known dynamics functions with a learning-based approximation of unknown system parts. Based on the proposed model structure, closed-form densities for efficient parameter marginalization are derived. No user-tailored coordinate transformations or model inversions are needed, making the presented framework a general-purpose tool for inference and learning. The broad applicability of the devised framework is illustrated in three distinct case studies, including an experimental data set.
comment: 16 pages, Preprint submitted to Automatica
☆ An Enhanced Audio Feature Tailored for Anomalous Sound Detection Based on Pre-trained Models ICANN2025
Anomalous Sound Detection (ASD) aims at identifying anomalous sounds from machines and has gained extensive research interests from both academia and industry. However, the uncertainty of anomaly location and much redundant information such as noise in machine sounds hinder the improvement of ASD system performance. This paper proposes a novel audio feature of filter banks with evenly distributed intervals, ensuring equal attention to all frequency ranges in the audio, which enhances the detection of anomalies in machine sounds. Moreover, based on pre-trained models, this paper presents a parameter-free feature enhancement approach to remove redundant information in machine audio. It is believed that this parameter-free strategy facilitates the effective transfer of universal knowledge from pre-trained tasks to the ASD task during model fine-tuning. Evaluation results on the Detection and Classification of Acoustic Scenes and Events (DCASE) 2024 Challenge dataset demonstrate significant improvements in ASD performance with our proposed methods.
comment: 13 pages, 3 figures, accepted by ICANN2025
☆ Search-Based Credit Assignment for Offline Preference-Based Reinforcement Learning
Offline reinforcement learning refers to the process of learning policies from fixed datasets, without requiring additional environment interaction. However, it often relies on well-defined reward functions, which are difficult and expensive to design. Human feedback is an appealing alternative, but its two common forms, expert demonstrations and preferences, have complementary limitations. Demonstrations provide stepwise supervision, but they are costly to collect and often reflect limited expert behavior modes. In contrast, preferences are easier to collect, but it is unclear which parts of a behavior contribute most to a trajectory segment, leaving credit assignment unresolved. In this paper, we introduce a Search-Based Preference Weighting (SPW) scheme to unify these two feedback sources. For each transition in a preference labeled trajectory, SPW searches for the most similar state-action pairs from expert demonstrations and directly derives stepwise importance weights based on their similarity scores. These weights are then used to guide standard preference learning, enabling more accurate credit assignment that traditional approaches struggle to achieve. We demonstrate that SPW enables effective joint learning from preferences and demonstrations, outperforming prior methods that leverage both feedback types on challenging robot manipulation tasks.
comment: 7 pages, 6 figures, under review
☆ Flow Matching at Scale: A Machine Learning Framework for Efficient Large-Size Sampling of Many-Body Systems
We propose a machine learning framework based on Flow Matching to overcome the scaling limitations of Markov Chain Monte Carlo (MCMC) methods. We demonstrate its capability in the 2D XY model, where a single network, trained only on configurations from a small ($32\times 32$) lattice at sparse temperature points, generates reliable samples for a significantly larger system ($128\times 128$) across a continuous temperature range without retraining. The generated configurations show strong agreement with key thermodynamic observables and correctly capture the signatures of the Berezinskii-Kosterlitz-Thouless (BKT) transition. This dual generalization is enabled by the Flow Matching framework, which allows us to learn a continuous, temperature-conditioned mapping. At the same time, the inductive biases of the underlying CNN architecture ensure that the learned local physical rules are scale-invariant. This "train-small, generate-large" capability establishes a new paradigm for efficiently studying critical phenomena, offering a significant computational advantage for exploring the thermodynamic limit. The method can be directly applied to other classical or quantum many-body systems described by continuous fields on a lattice.
☆ Saving for the future: Enhancing generalization via partial logic regularization
Generalization remains a significant challenge in visual classification tasks, particularly in handling unknown classes in real-world applications. Existing research focuses on the class discovery paradigm, which tends to favor known classes, and the incremental learning paradigm, which suffers from catastrophic forgetting. Recent approaches such as the L-Reg technique employ logic-based regularization to enhance generalization but are bound by the necessity of fully defined logical formulas, limiting flexibility for unknown classes. This paper introduces PL-Reg, a novel partial-logic regularization term that allows models to reserve space for undefined logic formulas, improving adaptability to unknown classes. Specifically, we formally demonstrate that tasks involving unknown classes can be effectively explained using partial logic. We also prove that methods based on partial logic lead to improved generalization. We validate PL-Reg through extensive experiments on Generalized Category Discovery, Multi-Domain Generalized Category Discovery, and long-tailed Class Incremental Learning tasks, demonstrating consistent performance improvements. Our results highlight the effectiveness of partial logic in tackling challenges related to unknown classes.
☆ CUPE: Contextless Universal Phoneme Encoder for Language-Agnostic Speech Processing SP 2025
Universal phoneme recognition typically requires analyzing long speech segments and language-specific patterns. Many speech processing tasks require pure phoneme representations free from contextual influence, which motivated our development of CUPE - a lightweight model that captures key phoneme features in just 120 milliseconds, about one phoneme's length. CUPE processes short, fixed-width windows independently and, despite fewer parameters than current approaches, achieves competitive cross-lingual performance by learning fundamental acoustic patterns common to all languages. Our extensive evaluation through supervised and self-supervised training on diverse languages, including zero-shot tests on the UCLA Phonetic Corpus, demonstrates strong cross-lingual generalization and reveals that effective universal speech processing is possible through modeling basic acoustic patterns within phoneme-length windows.
comment: Accepted in: 8th International Conference on Natural Language and Speech Processing (ICNLSP 2025)
☆ Evaluating Knowledge Graph Complexity via Semantic, Spectral, and Structural Metrics for Link Prediction
Understanding dataset complexity is fundamental to evaluating and comparing link prediction models on knowledge graphs (KGs). While the Cumulative Spectral Gradient (CSG) metric, derived from probabilistic divergence between classes within a spectral clustering framework, has been proposed as a classifier agnostic complexity metric purportedly scaling with class cardinality and correlating with downstream performance, it has not been evaluated in KG settings so far. In this work, we critically examine CSG in the context of multi relational link prediction, incorporating semantic representations via transformer derived embeddings. Contrary to prior claims, we find that CSG is highly sensitive to parametrisation and does not robustly scale with the number of classes. Moreover, it exhibits weak or inconsistent correlation with standard performance metrics such as Mean Reciprocal Rank (MRR) and Hit@1. To deepen the analysis, we introduce and benchmark a set of structural and semantic KG complexity metrics. Our findings reveal that global and local relational ambiguity captured via Relation Entropy, node level Maximum Relation Diversity, and Relation Type Cardinality exhibit strong inverse correlations with MRR and Hit@1, suggesting these as more faithful indicators of task difficulty. Conversely, graph connectivity measures such as Average Degree, Degree Entropy, PageRank, and Eigenvector Centrality correlate positively with Hit@10. Our results demonstrate that CSGs purported stability and generalization predictive power fail to hold in link prediction settings and underscore the need for more stable, interpretable, and task-aligned measures of dataset complexity in knowledge driven learning.
☆ MMQ: Multimodal Mixture-of-Quantization Tokenization for Semantic ID Generation and User Behavioral Adaptation
Recommender systems traditionally represent items using unique identifiers (ItemIDs), but this approach struggles with large, dynamic item corpora and sparse long-tail data, limiting scalability and generalization. Semantic IDs, derived from multimodal content such as text and images, offer a promising alternative by mapping items into a shared semantic space, enabling knowledge transfer and improving recommendations for new or rare items. However, existing methods face two key challenges: (1) balancing cross-modal synergy with modality-specific uniqueness, and (2) bridging the semantic-behavioral gap, where semantic representations may misalign with actual user preferences. To address these challenges, we propose Multimodal Mixture-of-Quantization (MMQ), a two-stage framework that trains a novel multimodal tokenizer. First, a shared-specific tokenizer leverages a multi-expert architecture with modality-specific and modality-shared experts, using orthogonal regularization to capture comprehensive multimodal information. Second, behavior-aware fine-tuning dynamically adapts semantic IDs to downstream recommendation objectives while preserving modality information through a multimodal reconstruction loss. Extensive offline experiments and online A/B tests demonstrate that MMQ effectively unifies multimodal synergy, specificity, and behavioral adaptation, providing a scalable and versatile solution for both generative retrieval and discriminative ranking tasks.
☆ Deep Think with Confidence
Large Language Models (LLMs) have shown great potential in reasoning tasks through test-time scaling methods like self-consistency with majority voting. However, this approach often leads to diminishing returns in accuracy and high computational overhead. To address these challenges, we introduce Deep Think with Confidence (DeepConf), a simple yet powerful method that enhances both reasoning efficiency and performance at test time. DeepConf leverages model-internal confidence signals to dynamically filter out low-quality reasoning traces during or after generation. It requires no additional model training or hyperparameter tuning and can be seamlessly integrated into existing serving frameworks. We evaluate DeepConf across a variety of reasoning tasks and the latest open-source models, including Qwen 3 and GPT-OSS series. Notably, on challenging benchmarks such as AIME 2025, DeepConf@512 achieves up to 99.9% accuracy and reduces generated tokens by up to 84.7% compared to full parallel thinking.
☆ Pretrained Diffusion Models Are Inherently Skipped-Step Samplers
Diffusion models have been achieving state-of-the-art results across various generation tasks. However, a notable drawback is their sequential generation process, requiring long-sequence step-by-step generation. Existing methods, such as DDIM, attempt to reduce sampling steps by constructing a class of non-Markovian diffusion processes that maintain the same training objective. However, there remains a gap in understanding whether the original diffusion process can achieve the same efficiency without resorting to non-Markovian processes. In this paper, we provide a confirmative answer and introduce skipped-step sampling, a mechanism that bypasses multiple intermediate denoising steps in the iterative generation process, in contrast with the traditional step-by-step refinement of standard diffusion inference. Crucially, we demonstrate that this skipped-step sampling mechanism is derived from the same training objective as the standard diffusion model, indicating that accelerated sampling via skipped-step sampling via a Markovian way is an intrinsic property of pretrained diffusion models. Additionally, we propose an enhanced generation method by integrating our accelerated sampling technique with DDIM. Extensive experiments on popular pretrained diffusion models, including the OpenAI ADM, Stable Diffusion, and Open Sora models, show that our method achieves high-quality generation with significantly reduced sampling steps.
☆ Robust and Efficient Quantum Reservoir Computing with Discrete Time Crystal
The rapid development of machine learning and quantum computing has placed quantum machine learning at the forefront of research. However, existing quantum machine learning algorithms based on quantum variational algorithms face challenges in trainability and noise robustness. In order to address these challenges, we introduce a gradient-free, noise-robust quantum reservoir computing algorithm that harnesses discrete time crystal dynamics as a reservoir. We first calibrate the memory, nonlinear, and information scrambling capacities of the quantum reservoir, revealing their correlation with dynamical phases and non-equilibrium phase transitions. We then apply the algorithm to the binary classification task and establish a comparative quantum kernel advantage. For ten-class classification, both noisy simulations and experimental results on superconducting quantum processors match ideal simulations, demonstrating the enhanced accuracy with increasing system size and confirming the topological noise robustness. Our work presents the first experimental demonstration of quantum reservoir computing for image classification based on digital quantum simulation. It establishes the correlation between quantum many-body non-equilibrium phase transitions and quantum machine learning performance, providing new design principles for quantum reservoir computing and broader quantum machine learning algorithms in the NISQ era.
comment: 12 pages, 7 figures
☆ VocabTailor: Dynamic Vocabulary Selection for Downstream Tasks in Small Language Models
Small Language Models (SLMs) provide computational advantages in resource-constrained environments, yet memory limitations remain a critical bottleneck for edge device deployment. A substantial portion of SLMs' memory footprint stems from vocabulary-related components, particularly embeddings and language modeling (LM) heads, due to large vocabulary sizes. Existing static vocabulary pruning, while reducing memory usage, suffers from rigid, one-size-fits-all designs that cause information loss from the prefill stage and a lack of flexibility. In this work, we identify two key principles underlying the vocabulary reduction challenge: the lexical locality principle, the observation that only a small subset of tokens is required during any single inference, and the asymmetry in computational characteristics between vocabulary-related components of SLM. Based on these insights, we introduce VocabTailor, a novel decoupled dynamic vocabulary selection framework that addresses memory constraints through offloading embedding and implements a hybrid static-dynamic vocabulary selection strategy for LM Head, enabling on-demand loading of vocabulary components. Comprehensive experiments across diverse downstream tasks demonstrate that VocabTailor achieves a reduction of up to 99% in the memory usage of vocabulary-related components with minimal or no degradation in task performance, substantially outperforming existing static vocabulary pruning.
☆ Learning ECG Representations via Poly-Window Contrastive Learning
Electrocardiogram (ECG) analysis is foundational for cardiovascular disease diagnosis, yet the performance of deep learning models is often constrained by limited access to annotated data. Self-supervised contrastive learning has emerged as a powerful approach for learning robust ECG representations from unlabeled signals. However, most existing methods generate only pairwise augmented views and fail to leverage the rich temporal structure of ECG recordings. In this work, we present a poly-window contrastive learning framework. We extract multiple temporal windows from each ECG instance to construct positive pairs and maximize their agreement via statistics. Inspired by the principle of slow feature analysis, our approach explicitly encourages the model to learn temporally invariant and physiologically meaningful features that persist across time. We validate our approach through extensive experiments and ablation studies on the PTB-XL dataset. Our results demonstrate that poly-window contrastive learning consistently outperforms conventional two-view methods in multi-label superclass classification, achieving higher AUROC (0.891 vs. 0.888) and F1 scores (0.680 vs. 0.679) while requiring up to four times fewer pre-training epochs (32 vs. 128) and 14.8% in total wall clock pre-training time reduction. Despite processing multiple windows per sample, we achieve a significant reduction in the number of training epochs and total computation time, making our method practical for training foundational models. Through extensive ablations, we identify optimal design choices and demonstrate robustness across various hyperparameters. These findings establish poly-window contrastive learning as a highly efficient and scalable paradigm for automated ECG analysis and provide a promising general framework for self-supervised representation learning in biomedical time-series data.
comment: This work has been accepted for publication in IEEE-EMBS International Conference on Biomedical and Health Informatics 2025. The final published version will be available via IEEE Xplore
☆ Locally Pareto-Optimal Interpretations for Black-Box Machine Learning Models
Creating meaningful interpretations for black-box machine learning models involves balancing two often conflicting objectives: accuracy and explainability. Exploring the trade-off between these objectives is essential for developing trustworthy interpretations. While many techniques for multi-objective interpretation synthesis have been developed, they typically lack formal guarantees on the Pareto-optimality of the results. Methods that do provide such guarantees, on the other hand, often face severe scalability limitations when exploring the Pareto-optimal space. To address this, we develop a framework based on local optimality guarantees that enables more scalable synthesis of interpretations. Specifically, we consider the problem of synthesizing a set of Pareto-optimal interpretations with local optimality guarantees, within the immediate neighborhood of each solution. Our approach begins with a multi-objective learning or search technique, such as Multi-Objective Monte Carlo Tree Search, to generate a best-effort set of Pareto-optimal candidates with respect to accuracy and explainability. We then verify local optimality for each candidate as a Boolean satisfiability problem, which we solve using a SAT solver. We demonstrate the efficacy of our approach on a set of benchmarks, comparing it against previous methods for exploring the Pareto-optimal front of interpretations. In particular, we show that our approach yields interpretations that closely match those synthesized by methods offering global guarantees.
comment: This work has been accepted at ATVA'25
☆ See Beyond a Single View: Multi-Attribution Learning Leads to Better Conversion Rate Prediction CIKM 2025
Conversion rate (CVR) prediction is a core component of online advertising systems, where the attribution mechanisms-rules for allocating conversion credit across user touchpoints-fundamentally determine label generation and model optimization. While many industrial platforms support diverse attribution mechanisms (e.g., First-Click, Last-Click, Linear, and Data-Driven Multi-Touch Attribution), conventional approaches restrict model training to labels from a single production-critical attribution mechanism, discarding complementary signals in alternative attribution perspectives. To address this limitation, we propose a novel Multi-Attribution Learning (MAL) framework for CVR prediction that integrates signals from multiple attribution perspectives to better capture the underlying patterns driving user conversions. Specifically, MAL is a joint learning framework consisting of two core components: the Attribution Knowledge Aggregator (AKA) and the Primary Target Predictor (PTP). AKA is implemented as a multi-task learner that integrates knowledge extracted from diverse attribution labels. PTP, in contrast, focuses on the task of generating well-calibrated conversion probabilities that align with the system-optimized attribution metric (e.g., CVR under the Last-Click attribution), ensuring direct compatibility with industrial deployment requirements. Additionally, we propose CAT, a novel training strategy that leverages the Cartesian product of all attribution label combinations to generate enriched supervision signals. This design substantially enhances the performance of the attribution knowledge aggregator. Empirical evaluations demonstrate the superiority of MAL over single-attribution learning baselines, achieving +0.51% GAUC improvement on offline metrics. Online experiments demonstrate that MAL achieved a +2.6% increase in ROI (Return on Investment).
comment: Accepted at CIKM 2025
☆ SleepDIFFormer: Sleep Stage Classification via Multivariate Differential Transformer
Classification of sleep stages is essential for assessing sleep quality and diagnosing sleep disorders such as insomnia. However, manual inspection of EEG characteristics for each stage is time-consuming and prone to human error. Although machine learning and deep learning methods have been actively developed, they continue to face challenges from the non-stationarity and variability of electroencephalography (EEG) and electrooculography (EOG) signals, often leading to poor generalization on unseen datasets. This research proposed a Sleep Stage Classification method by developing Multivariate Differential Transformer (SleepDIFFormer) for joint EEG and EOG representation learning. Specifically, SleepDIFFormer was developed to process EEG and EOG signals using our Multivariate Differential Transformer Architecture (MDTA) for time series, trained with cross-domain alignment. Our method mitigated spatial and temporal attention noise while learning a domain-invariant joint EEG-EOG representation through feature distribution alignment, thereby enabling generalization to unseen target datasets. Empirically, we evaluated our method on five different sleep staging datasets and compared it with existing approaches, achieving state-of-the-art performance. We also conducted thorough ablation analyses of SleepDIFFormer and interpreted the differential attention weights, highlighting their relevance to characteristic sleep EEG patterns. These findings have implications for advancing automated sleep stage classification and its application to sleep quality assessment. Our source code is publicly available at https://github.com/Ben1001409/SleepDIFFormer
comment: 8 Pages
☆ SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning
Long-context inference in large language models (LLMs) is increasingly constrained by the KV cache bottleneck: memory usage grows linearly with sequence length, while attention computation scales quadratically. Existing approaches address this issue by compressing the KV cache along the temporal axis through strategies such as token eviction or merging to reduce memory and computational overhead. However, these methods often neglect fine-grained importance variations across feature dimensions (i.e., the channel axis), thereby limiting their ability to effectively balance efficiency and model accuracy. In reality, we observe that channel saliency varies dramatically across both queries and positions: certain feature channels carry near-zero information for a given query, while others spike in relevance. To address this oversight, we propose SPARK, a training-free plug-and-play method that applies unstructured sparsity by pruning KV at the channel level, while dynamically restoring the pruned entries during attention score computation. Notably, our approach is orthogonal to existing KV compression and quantization techniques, making it compatible for integration with them to achieve further acceleration. By reducing channel-level redundancy, SPARK enables processing of longer sequences within the same memory budget. For sequences of equal length, SPARK not only preserves or improves model accuracy but also reduces KV cache storage by over 30% compared to eviction-based methods. Furthermore, even with an aggressive pruning ratio of 80%, SPARK maintains performance with less degradation than 5% compared to the baseline eviction method, demonstrating its robustness and effectiveness. Our code will be available at https://github.com/Xnhyacinth/SparK.
☆ Frequency-adaptive tensor neural networks for high-dimensional multi-scale problems
Tensor neural networks (TNNs) have demonstrated their superiority in solving high-dimensional problems. However, similar to conventional neural networks, TNNs are also influenced by the Frequency Principle, which limits their ability to accurately capture high-frequency features of the solution. In this work, we analyze the training dynamics of TNNs by Fourier analysis and enhance their expressivity for high-dimensional multi-scale problems by incorporating random Fourier features. Leveraging the inherent tensor structure of TNNs, we further propose a novel approach to extract frequency features of high-dimensional functions by performing the Discrete Fourier Transform to one-dimensional component functions. This strategy effectively mitigates the curse of dimensionality. Building on this idea, we propose a frequency-adaptive TNNs algorithm, which significantly improves the ability of TNNs in solving complex multi-scale problems. Extensive numerical experiments are performed to validate the effectiveness and robustness of the proposed frequency-adaptive TNNs algorithm.
☆ GEN2: A Generative Prediction-Correction Framework for Long-time Emulations of Spatially-Resolved Climate Extremes
Accurately quantifying the increased risks of climate extremes requires generating large ensembles of climate realization across a wide range of emissions scenarios, which is computationally challenging for conventional Earth System Models. We propose GEN2, a generative prediction-correction framework for an efficient and accurate forecast of the extreme event statistics. The prediction step is constructed as a conditional Gaussian emulator, followed by a non-Gaussian machine-learning (ML) correction step. The ML model is trained on pairs of the reference data and the emulated fields nudged towards the reference, to ensure the training is robust to chaos. We first validate the accuracy of our model on historical ERA5 data and then demonstrate the extrapolation capabilities on various future climate change scenarios. When trained on a single realization of one warming scenario, our model accurately predicts the statistics of extreme events in different scenarios, successfully extrapolating beyond the distribution of training data.
☆ Revisiting Pre-processing Group Fairness: A Modular Benchmarking Framework CIKM 2025
As machine learning systems become increasingly integrated into high-stakes decision-making processes, ensuring fairness in algorithmic outcomes has become a critical concern. Methods to mitigate bias typically fall into three categories: pre-processing, in-processing, and post-processing. While significant attention has been devoted to the latter two, pre-processing methods, which operate at the data level and offer advantages such as model-agnosticism and improved privacy compliance, have received comparatively less focus and lack standardised evaluation tools. In this work, we introduce FairPrep, an extensible and modular benchmarking framework designed to evaluate fairness-aware pre-processing techniques on tabular datasets. Built on the AIF360 platform, FairPrep allows seamless integration of datasets, fairness interventions, and predictive models. It features a batch-processing interface that enables efficient experimentation and automatic reporting of fairness and utility metrics. By offering standardised pipelines and supporting reproducible evaluations, FairPrep fills a critical gap in the fairness benchmarking landscape and provides a practical foundation for advancing data-level fairness research.
comment: This paper has been accepted to the 34th ACM International Conference on Information and Knowledge Management (CIKM 2025), Resource Track
☆ Integrated Sensing, Communication, and Computation for Over-the-Air Federated Edge Learning
This paper studies an over-the-air federated edge learning (Air-FEEL) system with integrated sensing, communication, and computation (ISCC), in which one edge server coordinates multiple edge devices to wirelessly sense the objects and use the sensing data to collaboratively train a machine learning model for recognition tasks. In this system, over-the-air computation (AirComp) is employed to enable one-shot model aggregation from edge devices. Under this setup, we analyze the convergence behavior of the ISCC-enabled Air-FEEL in terms of the loss function degradation, by particularly taking into account the wireless sensing noise during the training data acquisition and the AirComp distortions during the over-the-air model aggregation. The result theoretically shows that sensing, communication, and computation compete for network resources to jointly decide the convergence rate. Based on the analysis, we design the ISCC parameters under the target of maximizing the loss function degradation while ensuring the latency and energy budgets in each round. The challenge lies on the tightly coupled processes of sensing, communication, and computation among different devices. To tackle the challenge, we derive a low-complexity ISCC algorithm by alternately optimizing the batch size control and the network resource allocation. It is found that for each device, less sensing power should be consumed if a larger batch of data samples is obtained and vice versa. Besides, with a given batch size, the optimal computation speed of one device is the minimum one that satisfies the latency constraint. Numerical results based on a human motion recognition task verify the theoretical convergence analysis and show that the proposed ISCC algorithm well coordinates the batch size control and resource allocation among sensing, communication, and computation to enhance the learning performance.
comment: The paper has been accepted for publication in IEEE Transactions on Wireless Communications
☆ SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks
Jailbreak attacks pose a serious threat to the safety of Large Language Models (LLMs) by crafting adversarial prompts that bypass alignment mechanisms, causing the models to produce harmful, restricted, or biased content. In this paper, we propose SafeLLM, a novel unlearning-based defense framework that unlearn the harmful knowledge from LLMs while preserving linguistic fluency and general capabilities. SafeLLM employs a three-stage pipeline: (1) dynamic unsafe output detection using a hybrid approach that integrates external classifiers with model-internal evaluations; (2) token-level harmful content tracing through feedforward network (FFN) activations to localize harmful knowledge; and (3) constrained optimization to suppress unsafe behavior without degrading overall model quality. SafeLLM achieves targeted and irreversible forgetting by identifying and neutralizing FFN substructures responsible for harmful generation pathways. Extensive experiments on prominent LLMs (Vicuna, LLaMA, and GPT-J) across multiple jailbreak benchmarks show that SafeLLM substantially reduces attack success rates while maintaining high general-purpose performance. Compared to standard defense methods such as supervised fine-tuning and direct preference optimization, SafeLLM offers stronger safety guarantees, more precise control over harmful behavior, and greater robustness to unseen attacks. Moreover, SafeLLM maintains the general performance after the harmful knowledge unlearned. These results highlight unlearning as a promising direction for scalable and effective LLM safety.
☆ A Robust BERT-Based Deep Learning Model for Automated Cancer Type Extraction from Unstructured Pathology Reports
The accurate extraction of clinical information from electronic medical records is particularly critical to clinical research but require much trained expertise and manual labor. In this study we developed a robust system for automated extraction of the specific cancer types for the purpose of supporting precision oncology research. from pathology reports using a fine-tuned RoBERTa model. This model significantly outperformed the baseline model and a Large Language Model, Mistral 7B, achieving F1_Bertscore 0.98 and overall exact match of 80.61%. This fine-tuning approach demonstrates the potential for scalability that can integrate seamlessly into the molecular tumour board process. Fine-tuning domain-specific models for precision tasks in oncology, may pave the way for more efficient and accurate clinical information extraction.
☆ Towards Reliable and Generalizable Differentially Private Machine Learning (Extended Version) ACSA
There is a flurry of recent research papers proposing novel differentially private machine learning (DPML) techniques. These papers claim to achieve new state-of-the-art (SoTA) results and offer empirical results as validation. However, there is no consensus on which techniques are most effective or if they genuinely meet their stated claims. Complicating matters, heterogeneity in codebases, datasets, methodologies, and model architectures make direct comparisons of different approaches challenging. In this paper, we conduct a reproducibility and replicability (R+R) experiment on 11 different SoTA DPML techniques from the recent research literature. Results of our investigation are varied: while some methods stand up to scrutiny, others falter when tested outside their initial experimental conditions. We also discuss challenges unique to the reproducibility of DPML, including additional randomness due to DP noise, and how to address them. Finally, we derive insights and best practices to obtain scientifically valid and reliable results.
comment: This paper is published at ACSAC 2024. This is the extended version that includes an overview of the relevant literature. We open-source our codebase at: https://github.com/wenxuan-Bao/Reliable-and-Generalizable-DPML
♻ ☆ Large Language Models Encode Semantics in Low-Dimensional Linear Subspaces
Understanding the latent space geometry of large language models (LLMs) is key to interpreting their behavior and improving alignment. However, it remains unclear to what extent LLMs internally organize representations related to semantic understanding. To explore this, we conduct a large-scale empirical study of hidden representations in 11 autoregressive models across 6 scientific topics. We find that high-level semantic information consistently resides in low-dimensional subspaces that form linearly separable representations across domains. This separability becomes more pronounced in deeper layers and under prompts that elicit structured reasoning or alignment behavior$\unicode{x2013}$even when surface content remains unchanged. These findings support geometry-aware tools that operate directly in latent space to detect and mitigate harmful or adversarial content. As a proof of concept, we train an MLP probe on final-layer hidden states to act as a lightweight latent-space guardrail. This approach substantially improves refusal rates on malicious queries and prompt injections that bypass both the model's built-in safety alignment and external token-level filters.
♻ ☆ Let's Measure Information Step-by-Step: LLM-Based Evaluation Beyond Vibes
We study evaluation of AI systems without ground truth by exploiting a link between strategic gaming and information loss. We analyze which information-theoretic mechanisms resist adversarial manipulation, extending finite-sample bounds to show that bounded f-divergences (e.g., total variation distance) maintain polynomial guarantees under attacks while unbounded measures (e.g., KL divergence) degrade exponentially. To implement these mechanisms, we model the overseer as an agent and characterize incentive-compatible scoring rules as f-mutual information objectives. Under adversarial attacks, TVD-MI maintains effectiveness (area under curve 0.70-0.77) while traditional judge queries are near change (AUC $\approx$ 0.50), demonstrating that querying the same LLM for information relationships rather than quality judgments provides both theoretical and practical robustness. The mechanisms decompose pairwise evaluations into reliable item-level quality scores without ground truth, addressing a key limitation of traditional peer prediction. We release preregistration and code.
comment: Add AUC results, pre-reg conformance, theory section clarification. 12 pages
♻ ☆ Contextual Bandits with Stage-wise Constraints
We study contextual bandits in the presence of a stage-wise constraint when the constraint must be satisfied both with high probability and in expectation. We start with the linear case where both the reward function and the stage-wise constraint (cost function) are linear. In each of the high probability and in expectation settings, we propose an upper-confidence bound algorithm for the problem and prove a $T$-round regret bound for it. We also prove a lower-bound for this constrained problem, show how our algorithms and analyses can be extended to multiple constraints, and provide simulations to validate our theoretical results. In the high probability setting, we describe the minimum requirements for the action set for our algorithm to be tractable. In the setting that the constraint is in expectation, we specialize our results to multi-armed bandits and propose a computationally efficient algorithm for this setting with regret analysis. Finally, we extend our results to the case where the reward and cost functions are both non-linear. We propose an algorithm for this case and prove a regret bound for it that characterize the function class complexity by the eluder dimension.
♻ ☆ A Novel Vascular Risk Scoring Framework for Quantifying Sex-Specific Cerebral Perfusion from 3D pCASL MRI
The influence of sex and age on cerebral perfusion is recognized, but the specific impacts on regional cerebral blood flow (CBF) and vascular risk remain to be fully characterized. In this study, 3D pseudo-continuous arterial spin labeling (pCASL) MRI was used to identify sex and age related CBF patterns, and a vascular risk score (VRS) was developed based on normative perfusion profiles. Perfusion data from 186 cognitively healthy participants (89 males, 97 females; aged 8 to 92 years), obtained from a publicly available dataset, were analyzed. An extension of the 3D Simple Linear Iterative Clustering (SLIC) supervoxel algorithm was applied to CBF maps to group neighboring voxels with similar intensities into anatomically meaningful regions. Regional CBF features were extracted and used to train a convolutional neural network (CNN) for sex classification and perfusion pattern analysis. Global, age related CBF changes were also assessed. Participant specific VRS was computed by comparing individual CBF profiles to age and sex specific normative data to quantify perfusion deficits. A 95 percent accuracy in sex classification was achieved using the proposed supervoxel based method, and distinct perfusion signatures were identified. Higher CBF was observed in females in medial Brodmann areas 6 and 10, area V5, occipital polar cortex, and insular regions. A global decline in CBF with age was observed in both sexes. Individual perfusion deficits were quantified using VRS, providing a personalized biomarker for early hypoperfusion. Sex and age specific CBF patterns were identified, and a personalized vascular risk biomarker was proposed, contributing to advancements in precision neurology.
comment: 16 pages, 6 figures
♻ ☆ Multi-Exit Kolmogorov-Arnold Networks: enhancing accuracy and parsimony
Kolmogorov-Arnold Networks (KANs) uniquely combine high accuracy with interpretability, making them valuable for scientific modeling. However, it is unclear a priori how deep a network needs to be for any given task, and deeper KANs can be difficult to optimize and interpret. Here we introduce multi-exit KANs, where each layer includes its own prediction branch, enabling the network to make accurate predictions at multiple depths simultaneously. This architecture provides deep supervision that improves training while discovering the right level of model complexity for each task. Multi-exit KANs consistently outperform standard, single-exit versions on synthetic functions, dynamical systems, and real-world datasets. Remarkably, the best predictions often come from earlier, simpler exits, revealing that these networks naturally identify smaller, more parsimonious and interpretable models without sacrificing accuracy. To automate this discovery, we develop a differentiable "learning-to-exit" algorithm that balances contributions from exits during training. Our approach offers scientists a practical way to achieve both high performance and interpretability, addressing a fundamental challenge in machine learning for scientific discovery.
comment: 15 pages, 6 figures, 2 tables
♻ ☆ Vision Transformers for Kidney Stone Image Classification: A Comparative Study with CNNs
Kidney stone classification from endoscopic images is critical for personalized treatment and recurrence prevention. While convolutional neural networks (CNNs) have shown promise in this task, their limited ability to capture long-range dependencies can hinder performance under variable imaging conditions. This study presents a comparative analysis between Vision Transformers (ViTs) and CNN-based models, evaluating their performance on two ex vivo datasets comprising CCD camera and flexible ureteroscope images. The ViT-base model pretrained on ImageNet-21k consistently outperformed a ResNet50 baseline across multiple imaging conditions. For instance, in the most visually complex subset (Section patches from endoscopic images), the ViT model achieved 95.2% accuracy and 95.1% F1-score, compared to 64.5% and 59.3% with ResNet50. In the mixed-view subset from CCD-camera images, ViT reached 87.1% accuracy versus 78.4% with CNN. These improvements extend across precision and recall as well. The results demonstrate that ViT-based architectures provide superior classification performance and offer a scalable alternative to conventional CNNs for kidney stone image analysis.
♻ ☆ Training neural control variates using correlated configurations
Neural control variates (NCVs) have emerged as a powerful tool for variance reduction in Monte Carlo (MC) simulations, particularly in high-dimensional problems where traditional control variates are difficult to construct analytically. By training neural networks to learn auxiliary functions correlated with the target observable, NCVs can significantly reduce estimator variance while preserving unbiasedness. However, a critical but often overlooked aspect of NCV training is the role of autocorrelated samples generated by Markov Chain Monte Carlo (MCMC). While such samples are typically discarded for error estimation due to their statistical redundancy, they may contain useful information about the structure of the underlying probability distribution that can benefit the training process. In this work, we systematically examine the effect of using correlated configurations in training neural control variates. We demonstrate, both conceptually and numerically, that training on correlated data can improve control variate performance, especially in settings with limited computational resources. Our analysis includes empirical results from $U(1)$ gauge theory and scalar field theory, illustrating when and how autocorrelated samples enhance NCV construction. These findings provide practical guidance for the efficient use of MCMC data in training neural networks.
comment: 9 pages, 8 figures; Version accepted by PRD
♻ ☆ InfAlign: Inference-aware language model alignment
Language model alignment is a critical step in training modern generative language models. Alignment targets to improve win rate of a sample from the aligned model against the base model. Today, we are increasingly using inference-time algorithms (e.g., Best-of-N, controlled decoding, tree search) to decode from language models rather than standard sampling. We show that this train/test mismatch makes standard RLHF framework sub-optimal in view of such inference-time methods. To this end, we propose a framework for inference-aware alignment (InfAlign), which aims to optimize inference-time win rate of the aligned policy against the base model. We prove that for any inference-time decoding procedure, the optimal aligned policy is the solution to the standard RLHF problem with a transformation of the reward. This motivates us to provide the calibrate-and-transform RL (InfAlign-CTRL) algorithm to solve this problem, which involves a reward calibration step and a KL-regularized reward maximization step with a transformation of the calibrated reward. For best-of-N sampling and best-of-N jailbreaking, we propose specific transformations offering up to 3-8% improvement on inference-time win rates. Finally, we also show that our proposed reward calibration method is a strong baseline for optimizing standard win rate.
♻ ☆ Learning to Generate Unit Tests for Automated Debugging
Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to large language models (LLMs), motivating automated test generation. However, we uncover a trade-off between generating unit test inputs that reveal errors when given a faulty code and correctly predicting the unit test output without access to the gold solution. To address this trade-off, we propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs based on task descriptions. Since model-generated tests can provide noisy signals (e.g., from incorrectly predicted outputs), we propose UTDebug that (i) scales UTGen via test-time compute to improve UT output prediction, and (ii) validates and backtracks edits based on multiple generated UTs to avoid overfitting, and helps LLMs debug effectively. We show that UTGen outperforms other LLM-based baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs. When used with UTDebug, we find that feedback from UTGen's unit tests improves pass@1 accuracy of Qwen2.5 32B on HumanEvalFix and our own harder debugging split of MBPP+ by over 3.17% and 12.35% (respectively) over other LLM-based UT generation baselines. Moreover, we observe that feedback from Qwen2.5 32B-based UTGen model can enhance debugging with frontier LLMs like GPT-4o by 13.8%. Lastly, we demonstrate that UTGen is a better judge for code correctness, outperforming a state-of-the-art trained 8B reward model by 4.43% on HumanEval+ with best-of-10 sampling using Qwen2.5 7B.
comment: Accepted to COLM 2025. Dataset and Code: https://github.com/archiki/UTGenDebug
♻ ☆ OPDR: Order-Preserving Dimension Reduction for Semantic Embedding of Multimodal Scientific Data
One of the most common operations in multimodal scientific data management is searching for the $k$ most similar items (or, $k$-nearest neighbors, KNN) from the database after being provided a new item. Although recent advances of multimodal machine learning models offer a \textit{semantic} index, the so-called \textit{embedding vectors} mapped from the original multimodal data, the dimension of the resulting embedding vectors are usually on the order of hundreds or a thousand, which are impractically high for time-sensitive scientific applications. This work proposes to reduce the dimensionality of the output embedding vectors such that the set of top-$k$ nearest neighbors do not change in the lower-dimensional space, namely Order-Preserving Dimension Reduction (OPDR). In order to develop such an OPDR method, our central hypothesis is that by analyzing the intrinsic relationship among key parameters during the dimension-reduction map, a quantitative function may be constructed to reveal the correlation between the target (lower) dimensionality and other variables. To demonstrate the hypothesis, this paper first defines a formal measure function to quantify the KNN similarity for a specific vector, then extends the measure into an aggregate accuracy of the global metric spaces, and finally derives a closed-form function between the target (lower) dimensionality and other variables. We incorporate the closed-function into popular dimension-reduction methods, various distance metrics, and embedding models.
♻ ☆ Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation
Large Language Models (LLMs) are widely used as proxies for human labelers in both training (Reinforcement Learning from AI Feedback) and large-scale response evaluation (LLM-as-a-judge). Alignment and evaluation are critical components in the development of reliable LLMs, and the choice of feedback protocol plays a central role in both but remains understudied. In this work, we show that the choice of feedback protocol for evaluation (absolute scores versus relative preferences) can significantly affect evaluation reliability and induce systematic biases. In the context of LLM-as-a-judge evaluation, we show that pairwise protocols are more vulnerable to distracted evaluation. Generator models can exploit spurious attributes (or distractor features) favored by the LLM judge, resulting in inflated scores for lower-quality outputs. We find that absolute scoring is more robust to such manipulation, producing judgments that better reflect response quality and are less influenced by distractor features. Our results demonstrate that generator models can flip preferences by embedding distractor features, skewing LLM-as-a-judge comparisons and leading to inaccurate conclusions about model quality in benchmark evaluations. Pairwise preferences flip in about 35% of the cases, compared to only 9% for absolute scores. We offer recommendations for choosing feedback protocols based on dataset characteristics and evaluation objectives.
comment: Published at COLM 2025
♻ ☆ Cooperative SGD with Dynamic Mixing Matrices ECAI-2025
One of the most common methods to train machine learning algorithms today is the stochastic gradient descent (SGD). In a distributed setting, SGD-based algorithms have been shown to converge theoretically under specific circumstances. A substantial number of works in the distributed SGD setting assume a fixed topology for the edge devices. These papers also assume that the contribution of nodes to the global model is uniform. However, experiments have shown that such assumptions are suboptimal and a non uniform aggregation strategy coupled with a dynamically shifting topology and client selection can significantly improve the performance of such models. This paper details a unified framework that covers several Local-Update SGD-based distributed algorithms with dynamic topologies and provides improved or matching theoretical guarantees on convergence compared to existing work.
comment: Accepted at 28th European Conference on Artificial Intelligence (ECAI-2025) in Main Paper track
♻ ☆ Exploring Modularity of Agentic Systems for Drug Discovery
Large-language models (LLMs) and agentic systems present exciting opportunities to accelerate drug discovery. In this study, we examine the modularity of LLM-based agentic systems for drug discovery, i.e., whether parts of the system such as the LLM and type of agent are interchangeable, a topic that has received limited attention in drug discovery. We compare the performance of different LLMs and the effectiveness of tool-calling agents versus code-generating agents. Our case study, comparing performance in orchestrating tools for chemistry and drug discovery using an LLM-as-a-judge score, shows that Claude-3.5-Sonnet, Claude-3.7-Sonnet and GPT-4o outperform alternative language models such as Llama-3.1-8B, Llama-3.1-70B, GPT-3.5-Turbo, and Nova-Micro. Although we confirm that code-generating agents outperform the tool-calling ones on average, we show that this is highly question- and model-dependent. Furthermore, the impact of replacing system prompts is dependent on the question and model, underscoring that even in this particular domain one cannot just replace components of the system without re-engineering. Our study highlights the necessity of further research into the modularity of agentic systems to enable the development of reliable and modular solutions for real-world problems.
♻ ☆ Architectural Co-Design for Zero-Shot Anomaly Detection: Decoupling Representation and Dynamically Fusing Features in CLIP
Pre-trained Vision-Language Models (VLMs) face a significant adaptation gap when applied to Zero-Shot Anomaly Detection (ZSAD), stemming from their lack of local inductive biases for dense prediction and their reliance on inflexible feature fusion paradigms. We address these limitations through an Architectural Co-Design framework that jointly refines feature representation and cross-modal fusion. Our method integrates a parameter-efficient Convolutional Low-Rank Adaptation (Conv-LoRA) adapter to inject local inductive biases for fine-grained representation, and introduces a Dynamic Fusion Gateway (DFG) that leverages visual context to adaptively modulate text prompts, enabling a powerful bidirectional fusion. Extensive experiments on diverse industrial and medical benchmarks demonstrate superior accuracy and robustness, validating that this synergistic co-design is critical for robustly adapting foundation models to dense perception tasks.
comment: 4 pages, 1 reference, 3 figures, icassp 2026
♻ ☆ Scalable Time-Series Causal Discovery with Approximate Causal Ordering
Causal discovery in time-series data presents a significant computational challenge. Standard algorithms are often prohibitively expensive for datasets with many variables or samples. This study introduces and validates a heuristic approximation of the VarLiNGAM algorithm to address this scalability problem. The standard VarLiNGAM method relies on an iterative search, recalculating statistical dependencies after each step. Our heuristic modifies this procedure by omitting the iterative refinement. This change permits a one-time precomputation of all necessary statistical values. The algorithmic modification reduces the time complexity from $O(m^3n)$ to $O(m^2n + m^3)$ while keeping the space complexity at $O(m^2)$, where $m$ is the number of variables and $n$ is the number of samples. While an approximation, our approach retains VarLiNGAM's essential structure and empirical reliability. On large-scale financial data with up to 400 variables, our algorithm achieves a 7--13x speedup over the standard implementation and a 4.5x speedup over a GPU-accelerated version. Evaluations across medical imaging, web server monitoring, and finance demonstrate the heuristic's robustness and practical scalability. This work offers a validated balance between computational efficiency and discovery quality, making large-scale causal analysis feasible on personal computers.
♻ ☆ Annif at SemEval-2025 Task 5: Traditional XMTC augmented by LLMs SemEval-2025
This paper presents the Annif system in SemEval-2025 Task 5 (LLMs4Subjects), which focussed on subject indexing using large language models (LLMs). The task required creating subject predictions for bibliographic records from the bilingual TIBKAT database using the GND subject vocabulary. Our approach combines traditional natural language processing and machine learning techniques implemented in the Annif toolkit with innovative LLM-based methods for translation and synthetic data generation, and merging predictions from monolingual models. The system ranked first in the all-subjects category and second in the tib-core-subjects category in the quantitative evaluation, and fourth in qualitative evaluations. These findings demonstrate the potential of combining traditional XMTC algorithms with modern LLM techniques to improve the accuracy and efficiency of subject indexing in multilingual contexts.
comment: 6 pages, 4 figures, published at SemEval-2025 workshop Task 5: LLMs4Subjects: https://aclanthology.org/2025.semeval-1.315/
♻ ☆ Online Convex Optimization and Integral Quadratic Constraints: An automated approach to regret analysis
We propose a novel approach for analyzing dynamic regret of first-order constrained online convex optimization algorithms for strongly convex and Lipschitz-smooth objectives. Crucially, we provide a general analysis that is applicable to a wide range of first-order algorithms that can be expressed as an interconnection of a linear dynamical system in feedback with a first-order oracle. By leveraging Integral Quadratic Constraints (IQCs), we derive a semi-definite program which, when feasible, provides a regret guarantee for the online algorithm. For this, the concept of variational IQCs is introduced as the generalization of IQCs to time-varying monotone operators. Our bounds capture the temporal rate of change of the problem in the form of the path length of the time-varying minimizer and the objective function variation. In contrast to standard results in OCO, our results do not require nerither the assumption of gradient boundedness, nor that of a bounded feasible set. Numerical analyses showcase the ability of the approach to capture the dependence of the regret on the function class condition number.
comment: Published in the 64th IEEE Conference on Decision and Control, 2025
♻ ☆ ILeSiA: Interactive Learning of Robot Situational Awareness from Camera Input
Learning from demonstration is a promising approach for teaching robots new skills. However, a central challenge in the execution of acquired skills is the ability to recognize faults and prevent failures. This is essential because demonstrations typically cover only a limited set of scenarios and often only the successful ones. During task execution, unforeseen situations may arise, such as changes in the robot's environment or interaction with human operators. To recognize such situations, this paper focuses on teaching the robot situational awareness by using a camera input and labeling frames as safe or risky. We train a Gaussian Process (GP) regression model fed by a low-dimensional latent space representation of the input images. The model outputs a continuous risk score ranging from zero to one, quantifying the degree of risk at each timestep. This allows for pausing task execution in unsafe situations and directly adding new training data, labeled by the human user. Our experiments on a robotic manipulator show that the proposed method can reliably detect both known and novel faults using only a single example for each new fault. In contrast, a standard multi-layer perceptron (MLP) performs well only on faults it has encountered during training. Our method enables the next generation of cobots to be rapidly deployed with easy-to-set-up, vision-based risk assessment, proactively safeguarding humans and detecting misaligned parts or missing objects before failures occur. We provide all the code and data required to reproduce our experiments at imitrob.ciirc.cvut.cz/publications/ilesia.
comment: 8 pages, 9 figures. Accepted to IEEE Robotics and Automation Letters (Early Access)
♻ ☆ CREMA: A Contrastive Regularized Masked Autoencoder for Robust ECG Diagnostics across Clinical Domains
Electrocardiogram (ECG) diagnosis remains challenging due to limited labeled data and the need to capture subtle yet clinically meaningful variations in rhythm and morphology. We present CREMA (Contrastive Regularized Masked Autoencoder), a foundation model for 12-lead ECGs designed to learn generalizable representations through self-supervised pretraining. CREMA combines generative learning and contrastive regularization via a Contrastive Regularized MAE loss, and employs a Signal Transformer (SiT) architecture to capture both local waveform details and global temporal dependencies. We evaluate CREMA on benchmark datasets and real-world clinical environments, including deployment scenarios with significant distribution shifts. CREMA outperforms supervised baselines and existing self-supervised models in both linear probing and fine-tuning evaluations. Notably, it maintains superior performance across diverse clinical domains, such as emergency care, highlighting its robustness under real-world conditions. These results demonstrate that CREMA serves as a scalable and reliable foundation model for ECG diagnostics, supporting downstream applications across heterogeneous and high-risk clinical settings.
comment: 10 pages
♻ ☆ Non-linear Welfare-Aware Strategic Learning
This paper studies algorithmic decision-making in the presence of strategic individual behaviors, where an ML model is used to make decisions about human agents and the latter can adapt their behavior strategically to improve their future data. Existing results on strategic learning have largely focused on the linear setting where agents with linear labeling functions best respond to a (noisy) linear decision policy. Instead, this work focuses on general non-linear settings where agents respond to the decision policy with only "local information" of the policy. Moreover, we simultaneously consider the objectives of maximizing decision-maker welfare (model prediction accuracy), social welfare (agent improvement caused by strategic behaviors), and agent welfare (the extent that ML underestimates the agents). We first generalize the agent best response model in previous works to the non-linear setting, then reveal the compatibility of welfare objectives. We show the three welfare can attain the optimum simultaneously only under restrictive conditions which are challenging to achieve in non-linear settings. The theoretical results imply that existing works solely maximizing the welfare of a subset of parties inevitably diminish the welfare of the others. We thus claim the necessity of balancing the welfare of each party in non-linear settings and propose an irreducible optimization algorithm suitable for general strategic learning. Experiments on synthetic and real data validate the proposed algorithm.
comment: Minor changes
♻ ☆ Mean-Field Langevin Diffusions with Density-dependent Temperature
In the context of non-convex optimization, we let the temperature of a Langevin diffusion to depend on the diffusion's own density function. The rationale is that the induced density captures to some extent the landscape imposed by the non-convex function to be minimized, such that a density-dependent temperature provides location-wise random perturbation that may better react to, for instance, the location and depth of local minimizers. As the Langevin dynamics is now self-regulated by its own density, it forms a mean-field stochastic differential equation (SDE) of the Nemytskii type, distinct from the standard McKean-Vlasov equations. Relying on Wasserstein subdifferential calculus, we first show that the corresponding (nonlinear) Fokker-Planck equation has a unique solution. Next, a weak solution to the SDE is constructed from the solution to the Fokker-Planck equation, by Trevisan's superposition principle. As time goes to infinity, we further show that the induced density converges to an invariant distribution, which admits an explicit formula in terms of the Lambert $W$ function. A numerical example suggests that the density-dependent temperature can simultaneously improve the accuracy of and rate of convergence to the estimate of global minimizers.
♻ ☆ MaskSDM with Shapley values to improve flexibility, robustness, and explainability in species distribution modeling
Species Distribution Models (SDMs) play a vital role in biodiversity research, conservation planning, and ecological niche modeling by predicting species distributions based on environmental conditions. The selection of predictors is crucial, strongly impacting both model accuracy and how well the predictions reflect ecological patterns. To ensure meaningful insights, input variables must be carefully chosen to match the study objectives and the ecological requirements of the target species. However, existing SDMs, including both traditional and deep learning-based approaches, often lack key capabilities for variable selection: (i) flexibility to choose relevant predictors at inference without retraining; (ii) robustness to handle missing predictor values without compromising accuracy; and (iii) explainability to interpret and accurately quantify each predictor's contribution. To overcome these limitations, we introduce MaskSDM, a novel deep learning-based SDM that enables flexible predictor selection by employing a masked training strategy. This approach allows the model to make predictions with arbitrary subsets of input variables while remaining robust to missing data. It also provides a clearer understanding of how adding or removing a given predictor affects model performance and predictions. Additionally, MaskSDM leverages Shapley values for precise predictor contribution assessments, improving upon traditional approximations. We evaluate MaskSDM on the global sPlotOpen dataset, modeling the distributions of 12,738 plant species. Our results show that MaskSDM outperforms imputation-based methods and approximates models trained on specific subsets of variables. These findings underscore MaskSDM's potential to increase the applicability and adoption of SDMs, laying the groundwork for developing foundation models in SDMs that can be readily applied to diverse ecological applications.
♻ ☆ On the Consistency of GNN Explanations for Malware Detection
Control Flow Graphs (CFGs) are critical for analyzing program execution and characterizing malware behavior. With the growing adoption of Graph Neural Networks (GNNs), CFG-based representations have proven highly effective for malware detection. This study proposes a novel framework that dynamically constructs CFGs and embeds node features using a hybrid approach combining rule-based encoding and autoencoder-based embedding. A GNN-based classifier is then constructed to detect malicious behavior from the resulting graph representations. To improve model interpretability, we apply state-of-the-art explainability techniques, including GNNExplainer, PGExplainer, and CaptumExplainer, the latter is utilized three attribution methods: Integrated Gradients, Guided Backpropagation, and Saliency. In addition, we introduce a novel aggregation method, called RankFusion, that integrates the outputs of the top-performing explainers to enhance the explanation quality. We also evaluate explanations using two subgraph extraction strategies, including the proposed Greedy Edge-wise Composition (GEC) method for improved structural coherence. A comprehensive evaluation using accuracy, fidelity, and consistency metrics demonstrates the effectiveness of the proposed framework in terms of accurate identification of malware samples and generating reliable and interpretable explanations.
♻ ☆ The Complexity Dynamics of Grokking
We demonstrate the existence of a complexity phase transition in neural networks by studying the grokking phenomenon, where networks suddenly transition from memorization to generalization long after overfitting their training data. To characterize this phase transition, we introduce a theoretical framework for measuring complexity based on rate-distortion theory and Kolmogorov complexity, which can be understood as principled lossy compression for networks. We find that properly regularized networks exhibit a sharp phase transition: complexity rises during memorization, then falls as the network discovers a simpler underlying pattern that generalizes. In contrast, unregularized networks remain trapped in a high-complexity memorization phase. We establish an explicit connection between our complexity measure and generalization bounds, providing a theoretical foundation for the link between lossy compression and generalization. Our framework achieves compression ratios 30-40x better than na\"ive approaches, enabling precise tracking of complexity dynamics. Finally, we introduce a regularization method based on spectral entropy that encourages networks toward low-complexity representations by penalizing their intrinsic dimension.
♻ ☆ From Points to Spheres: A Geometric Reinterpretation of Variational Autoencoders
Variational Autoencoder is typically understood from the perspective of probabilistic inference. In this work, we propose a new geometric reinterpretation which complements the probabilistic view and enhances its intuitiveness. We demonstrate that the proper construction of semantic manifolds arises primarily from the constraining effect of the KL divergence on the encoder. We view the latent representations as a Gaussian ball rather than deterministic points. Under the constraint of KL divergence, Gaussian ball regularizes the latent space, promoting a more uniform distribution of encodings. Furthermore, we show that reparameterization establishes a critical contractual mechanism between the encoder and decoder, enabling the decoder to learn how to reconstruct from these stochastic regions. We further connect this viewpoint with VQ-VAE, offering a unified perspective: VQ-VAE can be seen as an autoencoder where encodings are constrained to a set of cluster centers, with its generative capability arising from the compactness rather than its stochasticity. This geometric framework provides a new lens for understanding how VAE shapes the latent geometry to enable effective generation.
♻ ☆ CC-Time: Cross-Model and Cross-Modality Time Series Forecasting
With the success of pre-trained language models (PLMs) in various application fields beyond natural language processing, language models have raised emerging attention in the field of time series forecasting (TSF) and have shown great prospects. However, current PLM-based TSF methods still fail to achieve satisfactory prediction accuracy matching the strong sequential modeling power of language models. To address this issue, we propose Cross-Model and Cross-Modality Learning with PLMs for time series forecasting (CC-Time). We explore the potential of PLMs for time series forecasting from two aspects: 1) what time series features could be modeled by PLMs, and 2) whether relying solely on PLMs is sufficient for building time series models. In the first aspect, CC-Time incorporates cross-modality learning to model temporal dependency and channel correlations in the language model from both time series sequences and their corresponding text descriptions. In the second aspect, CC-Time further proposes the cross-model fusion block to adaptively integrate knowledge from the PLMs and time series model to form a more comprehensive modeling of time series patterns. Extensive experiments on nine real-world datasets demonstrate that CC-Time achieves state-of-the-art prediction accuracy in both full-data training and few-shot learning situations.
♻ ☆ A mathematical perspective on Transformers
Transformers play a central role in the inner workings of large language models. We develop a mathematical framework for analyzing Transformers based on their interpretation as interacting particle systems, which reveals that clusters emerge in long time. Our study explores the underlying theory and offers new perspectives for mathematicians as well as computer scientists.
♻ ☆ MMiC: Mitigating Modality Incompleteness in Clustered Federated Learning
In the era of big data, data mining has become indispensable for uncovering hidden patterns and insights from vast and complex datasets. The integration of multimodal data sources further enhances its potential. Multimodal Federated Learning (MFL) is a distributed approach that enhances the efficiency and quality of multimodal learning, ensuring collaborative work and privacy protection. However, missing modalities pose a significant challenge in MFL, often due to data quality issues or privacy policies across the clients. In this work, we present MMiC, a framework for Mitigating Modality incompleteness in MFL within the Clusters. MMiC replaces partial parameters within client models inside clusters to mitigate the impact of missing modalities. Furthermore, it leverages the Banzhaf Power Index to optimize client selection under these conditions. Finally, MMiC employs an innovative approach to dynamically control global aggregation by utilizing Markovitz Portfolio Optimization. Extensive experiments demonstrate that MMiC consistently outperforms existing federated learning architectures in both global and personalized performance on multimodal datasets with missing modalities, confirming the effectiveness of our proposed solution. Our code is available at https://github.com/gotobcn8/MMiC.
comment: 9 pages
♻ ☆ Optimizing Cross-Client Domain Coverage for Federated Instruction Tuning of Large Language Models EMNLP 2025
Federated domain-specific instruction tuning (FedDIT) for large language models (LLMs) aims to enhance performance in specialized domains using distributed private and limited data, yet identifying key performance drivers and optimal augmentation strategies remains challenging. We empirically establish that cross-client domain coverage, rather than data heterogeneity, is the pivotal factor. We then introduce FedDCA, an algorithm that explicitly maximizes this coverage through diversity-oriented client center selection and retrieval-based augmentation, constructing diverse, non-redundant cross-client instruction sets. Extensive experiments across multiple domains demonstrate FedDCA's superiority over eleven baselines, achieving performance gains of up to 29.19\% and domain coverage improvements of 4.82\%-21.36\%. FedDCA maintains its effectiveness in diverse and challenging scenarios, including data selection, held-out settings where task-specific public data is scarce and various data heterogeneity, with manageable privacy risks. This work clarifies critical FedDIT dynamics and presents FedDCA as an effective, privacy-preserving, and scalable solution for advancing domain-specific LLM tuning.
comment: EMNLP 2025
♻ ☆ Wasserstein Distributionally Robust Shallow Convex Neural Networks
In this work, we propose Wasserstein distributionally robust shallow convex neural networks (WaDiRo-SCNNs) to provide reliable nonlinear predictions when subject to adverse and corrupted datasets. Our approach is based on the reformulation of a new convex training program for ReLU-based shallow neural networks, which allows us to cast the problem into the order-1 Wasserstein distributionally robust optimization framework. Our training procedure is conservative, has low stochasticity, is solvable with open-source solvers, and is scalable to large industrial deployments. We provide out-of-sample performance guarantees, show that hard convex physical constraints can be enforced in the training program, and propose a mixed-integer convex post-training verification program to evaluate model stability. WaDiRo-SCNN aims to make neural networks safer for critical applications, such as in the energy sector. Finally, we numerically demonstrate our model's performance through both a synthetic experiment and a real-world power system application, viz., the prediction of hourly energy consumption in non-residential buildings within the context of virtual power plants, and evaluate its stability across standard regression benchmark datasets. The experimental results are convincing and showcase the strengths of the proposed model.
♻ ☆ Scalable Bayesian Monte Carlo: fast uncertainty estimation beyond deep ensembles
This work introduces a new method designed for Bayesian deep learning called scalable Bayesian Monte Carlo (SBMC). The method is comprised of a model and an algorithm. The model interpolates between a point estimator and the posterior. The algorithm is a parallel implementation of sequential Monte Carlo sampler (SMC$_\parallel$) or Markov chain Monte Carlo (MCMC$_\parallel$). We collectively refer to these consistent (asymptotically unbiased) algorithms as Bayesian Monte Carlo (BMC), and any such algorithm can be used in our SBMC method. The utility of the method is demonstrated on practical examples: MNIST, CIFAR, IMDb. A systematic numerical study reveals that for the same wall-clock time as state-of-the-art (SOTA) methods like deep ensembles (DE), SBMC achieves comparable or better accuracy and substantially improved uncertainty quantification (UQ)--in particular, epistemic UQ. This is demonstrated on the downstream task of estimating the confidence in predictions, which can be used for reliability assessment or abstention decisions.
comment: 38 pages, 30 figures, 23 tables
♻ ☆ Faster Convergence of Riemannian Stochastic Gradient Descent with Increasing Batch Size
We have theoretically analyzed the use of Riemannian stochastic gradient descent (RSGD) and found that using an increasing batch size leads to faster RSGD convergence rate than using a constant batch size not only with a constant learning rate but also with a decaying learning rate, such as cosine annealing decay and polynomial decay. The convergence rate of RSGD improves from $O(\sqrt{T^{-1}+\text{const.}})$ with a constant batch size to $O(T^{-\frac{1}{2}})$ with an increasing batch size, where $T$ denotes the number of iterations. Using principal component analysis and low-rank matrix completion tasks, we investigated, both theoretically and numerically, how increasing batch size affects computational time as measured by stochastic first-order oracle (SFO) complexity. Increasing batch size reduces the SFO complexity of RSGD. Furthermore, our numerical results demonstrated that increasing batch size offers the advantages of both small and large constant batch sizes.
comment: Minor revisions have been made
♻ ☆ On the Fundamental Impossibility of Hallucination Control in Large Language Models
This paper establishes a fundamental impossibility theorem: no LLM capable of performing non-trivial knowledge aggregation can simultaneously achieve truthful knowledge representation, semantic information conservation, complete revelation of relevant knowledge, and knowledge-constrained optimality. The impossibility is not an engineering limitation but arises from the mathematical structure of information aggregation itself. We establish this result by describing the inference process as an auction of ideas, where distributed components compete exploiting their partial knowledge to shape responses. The proof spans three independent mathematical domains: mechanism design theory (Green-Laffont), the theory of proper scoring rules (Savage), and direct architectural analysis of transformers (Log-Sum-Exp convexity). In particular, we show how to quantify the creation of overconfident or intuitive responses-the signature of both hallucination and creativity, or imagination. To support this analysis, we introduce the complementary concepts of the semantic information measure and the emergence operator to model bounded reasoning in a general setting. We prove that while bounded reasoning generates accessible information, providing valuable insights and inspirations, the idealized unconstrained reasoning strictly preserves semantic content. By demonstrating that hallucination and imagination are mathematically identical phenomena-grounded in departures from truthfulness, semantic information conservation, revelation of relevant knowledge, and knowledge-constrained optimality-we offer a principled foundation for managing these behaviors in advanced AI systems. Finally, we present some speculative ideas to inspire evaluation and refinements of the proposed theory.
comment: cleared mathematics, proofs debugged and ideas expanded, added missing definitions and axioms, discussion and speculation section reorganized, related work improved
♻ ☆ Multitask Learning with Stochastic Interpolants
We propose a framework for learning maps between probability distributions that broadly generalizes the time dynamics of flow and diffusion models. To enable this, we generalize stochastic interpolants by replacing the scalar time variable with vectors, matrices, or linear operators, allowing us to bridge probability distributions across multiple dimensional spaces. This approach enables the construction of versatile generative models capable of fulfilling multiple tasks without task-specific training. Our operator-based interpolants not only provide a unifying theoretical perspective for existing generative models but also extend their capabilities. Through numerical experiments, we demonstrate the zero-shot efficacy of our method on conditional generation and inpainting, fine-tuning and posterior sampling, and multiscale modeling, suggesting its potential as a generic task-agnostic alternative to specialized models.
♻ ☆ Neural reproducing kernel Banach spaces and representer theorems for deep networks
Characterizing the function spaces defined by neural networks helps understanding the corresponding learning models and their inductive bias. While in some limits neural networks correspond to function spaces that are Hilbert spaces, these regimes do not capture the properties of the networks used in practice. Indeed, several results have shown that shallow networks can be better characterized in terms of suitable Banach spaces. However, analogous results for deep networks are limited. In this paper we show that deep neural networks define suitable reproducing kernel Banach spaces. These spaces are equipped with norms that enforce a form of sparsity, enabling them to adapt to potential latent structures within the input data and their representations. In particular, by leveraging the theory of reproducing kernel Banach spaces, combined with variational results, we derive representer theorems that justify the finite architectures commonly employed in applications. Our study extends analogous results for shallow networks and represents a step towards understanding the function spaces induced by neural architectures used in practice.
♻ ☆ Improving Predictions of Convective Storm Wind Gusts through Statistical Post-Processing of Neural Weather Models
Issuing timely severe weather warnings helps mitigate potentially disastrous consequences. Recent advancements in Neural Weather Models (NWMs) offer a computationally inexpensive and fast approach for forecasting atmospheric environments on a 0.25{\deg} global grid. For thunderstorms, these environments can be empirically post-processed to predict wind gust distributions at specific locations. With the Pangu-Weather NWM, we apply a hierarchy of statistical and deep learning post-processing methods to forecast hourly wind gusts up to three days ahead. To ensure statistical robustness, we constrain our probabilistic forecasts using generalised extreme-value distributions across five regions in Switzerland. Using a convolutional neural network to post-process the predicted atmospheric environment's spatial patterns yields the best results, outperforming direct forecasting approaches across lead times and wind gust speeds. Our results confirm the added value of NWMs for extreme wind forecasting, especially for designing more responsive early-warning systems.
comment: 18+6 pages, 11+8 figures, 4 tables, revised for npj Natural Hazards
♻ ☆ Inverse Problem Sampling in Latent Space Using Sequential Monte Carlo
In image processing, solving inverse problems is the task of finding plausible reconstructions of an image that was corrupted by some (usually known) degradation operator. Commonly, this process is done using a generative image model that can guide the reconstruction towards solutions that appear natural. The success of diffusion models over the last few years has made them a leading candidate for this task. However, the sequential nature of diffusion models makes this conditional sampling process challenging. Furthermore, since diffusion models are often defined in the latent space of an autoencoder, the encoder-decoder transformations introduce additional difficulties. To address these challenges, we suggest a novel sampling method based on sequential Monte Carlo (SMC) in the latent space of diffusion models. We name our method LD-SMC. We define a generative model for the data using additional auxiliary observations and perform posterior inference with SMC sampling based on a reverse diffusion process. Empirical evaluations on ImageNet and FFHQ show the benefits of LD-SMC over competing methods in various inverse problem tasks and especially in challenging inpainting tasks.
♻ ☆ TS-Insight: Visualizing Thompson Sampling for Verification and XAI IEEE VIS 2025
Thompson Sampling (TS) and its variants are powerful Multi-Armed Bandit algorithms used to balance exploration and exploitation strategies in active learning. Yet, their probabilistic nature often turns them into a "black box", hindering debugging and trust. We introduce TS-Insight, a visual analytics tool explicitly designed to shed light on the internal decision mechanisms of Thompson Sampling-based algorithms, for model developers. It comprises multiple plots, tracing for each arm the evolving posteriors, evidence counts, and sampling outcomes, enabling the verification, diagnosis, and explainability of exploration/exploitation dynamics. This tool aims at fostering trust and facilitating effective debugging and deployment in complex binary decision-making scenarios especially in sensitive domains requiring interpretable decision-making.
comment: Accepted as a poster at IEEE VIS 2025 ("TS-Insight: Visual Fingerprinting of Multi-Armed Bandits"). Open-source tool available at https://github.com/LIST-LUXEMBOURG/ts-insight
♻ ☆ Self-Supervised Prompt Optimization
Well-designed prompts are crucial for enhancing Large language models' (LLMs) reasoning capabilities while aligning their outputs with task requirements across diverse domains. However, manually designed prompts require expertise and iterative experimentation. While existing prompt optimization methods aim to automate this process, they rely heavily on external references such as ground truth or by humans, limiting their applicability in real-world scenarios where such data is unavailable or costly to obtain. To address this, we propose Self-Supervised Prompt Optimization (SPO), a cost-efficient framework that discovers effective prompts for both closed and open-ended tasks without requiring external reference. Motivated by the observations that prompt quality manifests directly in LLM outputs and LLMs can effectively assess adherence to task requirements, we derive evaluation and optimization signals purely from output comparisons. Specifically, SPO selects superior prompts through pairwise output comparisons evaluated by an LLM evaluator, followed by an LLM optimizer that aligns outputs with task requirements. Extensive experiments demonstrate that SPO outperforms state-of-the-art prompt optimization methods, achieving comparable or superior results with significantly lower costs (e.g., 1.1% to 5.6% of existing methods) and fewer samples (e.g., three samples). The code is available at https://github.com/FoundationAgents/SPO.
♻ ☆ Near-Optimal Sparse Allreduce for Distributed Deep Learning PPoPP'22
Communication overhead is one of the major obstacles to train large deep learning models at scale. Gradient sparsification is a promising technique to reduce the communication volume. However, it is very challenging to obtain real performance improvement because of (1) the difficulty of achieving an scalable and efficient sparse allreduce algorithm and (2) the sparsification overhead. This paper proposes O$k$-Top$k$, a scheme for distributed training with sparse gradients. O$k$-Top$k$ integrates a novel sparse allreduce algorithm (less than 6$k$ communication volume which is asymptotically optimal) with the decentralized parallel Stochastic Gradient Descent (SGD) optimizer, and its convergence is proved. To reduce the sparsification overhead, O$k$-Top$k$ efficiently selects the top-$k$ gradient values according to an estimated threshold. Evaluations are conducted on the Piz Daint supercomputer with neural network models from different deep learning domains. Empirical results show that O$k$-Top$k$ achieves similar model accuracy to dense allreduce. Compared with the optimized dense and the state-of-the-art sparse allreduces, O$k$-Top$k$ is more scalable and significantly improves training throughput (e.g., 3.29x-12.95x improvement for BERT on 256 GPUs).
comment: Published in Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'22), April 2-6, 2022, Pages 135-149, https://doi.org/10.1145/3503221.3508399
♻ ☆ Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines SC'21
Training large deep learning models at scale is very challenging. This paper proposes Chimera, a novel pipeline parallelism scheme which combines bidirectional pipelines for efficiently training large-scale models. Chimera is a synchronous approach and therefore no loss of accuracy, which is more convergence-friendly than asynchronous approaches. Compared with the latest synchronous pipeline approach, Chimera reduces the number of bubbles by up to 50%; benefiting from the sophisticated scheduling of bidirectional pipelines, Chimera has a more balanced activation memory consumption. Evaluations are conducted on Transformer based language models. For a GPT-2 model with 1.3 billion parameters running on 2,048 GPU nodes of the Piz Daint supercomputer, Chimera improves the training throughput by 1.16x-2.34x over the state-of-the-art synchronous and asynchronous pipeline approaches.
comment: Published in Proceedings of the 2021 International Conference for High Performance Computing, Networking, Storage and Analysis (SC'21), November 2021, Article No.: 27, Pages 1-14. Best Paper Finalist
♻ ☆ Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations PPoPP'20
Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself. Traditional synchronous Stochastic Gradient Descent (SGD) achieves good accuracy for a wide variety of tasks, but relies on global synchronization to accumulate the gradients at every training step. In this paper, we propose eager-SGD, which relaxes the global synchronization for decentralized accumulation. To implement eager-SGD, we propose to use two partial collectives: solo and majority. With solo allreduce, the faster processes contribute their gradients eagerly without waiting for the slower processes, whereas with majority allreduce, at least half of the participants must contribute gradients before continuing, all without using a central parameter server. We theoretically prove the convergence of the algorithms and describe the partial collectives in detail. Experimental results on load-imbalanced environments (CIFAR-10, ImageNet, and UCF101 datasets) show that eager-SGD achieves 1.27x speedup over the state-of-the-art synchronous SGD, without losing accuracy.
comment: Published in Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'20), pp. 45-61. 2020, Best Paper Nomination
♻ ☆ MCLPD:Multi-view Contrastive Learning for EEG-based PD Detection Across Datasets ECAI 2025
Electroencephalography has been validated as an effective technique for detecting Parkinson's disease,particularly in its early stages.However,the high cost of EEG data annotation often results in limited dataset size and considerable discrepancies across datasets,including differences in acquisition protocols and subject demographics,significantly hinder the robustness and generalizability of models in cross-dataset detection scenarios.To address such challenges,this paper proposes a semi-supervised learning framework named MCLPD,which integrates multi-view contrastive pre-training with lightweight supervised fine-tuning to enhance cross-dataset PD detection performance.During pre-training,MCLPD uses self-supervised learning on the unlabeled UNM dataset.To build contrastive pairs,it applies dual augmentations in both time and frequency domains,which enrich the data and naturally fuse time-frequency information.In the fine-tuning phase,only a small proportion of labeled data from another two datasets (UI and UC)is used for supervised optimization.Experimental results show that MCLPD achieves F1 scores of 0.91 on UI and 0.81 on UC using only 1%of labeled data,which further improve to 0.97 and 0.87,respectively,when 5%of labeled data is used.Compared to existing methods,MCLPD substantially improves cross-dataset generalization while reducing the dependency on labeled data,demonstrating the effectiveness of the proposed framework.
comment: Acccepted by European Conference on Artificial Intelligence(ECAI 2025)
♻ ☆ Continual Learning for Multimodal Data Fusion of a Soft Gripper
Continual learning (CL) refers to the ability of an algorithm to continuously and incrementally acquire new knowledge from its environment while retaining previously learned information. A model trained on one data modality often fails when tested with a different modality. A straightforward approach might be to fuse the two modalities by concatenating their features and training the model on the fused data. However, this requires retraining the model from scratch each time it encounters a new domain. In this paper, we introduce a continual learning algorithm capable of incrementally learning different data modalities by leveraging both class-incremental and domain-incremental learning scenarios in an artificial environment where labeled data is scarce, yet non-iid (independent and identical distribution) unlabeled data from the environment is plentiful. The proposed algorithm is efficient and only requires storing prototypes for each class. We evaluate the algorithm's effectiveness on a challenging custom multimodal dataset comprising of tactile data from a soft pneumatic gripper, and visual data from non-stationary images of objects extracted from video sequences. Additionally, we conduct an ablation study on the custom dataset and the Core50 dataset to highlight the contributions of different components of the algorithm. To further demonstrate the robustness of the algorithm, we perform a real-time experiment for object classification using the soft gripper and an external independent camera setup, all synchronized with the Robot Operating System (ROS) framework.
comment: Accepted in Wiley Advanced Robotics Research
♻ ☆ ThinkTuning: Instilling Cognitive Reflections without Distillation EMNLP 2025
Recent advances in test-time scaling have led to the emergence of thinking LLMs that exhibit self-reflective behaviors and multi-step reasoning. While RL drives this self-improvement paradigm, a recent study (Gandhi et al., 2025) shows that RL alone does not truly instill these new reasoning abilities - it merely draws out behaviors already present in the base models. This raises a question: How can we train the models that don't exhibit such thinking behavior to develop it in the first place? To this end, we propose ThinkTuning, a GRPO-based interactive training approach where we augment the rollouts of a student model with the guidance from a teacher model. A simple idea from classroom practice inspires our method: a teacher poses a problem, lets the student try an answer, then gives corrective feedback -- enough to point the mind in the right direction and then show the solution. Each piece of feedback reshapes the student's thoughts, leading them to arrive at the correct solution. Similarly, we find that this type of implicit supervision through feedback from a teacher model of the same size improves the reasoning capabilities of the student model. In particular, on average, our method shows a 3.85% improvement over zero-shot baselines across benchmarks, and on MATH-500, AIME and GPQA-Diamond it shows 2.08%, 2.23% and 3.99% improvements over the vanilla-GRPO baseline. Source code is available at https://github.com/3rdAT/ThinkTuning.
comment: EMNLP 2025 (Main Conference)
♻ ☆ Adaptive Routing of Text-to-Image Generation Requests Between Large Cloud Model and Light-Weight Edge Model ICCV 2025
Large text-to-image models demonstrate impressive generation capabilities; however, their substantial size necessitates expensive cloud servers for deployment. Conversely, light-weight models can be deployed on edge devices at lower cost but often with inferior generation quality for complex user prompts. To strike a balance between performance and cost, we propose a routing framework, called RouteT2I, which dynamically selects either the large cloud model or the light-weight edge model for each user prompt. Since generated image quality is challenging to measure and compare directly, RouteT2I establishes multi-dimensional quality metrics, particularly, by evaluating the similarity between the generated images and both positive and negative texts that describe each specific quality metric. RouteT2I then predicts the expected quality of the generated images by identifying key tokens in the prompt and comparing their impact on the quality. RouteT2I further introduces the Pareto relative superiority to compare the multi-metric quality of the generated images. Based on this comparison and predefined cost constraints, RouteT2I allocates prompts to either the edge or the cloud. Evaluation reveals that RouteT2I significantly reduces the number of requesting large cloud model while maintaining high-quality image generation.
comment: Accepted by ICCV 2025
♻ ☆ Deep Learning-Based Financial Time Series Forecasting via Sliding Window and Variational Mode Decomposition
To address the complexity of financial time series, this paper proposes a forecasting model combining sliding window and variational mode decomposition (VMD) methods. Historical stock prices and relevant market indicators are used to construct datasets. VMD decomposes non-stationary financial time series into smoother subcomponents, improving model adaptability. The decomposed data is then input into a deep learning model for prediction. The study compares the forecasting effects of an LSTM model trained on VMD-processed sequences with those using raw time series, demonstrating better performance and stability.
♻ ☆ Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging
Deep learning at scale is dominated by communication time. Distributing samples across nodes usually yields the best performance, but poses scaling challenges due to global information dissemination and load imbalance across uneven sample lengths. State-of-the-art decentralized optimizers mitigate the problem, but require more iterations to achieve the same accuracy as their globally-communicating counterparts. We present Wait-Avoiding Group Model Averaging (WAGMA) SGD, a wait-avoiding stochastic optimizer that reduces global communication via subgroup weight exchange. The key insight is a combination of algorithmic changes to the averaging scheme and the use of a group allreduce operation. We prove the convergence of WAGMA-SGD, and empirically show that it retains convergence rates similar to Allreduce-SGD. For evaluation, we train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale. Compared with state-of-the-art decentralized SGD variants, WAGMA-SGD significantly improves training throughput (e.g., 2.1x on 1,024 GPUs for reinforcement learning), and achieves the fastest time-to-solution (e.g., the highest score using the shortest training time for Transformer).
comment: Published in IEEE Transactions on Parallel and Distributed Systems (IEEE TPDS), vol. 32, no. 7, pp. 1725-1739, 1 July 2021, DOI: https://doi.org/10.1109/TPDS.2020.3040606
♻ ☆ Integrating Feature Attention and Temporal Modeling for Collaborative Financial Risk Assessment
This paper addresses the challenges of data privacy and collaborative modeling in cross-institution financial risk analysis. It proposes a risk assessment framework based on federated learning. Without sharing raw data, the method enables joint modeling and risk identification across multiple institutions. This is achieved by incorporating a feature attention mechanism and temporal modeling structure. Specifically, the model adopts a distributed optimization strategy. Each financial institution trains a local sub-model. The model parameters are protected using differential privacy and noise injection before being uploaded. A central server then aggregates these parameters to generate a global model. This global model is used for systemic risk identification. To validate the effectiveness of the proposed method, multiple experiments are conducted. These evaluate communication efficiency, model accuracy, systemic risk detection, and cross-market generalization. The results show that the proposed model outperforms both traditional centralized methods and existing federated learning variants across all evaluation metrics. It demonstrates strong modeling capabilities and practical value in sensitive financial environments. The method enhances the scope and efficiency of risk identification while preserving data sovereignty. It offers a secure and efficient solution for intelligent financial risk analysis.
♻ ☆ NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model
We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achieve improved inference speed when generating the long thinking traces needed for reasoning. We create Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parameter model (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe. After aligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to compress and distill the model with the goal of enabling inference on up to 128k tokens on a single NVIDIA A10G GPU (22GiB of memory, bfloat16 precision). Compared to existing similarly-sized models (e.g., Qwen3-8B), we show that Nemotron-Nano-9B-v2 achieves on-par or better accuracy on reasoning benchmarks while achieving up to 6x higher inference throughput in reasoning settings like 8k input and 16k output tokens. We are releasing Nemotron-Nano-9B-v2, Nemotron-Nano12B-v2-Base, and Nemotron-Nano-9B-v2-Base checkpoints along with the majority of our pre- and post-training datasets on Hugging Face.
♻ ☆ Versatile Cardiovascular Signal Generation with a Unified Diffusion Transformer
Cardiovascular signals such as photoplethysmography (PPG), electrocardiography (ECG), and blood pressure (BP) are inherently correlated and complementary, together reflecting the health of cardiovascular system. However, their joint utilization in real-time monitoring is severely limited by diverse acquisition challenges from noisy wearable recordings to burdened invasive procedures. Here we propose UniCardio, a multi-modal diffusion transformer that reconstructs low-quality signals and synthesizes unrecorded signals in a unified generative framework. Its key innovations include a specialized model architecture to manage the signal modalities involved in generation tasks and a continual learning paradigm to incorporate varying modality combinations. By exploiting the complementary nature of cardiovascular signals, UniCardio clearly outperforms recent task-specific baselines in signal denoising, imputation, and translation. The generated signals match the performance of ground-truth signals in detecting abnormal health conditions and estimating vital signs, even in unseen domains, while ensuring interpretability for human experts. These advantages position UniCardio as a promising avenue for advancing AI-assisted healthcare.
♻ ☆ A Survey of Foundation Models for IoT: Taxonomy and Criteria-Based Analysis
Foundation models have gained growing interest in the IoT domain due to their reduced reliance on labeled data and strong generalizability across tasks, which address key limitations of traditional machine learning approaches. However, most existing foundation model based methods are developed for specific IoT tasks, making it difficult to compare approaches across IoT domains and limiting guidance for applying them to new tasks. This survey aims to bridge this gap by providing a comprehensive overview of current methodologies and organizing them around four shared performance objectives by different domains: efficiency, context-awareness, safety, and security & privacy. For each objective, we review representative works, summarize commonly-used techniques and evaluation metrics. This objective-centric organization enables meaningful cross-domain comparisons and offers practical insights for selecting and designing foundation model based solutions for new IoT tasks. We conclude with key directions for future research to guide both practitioners and researchers in advancing the use of foundation models in IoT applications.
comment: Accepted by CCF Transactions on Pervasive Computing and Interaction
♻ ☆ Edge-Cloud Collaborative Computing on Distributed Intelligence and Model Optimization: A Survey
Edge-cloud collaborative computing (ECCC) has emerged as a pivotal paradigm for addressing the computational demands of modern intelligent applications, integrating cloud resources with edge devices to enable efficient, low-latency processing. Recent advancements in AI, particularly deep learning and large language models (LLMs), have dramatically enhanced the capabilities of these distributed systems, yet introduce significant challenges in model deployment and resource management. In this survey, we comprehensive examine the intersection of distributed intelligence and model optimization within edge-cloud environments, providing a structured tutorial on fundamental architectures, enabling technologies, and emerging applications. Additionally, we systematically analyze model optimization approaches, including compression, adaptation, and neural architecture search, alongside AI-driven resource management strategies that balance performance, energy efficiency, and latency requirements. We further explore critical aspects of privacy protection and security enhancement within ECCC systems and examines practical deployments through diverse applications, spanning autonomous driving, healthcare, and industrial automation. Performance analysis and benchmarking techniques are also thoroughly explored to establish evaluation standards for these complex systems. Furthermore, the review identifies critical research directions including LLMs deployment, 6G integration, neuromorphic computing, and quantum computing, offering a roadmap for addressing persistent challenges in heterogeneity management, real-time processing, and scalability. By bridging theoretical advancements and practical deployments, this survey offers researchers and practitioners a holistic perspective on leveraging AI to optimize distributed computing environments, fostering innovation in next-generation intelligent systems.
comment: 43 pages, 10 figures, 10 tables
♻ ☆ Teuken-7B-Base & Teuken-7B-Instruct: Towards European LLMs
We present two multilingual LLMs, Teuken 7B-base and Teuken 7B-instruct, designed to embrace Europe's linguistic diversity by supporting all 24 official languages of the European Union. Trained on a dataset comprising around 60% non-English data and utilizing a custom multilingual tokenizer, our models address the limitations of existing LLMs that predominantly focus on English or a few high-resource languages. We detail the models' development principles, i.e., data composition, tokenizer optimization, and training methodologies. The models demonstrate strong performance across multilingual benchmarks, as evidenced by their performance on European versions of ARC, HellaSwag, and TruthfulQA.
♻ ☆ Robust Sparse Mean Estimation via Incremental Learning
In this paper, we study the problem of robust sparse mean estimation, where the goal is to estimate a $k$-sparse mean from a collection of partially corrupted samples drawn from a heavy-tailed distribution. Existing estimators face two critical challenges in this setting. First, the existing estimators rely on the prior knowledge of the sparsity level $k$. Second, the existing estimators fall short of practical use as they scale poorly with the ambient dimension. This paper presents a simple mean estimator that overcomes both challenges under moderate conditions: it works without the knowledge of $k$ and runs in near-linear time and memory (both with respect to the ambient dimension). Moreover, provided that the signal-to-noise ratio is large, we can further improve our result to match the information-theoretic lower bound. At the core of our method lies an incremental learning phenomenon: we introduce a simple nonconvex framework that can incrementally learn the top-$k$ nonzero elements of the mean while keeping the zero elements arbitrarily small. Finally, we conduct a series of simulations to corroborate our theoretical findings.
♻ ☆ Federated Learning on Riemannian Manifolds: A Gradient-Free Projection-Based Approach
Federated learning (FL) has emerged as a powerful paradigm for collaborative model training across distributed clients while preserving data privacy. However, existing FL algorithms predominantly focus on unconstrained optimization problems with exact gradient information, limiting its applicability in scenarios where only noisy function evaluations are accessible or where model parameters are constrained. To address these challenges, we propose a novel zeroth-order projection-based algorithm on Riemannian manifolds for FL. By leveraging the projection operator, we introduce a computationally efficient zeroth-order Riemannian gradient estimator. Unlike existing estimators, ours requires only a simple Euclidean random perturbation, eliminating the need to sample random vectors in the tangent space, thus reducing computational cost. Theoretically, we first prove the approximation properties of the estimator and then establish the sublinear convergence of the proposed algorithm, matching the rate of its first-order counterpart. Numerically, we first assess the efficiency of our estimator using kernel principal component analysis. Furthermore, we apply the proposed algorithm to two real-world scenarios: zeroth-order attacks on deep neural networks and low-rank neural network training to validate the theoretical findings.
♻ ☆ KEA Explain: Explanations of Hallucinations using Graph Kernel Analysis
Large Language Models (LLMs) frequently generate hallucinations: statements that are syntactically plausible but lack factual grounding. This research presents KEA (Kernel-Enriched AI) Explain: a neurosymbolic framework that detects and explains such hallucinations by comparing knowledge graphs constructed from LLM outputs with ground truth data from Wikidata or contextual documents. Using graph kernels and semantic clustering, the method provides explanations for detected hallucinations, ensuring both robustness and interpretability. Our framework achieves competitive accuracy in detecting hallucinations across both open- and closed-domain tasks, and is able to generate contrastive explanations, enhancing transparency. This research advances the reliability of LLMs in high-stakes domains and provides a foundation for future work on precision improvements and multi-source knowledge integration.
♻ ☆ Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding ICML 2025
Transformers rely on both content-based and position-based addressing mechanisms to make predictions, but existing positional encoding techniques often diminish the effectiveness of position-based addressing. Many current methods enforce rigid patterns in attention maps, limiting the ability to model long-range dependencies and adapt to diverse tasks. Additionally, most positional encodings are learned as general biases, lacking the specialization required for different instances within a dataset. To address this, we propose con\textbf{T}extualized equivari\textbf{A}nt \textbf{P}osition \textbf{E}ncoding (\textbf{TAPE}), a novel framework that enhances positional embeddings by incorporating sequence content across layers. TAPE introduces dynamic, context-aware positional encodings, overcoming the constraints of traditional fixed patterns. We show that TAPE can provably facilitate LLM reasoning ability by emulating a broader class of algorithms. By enforcing permutation and orthogonal equivariance, TAPE ensures the stability of positional encodings during updates, improving long-context ability. Our method can be easily integrated into pre-trained transformers, offering parameter-efficient fine-tuning with minimal overhead. Extensive experiments show that TAPE achieves superior performance in language modeling, arithmetic reasoning, and long-context retrieval tasks compared to existing positional embedding techniques. Code is available at https://github.com/VITA-Group/TAPE.
comment: ICML 2025
♻ ☆ Redundant feature screening method for human activity recognition based on attention purification mechanism
In the field of sensor-based Human Activity Recognition (HAR), deep neural networks provide advanced technical support. Many studies have proven that recognition accuracy can be improved by increasing the depth or width of the network. However, for wearable devices, the balance between network performance and resource consumption is crucial. With minimum resource consumption as the basic principle, we propose a universal attention feature purification mechanism, called MSAP, which is suitable for multi-scale networks. The mechanism effectively solves the feature redundancy caused by the superposition of multi-scale features by means of inter-scale attention screening and connection method. In addition, we have designed a network correction module that integrates seamlessly between layers of individual network modules to mitigate inherent problems in deep networks. We also built an embedded deployment system that is in line with the current level of wearable technology to test the practical feasibility of the HAR model, and further prove the efficiency of the method. Extensive experiments on four public datasets show that the proposed method model effectively reduces redundant features in filtered data and provides excellent performance with little resource consumption.
comment: 12 pages,7 figures
♻ ☆ Hard Examples Are All You Need: Maximizing GRPO Post-Training Under Annotation Budgets
Collecting high-quality training examples for language model fine-tuning is expensive, with practical budgets limiting the amount of data that can be procured. We investigate a critical question for resource-constrained alignment: under a fixed acquisition budget, should practitioners prioritize examples that are easy, medium, hard, or of random difficulty? We study Group Relative Policy Optimization (GRPO) fine-tuning across different model sizes and families, comparing four subset selection policies chosen from the same unlabeled pool using base-model difficulty estimates obtained via multi-sample evaluation. Our experiments reveal that training on the hardest examples yields the largest performance gains, up to 47%, while training on easy examples yield the smallest gains. Analysis reveals that this effect arises from harder examples providing more learnable opportunities during GRPO training. These findings provide practical guidance for budget-constrained post-training: prioritizing hard examples yields substantial performance gains on reasoning tasks when using GRPO.
♻ ☆ Adaptive Experiments Under Data Sparse Settings: Applications for Educational Platforms
Adaptive experimentation is increasingly used in educational platforms to personalize learning through dynamic content and feedback. However, standard adaptive strategies such as Thompson Sampling often underperform in real-world educational settings where content variations are numerous and student participation is limited, resulting in sparse data. In particular, Thompson Sampling can lead to imbalanced content allocation and delayed convergence on which aspects of content are most effective for student learning. To address these challenges, we introduce Weighted Allocation Probability Adjusted Thompson Sampling (WAPTS), an algorithm that refines the sampling strategy to improve content-related decision-making in data-sparse environments. WAPTS is guided by the principle of lenient regret, allowing near-optimal allocations to accelerate learning while still exploring promising content. We evaluate WAPTS in a learnersourcing scenario where students rate peer-generated learning materials, and demonstrate that it enables earlier and more reliable identification of promising treatments.
♻ ☆ Diffusion MRI with Machine Learning
\hspace{2mm} Diffusion-weighted magnetic resonance imaging (dMRI) of the brain offers unique capabilities including noninvasive probing of tissue microstructure and structural connectivity. It is widely used for clinical assessment of disease and injury, and for neuroscience research. Analyzing the dMRI data to extract useful information for medical and scientific purposes can be challenging. The dMRI measurements may suffer from strong noise and artifacts, and may exhibit high inter-session and inter-scanner variability in the data, as well as inter-subject heterogeneity in brain structure. Moreover, the relationship between measurements and the phenomena of interest can be highly complex. Recent years have witnessed increasing use of machine learning methods for dMRI analysis. This manuscript aims to assess these efforts, with a focus on methods that have addressed data preprocessing and harmonization, microstructure mapping, tractography, and white matter tract analysis. We study the main findings, strengths, and weaknesses of the existing methods and suggest topics for future research. We find that machine learning may be exceptionally suited to tackle some of the difficult tasks in dMRI analysis. However, for this to happen, several shortcomings of existing methods and critical unresolved issues need to be addressed. There is a pressing need to improve evaluation practices, to increase the availability of rich training datasets and validation benchmarks, as well as model generalizability, reliability, and explainability concerns.
♻ ☆ Toward Errorless Training ImageNet-1k
In this paper, we describe a feedforward artificial neural network trained on the ImageNet 2012 contest dataset [7] with the new method of [5] to an accuracy rate of 98.3% with a 99.69 Top-1 rate, and an average of 285.9 labels that are perfectly classified over the 10 batch partitions of the dataset. The best performing model uses 322,430,160 parameters, with 4 decimal places precision. We conjecture that the reason our model does not achieve a 100% accuracy rate is due to a double-labeling problem, by which there are duplicate images in the dataset with different labels.
comment: 14 pages, 2 figures, 5 tables
Multimedia 4
☆ Visual Autoregressive Modeling for Instruction-Guided Image Editing
Recent advances in diffusion models have brought remarkable visual fidelity to instruction-guided image editing. However, their global denoising process inherently entangles the edited region with the entire image context, leading to unintended spurious modifications and compromised adherence to editing instructions. In contrast, autoregressive models offer a distinct paradigm by formulating image synthesis as a sequential process over discrete visual tokens. Their causal and compositional mechanism naturally circumvents the adherence challenges of diffusion-based methods. In this paper, we present VAREdit, a visual autoregressive (VAR) framework that reframes image editing as a next-scale prediction problem. Conditioned on source image features and text instructions, VAREdit generates multi-scale target features to achieve precise edits. A core challenge in this paradigm is how to effectively condition the source image tokens. We observe that finest-scale source features cannot effectively guide the prediction of coarser target features. To bridge this gap, we introduce a Scale-Aligned Reference (SAR) module, which injects scale-matched conditioning information into the first self-attention layer. VAREdit demonstrates significant advancements in both editing adherence and efficiency. On standard benchmarks, it outperforms leading diffusion-based methods by 30\%+ higher GPT-Balance score. Moreover, it completes a $512\times512$ editing in 1.2 seconds, making it 2.2$\times$ faster than the similarly sized UltraEdit. The models are available at https://github.com/HiDream-ai/VAREdit.
comment: Source codes and models are available at https://github.com/HiDream-ai/VAREdit
☆ GRAFT: GRaPH and Table Reasoning for Textual Alignment -- A Benchmark for Structured Instruction Following and Visual Reasoning
GRAFT is a structured multimodal benchmark for evaluating models on instruction-following, visual reasoning, and visual-textual alignment tasks. It features programmatically generated charts and synthetically rendered tables, created with Python visualization libraries to ensure control over data semantics, structure, and clarity. Each GRAFT instance pairs a chart or table image with a systematically generated, multi-step analytical question based solely on visual content. Answers are provided in structured formats such as JSON or YAML, supporting consistent evaluation of both reasoning and output format. The benchmark introduces a taxonomy of reasoning types including comparison, trend identification, ranking, aggregation, proportion estimation, and anomaly detection to enable comprehensive assessment. Reference answers follow strict factual and formatting guidelines for precise, aspect-based evaluation. GRAFT offers a unified, scalable framework for fine-grained benchmarking of multimodal models on visually grounded, structured reasoning tasks, setting a new evaluation standard in this field.
comment: 23 pages, 9 tables, 3 figures
☆ LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model
The development of Large Speech-Language Models (LSLMs) has been slowed by fragmented architectures and a lack of transparency, hindering the systematic comparison and reproducibility of research. Unlike in the vision-language domain, the LSLM field suffers from the common practice of releasing model weights without their corresponding training data and configurations. To address these critical gaps, we introduce LLaSO, the first fully open, end-to-end framework for large-scale speech-language modeling. LLaSO provides the community with three essential resources: (1) LLaSO-Align, a 12M-instance speech-text alignment corpus; (2) LLaSO-Instruct, a 13.5M-instance multi-task instruction-tuning dataset; and (3) LLaSO-Eval, a reproducible benchmark for standardized evaluation. To validate our framework, we build and release LLaSO-Base, a 3.8B-parameter reference model trained exclusively on our public data. It achieves a normalized score of 0.72, establishing a strong, reproducible baseline that surpasses comparable models. Our analysis reveals that while broader training coverage enhances performance, significant generalization gaps persist on unseen tasks, particularly in pure audio scenarios. By releasing the complete stack of data, benchmarks, and models, LLaSO establishes a foundational open standard to unify research efforts and accelerate community-driven progress in LSLMs. We release the code, dataset, pretrained models, and results in https://github.com/EIT-NLP/LLaSO.
☆ A Low-Latency 3D Live Remote Visualization System for Tourist Sites Integrating Dynamic and Pre-captured Static Point Clouds
Various real-time methods for capturing and transmitting dynamic 3D spaces have been proposed, including those based on RGB-D cameras and volumetric capture. However, applying existing methods to outdoor tourist sites remains difficult because maintenance and aesthetic constraints limit sensor placement, and daylight variability complicates processing. We propose a system that combines multiple LiDARs and cameras for live dynamic point cloud capture, and integrates them with pre-captured static point clouds for wide-area 3D visualization. The system sustains 30 fps across wide-area scenes while keeping latency below 100 ms. To mitigate lighting inconsistencies, static point-cloud colors are automatically adjusted to current lighting. The effectiveness of our system is demonstrated through real-world deployment in a tourist site.
comment: 3 pages, 4 figures, submitted to IEEE ISMAR 2025 Posters
Computation and Language 85
☆ Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs
Recent advances in diffusion large language models (dLLMs) have introduced a promising alternative to autoregressive (AR) LLMs for natural language generation tasks, leveraging full attention and denoising-based decoding strategies. However, the deployment of these models on edge devices remains challenging due to their massive parameter scale and high resource demands. While post-training quantization (PTQ) has emerged as a widely adopted technique for compressing AR LLMs, its applicability to dLLMs remains largely unexplored. In this work, we present the first systematic study on quantizing diffusion-based language models. We begin by identifying the presence of activation outliers, characterized by abnormally large activation values that dominate the dynamic range. These outliers pose a key challenge to low-bit quantization, as they make it difficult to preserve precision for the majority of values. More importantly, we implement state-of-the-art PTQ methods and conduct a comprehensive evaluation across multiple task types and model variants. Our analysis is structured along four key dimensions: bit-width, quantization method, task category, and model type. Through this multi-perspective evaluation, we offer practical insights into the quantization behavior of dLLMs under different configurations. We hope our findings provide a foundation for future research in efficient dLLM deployment. All codes and experimental setups will be released to support the community.
comment: Technical Report, Work in Progress
☆ Virtual Community: An Open World for Humans, Robots, and Society
The rapid progress in AI and Robotics may lead to a profound societal transformation, as humans and robots begin to coexist within shared communities, introducing both opportunities and challenges. To explore this future, we present Virtual Community-an open-world platform for humans, robots, and society-built on a universal physics engine and grounded in real-world 3D scenes. With Virtual Community, we aim to study embodied social intelligence at scale: 1) How robots can intelligently cooperate or compete; 2) How humans develop social relations and build community; 3) More importantly, how intelligent robots and humans can co-exist in an open world. To support these, Virtual Community features: 1) An open-source multi-agent physics simulator that supports robots, humans, and their interactions within a society; 2) A large-scale, real-world aligned community generation pipeline, including vast outdoor space, diverse indoor scenes, and a community of grounded agents with rich characters and appearances. Leveraging Virtual Community, we propose two novel challenges. The Community Planning Challenge evaluates multi-agent reasoning and planning ability in open-world settings, such as cooperating to help agents with daily activities and efficiently connecting other agents. The Community Robot Challenge requires multiple heterogeneous robots to collaborate in solving complex open-world tasks. We evaluate various baselines on these tasks and demonstrate the challenges in both high-level open-world task planning and low-level cooperation controls. We hope that Virtual Community will unlock further study of human-robot coexistence within open-world environments.
comment: website https://virtual-community-ai.github.io/
☆ MedReseacher-R1: Expert-Level Medical Deep Researcher via A Knowledge-Informed Trajectory Synthesis Framework
Recent developments in Large Language Model (LLM)-based agents have shown impressive capabilities spanning multiple domains, exemplified by deep research systems that demonstrate superior performance on complex information-seeking and synthesis tasks. While general-purpose deep research agents have shown impressive capabilities, they struggle significantly with medical domain challenges, as evidenced by leading proprietary systems achieving limited accuracy on complex medical benchmarks. The key limitations are: (1) the model lacks sufficient dense medical knowledge for clinical reasoning, and (2) the framework is constrained by the absence of specialized retrieval tools tailored for medical contexts.We present a medical deep research agent that addresses these challenges through two core innovations. First, we develop a novel data synthesis framework using medical knowledge graphs, extracting the longest chains from subgraphs around rare medical entities to generate complex multi-hop question-answer pairs. Second, we integrate a custom-built private medical retrieval engine alongside general-purpose tools, enabling accurate medical information synthesis. Our approach generates 2100+ diverse trajectories across 12 medical specialties, each averaging 4.2 tool interactions.Through a two-stage training paradigm combining supervised fine-tuning and online reinforcement learning with composite rewards, our MedResearcher-R1-32B model demonstrates exceptional performance, establishing new state-of-the-art results on medical benchmarks while maintaining competitive performance on general deep research tasks. Our work demonstrates that strategic domain-specific innovations in architecture, tool design, and training data construction can enable smaller open-source models to outperform much larger proprietary systems in specialized domains.
comment: 13 pages, 5 figures
☆ The Prompting Brain: Neurocognitive Markers of Expertise in Guiding Large Language Models
Prompt engineering has rapidly emerged as a critical skill for effective interaction with large language models (LLMs). However, the cognitive and neural underpinnings of this expertise remain largely unexplored. This paper presents findings from a cross-sectional pilot fMRI study investigating differences in brain functional connectivity and network activity between experts and intermediate prompt engineers. Our results reveal distinct neural signatures associated with higher prompt engineering literacy, including increased functional connectivity in brain regions such as the left middle temporal gyrus and the left frontal pole, as well as altered power-frequency dynamics in key cognitive networks. These findings offer initial insights into the neurobiological basis of prompt engineering proficiency. We discuss the implications of these neurocognitive markers in Natural Language Processing (NLP). Understanding the neural basis of human expertise in interacting with LLMs can inform the design of more intuitive human-AI interfaces, contribute to cognitive models of LLM interaction, and potentially guide the development of AI systems that better align with human cognitive workflows. This interdisciplinary approach aims to bridge the gap between human cognition and machine intelligence, fostering a deeper understanding of how humans learn and adapt to complex AI systems.
☆ Long Chain-of-Thought Reasoning Across Languages SC
Scaling inference through long chains-of-thought (CoTs) has unlocked impressive reasoning capabilities in large language models (LLMs), yet the reasoning process remains almost exclusively English-centric. We construct translated versions of two popular English reasoning datasets, fine-tune Qwen 2.5 (7B) and Qwen 3 (8B) models, and present a systematic study of long CoT generation across French, Japanese, Latvian, and Swahili. Our experiments reveal three key findings. First, the efficacy of using English as a pivot language varies by language: it provides no benefit for French, improves performance when used as the reasoning language for Japanese and Latvian, and proves insufficient for Swahili where both task comprehension and reasoning remain poor. Second, extensive multilingual pretraining in Qwen 3 narrows but does not eliminate the cross-lingual performance gap. A lightweight fine-tune using only 1k traces still improves performance by over 30\% in Swahili. Third, data quality versus scale trade-offs are language dependent: small, carefully curated datasets suffice for English and French, whereas larger but noisier corpora prove more effective for Swahili and Latvian. Together, these results clarify when and why long CoTs transfer across languages and provide translated datasets to foster equitable multilingual reasoning research.
comment: Accepted to SCALR @ COLM 2025
☆ Evaluating Retrieval-Augmented Generation vs. Long-Context Input for Clinical Reasoning over EHRs
Electronic health records (EHRs) are long, noisy, and often redundant, posing a major challenge for the clinicians who must navigate them. Large language models (LLMs) offer a promising solution for extracting and reasoning over this unstructured text, but the length of clinical notes often exceeds even state-of-the-art models' extended context windows. Retrieval-augmented generation (RAG) offers an alternative by retrieving task-relevant passages from across the entire EHR, potentially reducing the amount of required input tokens. In this work, we propose three clinical tasks designed to be replicable across health systems with minimal effort: 1) extracting imaging procedures, 2) generating timelines of antibiotic use, and 3) identifying key diagnoses. Using EHRs from actual hospitalized patients, we test three state-of-the-art LLMs with varying amounts of provided context, using either targeted text retrieval or the most recent clinical notes. We find that RAG closely matches or exceeds the performance of using recent notes, and approaches the performance of using the models' full context while requiring drastically fewer input tokens. Our results suggest that RAG remains a competitive and efficient approach even as newer models become capable of handling increasingly longer amounts of text.
☆ Privileged Self-Access Matters for Introspection in AI
Whether AI models can introspect is an increasingly important practical question. But there is no consensus on how introspection is to be defined. Beginning from a recently proposed ''lightweight'' definition, we argue instead for a thicker one. According to our proposal, introspection in AI is any process which yields information about internal states through a process more reliable than one with equal or lower computational cost available to a third party. Using experiments where LLMs reason about their internal temperature parameters, we show they can appear to have lightweight introspection while failing to meaningfully introspect per our proposed definition.
☆ TransLLM: A Unified Multi-Task Foundation Framework for Urban Transportation via Learnable Prompting
Urban transportation systems encounter diverse challenges across multiple tasks, such as traffic forecasting, electric vehicle (EV) charging demand prediction, and taxi dispatch. Existing approaches suffer from two key limitations: small-scale deep learning models are task-specific and data-hungry, limiting their generalizability across diverse scenarios, while large language models (LLMs), despite offering flexibility through natural language interfaces, struggle with structured spatiotemporal data and numerical reasoning in transportation domains. To address these limitations, we propose TransLLM, a unified foundation framework that integrates spatiotemporal modeling with large language models through learnable prompt composition. Our approach features a lightweight spatiotemporal encoder that captures complex dependencies via dilated temporal convolutions and dual-adjacency graph attention networks, seamlessly interfacing with LLMs through structured embeddings. A novel instance-level prompt routing mechanism, trained via reinforcement learning, dynamically personalizes prompts based on input characteristics, moving beyond fixed task-specific templates. The framework operates by encoding spatiotemporal patterns into contextual representations, dynamically composing personalized prompts to guide LLM reasoning, and projecting the resulting representations through specialized output layers to generate task-specific predictions. Experiments across seven datasets and three tasks demonstrate the exceptional effectiveness of TransLLM in both supervised and zero-shot settings. Compared to ten baseline models, it delivers competitive performance on both regression and planning problems, showing strong generalization and cross-task adaptability. Our code is available at https://github.com/BiYunying/TransLLM.
☆ Evaluating Multilingual and Code-Switched Alignment in LLMs via Synthetic Natural Language Inference
Large language models (LLMs) are increasingly applied in multilingual contexts, yet their capacity for consistent, logically grounded alignment across languages remains underexplored. We present a controlled evaluation framework for multilingual natural language inference (NLI) that generates synthetic, logic-based premise-hypothesis pairs and translates them into a typologically diverse set of languages. This design enables precise control over semantic relations and allows testing in both monolingual and mixed-language (code-switched) conditions. Surprisingly, code-switching does not degrade, and can even improve, performance, suggesting that translation-induced lexical variation may serve as a regularization signal. We validate semantic preservation through embedding-based similarity analyses and cross-lingual alignment visualizations, confirming the fidelity of translated pairs. Our findings expose both the potential and the brittleness of current LLM cross-lingual reasoning, and identify code-switching as a promising lever for improving multilingual robustness. Code available at: https://github.com/KurbanIntelligenceLab/nli-stress-testing
comment: Under review
☆ Transplant Then Regenerate: A New Paradigm for Text Data Augmentation EMNLP 2025
Data augmentation is a critical technique in deep learning. Traditional methods like Back-translation typically focus on lexical-level rephrasing, which primarily produces variations with the same semantics. While large language models (LLMs) have enhanced text augmentation by their "knowledge emergence" capability, controlling the style and structure of these outputs remains challenging and requires meticulous prompt engineering. In this paper, we propose LMTransplant, a novel text augmentation paradigm leveraging LLMs. The core idea of LMTransplant is transplant-then-regenerate: incorporating seed text into a context expanded by LLM, and asking the LLM to regenerate a variant based on the expanded context. This strategy allows the model to create more diverse and creative content-level variants by fully leveraging the knowledge embedded in LLMs, while preserving the core attributes of the original text. We evaluate LMTransplant across various text-related tasks, demonstrating its superior performance over existing text augmentation methods. Moreover, LMTransplant demonstrates exceptional scalability as the size of augmented data grows.
comment: Accepted by EMNLP 2025
☆ The Digital Sous Chef -- A Comparative Study on Fine-Tuning Language Models for Recipe Generation
We established a rigorous benchmark for text-based recipe generation, a fundamental task in natural language generation. We present a comprehensive comparative study contrasting a fine-tuned GPT-2 large (774M) model against the GPT-2 small (124M) model and traditional LSTM/RNN baselines on the 5-cuisine corpus from RecipeDB. Our key contribution is a targeted tokenization strategy that augments the vocabulary with 23 common fraction tokens and custom structural markers. This approach addresses a critical limitation of generic tokenizers by preserving essential recipe structures and precise numerical quantities, thereby enhancing domain specificity. Performance is evaluated using a comprehensive suite of seven automatic metrics spanning fluency (BLEU-4, METEOR), coherence (ROUGE-L), semantic relevance (BERTScore), and diversity. Our experiments show that the large transformer-based approach yields a >20% relative improvement in BERTScore (F1) (0.92 vs 0.72) over the best recurrent baseline, while reducing perplexity by 69.8%. We conclude with a discussion of remaining challenges, particularly regarding factual accuracy, and outline how this foundational study paves the way for integrating real-world constraints and multi-modal inputs in advanced recipe generation research.
comment: 8 pages, 4 figures. Code is available at: https://github.com/shubh-iiit/RecipeGPT2-Your-Own-AI-Chef
☆ ShizhenGPT: Towards Multimodal LLMs for Traditional Chinese Medicine
Despite the success of large language models (LLMs) in various domains, their potential in Traditional Chinese Medicine (TCM) remains largely underexplored due to two critical barriers: (1) the scarcity of high-quality TCM data and (2) the inherently multimodal nature of TCM diagnostics, which involve looking, listening, smelling, and pulse-taking. These sensory-rich modalities are beyond the scope of conventional LLMs. To address these challenges, we present ShizhenGPT, the first multimodal LLM tailored for TCM. To overcome data scarcity, we curate the largest TCM dataset to date, comprising 100GB+ of text and 200GB+ of multimodal data, including 1.2M images, 200 hours of audio, and physiological signals. ShizhenGPT is pretrained and instruction-tuned to achieve deep TCM knowledge and multimodal reasoning. For evaluation, we collect recent national TCM qualification exams and build a visual benchmark for Medicinal Recognition and Visual Diagnosis. Experiments demonstrate that ShizhenGPT outperforms comparable-scale LLMs and competes with larger proprietary models. Moreover, it leads in TCM visual understanding among existing multimodal LLMs and demonstrates unified perception across modalities like sound, pulse, smell, and vision, paving the way toward holistic multimodal perception and diagnosis in TCM. Datasets, models, and code are publicly available. We hope this work will inspire further exploration in this field.
☆ MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers
The Model Context Protocol has emerged as a transformative standard for connecting large language models to external data sources and tools, rapidly gaining adoption across major AI providers and development platforms. However, existing benchmarks are overly simplistic and fail to capture real application challenges such as long-horizon reasoning and large, unfamiliar tool spaces. To address this critical gap, we introduce MCP-Universe, the first comprehensive benchmark specifically designed to evaluate LLMs in realistic and hard tasks through interaction with real-world MCP servers. Our benchmark encompasses 6 core domains spanning 11 different MCP servers: Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching. To ensure rigorous evaluation, we implement execution-based evaluators, including format evaluators for agent format compliance, static evaluators for time-invariant content matching, and dynamic evaluators that automatically retrieve real-time ground truth for temporally sensitive tasks. Through extensive evaluation of leading LLMs, we find that even SOTA models such as GPT-5 (43.72%), Grok-4 (33.33%) and Claude-4.0-Sonnet (29.44%) exhibit significant performance limitations. In addition, our benchmark poses a significant long-context challenge for LLM agents, as the number of input tokens increases rapidly with the number of interaction steps. Moreover, it introduces an unknown-tools challenge, as LLM agents often lack familiarity with the precise usage of the MCP servers. Notably, enterprise-level agents like Cursor cannot achieve better performance than standard ReAct frameworks. Beyond evaluation, we open-source our extensible evaluation framework with UI support, enabling researchers and practitioners to seamlessly integrate new agents and MCP servers while fostering innovation in the rapidly evolving MCP ecosystem.
comment: Website: https://mcp-universe.github.io
☆ Improving in-context learning with a better scoring function
Large language models (LLMs) exhibit a remarkable capacity to learn by analogy, known as in-context learning (ICL). However, recent studies have revealed limitations in this ability. In this paper, we examine these limitations on tasks involving first-order quantifiers such as {\em all} and {\em some}, as well as on ICL with linear functions. We identify Softmax, the scoring function in attention mechanism, as a contributing factor to these constraints. To address this, we propose \textbf{scaled signed averaging (SSA)}, a novel alternative to Softmax. Empirical results show that SSA dramatically improves performance on our target tasks. Furthermore, we evaluate both encoder-only and decoder-only transformers models with SSA, demonstrating that they match or exceed their Softmax-based counterparts across a variety of linguistic probing tasks.
☆ Continuous sentiment scores for literary and multilingual contexts
Sentiment Analysis is widely used to quantify sentiment in text, but its application to literary texts poses unique challenges due to figurative language, stylistic ambiguity, as well as sentiment evocation strategies. Traditional dictionary-based tools often underperform, especially for low-resource languages, and transformer models, while promising, typically output coarse categorical labels that limit fine-grained analysis. We introduce a novel continuous sentiment scoring method based on concept vector projection, trained on multilingual literary data, which more effectively captures nuanced sentiment expressions across genres, languages, and historical periods. Our approach outperforms existing tools on English and Danish texts, producing sentiment scores whose distribution closely matches human ratings, enabling more accurate analysis and sentiment arc modeling in literature.
comment: 16 pages after compiling, 3025 words, 6 figures, 5 tables and an algorithm
☆ Filling the Gap for Uzbek: Creating Translation Resources for Southern Uzbek
Southern Uzbek (uzs) is a Turkic language variety spoken by around 5 million people in Afghanistan and differs significantly from Northern Uzbek (uzn) in phonology, lexicon, and orthography. Despite the large number of speakers, Southern Uzbek is underrepresented in natural language processing. We present new resources for Southern Uzbek machine translation, including a 997-sentence FLORES+ dev set, 39,994 parallel sentences from dictionary, literary, and web sources, and a fine-tuned NLLB-200 model (lutfiy). We also propose a post-processing method for restoring Arabic-script half-space characters, which improves handling of morphological boundaries. All datasets, models, and tools are released publicly to support future work on Southern Uzbek and other low-resource languages.
☆ Towards Skeletal and Signer Noise Reduction in Sign Language Production via Quaternion-Based Pose Encoding and Contrastive Learning
One of the main challenges in neural sign language production (SLP) lies in the high intra-class variability of signs, arising from signer morphology and stylistic variety in the training data. To improve robustness to such variations, we propose two enhancements to the standard Progressive Transformers (PT) architecture (Saunders et al., 2020). First, we encode poses using bone rotations in quaternion space and train with a geodesic loss to improve the accuracy and clarity of angular joint movements. Second, we introduce a contrastive loss to structure decoder embeddings by semantic similarity, using either gloss overlap or SBERT-based sentence similarity, aiming to filter out anatomical and stylistic features that do not convey relevant semantic information. On the Phoenix14T dataset, the contrastive loss alone yields a 16% improvement in Probability of Correct Keypoint over the PT baseline. When combined with quaternion-based pose encoding, the model achieves a 6% reduction in Mean Bone Angle Error. These results point to the benefit of incorporating skeletal structure modeling and semantically guided contrastive objectives on sign pose representations into the training of Transformer-based SLP models.
☆ Who Sees What? Structured Thought-Action Sequences for Epistemic Reasoning in LLMs
Recent advances in large language models (LLMs) and reasoning frameworks have opened new possibilities for improving the perspective -taking capabilities of autonomous agents. However, tasks that involve active perception, collaborative reasoning, and perspective taking (understanding what another agent can see or knows) pose persistent challenges for current LLM-based systems. This study investigates the potential of structured examples derived from transformed solution graphs generated by the Fast Downward planner to improve the performance of LLM-based agents within a ReAct framework. We propose a structured solution-processing pipeline that generates three distinct categories of examples: optimal goal paths (G-type), informative node paths (E-type), and step-by-step optimal decision sequences contrasting alternative actions (L-type). These solutions are further converted into ``thought-action'' examples by prompting an LLM to explicitly articulate the reasoning behind each decision. While L-type examples slightly reduce clarification requests and overall action steps, they do not yield consistent improvements. Agents are successful in tasks requiring basic attentional filtering but struggle in scenarios that required mentalising about occluded spaces or weighing the costs of epistemic actions. These findings suggest that structured examples alone are insufficient for robust perspective-taking, underscoring the need for explicit belief tracking, cost modelling, and richer environments to enable socially grounded collaboration in LLM-based agents.
comment: Accepted at ICSR25
☆ EmoTale: An Enacted Speech-emotion Dataset in Danish
While multiple emotional speech corpora exist for commonly spoken languages, there is a lack of functional datasets for smaller (spoken) languages, such as Danish. To our knowledge, Danish Emotional Speech (DES), published in 1997, is the only other database of Danish emotional speech. We present EmoTale; a corpus comprising Danish and English speech recordings with their associated enacted emotion annotations. We demonstrate the validity of the dataset by investigating and presenting its predictive power using speech emotion recognition (SER) models. We develop SER models for EmoTale and the reference datasets using self-supervised speech model (SSLM) embeddings and the openSMILE feature extractor. We find the embeddings superior to the hand-crafted features. The best model achieves an unweighted average recall (UAR) of 64.1% on the EmoTale corpus using leave-one-speaker-out cross-validation, comparable to the performance on DES.
comment: To appear in the proceedings of ASRU 2025
☆ Reasoning is about giving reasons
Convincing someone of the truth value of a premise requires understanding and articulating the core logical structure of the argument which proves or disproves the premise. Understanding the logical structure of an argument refers to understanding the underlying "reasons" which make up the proof or disproof of the premise - as a function of the "logical atoms" in the argument. While it has been shown that transformers can "chain" rules to derive simple arguments, the challenge of articulating the "reasons" remains. Not only do current approaches to chaining rules suffer in terms of their interpretability, they are also quite constrained in their ability to accommodate extensions to theoretically equivalent reasoning tasks - a model trained to chain rules cannot support abduction or identify contradictions. In this work we suggest addressing these shortcomings by identifying an intermediate representation (which we call the Representation of the Logical Structure (RLS) of the argument) that possesses an understanding of the logical structure of a natural language argument - the logical atoms in the argument and the rules incorporating them. Given the logical structure, reasoning is deterministic and easy to compute. Therefore, our approach supports all forms of reasoning that depend on the logical structure of the natural language argument, including arbitrary depths of reasoning, on-the-fly mistake rectification and interactive discussion with respect to an argument. We show that we can identify and extract the logical structure of natural language arguments in three popular reasoning datasets with high accuracies, thus supporting explanation generation and extending the reasoning capabilities significantly.
☆ In2x at WMT25 Translation Task
This paper presents the open-system submission by the In2x research team for the WMT25 General Machine Translation Shared Task. Our submission focuses on Japanese-related translation tasks, aiming to explore a generalizable paradigm for extending large language models (LLMs) to other languages. This paradigm encompasses aspects such as data construction methods and reward model design. The ultimate goal is to enable large language model systems to achieve exceptional performance in low-resource or less commonly spoken languages.
☆ DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization
We present DuPO, a dual learning-based preference optimization framework that generates annotation-free feedback via a generalized duality. DuPO addresses two key limitations: Reinforcement Learning with Verifiable Rewards (RLVR)'s reliance on costly labels and applicability restricted to verifiable tasks, and traditional dual learning's restriction to strictly dual task pairs (e.g., translation and back-translation). Specifically, DuPO decomposes a primal task's input into known and unknown components, then constructs its dual task to reconstruct the unknown part using the primal output and known information (e.g., reversing math solutions to recover hidden variables), broadening applicability to non-invertible tasks. The quality of this reconstruction serves as a self-supervised reward to optimize the primal task, synergizing with LLMs' ability to instantiate both tasks via a single model. Empirically, DuPO achieves substantial gains across diverse tasks: it enhances the average translation quality by 2.13 COMET over 756 directions, boosts the mathematical reasoning accuracy by an average of 6.4 points on three challenge benchmarks, and enhances performance by 9.3 points as an inference-time reranker (trading computation for accuracy). These results position DuPO as a scalable, general, and annotation-free paradigm for LLM optimization.
comment: 18 pages, 4 figures,
☆ NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model
We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achieve improved inference speed when generating the long thinking traces needed for reasoning. We create Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parameter model (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe. After aligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to compress and distill the model with the goal of enabling inference on up to 128k tokens on a single NVIDIA A10G GPU (22GiB of memory, bfloat16 precision). Compared to existing similarly-sized models (e.g., Qwen3-8B), we show that Nemotron-Nano-9B-v2 achieves on-par or better accuracy on reasoning benchmarks while achieving up to 6x higher inference throughput in reasoning settings like 8k input and 16k output tokens. We are releasing Nemotron-Nano-9B-v2, Nemotron-Nano12B-v2-Base, and Nemotron-Nano-9B-v2-Base checkpoints along with the majority of our pre- and post-training datasets on Hugging Face.
☆ Knowledge Graph-Infused Fine-Tuning for Structured Reasoning in Large Language Models
This paper addresses the problems of missing reasoning chains and insufficient entity-level semantic understanding in large language models when dealing with tasks that require structured knowledge. It proposes a fine-tuning algorithm framework based on knowledge graph injection. The method builds on pretrained language models and introduces structured graph information for auxiliary learning. A graph neural network is used to encode entities and their relations, constructing a graph-based semantic representation. A fusion mechanism is then designed to jointly model the knowledge graph embeddings with the contextual representations from the language model. To enhance the robustness of knowledge integration, a gating mechanism is introduced to dynamically balance the contributions of linguistic semantics and structural knowledge. This effectively mitigates conflicts between different representational spaces. During training, a joint loss function is constructed to account for both task performance and structural alignment objectives. This helps improve the accuracy of entity prediction and semantic reasoning. The study also includes a series of systematic sensitivity experiments. It evaluates the effects of learning rate, graph coverage, and structural perturbations on model performance. The results further validate the effectiveness and stability of the proposed method across tasks such as entity recognition, question answering, and language generation. Experimental findings show that the proposed structure-aware fine-tuning framework significantly enhances the model's ability to represent complex semantic units. It demonstrates better semantic consistency and contextual logic modeling in scenarios involving structural reasoning and entity extraction.
☆ Cognitive Surgery: The Awakening of Implicit Territorial Awareness in LLMs
Large language models (LLMs) have been shown to possess a degree of self-recognition capability-the ability to identify whether a given text was generated by themselves. Prior work has demonstrated that this capability is reliably expressed under the Pair Presentation Paradigm (PPP), where the model is presented with two texts and asked to choose which one it authored. However, performance deteriorates sharply under the Individual Presentation Paradigm (IPP), where the model is given a single text to judge authorship. Although this phenomenon has been observed, its underlying causes have not been systematically analyzed. In this paper, we first replicate existing findings to confirm that LLMs struggle to distinguish self- from other-generated text under IPP. We then investigate the reasons for this failure and attribute it to a phenomenon we term Implicit Territorial Awareness (ITA)-the model's latent ability to distinguish self- and other-texts in representational space, which remains unexpressed in its output behavior. To awaken the ITA of LLMs, we propose Cognitive Surgery (CoSur), a novel framework comprising four main modules: representation extraction, territory construction, authorship discrimination and cognitive editing. Experimental results demonstrate that our proposed method improves the performance of three different LLMs in the IPP scenario, achieving average accuracies of 83.25%, 66.19%, and 88.01%, respectively.
☆ DEPTH: Hallucination-Free Relation Extraction via Dependency-Aware Sentence Simplification and Two-tiered Hierarchical Refinement
Relation extraction enables the construction of structured knowledge for many downstream applications. While large language models (LLMs) have shown great promise in this domain, most existing methods concentrate on relation classification, which predicts the semantic relation type between a related entity pair. However, we observe that LLMs often struggle to reliably determine whether a relation exists, especially in cases involving complex sentence structures or intricate semantics, which leads to spurious predictions. Such hallucinations can introduce noisy edges in knowledge graphs, compromising the integrity of structured knowledge and downstream reliability. To address these challenges, we propose DEPTH, a framework that integrates Dependency-aware sEntence simPlification and Two-tiered Hierarchical refinement into the relation extraction pipeline. Given a sentence and its candidate entity pairs, DEPTH operates in two stages: (1) the Grounding module extracts relations for each pair by leveraging their shortest dependency path, distilling the sentence into a minimal yet coherent relational context that reduces syntactic noise while preserving key semantics; (2) the Refinement module aggregates all local predictions and revises them based on a holistic understanding of the sentence, correcting omissions and inconsistencies. We further introduce a causality-driven reward model that mitigates reward hacking by disentangling spurious correlations, enabling robust fine-tuning via reinforcement learning with human feedback. Experiments on six benchmarks demonstrate that DEPTH reduces the average hallucination rate to 7.0\% while achieving a 17.2\% improvement in average F1 score over state-of-the-art baselines.
☆ Credence Calibration Game? Calibrating Large Language Models through Structured Play
As Large Language Models (LLMs) are increasingly deployed in decision-critical domains, it becomes essential to ensure that their confidence estimates faithfully correspond to their actual correctness. Existing calibration methods have primarily focused on post-hoc adjustments or auxiliary model training; however, many of these approaches necessitate additional supervision or parameter updates. In this work, we propose a novel prompt-based calibration framework inspired by the Credence Calibration Game. Our method establishes a structured interaction loop wherein LLMs receive feedback based on the alignment of their predicted confidence with correctness. Through feedback-driven prompting and natural language summaries of prior performance, our framework dynamically improves model calibration. Extensive experiments across models and game configurations demonstrate consistent improvements in evaluation metrics. Our results highlight the potential of game-based prompting as an effective strategy for LLM calibration. Code and data are available at https://anonymous.4open.science/r/LLM-Calibration/.
☆ ZPD-SCA: Unveiling the Blind Spots of LLMs in Assessing Students' Cognitive Abilities
Large language models (LLMs) have demonstrated potential in educational applications, yet their capacity to accurately assess the cognitive alignment of reading materials with students' developmental stages remains insufficiently explored. This gap is particularly critical given the foundational educational principle of the Zone of Proximal Development (ZPD), which emphasizes the need to match learning resources with Students' Cognitive Abilities (SCA). Despite the importance of this alignment, there is a notable absence of comprehensive studies investigating LLMs' ability to evaluate reading comprehension difficulty across different student age groups, especially in the context of Chinese language education. To fill this gap, we introduce ZPD-SCA, a novel benchmark specifically designed to assess stage-level Chinese reading comprehension difficulty. The benchmark is annotated by 60 Special Grade teachers, a group that represents the top 0.15% of all in-service teachers nationwide. Experimental results reveal that LLMs perform poorly in zero-shot learning scenarios, with Qwen-max and GLM even falling below the probability of random guessing. When provided with in-context examples, LLMs performance improves substantially, with some models achieving nearly double the accuracy of their zero-shot baselines. These results reveal that LLMs possess emerging abilities to assess reading difficulty, while also exposing limitations in their current training for educationally aligned judgment. Notably, even the best-performing models display systematic directional biases, suggesting difficulties in accurately aligning material difficulty with SCA. Furthermore, significant variations in model performance across different genres underscore the complexity of task. We envision that ZPD-SCA can provide a foundation for evaluating and improving LLMs in cognitively aligned educational applications.
☆ ISCA: A Framework for Interview-Style Conversational Agents
We present a low-compute non-generative system for implementing interview-style conversational agents which can be used to facilitate qualitative data collection through controlled interactions and quantitative analysis. Use cases include applications to tracking attitude formation or behavior change, where control or standardization over the conversational flow is desired. We show how our system can be easily adjusted through an online administrative panel to create new interviews, making the tool accessible without coding. Two case studies are presented as example applications, one regarding the Expressive Interviewing system for COVID-19 and the other a semi-structured interview to survey public opinion on emerging neurotechnology. Our code is open-source, allowing others to build off of our work and develop extensions for additional functionality.
☆ Beyond Semantic Similarity: Reducing Unnecessary API Calls via Behavior-Aligned Retriever
Tool-augmented large language models (LLMs) leverage external functions to extend their capabilities, but inaccurate function calls can lead to inefficiencies and increased costs.Existing methods address this challenge by fine-tuning LLMs or using demonstration-based prompting, yet they often suffer from high training overhead and fail to account for inconsistent demonstration samples, which misguide the model's invocation behavior. In this paper, we trained a behavior-aligned retriever (BAR), which provides behaviorally consistent demonstrations to help LLMs make more accurate tool-using decisions. To train the BAR, we construct a corpus including different function-calling behaviors, i.e., calling or non-calling.We use the contrastive learning framework to train the BAR with customized positive/negative pairs and a dual-negative contrastive loss, ensuring robust retrieval of behaviorally consistent examples.Experiments demonstrate that our approach significantly reduces erroneous function calls while maintaining high task performance, offering a cost-effective and efficient solution for tool-augmented LLMs.
SurveyGen-I: Consistent Scientific Survey Generation with Evolving Plans and Memory-Guided Writing
Survey papers play a critical role in scientific communication by consolidating progress across a field. Recent advances in Large Language Models (LLMs) offer a promising solution by automating key steps in the survey-generation pipeline, such as retrieval, structuring, and summarization. However, existing LLM-based approaches often struggle with maintaining coherence across long, multi-section surveys and providing comprehensive citation coverage. To address these limitations, we introduce SurveyGen-I, an automatic survey generation framework that combines coarse-to-fine retrieval, adaptive planning, and memory-guided generation. SurveyGen-I first performs survey-level retrieval to construct the initial outline and writing plan, and then dynamically refines both during generation through a memory mechanism that stores previously written content and terminology, ensuring coherence across subsections. When the system detects insufficient context, it triggers fine-grained subsection-level retrieval. During generation, SurveyGen-I leverages this memory mechanism to maintain coherence across subsections. Experiments across four scientific domains demonstrate that SurveyGen-I consistently outperforms previous works in content quality, consistency, and citation coverage.
comment: The code is available at https://github.com/SurveyGens/SurveyGen-I , 20 pages, 16 figures
☆ aiXiv: A Next-Generation Open Access Ecosystem for Scientific Discovery Generated by AI Scientists
Recent advances in large language models (LLMs) have enabled AI agents to autonomously generate scientific proposals, conduct experiments, author papers, and perform peer reviews. Yet this flood of AI-generated research content collides with a fragmented and largely closed publication ecosystem. Traditional journals and conferences rely on human peer review, making them difficult to scale and often reluctant to accept AI-generated research content; existing preprint servers (e.g. arXiv) lack rigorous quality-control mechanisms. Consequently, a significant amount of high-quality AI-generated research lacks appropriate venues for dissemination, hindering its potential to advance scientific progress. To address these challenges, we introduce aiXiv, a next-generation open-access platform for human and AI scientists. Its multi-agent architecture allows research proposals and papers to be submitted, reviewed, and iteratively refined by both human and AI scientists. It also provides API and MCP interfaces that enable seamless integration of heterogeneous human and AI scientists, creating a scalable and extensible ecosystem for autonomous scientific discovery. Through extensive experiments, we demonstrate that aiXiv is a reliable and robust platform that significantly enhances the quality of AI-generated research proposals and papers after iterative revising and reviewing on aiXiv. Our work lays the groundwork for a next-generation open-access ecosystem for AI scientists, accelerating the publication and dissemination of high-quality AI-generated research content. Code is available at https://github.com/aixiv-org. Website is available at https://forms.gle/DxQgCtXFsJ4paMtn8.
comment: Preprint under review. Code is available at https://github.com/aixiv-org. Website is available at https://forms.gle/DxQgCtXFsJ4paMtn8
☆ Open-Universe Assistance Games
Embodied AI agents must infer and act in an interpretable way on diverse human goals and preferences that are not predefined. To formalize this setting, we introduce Open-Universe Assistance Games (OU-AGs), a framework where the agent must reason over an unbounded and evolving space of possible goals. In this context, we introduce GOOD (GOals from Open-ended Dialogue), a data-efficient, online method that extracts goals in the form of natural language during an interaction with a human, and infers a distribution over natural language goals. GOOD prompts an LLM to simulate users with different complex intents, using its responses to perform probabilistic inference over candidate goals. This approach enables rich goal representations and uncertainty estimation without requiring large offline datasets. We evaluate GOOD in a text-based grocery shopping domain and in a text-operated simulated household robotics environment (AI2Thor), using synthetic user profiles. Our method outperforms a baseline without explicit goal tracking, as confirmed by both LLM-based and human evaluations.
comment: 7 pages + 2 pages references + 7 pages appendix
☆ LLMs and Agentic AI in Insurance Decision-Making: Opportunities and Challenges For Africa
In this work, we highlight the transformative potential of Artificial Intelligence (AI), particularly Large Language Models (LLMs) and agentic AI, in the insurance sector. We consider and emphasize the unique opportunities, challenges, and potential pathways in insurance amid rapid performance improvements, increased open-source access, decreasing deployment costs, and the complexity of LLM or agentic AI frameworks. To bring it closer to home, we identify critical gaps in the African insurance market and highlight key local efforts, players, and partnership opportunities. Finally, we call upon actuaries, insurers, regulators, and tech leaders to a collaborative effort aimed at creating inclusive, sustainable, and equitable AI strategies and solutions: by and for Africans.
☆ Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset
Pretraining large language models (LLMs) on high-quality, structured data such as mathematics and code substantially enhances reasoning capabilities. However, existing math-focused datasets built from Common Crawl suffer from degraded quality due to brittle extraction heuristics, lossy HTML-to-text conversion, and the failure to reliably preserve mathematical structure. In this work, we introduce Nemotron-CC-Math, a large-scale, high-quality mathematical corpus constructed from Common Crawl using a novel, domain-agnostic pipeline specifically designed for robust scientific text extraction. Unlike previous efforts, our pipeline recovers math across various formats (e.g., MathJax, KaTeX, MathML) by leveraging layout-aware rendering with lynx and a targeted LLM-based cleaning stage. This approach preserves the structural integrity of equations and code blocks while removing boilerplate, standardizing notation into LaTeX representation, and correcting inconsistencies. We collected a large, high-quality math corpus, namely Nemotron-CC-Math-3+ (133B tokens) and Nemotron-CC-Math-4+ (52B tokens). Notably, Nemotron-CC-Math-4+ not only surpasses all prior open math datasets-including MegaMath, FineMath, and OpenWebMath-but also contains 5.5 times more tokens than FineMath-4+, which was previously the highest-quality math pretraining dataset. When used to pretrain a Nemotron-T 8B model, our corpus yields +4.8 to +12.6 gains on MATH and +4.6 to +14.3 gains on MBPP+ over strong baselines, while also improving general-domain performance on MMLU and MMLU-Stem. We present the first pipeline to reliably extract scientific content--including math--from noisy web-scale data, yielding measurable gains in math, code, and general reasoning, and setting a new state of the art among open math pretraining corpora. To support open-source efforts, we release our code and datasets.
☆ Mapping the Course for Prompt-based Structured Prediction
LLMs have been shown to be useful for a variety of language tasks, without requiring task-specific fine-tuning. However, these models often struggle with hallucinations and complex reasoning problems due to their autoregressive nature. We propose to address some of these issues, specifically in the area of structured prediction, by combining LLMs with combinatorial inference in an attempt to marry the predictive power of LLMs with the structural consistency provided by inference methods. We perform exhaustive experiments in an effort to understand which prompting strategies can effectively estimate LLM confidence values for use with symbolic inference, and show that, regardless of the prompting strategy, the addition of symbolic inference on top of prompting alone leads to more consistent and accurate predictions. Additionally, we show that calibration and fine-tuning using structured prediction objectives leads to increased performance for challenging tasks, showing that structured learning is still valuable in the era of LLMs.
☆ LongRecall: A Structured Approach for Robust Recall Evaluation in Long-Form Text
LongRecall. The completeness of machine-generated text, ensuring that it captures all relevant information, is crucial in domains such as medicine and law and in tasks like list-based question answering (QA), where omissions can have serious consequences. However, existing recall metrics often depend on lexical overlap, leading to errors with unsubstantiated entities and paraphrased answers, while LLM-as-a-Judge methods with long holistic prompts capture broader semantics but remain prone to misalignment and hallucinations without structured verification. We introduce LongRecall, a general three-stage recall evaluation framework that decomposes answers into self-contained facts, successively narrows plausible candidate matches through lexical and semantic filtering, and verifies their alignment through structured entailment checks. This design reduces false positives and false negatives while accommodating diverse phrasings and contextual variations, serving as a foundational building block for systematic recall assessment. We evaluate LongRecall on three challenging long-form QA benchmarks using both human annotations and LLM-based judges, demonstrating substantial improvements in recall accuracy over strong lexical and LLM-as-a-Judge baselines.
☆ Don't Think Twice! Over-Reasoning Impairs Confidence Calibration ICML 2025
Large Language Models deployed as question answering tools require robust calibration to avoid overconfidence. We systematically evaluate how reasoning capabilities and budget affect confidence assessment accuracy, using the ClimateX dataset (Lacombe et al., 2023) and expanding it to human and planetary health. Our key finding challenges the "test-time scaling" paradigm: while recent reasoning LLMs achieve 48.7% accuracy in assessing expert confidence, increasing reasoning budgets consistently impairs rather than improves calibration. Extended reasoning leads to systematic overconfidence that worsens with longer thinking budgets, producing diminishing and negative returns beyond modest computational investments. Conversely, search-augmented generation dramatically outperforms pure reasoning, achieving 89.3% accuracy by retrieving relevant evidence. Our results suggest that information access, rather than reasoning depth or inference budget, may be the critical bottleneck for improved confidence calibration of knowledge-intensive tasks.
comment: Published at ICML 2025 Workshop on Reliable and Responsible Foundation Models
☆ Reward-Shifted Speculative Sampling Is An Efficient Test-Time Weak-to-Strong Aligner
Aligning large language models (LLMs) with human preferences has become a critical step in their development. Recent research has increasingly focused on test-time alignment, where additional compute is allocated during inference to enhance LLM safety and reasoning capabilities. However, these test-time alignment techniques often incur substantial inference costs, limiting their practical application. We are inspired by the speculative sampling acceleration, which leverages a small draft model to efficiently predict future tokens, to address the efficiency bottleneck of test-time alignment. We introduce the reward-Shifted Speculative Sampling (SSS) algorithm, in which the draft model is aligned with human preferences, while the target model remains unchanged. We theoretically demonstrate that the distributional shift between the aligned draft model and the unaligned target model can be exploited to recover the RLHF optimal solution without actually obtaining it, by modifying the acceptance criterion and bonus token distribution. Our algorithm achieves superior gold reward scores at a significantly reduced inference cost in test-time weak-to-strong alignment experiments, thereby validating both its effectiveness and efficiency.
☆ Multilingual Datasets for Custom Input Extraction and Explanation Requests Parsing in Conversational XAI Systems EMNLP 2025
Conversational explainable artificial intelligence (ConvXAI) systems based on large language models (LLMs) have garnered considerable attention for their ability to enhance user comprehension through dialogue-based explanations. Current ConvXAI systems often are based on intent recognition to accurately identify the user's desired intention and map it to an explainability method. While such methods offer great precision and reliability in discerning users' underlying intentions for English, a significant challenge in the scarcity of training data persists, which impedes multilingual generalization. Besides, the support for free-form custom inputs, which are user-defined data distinct from pre-configured dataset instances, remains largely limited. To bridge these gaps, we first introduce MultiCoXQL, a multilingual extension of the CoXQL dataset spanning five typologically diverse languages, including one low-resource language. Subsequently, we propose a new parsing approach aimed at enhancing multilingual parsing performance, and evaluate three LLMs on MultiCoXQL using various parsing strategies. Furthermore, we present Compass, a new multilingual dataset designed for custom input extraction in ConvXAI systems, encompassing 11 intents across the same five languages as MultiCoXQL. We conduct monolingual, cross-lingual, and multilingual evaluations on Compass, employing three LLMs of varying sizes alongside BERT-type models.
comment: Accepted at EMNLP 2025 Findings, camera-ready version
☆ Improving LLMs for Machine Translation Using Synthetic Preference Data ECAI 2025
Large language models have emerged as effective machine translation systems. In this paper, we explore how a general instruction-tuned large language model can be improved for machine translation using relatively few easily produced data resources. Using Slovene as a use case, we improve the GaMS-9B-Instruct model using Direct Preference Optimization (DPO) training on a programmatically curated and enhanced subset of a public dataset. As DPO requires pairs of quality-ranked instances, we generated its training dataset by translating English Wikipedia articles using two LLMs, GaMS-9B-Instruct and EuroLLM-9B-Instruct. We ranked the resulting translations based on heuristics coupled with automatic evaluation metrics such as COMET. The evaluation shows that our fine-tuned model outperforms both models involved in the dataset generation. In comparison to the baseline models, the fine-tuned model achieved a COMET score gain of around 0.04 and 0.02, respectively, on translating Wikipedia articles. It also more consistently avoids language and formatting errors.
comment: Paper with individual presentation at LUHME workshop at ECAI 2025
♻ ☆ RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation
We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0{\deg}, 90{\deg}, 180{\deg}, and 270{\deg}. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench -- a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information -- including captions, depth maps, and more -- or using chain-of-thought prompting offers only small and inconsistent improvements. Our results indicate that most models are able to reliably identify right-side-up (0{\deg}) images, while certain models are able to identify upside-down (180{\deg}) images. None can reliably distinguish between 90{\deg} and 270{\deg}. Simultaneously showing the image rotated in different orientations leads to moderate performance gains for reasoning models, while a modified setup using voting improves the performance of weaker models. We further show that fine-tuning does not improve models' ability to distinguish 90{\deg} and 270{\deg} rotations, despite substantially improving the identification of 180{\deg} images. Together, these results reveal a significant gap between MLLMs' spatial reasoning capabilities and human perception in identifying rotation.
comment: 20 pages. Code and data: https://github.com/tianyiniu/RotBench
♻ ☆ Task-Oriented Automatic Fact-Checking with Frame-Semantics
We propose a novel paradigm for automatic fact-checking that leverages frame semantics to enhance the structured understanding of claims and guide the process of fact-checking them. To support this, we introduce a pilot dataset of real-world claims extracted from PolitiFact, specifically annotated for large-scale structured data. This dataset underpins two case studies: the first investigates voting-related claims using the Vote semantic frame, while the second explores various semantic frames based on data sources from the Organisation for Economic Co-operation and Development (OECD). Our findings demonstrate the effectiveness of frame semantics in improving evidence retrieval and explainability for fact-checking. Finally, we conducted a survey of frames evoked in fact-checked claims, identifying high-impact frames to guide future work in this direction.
♻ ☆ Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
Synthetic data generation has recently emerged as a promising approach for enhancing the capabilities of large language models (LLMs) without the need for expensive human annotations. However, existing methods often generate data that can be low quality or contrived. In this paper, we introduce Source2Synth, a scalable approach for synthetic data generation and curation that is grounded in real-world data sources. Source2Synth takes as input a custom data source and produces synthetic data examples with intermediate reasoning steps. Our method improves the dataset quality by discarding low-quality generations based on their answerability. We demonstrate the generality of this approach by applying it to two tasks that leverage two different types of data: multi-hop question answering (MHQA), where we test complex reasoning abilities leveraging documents, and tabular question answering (TQA), where we test tool usage leveraging tables. Our method improves performance by 25.51% for TQA on WikiSQL and 22.57% for MHQA on HotpotQA compared to the fine-tuned baselines.
♻ ☆ JudgeLRM: Large Reasoning Models as a Judge
The rise of Large Language Models (LLMs) as evaluators offers a scalable alternative to human annotation, yet existing Supervised Fine-Tuning (SFT) for judges approaches often fall short in domains requiring complex reasoning. In this work, we investigate whether LLM judges truly benefit from enhanced reasoning capabilities. Through a detailed analysis of reasoning requirements across evaluation tasks, we reveal a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples - highlighting the limitations of SFT in such scenarios. To address this, we introduce JudgeLRM, a family of judgment-oriented LLMs trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards. JudgeLRM models consistently outperform both SFT-tuned and state-of-the-art reasoning models. Notably, JudgeLRM-3B surpasses GPT-4, and JudgeLRM-7B outperforms DeepSeek-R1 by 2.79% in F1 score, particularly excelling in judge tasks requiring deep reasoning.
comment: Preprint
♻ ☆ TASER: Table Agents for Schema-guided Extraction and Recommendation
Real-world financial documents report essential information about an entity's financial holdings that can span millions of different financial instrument types. Yet, these details are often buried in messy, multi-page, fragmented tables - for example, 99.4% of the tables in our dataset have no bounding boxes with the maximum number of rows amounting to 426 per table across 44 pages. To tackle these unique challenges from real-world tables, we present a continuously learning, agentic table extraction system, TASER (Table Agents for Schema-guided Extraction and Recommendation) that extracts highly unstructured, multi-page, heterogeneous tables into normalized, schema-conforming outputs. Our table agents execute on table detection, classification, extraction, and recommendations by leveraging an initial schema. Then, our Recommender Agent reviews the outputs, recommends schema revisions, and decides on the final recommendations, enabling TASER to outperform existing table detection models such as Table Transformer by 10.1%. Within this continuous learning process, we highlight that larger batch sizes result in a 104.3% increase in schema recommendations that are actionable and utilized, resulting in a 9.8% increase in extracted holdings - highlighting the importance of a continuous learning process. To train TASER, we have manually labeled 22,584 pages (28,150,449 tokens), 3,213 tables for $731,685,511,687 of holdings culminating in one of the first real financial table datasets. We release our dataset TASERTab to enable the research community to access real-world financial tables and outputs. Our results highlight the promise of agentic, schema-guided extraction systems for robust understanding of real-world financial tables.
comment: Withdrawn due to missing key sections in the paper
♻ ☆ G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model
Large language models (LLMs) have shown remarkable proficiency in human-level reasoning and generation capabilities, which encourages extensive research on their application in mathematical problem solving. However, current work has been largely focused on text-based mathematical problems, with limited investigation in problems involving geometric information. Addressing this gap, we aim to enable LLMs to solve geometric problems by understanding image input. We first analyze the limitations of current Multimodal Large Language Models (MLLMs) in this area: they struggle to accurately comprehending basic geometric elements and their relationships. To overcome these challenges, we take advantage of the unique characteristics of geometric problems (such as unique geometric logical form, and geometric scalability) and the capacity of the textual LLMs to build an enriched multimodal geometry dataset based on existing data. The augmented dataset, Geo170K, contains more than 170K geometric image-caption and question-answer pairs. Utilizing our constructed Geo170K dataset, we develop G-LLaVA, which demonstrates exceptional performance in solving geometric problems, significantly outperforming GPT-4-V on the MathVista benchmark with only 7B parameters.
comment: 10 pages
♻ ☆ Coupling without Communication and Drafter-Invariant Speculative Decoding
Suppose Alice has a distribution $P$ and Bob has a distribution $Q$. Alice wants to draw a sample $a\sim P$ and Bob a sample $b \sim Q$ such that $a = b$ with as high of probability as possible. It is well-known that, by sampling from an optimal coupling between the distributions, Alice and Bob can achieve $\Pr[a = b] = 1 - D_{TV}(P,Q)$, where $D_{TV}(P,Q)$ is the total variation distance between $P$ and $Q$. What if Alice and Bob must solve this same problem \emph{without communicating at all?} Perhaps surprisingly, with access to public randomness, they can still achieve $\Pr[a = b] \geq \frac{1 - D_{TV}(P,Q)}{1 + D_{TV}(P,Q)} \geq 1-2D_{TV}(P,Q)$ using a simple protocol based on the Weighted MinHash algorithm. This bound was shown to be optimal in the worst-case by [Bavarian et al., 2020]. In this work, we revisit the communication-free coupling problem. We provide a simpler proof of the optimality result from [Bavarian et al., 2020]. We show that, while the worst-case success probability of Weighted MinHash cannot be improved, an equally simple protocol based on Gumbel sampling offers a Pareto improvement: for every pair of distributions $P, Q$, Gumbel sampling achieves an equal or higher value of $\Pr[a = b]$ than Weighted MinHash. Importantly, this improvement translates to practice. We demonstrate an application of communication-free coupling to \emph{speculative decoding}, a recent method for accelerating autoregressive large language models [Leviathan, Kalman, Matias, ICML 2023]. We show that communication-free protocols can be used to contruct \emph{\CSD{}} schemes, which have the desirable property that their output is fixed given a fixed random seed, regardless of what drafter is used for speculation. In experiments on a language generation task, Gumbel sampling outperforms Weighted MinHash. Code is available at https://github.com/majid-daliri/DISD.
comment: 18 pages
♻ ☆ Boosting Chart-to-Code Generation in MLLM via Dual Preference-Guided Refinement ACM MM 2025
Translating chart images into executable plotting scripts-referred to as the chart-to-code generation task-requires Multimodal Large Language Models (MLLMs) to perform fine-grained visual parsing, precise code synthesis, and robust cross-modal reasoning. However, this task is inherently under-constrained: multiple valid code implementations can produce the same visual chart, and evaluation must consider both code correctness and visual fidelity across diverse dimensions. This makes it difficult to learn accurate and generalizable mappings through standard supervised fine-tuning. To address these challenges, we propose a dual preference-guided refinement framework that combines a feedback-driven, dual-modality reward mechanism with iterative preference learning. Our approach introduces a structured variant generation strategy and a visual reward model to efficiently produce high-quality, aspect-aware preference pairs-making preference collection scalable and supervision more targeted. These preferences are used in an offline reinforcement learning setup to optimize the model toward multi-dimensional fidelity. Experimental results show that our framework significantly enhances the performance of general-purpose open-source MLLMs, enabling them to generate high-quality plotting code that rivals specialized chart-centric models and even some proprietary systems. The code and datasets are publicly available at https://github.com/Zhihan72/Chart2Code.
comment: Accepted by ACM MM 2025
♻ ☆ Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers
Hallucinations are a persistent problem with Large Language Models (LLMs). As these models become increasingly used in high-stakes domains, such as healthcare and finance, the need for effective hallucination detection is crucial. To this end, we outline a versatile framework for zero-resource hallucination detection that practitioners can apply to real-world use cases. To achieve this, we adapt a variety of existing uncertainty quantification (UQ) techniques, including black-box UQ, white-box UQ, and LLM-as-a-Judge, transforming them as necessary into standardized response-level confidence scores ranging from 0 to 1. To enhance flexibility, we propose a tunable ensemble approach that incorporates any combination of the individual confidence scores. This approach enables practitioners to optimize the ensemble for a specific use case for improved performance. To streamline implementation, the full suite of scorers is offered in this paper's companion Python toolkit, UQLM. To evaluate the performance of the various scorers, we conduct an extensive set of experiments using several LLM question-answering benchmarks. We find that our tunable ensemble typically surpasses its individual components and outperforms existing hallucination detection methods. Our results demonstrate the benefits of customized hallucination detection strategies for improving the accuracy and reliability of LLMs.
comment: UQLM repository: https://github.com/cvs-health/uqlm
♻ ☆ Hallucinations and Key Information Extraction in Medical Texts: A Comprehensive Assessment of Open-Source Large Language Models
Clinical summarization is crucial in healthcare as it distills complex medical data into digestible information, enhancing patient understanding and care management. Large language models (LLMs) have shown significant potential in automating and improving the accuracy of such summarizations due to their advanced natural language understanding capabilities. These models are particularly applicable in the context of summarizing medical/clinical texts, where precise and concise information transfer is essential. In this paper, we investigate the effectiveness of open-source LLMs in extracting key events from discharge reports, including admission reasons, major in-hospital events, and critical follow-up actions. In addition, we also assess the prevalence of various types of hallucinations in the summaries produced by these models. Detecting hallucinations is vital as it directly influences the reliability of the information, potentially affecting patient care and treatment outcomes. We conduct comprehensive simulations to rigorously evaluate the performance of these models, further probing the accuracy and fidelity of the extracted content in clinical summarization. Our results reveal that while the LLMs (e.g., Qwen2.5 and DeepSeek-v2) perform quite well in capturing admission reasons and hospitalization events, they are generally less consistent when it comes to identifying follow-up recommendations, highlighting broader challenges in leveraging LLMs for comprehensive summarization.
♻ ☆ From Autonomy to Agency: Agentic Vehicles for Human-Centered Mobility Systems
Autonomy, from the Greek autos (self) and nomos (law), refers to the capacity to operate according to internal rules without external control. Accordingly, autonomous vehicles (AuVs) are viewed as vehicular systems capable of perceiving their environment and executing pre-programmed tasks independently of external input. However, both research and real-world deployments increasingly showcase vehicles that demonstrate behaviors beyond this definition (including the SAE levels 0 to 5); Examples of this outpace include the interaction with humans with natural language, goal adaptation, contextual reasoning, external tool use, and unseen ethical dilemma handling, largely empowered by multi-modal large language models (LLMs). These developments reveal a conceptual gap between technical autonomy and the broader cognitive and social capabilities needed for future human-centered mobility systems. To address this gap, this paper introduces the concept of agentic vehicles (AgVs), referring to vehicles that integrate agentic AI systems to reason, adapt, and interact within complex environments. This paper proposes the term AgVs and their distinguishing characteristics from conventional AuVs. It synthesizes relevant advances in integrating LLMs and AuVs and highlights how AgVs might transform future mobility systems and ensure the systems are human-centered. The paper concludes by identifying key challenges in the development and governance of AgVs, and how they can play a significant role in future agentic transportation systems.
♻ ☆ Is neural semantic parsing good at ellipsis resolution, or isn't it?
Neural semantic parsers have shown good overall performance for a variety of linguistic phenomena, reaching semantic matching scores of more than 90%. But how do such parsers perform on strongly context-sensitive phenomena, where large pieces of semantic information need to be duplicated to form a meaningful semantic representation? A case in point is English verb phrase ellipsis, a construct where entire verb phrases can be abbreviated by a single auxiliary verb. Are the otherwise known as powerful semantic parsers able to deal with ellipsis or aren't they? We constructed a corpus of 120 cases of ellipsis with their fully resolved meaning representation and used this as a challenge set for a large battery of neural semantic parsers. Although these parsers performed very well on the standard test set, they failed in the instances with ellipsis. Data augmentation helped improve the parsing results. The reason for the difficulty of parsing elided phrases is not that copying semantic material is hard, but that usually occur in linguistically complicated contexts causing most of the parsing errors.
comment: Accepted by 16th IWCS
♻ ☆ Retrieval-Augmented Semantic Parsing: Improving Generalization with Lexical Knowledge
Open-domain semantic parsing remains a challenging task, as neural models often rely on heuristics and struggle to handle unseen concepts. In this paper, we investigate the potential of large language models (LLMs) for this task and introduce Retrieval-Augmented Semantic Parsing (RASP), a simple yet effective approach that integrates external symbolic knowledge into the parsing process. Our experiments not only show that LLMs outperform previous encoder-decoder baselines for semantic parsing, but that RASP further enhances their ability to predict unseen concepts, nearly doubling the performance of previous models on out-of-distribution concepts. These findings highlight the promise of leveraging large language models and retrieval mechanisms for robust and open-domain semantic parsing.
comment: Accpted by 16th IWCS
♻ ☆ Applying Text Embedding Models for Efficient Analysis in Labeled Property Graphs
Labeled property graphs often contain rich textual attributes that can enhance analytical tasks when properly leveraged. This work explores the use of pretrained text embedding models to enable efficient semantic analysis in such graphs. By embedding textual node and edge properties, we support downstream tasks including node classification and relation prediction with improved contextual understanding. Our approach integrates language model embeddings into the graph pipeline without altering its structure, demonstrating that textual semantics can significantly enhance the accuracy and interpretability of property graph analysis.
♻ ☆ STEM: Efficient Relative Capability Evaluation of LLMs through Structured Transition Samples AAAI 2026
Evaluating large language models (LLMs) has become increasingly challenging as model capabilities advance rapidly. While recent models often achieve higher scores on standard benchmarks, these improvements do not consistently reflect enhanced real-world reasoning capabilities. Moreover, widespread overfitting to public benchmarks and the high computational cost of full evaluations have made it both expensive and less effective to distinguish meaningful differences between models. To address these challenges, we propose the \textbf{S}tructured \textbf{T}ransition \textbf{E}valuation \textbf{M}ethod (STEM), a lightweight and interpretable evaluation framework for efficiently estimating the relative capabilities of LLMs. STEM identifies \textit{significant transition samples} (STS) by analyzing consistent performance transitions among LLMs of the same architecture but varying parameter scales. These samples enable STEM to effectively estimate the capability position of an unknown model. Qwen3 model family is applied to construct the STS pool on six diverse and representative benchmarks. To assess generalizability. Experimental results indicate that STEM reliably captures performance trends, aligns with ground-truth rankings of model capability. These findings highlight STEM as a practical and scalable method for fine-grained, architecture-agnostic evaluation of LLMs.
comment: Submit to AAAI 2026
♻ ☆ Advancing Language Multi-Agent Learning with Credit Re-Assignment for Interactive Environment Generalization
LLM-based agents have made significant advancements in interactive environments, such as mobile operations and web browsing, and other domains beyond computer using. Current multi-agent systems universally excel in performance, compared to single agents, but struggle with generalization across environments due to predefined roles and inadequate strategies for generalizing language agents. The challenge of achieving both strong performance and good generalization has hindered the progress of multi-agent systems for interactive environments. To address these issues, we propose CollabUIAgents, a multi-agent reinforcement learning framework with a novel multi-agent credit re-assignment (CR) strategy, assigning process rewards with LLMs rather than environment-specific rewards and learning with synthesized preference data, in order to foster generalizable, collaborative behaviors among the role-free agents' policies. Empirical results show that our framework improves both performance and cross-environment generalizability of multi-agent systems. Moreover, our 7B-parameter system achieves results on par with or exceed strong closed-source models, and the LLM that guides the CR. We also provide insights in using granular CR rewards effectively for environment generalization, and accommodating trained LLMs in multi-agent systems.
comment: Published in COLM2025
♻ ☆ Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
Recent advances in reinforcement learning (RL) with numerical feedback, such as scalar rewards, have significantly enhanced the complex reasoning capabilities of large language models (LLMs). Despite this success, we identify three key challenges encountered by RL with solely numerical feedback: performance plateaus, limited effectiveness of spontaneous self-reflection, and persistent failures. We then demonstrate that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems by leveraging natural language feedback in the form of critiques. Building on this insight, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for effective policy optimization. Critique-GRPO enables LLMs to learn from initial responses and critique-guided self-refinements simultaneously while maintaining exploration. Additionally, we employ a shaping function to amplify learning from correct, especially unfamiliar, refinements and penalize incorrect ones. Extensive experiments with Qwen2.5-7B-Base, Qwen2.5-Math-7B-Base, and Qwen3-8B demonstrate that Critique-GRPO consistently outperforms supervised learning and RL-based fine-tuning methods across eight challenging mathematical, STEM, and general reasoning tasks. Specifically, Critique-GRPO improves average pass@1 scores across all compared methods by approximately +4.4% on Qwen2.5-7B-Base and +3.8% on Qwen3-8B. Notably, Critique-GRPO enables effective self-improvement through self-critiquing, achieving significant gains over GRPO, e.g., +16.7% pass@1 improvement on AIME 2024.
comment: 52 pages, updated with new experimental results and implementation details
♻ ☆ Enhancing Temporal Sensitivity of Large Language Model for Recommendation with Counterfactual Tuning
Recent advances have applied large language models (LLMs) to sequential recommendation, leveraging their pre-training knowledge and reasoning capabilities to provide more personalized user experiences. However, existing LLM-based methods fail to sufficiently leverage the rich temporal information inherent in users' historical interaction sequences, stemming from fundamental architectural constraints: LLMs process information through self-attention mechanisms that lack inherent sequence ordering and rely on position embeddings designed primarily for natural language rather than user interaction sequences. This limitation significantly impairs their ability to capture the evolution of user preferences over time and predict future interests accurately. To address this critical gap, we propose \underline{C}ounterfactual \underline{E}nhanced \underline{T}emporal Framework for LLM-Based \underline{Rec}ommendation (CETRec). CETRec is grounded in causal inference principles, which allow it to isolate and measure the specific impact of temporal information on recommendation outcomes. Combined with our counterfactual tuning task derived from causal analysis, CETRec effectively enhances LLMs' awareness of both absolute order (how recently items were interacted with) and relative order (the sequential relationships between items). Extensive experiments on real-world datasets demonstrate the effectiveness of our CETRec. Our code is available at https://anonymous.4open.science/r/CETRec-B9CE/.
♻ ☆ CRED-SQL: Enhancing Real-world Large Scale Database Text-to-SQL Parsing through Cluster Retrieval and Execution Description
Recent advances in large language models (LLMs) have significantly improved the accuracy of Text-to-SQL systems. However, a critical challenge remains: the semantic mismatch between natural language questions (NLQs) and their corresponding SQL queries. This issue is exacerbated in large-scale databases, where semantically similar attributes hinder schema linking and semantic drift during SQL generation, ultimately reducing model accuracy. To address these challenges, we introduce CRED-SQL, a framework designed for large-scale databases that integrates Cluster Retrieval and Execution Description. CRED-SQL first performs cluster-based large-scale schema retrieval to pinpoint the tables and columns most relevant to a given NLQ, alleviating schema mismatch. It then introduces an intermediate natural language representation-Execution Description Language (EDL)-to bridge the gap between NLQs and SQL. This reformulation decomposes the task into two stages: Text-to-EDL and EDL-to-SQL, leveraging LLMs' strong general reasoning capabilities while reducing semantic deviation. Extensive experiments on two large-scale, cross-domain benchmarks-SpiderUnion and BirdUnion-demonstrate that CRED-SQL achieves new state-of-the-art (SOTA) performance, validating its effectiveness and scalability. Our code is available at https://github.com/smduan/CRED-SQL.git
♻ ☆ ReSpark: Leveraging Previous Data Reports as References to Generate New Reports with LLMs
Creating data reports is a labor-intensive task involving iterative data exploration, insight extraction, and narrative construction. A key challenge lies in composing the analysis logic-from defining objectives and transforming data to identifying and communicating insights. Manually crafting this logic can be cognitively demanding. While experienced analysts often reuse scripts from past projects, finding a perfect match for a new dataset is rare. Even when similar analyses are available online, they usually share only results or visualizations, not the underlying code, making reuse difficult. To address this, we present ReSpark, a system that leverages large language models (LLMs) to reverse-engineer analysis logic from existing reports and adapt it to new datasets. By generating draft analysis steps, ReSpark provides a warm start for users. It also supports interactive refinement, allowing users to inspect intermediate outputs, insert objectives, and revise content. We evaluate ReSpark through comparative and user studies, demonstrating its effectiveness in lowering the barrier to generating data reports without relying on existing analysis code.
♻ ☆ Social Debiasing for Fair Multi-modal LLMs
Multi-modal Large Language Models (MLLMs) have dramatically advanced the research field and delivered powerful vision-language understanding capabilities. However, these models often inherit deep-rooted social biases from their training data, leading to uncomfortable responses with respect to attributes such as race and gender. This paper addresses the issue of social biases in MLLMs by i) introducing a comprehensive counterfactual dataset with multiple social concepts (CMSC), which complements existing datasets by providing 18 diverse and balanced social concepts; and ii) proposing a counter-stereotype debiasing (CSD) strategy that mitigates social biases in MLLMs by leveraging the opposites of prevalent stereotypes. CSD incorporates both a novel bias-aware data sampling method and a loss rescaling method, enabling the model to effectively reduce biases. We conduct extensive experiments with four prevalent MLLM architectures. The results demonstrate the advantage of the CMSC dataset and the edge of CSD strategy in reducing social biases compared to existing competing methods, without compromising the overall performance on general multi-modal reasoning benchmarks.
comment: Project page: https://github.com/xaCheng1996/Social_Debiasing_For_Fair_MLLMs
♻ ☆ FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation
Tibetan is a low-resource language with minimal parallel speech corpora spanning its three major dialects-\"U-Tsang, Amdo, and Kham-limiting progress in speech modeling. To address this issue, we propose FMSD-TTS, a few-shot, multi-speaker, multi-dialect text-to-speech framework that synthesizes parallel dialectal speech from limited reference audio and explicit dialect labels. Our method features a novel speaker-dialect fusion module and a Dialect-Specialized Dynamic Routing Network (DSDR-Net) to capture fine-grained acoustic and linguistic variations across dialects while preserving speaker identity. Extensive objective and subjective evaluations demonstrate that FMSD-TTS significantly outperforms baselines in both dialectal expressiveness and speaker similarity. We further validate the quality and utility of the synthesized speech through a challenging speech-to-speech dialect conversion task. Our contributions include: (1) a novel few-shot TTS system tailored for Tibetan multi-dialect speech synthesis, (2) the public release of a large-scale synthetic Tibetan speech corpus generated by FMSD-TTS, and (3) an open-source evaluation toolkit for standardized assessment of speaker similarity, dialect consistency, and audio quality.
comment: 18 pages
♻ ☆ Each to Their Own: Exploring the Optimal Embedding in RAG
Recently, as Large Language Models (LLMs) have fundamentally impacted various fields, the methods for incorporating up-to-date information into LLMs or adding external knowledge to construct domain-specific models have garnered wide attention. Retrieval-Augmented Generation (RAG), serving as an inference-time scaling method, is notable for its low cost and minimal effort for parameter tuning. However, due to heterogeneous training data and model architecture, the variant embedding models used in RAG exhibit different benefits across various areas, often leading to different similarity calculation results and, consequently, varying response quality from LLMs. To address this problem, we propose and examine two approaches to enhance RAG by combining the benefits of multiple embedding models, named Mixture-Embedding RAG and Confident RAG. Mixture-Embedding RAG simply sorts and selects retrievals from multiple embedding models based on standardized similarity; however, it does not outperform vanilla RAG. In contrast, Confident RAG generates responses multiple times using different embedding models and then selects the responses with the highest confidence level, demonstrating average improvements of approximately 10% and 5% over vanilla LLMs and RAG, respectively. The consistent results across different LLMs and embedding models indicate that Confident RAG is an efficient plug-and-play approach for various domains. We will release our code upon publication.
♻ ☆ Input Time Scaling
Current Large Language Models (LLMs) are usually post-trained on large-scale carefully curated datasets (data & training scaling) and doing reasoning in test time (inference time scaling). In this work, we present a new scaling paradigm, Input Time Scaling, to complement previous scaling methods by putting resources on queries (input time). During training and testing, we combine meta-knowledge from LLMs to refine inputs with different strategies. We also find a new phenomenon, training-testing co-design there. We need to apply query strategies during both training and testing. Only applying strategies on training or testing would seriously degrade the performance. We are also surprised to find that seemingly low data quality datasets can gain high performance. Adding irrelevant information to the queries, randomly selecting examples from a minimally filtered dataset, can even perform the best. These findings contradict the widely held inductive bias, "garbage in, garbage out". Curating datasets with seemingly high-quality data can even potentially limit the performance ceiling. In addition, models trained on more data with similar quality (15k VS 1k) perform worse, simple dataset size scaling should also be carefully inspected. The good news is that our findings are compatible with the Less is More phenomenon. A small set of examples is enough to evoke high-level reasoning ability. With experiments on models trained on Qwen2.5-32B-Instruct, we are able to reach SOTA performance among 32B models on AIME24(76.7%) and AIME25(76.7%) pass@1. We can further achieve AIME24(76.7%) and AIME25(80%) with a majority vote of three models. Starting from DeepSeek-R1-Distill-Qwen-32B, the best result would be 86.7% on AIME24 and 76.7% on AIME25. To facilitate reproducibility and further research, we are working on open-source our datasets, data pipelines, evaluation results, and checkpoints.
♻ ☆ Investigating Transcription Normalization in the Faetar ASR Benchmark
We examine the role of transcription inconsistencies in the Faetar Automatic Speech Recognition benchmark, a challenging low-resource ASR benchmark. With the help of a small, hand-constructed lexicon, we conclude that find that, while inconsistencies do exist in the transcriptions, they are not the main challenge in the task. We also demonstrate that bigram word-based language modelling is of no added benefit, but that constraining decoding to a finite lexicon can be beneficial. The task remains extremely difficult.
♻ ☆ Deliberate Reasoning in Language Models as Structure-Aware Planning with an Accurate World Model ACL25
Enhancing the reasoning capabilities of language models (LMs) remains a key challenge, especially for tasks that require complex, multi-step decision-making where existing Chain-of-Thought (CoT) approaches struggle with consistency and verification. In this paper, we propose a novel reasoning framework, referred to as Structure-aware Planning with an Accurate World Model (SWAP), that integrates structured knowledge representation with learned planning. Unlike prior methods that rely purely on natural language reasoning, SWAP leverages entailment graphs to encode structured dependencies and enable symbolic verification of intermediate steps. To systematically construct and update the graph, SWAP employs a policy model to propose candidate expansions and a world model to predict structural updates. To improve accuracy, the world model generates multiple alternative updates, and a discriminator re-ranks them based on plausibility. To encourage diverse exploration, we introduce Diversity-based Modelling (DM), which samples candidates from the remaining probability mass after removing previously sampled candidates from the original policy distribution. Additionally, SWAP improves the discrimination accuracy through Contrastive Ranking (CR), which directly compares candidates within prompts and incorporates meta-knowledge to improve ranking quality. We evaluate SWAP across diverse reasoning-intensive benchmarks including math reasoning, logical reasoning, and coding tasks. Extensive experiments demonstrate that SWAP significantly improves upon the base models and consistently outperforms existing reasoning methods.
comment: ACL25 (main)
♻ ☆ Chain of Correction for Full-text Speech Recognition with Large Language Models
Full-text error correction with Large Language Models (LLMs) for Automatic Speech Recognition (ASR) is attracting increased attention for its ability to address a wide range of error types, such as punctuation restoration and inverse text normalization, across long context. However, challenges remain regarding stability, controllability, completeness, and fluency. To mitigate these issues, this paper proposes the Chain of Correction (CoC), which uses a multi-turn chat format to correct errors segment by segment, guided by pre-recognized text and full-text context for better semantic understanding. Utilizing the open-sourced ChFT dataset, we fine-tune a pre-trained LLM to evaluate CoC's performance. Experiments show that CoC significantly outperforms baseline and benchmark systems in correcting full-text ASR outputs. We also analyze correction thresholds to balance under-correction and over-rephrasing, extrapolate CoC on extra-long ASR outputs, and explore using other types of information to guide error correction.
♻ ☆ Enhancing Depression-Diagnosis-Oriented Chat with Psychological State Tracking NLPCC 2025
Depression-diagnosis-oriented chat aims to guide patients in self-expression to collect key symptoms for depression detection. Recent work focuses on combining task-oriented dialogue and chitchat to simulate the interview-based depression diagnosis. Whereas, these methods can not well capture the changing information, feelings, or symptoms of the patient during dialogues. Moreover, no explicit framework has been explored to guide the dialogue, which results in some useless communications that affect the experience. In this paper, we propose to integrate Psychological State Tracking (POST) within the large language model (LLM) to explicitly guide depression-diagnosis-oriented chat. Specifically, the state is adapted from a psychological theoretical model, which consists of four components, namely Stage, Information, Summary and Next. We fine-tune an LLM model to generate the dynamic psychological state, which is further used to assist response generation at each turn to simulate the psychiatrist. Experimental results on the existing benchmark show that our proposed method boosts the performance of all subtasks in depression-diagnosis-oriented chat.
comment: Accepted by NLPCC 2025
♻ ☆ A Little Human Data Goes A Long Way ACL 2025
Faced with an expensive human annotation process, creators of NLP systems increasingly turn to synthetic data generation. While this method shows promise, the extent to which synthetic data can replace human annotation is poorly understood. We investigate the use of synthetic data in Fact Verification (FV) and Question Answering (QA) by studying the effects of incrementally replacing human generated data with synthetic points on eight diverse datasets. Strikingly, replacing up to 90% of the training data only marginally decreases performance, but replacing the final 10% leads to severe declines. We find that models trained on purely synthetic data can be reliably improved by including as few as 125 human generated data points. We show that matching the performance gain of just a little additional human data (only 200 points) requires an order of magnitude more synthetic data and estimate price ratios at which human annotation would be a more cost-effective solution. Our results suggest that even when human annotation at scale is infeasible, there is great value to having a small proportion of the dataset being human generated.
comment: ACL 2025
♻ ☆ CRINN: Contrastive Reinforcement Learning for Approximate Nearest Neighbor Search
Approximate nearest-neighbor search (ANNS) algorithms have become increasingly critical for recent AI applications, particularly in retrieval-augmented generation (RAG) and agent-based LLM applications. In this paper, we present CRINN, a new paradigm for ANNS algorithms. CRINN treats ANNS optimization as a reinforcement learning problem where execution speed serves as the reward signal. This approach enables the automatic generation of progressively faster ANNS implementations while maintaining accuracy constraints. Our experimental evaluation demonstrates CRINN's effectiveness across six widely-used NNS benchmark datasets. When compared against state-of-the-art open-source ANNS algorithms, CRINN achieves best performance on three of them (GIST-960-Euclidean, MNIST-784-Euclidean, and GloVe-25-angular), and tied for first place on two of them (SIFT-128-Euclidean and GloVe-25-angular). The implications of CRINN's success reach well beyond ANNS optimization: It validates that LLMs augmented with reinforcement learning can function as an effective tool for automating sophisticated algorithmic optimizations that demand specialized knowledge and labor-intensive manual refinement. Code can be found at https://github.com/deepreinforce-ai/CRINN
comment: Preprint Version
♻ ☆ Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training Large Language Models (LLMs), particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy's generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy's correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.
♻ ☆ ChuLo: Chunk-Level Key Information Representation for Long Document Understanding ACL 2025
Transformer-based models have achieved remarkable success in various Natural Language Processing (NLP) tasks, yet their ability to handle long documents is constrained by computational limitations. Traditional approaches, such as truncating inputs, sparse self-attention, and chunking, attempt to mitigate these issues, but they often lead to information loss and hinder the model's ability to capture long-range dependencies. In this paper, we introduce ChuLo, a novel chunk representation method for long document understanding that addresses these limitations. Our ChuLo groups input tokens using unsupervised keyphrase extraction, emphasizing semantically important keyphrase based chunks to retain core document content while reducing input length. This approach minimizes information loss and improves the efficiency of Transformer-based models. Preserving all tokens in long document understanding, especially token classification tasks, is important to ensure that fine-grained annotations, which depend on the entire sequence context, are not lost. We evaluate our method on multiple long document classification tasks and long document token classification tasks, demonstrating its effectiveness through comprehensive qualitative and quantitative analysis. Our implementation is open-sourced on https://github.com/adlnlp/Chulo.
comment: The paper has been accepted to ACL 2025
♻ ☆ Fine-tuning foundational models to code diagnoses from veterinary health records
Veterinary medical records represent a large data resource for application to veterinary and One Health clinical research efforts. Use of the data is limited by interoperability challenges including inconsistent data formats and data siloing. Clinical coding using standardized medical terminologies enhances the quality of medical records and facilitates their interoperability with veterinary and human health records from other sites. Previous studies, such as DeepTag and VetTag, evaluated the application of Natural Language Processing (NLP) to automate veterinary diagnosis coding, employing long short-term memory (LSTM) and transformer models to infer a subset of Systemized Nomenclature of Medicine - Clinical Terms (SNOMED-CT) diagnosis codes from free-text clinical notes. This study expands on these efforts by incorporating all 7,739 distinct SNOMED-CT diagnosis codes recognized by the Colorado State University (CSU) Veterinary Teaching Hospital (VTH) and by leveraging the increasing availability of pre-trained language models (LMs). 13 freely-available pre-trained LMs were fine-tuned on the free-text notes from 246,473 manually-coded veterinary patient visits included in the CSU VTH's electronic health records (EHRs), which resulted in superior performance relative to previous efforts. The most accurate results were obtained when expansive labeled data were used to fine-tune relatively large clinical LMs, but the study also showed that comparable results can be obtained using more limited resources and non-clinical LMs. The results of this study contribute to the improvement of the quality of veterinary EHRs by investigating accessible methods for automated coding and support both animal and human health research by paving the way for more integrated and comprehensive health databases that span species and institutions.
comment: 26 pages, 5 figures
♻ ☆ MFTCXplain: A Multilingual Benchmark Dataset for Evaluating the Moral Reasoning of LLMs through Hate Speech Multi-hop Explanations EMNLP 2025
Ensuring the moral reasoning capabilities of Large Language Models (LLMs) is a growing concern as these systems are used in socially sensitive tasks. Nevertheless, current evaluation benchmarks present two major shortcomings: a lack of annotations that justify moral classifications, which limits transparency and interpretability; and a predominant focus on English, which constrains the assessment of moral reasoning across diverse cultural settings. In this paper, we introduce MFTCXplain, a multilingual benchmark dataset for evaluating the moral reasoning of LLMs via hate speech multi-hop explanation using Moral Foundation Theory (MFT). The dataset comprises 3,000 tweets across Portuguese, Italian, Persian, and English, annotated with binary hate speech labels, moral categories, and text span-level rationales. Empirical results highlight a misalignment between LLM outputs and human annotations in moral reasoning tasks. While LLMs perform well in hate speech detection (F1 up to 0.836), their ability to predict moral sentiments is notably weak (F1 < 0.35). Furthermore, rationale alignment remains limited mainly in underrepresented languages. These findings show the limited capacity of current LLMs to internalize and reflect human moral reasoning.
comment: Findings of the Association for Computational Linguistics: EMNLP 2025; *These authors contributed equally
♻ ☆ On the Role of Entity and Event Level Conceptualization in Generalizable Reasoning: A Survey of Tasks, Methods, Applications, and Future Directions EMNLP 2025
Conceptualization, a fundamental element of human cognition, plays a pivotal role in human generalizable reasoning. Generally speaking, it refers to the process of sequentially abstracting specific instances into higher-level concepts and then forming abstract knowledge that can be applied in unfamiliar or novel situations. This enhances models' inferential capabilities and supports the effective transfer of knowledge across various domains. Despite its significance, the broad nature of this term has led to inconsistencies in understanding conceptualization across various works, as there exists different types of instances that can be abstracted in a wide variety of ways. There is also a lack of a systematic overview that comprehensively examines existing works on the definition, execution, and application of conceptualization to enhance reasoning tasks. In this paper, we address these gaps by first proposing a categorization of different types of conceptualizations into four levels based on the types of instances being conceptualized, in order to clarify the term and define the scope of our work. Then, we present the first comprehensive survey of over 150 papers, surveying various definitions, resources, methods, and downstream applications related to conceptualization into a unified taxonomy, with a focus on the entity and event levels. Furthermore, we shed light on potential future directions in this field and hope to garner more attention from the community.
comment: Findings of EMNLP 2025
♻ ☆ Lossless Token Sequence Compression via Meta-Tokens
Existing work on prompt compression for Large Language Models (LLM) focuses on lossy methods that try to maximize the retention of semantic information that is relevant to downstream tasks while significantly reducing the sequence length. In this paper, we introduce a task-agnostic lossless compression technique similar to LZ77 that makes it possible to reduce the input token sequence length on average by 27\% and 18\% for the two evaluation tasks explored here. Given that we use transformer-based LLMs, this equates to 47\% and 33\% less encoding computation, respectively, due to the quadratic nature of attention. The token sequence transformation is trivial to reverse and highlights that no semantic information is lost in the process. We evaluate our proposed approach on two tasks that require strict preservation of semantics/syntax and demonstrate that existing lossy compression methods perform poorly in this setting. We find that our lossless compression technique produces only a small gap in performance compared to using the uncompressed input and posit that larger models and an expanded computing budget would likely erase the gap entirely.
comment: 16 pages, 8 figures
♻ ☆ Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time
Large Language Models (LLMs) perform well on reasoning benchmarks but often fail when inputs alter slightly, raising concerns about the extent to which their success relies on memorization. This issue is especially acute in Chain-of-Thought (CoT) reasoning, where spurious memorized patterns can trigger intermediate errors that cascade into incorrect final answers. We introduce STIM, a novel framework for Source-aware Token-level Identification of Memorization, which attributes each token in a reasoning chain to one of multiple memorization sources - local, mid-range, or long-range - based on their statistical co-occurrence with the token in the pretraining corpus. Our token-level analysis across tasks and distributional settings reveals that models rely more on memorization in complex or long-tail cases, and that local memorization is often the dominant driver of errors, leading to up to 67% of wrong tokens. We also show that memorization scores from STIM can be effective in predicting the wrong tokens in the wrong reasoning step. STIM offers a powerful tool for diagnosing and improving model reasoning and can generalize to other structured step-wise generation tasks.
♻ ☆ SAND: Boosting LLM Agents with Self-Taught Action Deliberation EMNLP 2025
Large Language Model (LLM) agents are commonly tuned with supervised finetuning on ReAct-style expert trajectories or preference optimization over pairwise rollouts. Most of these methods focus on imitating specific expert behaviors or promoting chosen reasoning thoughts and actions over rejected ones. However, without reasoning and comparing over alternatives actions, LLM agents finetuned with these methods may over-commit towards seemingly plausible but suboptimal actions due to limited action space exploration. To address this, in this paper we propose Self-taught ActioN Deliberation (SAND) framework, enabling LLM agents to explicitly deliberate over candidate actions before committing to one. To tackle the challenges of when and what to deliberate given large action space and step-level action evaluation, we incorporate self-consistency action sampling and execution-guided action critique to help synthesize step-wise action deliberation thoughts using the base model of the LLM agent. In an iterative manner, the deliberation trajectories are then used to finetune the LLM agent itself. Evaluating on two representative interactive agent tasks, SAND achieves an average 20% improvement over initial supervised finetuning and also outperforms state-of-the-art agent tuning approaches.
comment: EMNLP 2025
♻ ☆ Synthetic vs. Gold: The Role of LLM Generated Labels and Data in Cyberbullying Detection
Cyberbullying (CB) presents a pressing threat, especially to children, underscoring the urgent need for robust detection systems to ensure online safety. While large-scale datasets on online abuse exist, there remains a significant gap in labeled data that specifically reflects the language and communication styles used by children. The acquisition of such data from vulnerable populations, such as children, is challenging due to ethical, legal and technical barriers. Moreover, the creation of these datasets relies heavily on human annotation, which not only strains resources but also raises significant concerns due to annotators exposure to harmful content. In this paper, we address these challenges by leveraging Large Language Models (LLMs) to generate synthetic data and labels. Our experiments demonstrate that synthetic data enables BERT-based CB classifiers to achieve performance close to that of those trained on fully authentic datasets (75.8% vs. 81.5% accuracy). Additionally, LLMs can effectively label authentic yet unlabeled data, allowing BERT classifiers to attain a comparable performance level (79.1% vs. 81.5% accuracy). These results highlight the potential of LLMs as a scalable, ethical, and cost-effective solution for generating data for CB detection.
♻ ☆ MATATA: Weakly Supervised End-to-End MAthematical Tool-Augmented Reasoning for Tabular Applications ICDAR 2025
Business documents often contain substantial tabular and textual information with numerical values, requiring mathematical reasoning for effective document understanding. While Small Language Models (SLMs) still struggle at this task, tool-augmented multi-step agents perform better, at the cost of relying on closed-source or larger models, external data, or extensive prompt-engineering. This work introduces MATATA, a novel weakly supervised end-to-end approach to train multi-step reasoning language agents for document tabular applications. MATATA presents an annotation-free paradigm for each agent to enhance 3.8B/8B SLMs. During its two-stage training, MATATA uses the final outcome of the multi-step reasoning chain as weak supervision. This approach avoids having to individually supervise each intermediate agent in the reasoning chain. By employing an adaptive planner and shared tools across different datasets, MATATA shows robust performance. Experiments demonstrate that MATATA achieves state-of-the-art on FinQA, and on TAT-QA among reasoning methods based on open-source SLMs. Although being SLM-based, MATATA closely matches GPT-4-based frameworks on TabMWP. This novel weakly supervised approach enables training an end-to-end multi-step reasoning agent without intermediate supervision, supporting future developments of cost-effective powerful agentic systems.
comment: Published as a conference paper at ICDAR 2025
♻ ☆ Exploring Big Five Personality and AI Capability Effects in LLM-Simulated Negotiation Dialogues KDD 2025
This paper presents an evaluation framework for agentic AI systems in mission-critical negotiation contexts, addressing the need for AI agents that can adapt to diverse human operators and stakeholders. Using Sotopia as a simulation testbed, we present two experiments that systematically evaluated how personality traits and AI agent characteristics influence LLM-simulated social negotiation outcomes--a capability essential for a variety of applications involving cross-team coordination and civil-military interactions. Experiment 1 employs causal discovery methods to measure how personality traits impact price bargaining negotiations, through which we found that Agreeableness and Extraversion significantly affect believability, goal achievement, and knowledge acquisition outcomes. Sociocognitive lexical measures extracted from team communications detected fine-grained differences in agents' empathic communication, moral foundations, and opinion patterns, providing actionable insights for agentic AI systems that must operate reliably in high-stakes operational scenarios. Experiment 2 evaluates human-AI job negotiations by manipulating both simulated human personality and AI system characteristics, specifically transparency, competence, adaptability, demonstrating how AI agent trustworthiness impact mission effectiveness. These findings establish a repeatable evaluation methodology for experimenting with AI agent reliability across diverse operator personalities and human-agent team dynamics, directly supporting operational requirements for reliable AI systems. Our work advances the evaluation of agentic AI workflows by moving beyond standard performance metrics to incorporate social dynamics essential for mission success in complex operations.
comment: Presented at the KDD 2025 Workshop on Evaluation and Trustworthiness of Agentic and Generative AI Models under the title "Evaluating the LLM-simulated Impacts of Big Five Personality Traits and AI Capabilities on Social Negotiations" (https://kdd-eval-workshop.github.io/genai-evaluation-kdd2025/assets/papers/Submission%2036.pdf)
♻ ☆ Prescriptive Agents based on RAG for Automated Maintenance (PARAM)
Industrial machinery maintenance requires timely intervention to prevent catastrophic failures and optimize operational efficiency. This paper presents an integrated Large Language Model (LLM)-based intelligent system for prescriptive maintenance that extends beyond traditional anomaly detection to provide actionable maintenance recommendations. Building upon our prior LAMP framework for numerical data analysis, we develop a comprehensive solution that combines bearing vibration frequency analysis with multi agentic generation for intelligent maintenance planning. Our approach serializes bearing vibration data (BPFO, BPFI, BSF, FTF frequencies) into natural language for LLM processing, enabling few-shot anomaly detection with high accuracy. The system classifies fault types (inner race, outer race, ball/roller, cage faults) and assesses severity levels. A multi-agentic component processes maintenance manuals using vector embeddings and semantic search, while also conducting web searches to retrieve comprehensive procedural knowledge and access up-to-date maintenance practices for more accurate and in-depth recommendations. The Gemini model then generates structured maintenance recommendations includes immediate actions, inspection checklists, corrective measures, parts requirements, and timeline specifications. Experimental validation in bearing vibration datasets demonstrates effective anomaly detection and contextually relevant maintenance guidance. The system successfully bridges the gap between condition monitoring and actionable maintenance planning, providing industrial practitioners with intelligent decision support. This work advances the application of LLMs in industrial maintenance, offering a scalable framework for prescriptive maintenance across machinery components and industrial sectors.
♻ ☆ Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models
Recent advancements in visual generative models have enabled high-quality image and video generation, opening diverse applications. However, evaluating these models often demands sampling hundreds or thousands of images or videos, making the process computationally expensive, especially for diffusion-based models with inherently slow sampling. Moreover, existing evaluation methods rely on rigid pipelines that overlook specific user needs and provide numerical results without clear explanations. In contrast, humans can quickly form impressions of a model's capabilities by observing only a few samples. To mimic this, we propose the Evaluation Agent framework, which employs human-like strategies for efficient, dynamic, multi-round evaluations using only a few samples per round, while offering detailed, user-tailored analyses. It offers four key advantages: 1) efficiency, 2) promptable evaluation tailored to diverse user needs, 3) explainability beyond single numerical scores, and 4) scalability across various models and tools. Experiments show that Evaluation Agent reduces evaluation time to 10% of traditional methods while delivering comparable results. The Evaluation Agent framework is fully open-sourced to advance research in visual generative models and their efficient evaluation.
comment: Equal contributions from first three authors. Project page: https://vchitect.github.io/Evaluation-Agent-project Code: https://github.com/Vchitect/Evaluation-Agent
♻ ☆ Length Representations in Large Language Models EMNLP 2025
Large language models (LLMs) have shown remarkable capabilities across various tasks, that are learned from massive amounts of text-based data. Although LLMs can control output sequence length, particularly in instruction-based settings, the internal mechanisms behind this control have been unexplored yet. In this study, we provide empirical evidence on how output sequence length information is encoded within the internal representations in LLMs. In particular, our findings show that multi-head attention mechanisms are critical in determining output sequence length, which can be adjusted in a disentangled manner. By scaling specific hidden units within the model, we can control the output sequence length without losing the informativeness of the generated text, thereby indicating that length information is partially disentangled from semantic information. Moreover, some hidden units become increasingly active as prompts become more length-specific, thus reflecting the model's internal awareness of this attribute. Our findings suggest that LLMs have learned robust and adaptable internal mechanisms for controlling output length without any external control.
comment: Accepted to EMNLP 2025 Findings
Information Retrieval 20
☆ Benefiting from Negative yet Informative Feedback by Contrasting Opposing Sequential Patterns
We consider the task of learning from both positive and negative feedback in a sequential recommendation scenario, as both types of feedback are often present in user interactions. Meanwhile, conventional sequential learning models usually focus on considering and predicting positive interactions, ignoring that reducing items with negative feedback in recommendations improves user satisfaction with the service. Moreover, the negative feedback can potentially provide a useful signal for more accurate identification of true user interests. In this work, we propose to train two transformer encoders on separate positive and negative interaction sequences. We incorporate both types of feedback into the training objective of the sequential recommender using a composite loss function that includes positive and negative cross-entropy as well as a cleverly crafted contrastive term, that helps better modeling opposing patterns. We demonstrate the effectiveness of this approach in terms of increasing true-positive metrics compared to state-of-the-art sequential recommendation methods while reducing the number of wrongly promoted negative items.
☆ OneLoc: Geo-Aware Generative Recommender Systems for Local Life Service
Local life service is a vital scenario in Kuaishou App, where video recommendation is intrinsically linked with store's location information. Thus, recommendation in our scenario is challenging because we should take into account user's interest and real-time location at the same time. In the face of such complex scenarios, end-to-end generative recommendation has emerged as a new paradigm, such as OneRec in the short video scenario, OneSug in the search scenario, and EGA in the advertising scenario. However, in local life service, an end-to-end generative recommendation model has not yet been developed as there are some key challenges to be solved. The first challenge is how to make full use of geographic information. The second challenge is how to balance multiple objectives, including user interests, the distance between user and stores, and some other business objectives. To address the challenges, we propose OneLoc. Specifically, we leverage geographic information from different perspectives: (1) geo-aware semantic ID incorporates both video and geographic information for tokenization, (2) geo-aware self-attention in the encoder leverages both video location similarity and user's real-time location, and (3) neighbor-aware prompt captures rich context information surrounding users for generation. To balance multiple objectives, we use reinforcement learning and propose two reward functions, i.e., geographic reward and GMV reward. With the above design, OneLoc achieves outstanding offline and online performance. In fact, OneLoc has been deployed in local life service of Kuaishou App. It serves 400 million active users daily, achieving 21.016% and 17.891% improvements in terms of gross merchandise value (GMV) and orders numbers.
☆ MISS: Multi-Modal Tree Indexing and Searching with Lifelong Sequential Behavior for Retrieval Recommendation CIKM 2025
Large-scale industrial recommendation systems typically employ a two-stage paradigm of retrieval and ranking to handle huge amounts of information. Recent research focuses on improving the performance of retrieval model. A promising way is to introduce extensive information about users and items. On one hand, lifelong sequential behavior is valuable. Existing lifelong behavior modeling methods in ranking stage focus on the interaction of lifelong behavior and candidate items from retrieval stage. In retrieval stage, it is difficult to utilize lifelong behavior because of a large corpus of candidate items. On the other hand, existing retrieval methods mostly relay on interaction information, potentially disregarding valuable multi-modal information. To solve these problems, we represent the pioneering exploration of leveraging multi-modal information and lifelong sequence model within the advanced tree-based retrieval model. We propose Multi-modal Indexing and Searching with lifelong Sequence (MISS), which contains a multi-modal index tree and a multi-modal lifelong sequence modeling module. Specifically, for better index structure, we propose multi-modal index tree, which is built using the multi-modal embedding to precisely represent item similarity. To precisely capture diverse user interests in user lifelong sequence, we propose collaborative general search unit (Co-GSU) and multi-modal general search unit (MM-GSU) for multi-perspective interests searching.
comment: CIKM 2025
☆ DGenCTR: Towards a Universal Generative Paradigm for Click-Through Rate Prediction via Discrete Diffusion
Recent advances in generative models have inspired the field of recommender systems to explore generative approaches, but most existing research focuses on sequence generation, a paradigm ill-suited for click-through rate (CTR) prediction. CTR models critically depend on a large number of cross-features between the target item and the user to estimate the probability of clicking on the item, and discarding these cross-features will significantly impair model performance. Therefore, to harness the ability of generative models to understand data distributions and thereby alleviate the constraints of traditional discriminative models in label-scarce space, diverging from the item-generation paradigm of sequence generation methods, we propose a novel sample-level generation paradigm specifically designed for the CTR task: a two-stage Discrete Diffusion-Based Generative CTR training framework (DGenCTR). This two-stage framework comprises a diffusion-based generative pre-training stage and a CTR-targeted supervised fine-tuning stage for CTR. Finally, extensive offline experiments and online A/B testing conclusively validate the effectiveness of our framework.
comment: 11 pages, 4 figures, 4 tables
☆ Global-Distribution Aware Scenario-Specific Variational Representation Learning Framework CIKM 2025
With the emergence of e-commerce, the recommendations provided by commercial platforms must adapt to diverse scenarios to accommodate users' varying shopping preferences. Current methods typically use a unified framework to offer personalized recommendations for different scenarios. However, they often employ shared bottom representations, which partially hinders the model's capacity to capture scenario uniqueness. Ideally, users and items should exhibit specific characteristics in different scenarios, prompting the need to learn scenario-specific representations to differentiate scenarios. Yet, variations in user and item interactions across scenarios lead to data sparsity issues, impeding the acquisition of scenario-specific representations. To learn robust scenario-specific representations, we introduce a Global-Distribution Aware Scenario-Specific Variational Representation Learning Framework (GSVR) that can be directly applied to existing multi-scenario methods. Specifically, considering the uncertainty stemming from limited samples, our approach employs a probabilistic model to generate scenario-specific distributions for each user and item in each scenario, estimated through variational inference (VI). Additionally, we introduce the global knowledge-aware multinomial distributions as prior knowledge to regulate the learning of the posterior user and item distributions, ensuring similarities among distributions for users with akin interests and items with similar side information. This mitigates the risk of users or items with fewer records being overwhelmed in sparse scenarios. Extensive experimental results affirm the efficacy of GSVR in assisting existing multi-scenario recommendation methods in learning more robust representations.
comment: Accepted by CIKM 2025, 6 pages, 1 figures, 5 tables
☆ Distribution-Guided Auto-Encoder for User Multimodal Interest Cross Fusion CIKM 2025
Traditional recommendation methods rely on correlating the embedding vectors of item IDs to capture implicit collaborative filtering signals to model the user's interest in the target item. Consequently, traditional ID-based methods often encounter data sparsity problems stemming from the sparse nature of ID features. To alleviate the problem of item ID sparsity, recommendation models incorporate multimodal item information to enhance recommendation accuracy. However, existing multimodal recommendation methods typically employ early fusion approaches, which focus primarily on combining text and image features, while neglecting the contextual influence of user behavior sequences. This oversight prevents dynamic adaptation of multimodal interest representations based on behavioral patterns, consequently restricting the model's capacity to effectively capture user multimodal interests. Therefore, this paper proposes the Distribution-Guided Multimodal-Interest Auto-Encoder (DMAE), which achieves the cross fusion of user multimodal interest at the behavioral level.Ultimately, extensive experiments demonstrate the superiority of DMAE.
comment: Accepted by CIKM 2025, 11 pages, 4 figures, 4 tables
☆ Diverse Negative Sampling for Implicit Collaborative Filtering
Implicit collaborative filtering recommenders are usually trained to learn user positive preferences. Negative sampling, which selects informative negative items to form negative training data, plays a crucial role in this process. Since items are often clustered in the latent space, existing negative sampling strategies normally oversample negative items from the dense regions. This leads to homogeneous negative data and limited model expressiveness. In this paper, we propose Diverse Negative Sampling (DivNS), a novel approach that explicitly accounts for diversity in negative training data during the negative sampling process. DivNS first finds hard negative items with large preference scores and constructs user-specific caches that store unused but highly informative negative samples. Then, its diversity-augmented sampler selects a diverse subset of negative items from the cache while ensuring dissimilarity from the user's hard negatives. Finally, a synthetic negatives generator combines the selected diverse negatives with hard negatives to form more effective training data. The resulting synthetic negatives are both informative and diverse, enabling recommenders to learn a broader item space and improve their generalisability. Extensive experiments on four public datasets demonstrate the effectiveness of DivNS in improving recommendation quality while maintaining computational efficiency.
☆ You Only Evaluate Once: A Tree-based Rerank Method at Meituan CIKM 2025
Reranking plays a crucial role in modern recommender systems by capturing the mutual influences within the list. Due to the inherent challenges of combinatorial search spaces, most methods adopt a two-stage search paradigm: a simple General Search Unit (GSU) efficiently reduces the candidate space, and an Exact Search Unit (ESU) effectively selects the optimal sequence. These methods essentially involve making trade-offs between effectiveness and efficiency, while suffering from a severe \textbf{inconsistency problem}, that is, the GSU often misses high-value lists from ESU. To address this problem, we propose YOLOR, a one-stage reranking method that removes the GSU while retaining only the ESU. Specifically, YOLOR includes: (1) a Tree-based Context Extraction Module (TCEM) that hierarchically aggregates multi-scale contextual features to achieve "list-level effectiveness", and (2) a Context Cache Module (CCM) that enables efficient feature reuse across candidate permutations to achieve "permutation-level efficiency". Extensive experiments across public and industry datasets validate YOLOR's performance, and we have successfully deployed YOLOR on the Meituan food delivery platform.
comment: Accepted by CIKM 2025
SurveyGen-I: Consistent Scientific Survey Generation with Evolving Plans and Memory-Guided Writing
Survey papers play a critical role in scientific communication by consolidating progress across a field. Recent advances in Large Language Models (LLMs) offer a promising solution by automating key steps in the survey-generation pipeline, such as retrieval, structuring, and summarization. However, existing LLM-based approaches often struggle with maintaining coherence across long, multi-section surveys and providing comprehensive citation coverage. To address these limitations, we introduce SurveyGen-I, an automatic survey generation framework that combines coarse-to-fine retrieval, adaptive planning, and memory-guided generation. SurveyGen-I first performs survey-level retrieval to construct the initial outline and writing plan, and then dynamically refines both during generation through a memory mechanism that stores previously written content and terminology, ensuring coherence across subsections. When the system detects insufficient context, it triggers fine-grained subsection-level retrieval. During generation, SurveyGen-I leverages this memory mechanism to maintain coherence across subsections. Experiments across four scientific domains demonstrate that SurveyGen-I consistently outperforms previous works in content quality, consistency, and citation coverage.
comment: The code is available at https://github.com/SurveyGens/SurveyGen-I , 20 pages, 16 figures
☆ LongRecall: A Structured Approach for Robust Recall Evaluation in Long-Form Text
LongRecall. The completeness of machine-generated text, ensuring that it captures all relevant information, is crucial in domains such as medicine and law and in tasks like list-based question answering (QA), where omissions can have serious consequences. However, existing recall metrics often depend on lexical overlap, leading to errors with unsubstantiated entities and paraphrased answers, while LLM-as-a-Judge methods with long holistic prompts capture broader semantics but remain prone to misalignment and hallucinations without structured verification. We introduce LongRecall, a general three-stage recall evaluation framework that decomposes answers into self-contained facts, successively narrows plausible candidate matches through lexical and semantic filtering, and verifies their alignment through structured entailment checks. This design reduces false positives and false negatives while accommodating diverse phrasings and contextual variations, serving as a foundational building block for systematic recall assessment. We evaluate LongRecall on three challenging long-form QA benchmarks using both human annotations and LLM-based judges, demonstrating substantial improvements in recall accuracy over strong lexical and LLM-as-a-Judge baselines.
☆ Alpha Berkeley: A Scalable Framework for the Orchestration of Agentic Systems
Coordinating workflows across heterogeneous control systems remains a central challenge in safety-critical environments such as scientific facilities, industrial plants, and energy infrastructures. Language-model-driven agents offer a natural interface for these tasks, but existing approaches often lack scalability, reliability, and human oversight. We introduce the Alpha Berkeley Framework, a production-ready architecture for scalable agentic systems that integrate conversational context with robust tool orchestration. The framework features dynamic capability classification to select only relevant tools per task, a plan-first orchestration model that generates execution plans with explicit dependencies and optional human approval, context-aware task extraction that combines dialogue history with external memory and domain resources, and production-ready execution environments with checkpointing, artifact management, and modular deployment. We demonstrate its versatility through two case studies: a tutorial-style wind farm monitoring example and a deployment at the Advanced Light Source particle accelerator. These results establish Alpha Berkeley as a reliable and transparent framework for agentic systems in high-stakes domains.
☆ MedCoT-RAG: Causal Chain-of-Thought RAG for Medical Question Answering
Large language models (LLMs) have shown promise in medical question answering but often struggle with hallucinations and shallow reasoning, particularly in tasks requiring nuanced clinical understanding. Retrieval-augmented generation (RAG) offers a practical and privacy-preserving way to enhance LLMs with external medical knowledge. However, most existing approaches rely on surface-level semantic retrieval and lack the structured reasoning needed for clinical decision support. We introduce MedCoT-RAG, a domain-specific framework that combines causal-aware document retrieval with structured chain-of-thought prompting tailored to medical workflows. This design enables models to retrieve evidence aligned with diagnostic logic and generate step-by-step causal reasoning reflective of real-world clinical practice. Experiments on three diverse medical QA benchmarks show that MedCoT-RAG outperforms strong baselines by up to 10.3% over vanilla RAG and 6.4% over advanced domain-adapted methods, improving accuracy, interpretability, and consistency in complex medical tasks.
♻ ☆ TASER: Table Agents for Schema-guided Extraction and Recommendation
Real-world financial documents report essential information about an entity's financial holdings that can span millions of different financial instrument types. Yet, these details are often buried in messy, multi-page, fragmented tables - for example, 99.4% of the tables in our dataset have no bounding boxes with the maximum number of rows amounting to 426 per table across 44 pages. To tackle these unique challenges from real-world tables, we present a continuously learning, agentic table extraction system, TASER (Table Agents for Schema-guided Extraction and Recommendation) that extracts highly unstructured, multi-page, heterogeneous tables into normalized, schema-conforming outputs. Our table agents execute on table detection, classification, extraction, and recommendations by leveraging an initial schema. Then, our Recommender Agent reviews the outputs, recommends schema revisions, and decides on the final recommendations, enabling TASER to outperform existing table detection models such as Table Transformer by 10.1%. Within this continuous learning process, we highlight that larger batch sizes result in a 104.3% increase in schema recommendations that are actionable and utilized, resulting in a 9.8% increase in extracted holdings - highlighting the importance of a continuous learning process. To train TASER, we have manually labeled 22,584 pages (28,150,449 tokens), 3,213 tables for $731,685,511,687 of holdings culminating in one of the first real financial table datasets. We release our dataset TASERTab to enable the research community to access real-world financial tables and outputs. Our results highlight the promise of agentic, schema-guided extraction systems for robust understanding of real-world financial tables.
comment: Withdrawn due to missing key sections in the paper
♻ ☆ Applying Text Embedding Models for Efficient Analysis in Labeled Property Graphs
Labeled property graphs often contain rich textual attributes that can enhance analytical tasks when properly leveraged. This work explores the use of pretrained text embedding models to enable efficient semantic analysis in such graphs. By embedding textual node and edge properties, we support downstream tasks including node classification and relation prediction with improved contextual understanding. Our approach integrates language model embeddings into the graph pipeline without altering its structure, demonstrating that textual semantics can significantly enhance the accuracy and interpretability of property graph analysis.
♻ ☆ Hybrid-Hierarchical Fashion Graph Attention Network for Compatibility-Oriented and Personalized Outfit Recommendation
The rapid expansion of the fashion industry and the growing variety of products have made it increasingly challenging for users to identify compatible items on e-commerce platforms. Effective fashion recommendation systems are therefore crucial for filtering irrelevant options and suggesting suitable ones. However, simultaneously addressing outfit compatibility and personalized recommendations remains a significant challenge, as these aspects are typically treated independently in existing studies, thereby overlooking the complex interactions between items and user preferences. This research introduces a new framework named FGAT, which leverages a hierarchical graph representation together with graph attention mechanisms to address this problem. The framework constructs a three-tier graph of users, outfits, and items, integrating visual and textual features to jointly model outfit compatibility and user preferences. By dynamically weighting node importance during representation propagation, the graph attention mechanism captures key interactions and produces precise embeddings for both user preferences and outfit compatibility. Evaluated on the POG dataset, FGAT outperforms strong baselines such as HFGN, achieving notable improvements in accuracy, precision, HR, recall, and NDCG. These results demonstrate that combining multimodal visual and textual features with a hierarchical graph structure and attention mechanisms significantly enhances the effectiveness and efficiency of personalized fashion recommendation systems.
comment: The corresponding author: Babak Teimourpour
♻ ☆ Enhancing Temporal Sensitivity of Large Language Model for Recommendation with Counterfactual Tuning
Recent advances have applied large language models (LLMs) to sequential recommendation, leveraging their pre-training knowledge and reasoning capabilities to provide more personalized user experiences. However, existing LLM-based methods fail to sufficiently leverage the rich temporal information inherent in users' historical interaction sequences, stemming from fundamental architectural constraints: LLMs process information through self-attention mechanisms that lack inherent sequence ordering and rely on position embeddings designed primarily for natural language rather than user interaction sequences. This limitation significantly impairs their ability to capture the evolution of user preferences over time and predict future interests accurately. To address this critical gap, we propose \underline{C}ounterfactual \underline{E}nhanced \underline{T}emporal Framework for LLM-Based \underline{Rec}ommendation (CETRec). CETRec is grounded in causal inference principles, which allow it to isolate and measure the specific impact of temporal information on recommendation outcomes. Combined with our counterfactual tuning task derived from causal analysis, CETRec effectively enhances LLMs' awareness of both absolute order (how recently items were interacted with) and relative order (the sequential relationships between items). Extensive experiments on real-world datasets demonstrate the effectiveness of our CETRec. Our code is available at https://anonymous.4open.science/r/CETRec-B9CE/.
♻ ☆ PathGPT: Reframing Path Recommendation as a Natural Language Generation Task with Retrieval-Augmented Language Models
Path recommendation (PR) aims to generate travel paths that are customized to a user's specific preferences and constraints. Conventional approaches often employ explicit optimization objectives or specialized machine learning architectures; however, these methods typically exhibit limited flexibility and generalizability, necessitating costly retraining to accommodate new scenarios. This paper introduces an alternative paradigm that conceptualizes PR as a natural language generation task. We present PathGPT, a retrieval-augmented large language model (LLM) system that leverages historical trajectory data and natural language user constraints to generate plausible paths. The proposed methodology first converts raw trajectory data into a human-interpretable textual format, which is then stored in a database. Subsequently, a hybrid retrieval system extracts path-specific context from this database to inform a pretrained LLM. The primary contribution of this work is a novel framework that demonstrates how integrating established information retrieval and generative model components can enable adaptive, zero-shot path generation across diverse scenarios. Extensive experiments on large-scale trajectory datasets indicate that PathGPT's performance is competitive with specialized, learning-based methods, underscoring its potential as a flexible and generalizable path generation system that avoids the need for retraining inherent in previous data-driven models.
♻ ☆ Diagnostic-Guided Dynamic Profile Optimization for LLM-based User Simulators in Sequential Recommendation
Recent advances in large language models (LLMs) have enabled realistic user simulators for developing and evaluating recommender systems (RSs). However, existing LLM-based simulators for RSs face two major limitations: (1) static and single-step prompt-based inference that leads to inaccurate and incomplete user profile construction; (2) unrealistic and single-round recommendation-feedback interaction pattern that fails to capture real-world scenarios. To address these limitations, we propose DGDPO (Diagnostic-Guided Dynamic Profile Optimization), a novel framework that constructs user profile through a dynamic and iterative optimization process to enhance the simulation fidelity. Specifically, DGDPO incorporates two core modules within each optimization loop: firstly, a specialized LLM-based diagnostic module, calibrated through our novel training strategy, accurately identifies specific defects in the user profile. Subsequently, a generalized LLM-based treatment module analyzes the diagnosed defect and generates targeted suggestions to refine the profile. Furthermore, unlike existing LLM-based user simulators that are limited to single-round interactions, we are the first to integrate DGDPO with sequential recommenders, enabling a bidirectional evolution where user profiles and recommendation strategies adapt to each other over multi-round interactions. Extensive experiments conducted on three real-world datasets demonstrate the effectiveness of our proposed framework.
♻ ☆ PinFM: Foundation Model for User Activity Sequences at a Billion-scale Visual Discovery Platform
User activity sequences have emerged as one of the most important signals in recommender systems. We present a foundational model, PinFM, for understanding user activity sequences across multiple applications at a billion-scale visual discovery platform. We pretrain a transformer model with 20B+ parameters using extensive user activity data, then fine-tune it for specific applications, efficiently coupling it with existing models. While this pretraining-and-fine-tuning approach has been popular in other domains, such as Vision and NLP, its application in industrial recommender systems presents numerous challenges. The foundational model must be scalable enough to score millions of items every second while meeting tight cost and latency constraints imposed by these systems. Additionally, it should capture the interactions between user activities and other features and handle new items that were not present during the pretraining stage. We developed innovative techniques to address these challenges. Our infrastructure and algorithmic optimizations, such as the Deduplicated Cross-Attention Transformer (DCAT), improved our throughput by 600% on Pinterest internal data. We demonstrate that PinFM can learn interactions between user sequences and candidate items by altering input sequences, leading to a 20% increase in engagement with new items. PinFM is now deployed to help improve the experience of more than half a billion users across various applications.
comment: clean up typos
♻ ☆ On the Comprehensibility of Multi-structured Financial Documents using LLMs and Pre-processing Tools
The proliferation of complex structured data in hybrid sources, such as PDF documents and web pages, presents unique challenges for current Large Language Models (LLMs) and Multi-modal Large Language Models (MLLMs) in providing accurate answers. Despite the recent advancements of MLLMs, they still often falter when interpreting intricately structured information, such as nested tables and multi-dimensional plots, leading to hallucinations and erroneous outputs. This paper explores the capabilities of LLMs and MLLMs in understanding and answering questions from complex data structures found in PDF documents by leveraging industrial and open-source tools as part of a pre-processing pipeline. Our findings indicate that GPT-4o, a popular MLLM, achieves an accuracy of 56% on multi-structured documents when fed documents directly, and that integrating pre-processing tools raises the accuracy of LLMs to 61.3% for GPT-4o and 76% for GPT-4, and with lower overall cost. The code is publicly available at https://github.com/OGCDS/FinancialQA.
comment: 15 pages, 5 figures, 9 tables
Machine Learning 128
☆ Compute-Optimal Scaling for Value-Based Deep RL
As models grow larger and training them becomes expensive, it becomes increasingly important to scale training recipes not just to larger models and more data, but to do so in a compute-optimal manner that extracts maximal performance per unit of compute. While such scaling has been well studied for language modeling, reinforcement learning (RL) has received less attention in this regard. In this paper, we investigate compute scaling for online, value-based deep RL. These methods present two primary axes for compute allocation: model capacity and the update-to-data (UTD) ratio. Given a fixed compute budget, we ask: how should resources be partitioned across these axes to maximize sample efficiency? Our analysis reveals a nuanced interplay between model size, batch size, and UTD. In particular, we identify a phenomenon we call TD-overfitting: increasing the batch quickly harms Q-function accuracy for small models, but this effect is absent in large models, enabling effective use of large batch size at scale. We provide a mental model for understanding this phenomenon and build guidelines for choosing batch size and UTD to optimize compute usage. Our findings provide a grounded starting point for compute-optimal scaling in deep RL, mirroring studies in supervised learning but adapted to TD learning.
☆ Squeezed Diffusion Models
Diffusion models typically inject isotropic Gaussian noise, disregarding structure in the data. Motivated by the way quantum squeezed states redistribute uncertainty according to the Heisenberg uncertainty principle, we introduce Squeezed Diffusion Models (SDM), which scale noise anisotropically along the principal component of the training distribution. As squeezing enhances the signal-to-noise ratio in physics, we hypothesize that scaling noise in a data-dependent manner can better assist diffusion models in learning important data features. We study two configurations: (i) a Heisenberg diffusion model that compensates the scaling on the principal axis with inverse scaling on orthogonal directions and (ii) a standard SDM variant that scales only the principal axis. Counterintuitively, on CIFAR-10/100 and CelebA-64, mild antisqueezing - i.e. increasing variance on the principal axis - consistently improves FID by up to 15% and shifts the precision-recall frontier toward higher recall. Our results demonstrate that simple, data-aware noise shaping can deliver robust generative gains without architectural changes.
comment: 7 pages, 3 figures
☆ Graph Structure Learning with Temporal Graph Information Bottleneck for Inductive Representation Learning ECAI
Temporal graph learning is crucial for dynamic networks where nodes and edges evolve over time and new nodes continuously join the system. Inductive representation learning in such settings faces two major challenges: effectively representing unseen nodes and mitigating noisy or redundant graph information. We propose GTGIB, a versatile framework that integrates Graph Structure Learning (GSL) with Temporal Graph Information Bottleneck (TGIB). We design a novel two-step GSL-based structural enhancer to enrich and optimize node neighborhoods and demonstrate its effectiveness and efficiency through theoretical proofs and experiments. The TGIB refines the optimized graph by extending the information bottleneck principle to temporal graphs, regularizing both edges and features based on our derived tractable TGIB objective function via variational approximation, enabling stable and efficient optimization. GTGIB-based models are evaluated to predict links on four real-world datasets; they outperform existing methods in all datasets under the inductive setting, with significant and consistent improvement in the transductive setting.
comment: Accepted in the 28th European Conference on Artificial Intelligence (ECAI), 2025
☆ Universal and Transferable Adversarial Attack on Large Language Models Using Exponentiated Gradient Descent
As large language models (LLMs) are increasingly deployed in critical applications, ensuring their robustness and safety alignment remains a major challenge. Despite the overall success of alignment techniques such as reinforcement learning from human feedback (RLHF) on typical prompts, LLMs remain vulnerable to jailbreak attacks enabled by crafted adversarial triggers appended to user prompts. Most existing jailbreak methods either rely on inefficient searches over discrete token spaces or direct optimization of continuous embeddings. While continuous embeddings can be given directly to selected open-source models as input, doing so is not feasible for proprietary models. On the other hand, projecting these embeddings back into valid discrete tokens introduces additional complexity and often reduces attack effectiveness. We propose an intrinsic optimization method which directly optimizes relaxed one-hot encodings of the adversarial suffix tokens using exponentiated gradient descent coupled with Bregman projection, ensuring that the optimized one-hot encoding of each token always remains within the probability simplex. We provide theoretical proof of convergence for our proposed method and implement an efficient algorithm that effectively jailbreaks several widely used LLMs. Our method achieves higher success rates and faster convergence compared to three state-of-the-art baselines, evaluated on five open-source LLMs and four adversarial behavior datasets curated for evaluating jailbreak methods. In addition to individual prompt attacks, we also generate universal adversarial suffixes effective across multiple prompts and demonstrate transferability of optimized suffixes to different LLMs.
☆ Multimodal Quantum Vision Transformer for Enzyme Commission Classification from Biochemical Representations
Accurately predicting enzyme functionality remains one of the major challenges in computational biology, particularly for enzymes with limited structural annotations or sequence homology. We present a novel multimodal Quantum Machine Learning (QML) framework that enhances Enzyme Commission (EC) classification by integrating four complementary biochemical modalities: protein sequence embeddings, quantum-derived electronic descriptors, molecular graph structures, and 2D molecular image representations. Quantum Vision Transformer (QVT) backbone equipped with modality-specific encoders and a unified cross-attention fusion module. By integrating graph features and spatial patterns, our method captures key stereoelectronic interactions behind enzyme function. Experimental results demonstrate that our multimodal QVT model achieves a top-1 accuracy of 85.1%, outperforming sequence-only baselines by a substantial margin and achieving better performance results compared to other QML models.
comment: Accepted at IEEE International Conference on Quantum Artificial Intelligence (QAI) 2025
☆ On Defining Neural Averaging
What does it even mean to average neural networks? We investigate the problem of synthesizing a single neural network from a collection of pretrained models, each trained on disjoint data shards, using only their final weights and no access to training data. In forming a definition of neural averaging, we take insight from model soup, which appears to aggregate multiple models into a singular model while enhancing generalization performance. In this work, we reinterpret model souping as a special case of a broader framework: Amortized Model Ensembling (AME) for neural averaging, a data-free meta-optimization approach that treats model differences as pseudogradients to guide neural weight updates. We show that this perspective not only recovers model soup but enables more expressive and adaptive ensembling strategies. Empirically, AME produces averaged neural solutions that outperform both individual experts and model soup baselines, especially in out-of-distribution settings. Our results suggest a principled and generalizable notion of data-free model weight aggregation and defines, in one sense, how to perform neural averaging.
☆ Long Chain-of-Thought Reasoning Across Languages SC
Scaling inference through long chains-of-thought (CoTs) has unlocked impressive reasoning capabilities in large language models (LLMs), yet the reasoning process remains almost exclusively English-centric. We construct translated versions of two popular English reasoning datasets, fine-tune Qwen 2.5 (7B) and Qwen 3 (8B) models, and present a systematic study of long CoT generation across French, Japanese, Latvian, and Swahili. Our experiments reveal three key findings. First, the efficacy of using English as a pivot language varies by language: it provides no benefit for French, improves performance when used as the reasoning language for Japanese and Latvian, and proves insufficient for Swahili where both task comprehension and reasoning remain poor. Second, extensive multilingual pretraining in Qwen 3 narrows but does not eliminate the cross-lingual performance gap. A lightweight fine-tune using only 1k traces still improves performance by over 30\% in Swahili. Third, data quality versus scale trade-offs are language dependent: small, carefully curated datasets suffice for English and French, whereas larger but noisier corpora prove more effective for Swahili and Latvian. Together, these results clarify when and why long CoTs transfer across languages and provide translated datasets to foster equitable multilingual reasoning research.
comment: Accepted to SCALR @ COLM 2025
☆ The C-index Multiverse
Quantifying out-of-sample discrimination performance for time-to-event outcomes is a fundamental step for model evaluation and selection in the context of predictive modelling. The concordance index, or C-index, is a widely used metric for this purpose, particularly with the growing development of machine learning methods. Beyond differences between proposed C-index estimators (e.g. Harrell's, Uno's and Antolini's), we demonstrate the existence of a C-index multiverse among available R and python software, where seemingly equal implementations can yield different results. This can undermine reproducibility and complicate fair comparisons across models and studies. Key variation sources include tie handling and adjustment to censoring. Additionally, the absence of a standardised approach to summarise risk from survival distributions, result in another source of variation dependent on input types. We demonstrate the consequences of the C-index multiverse when quantifying predictive performance for several survival models (from Cox proportional hazards to recent deep learning approaches) on publicly available breast cancer data, and semi-synthetic examples. Our work emphasises the need for better reporting to improve transparency and reproducibility. This article aims to be a useful guideline, helping analysts when navigating the multiverse, providing unified documentation and highlighting potential pitfalls of existing software. All code is publicly available at: www.github.com/BBolosSierra/CindexMultiverse.
comment: 21 pages main text with 6 figures and 3 tables. 19 pages of supplementary material
☆ Successive Halving with Learning Curve Prediction via Latent Kronecker Gaussian Processes
Successive Halving is a popular algorithm for hyperparameter optimization which allocates exponentially more resources to promising candidates. However, the algorithm typically relies on intermediate performance values to make resource allocation decisions, which can cause it to prematurely prune slow starters that would eventually become the best candidate. We investigate whether guiding Successive Halving with learning curve predictions based on Latent Kronecker Gaussian Processes can overcome this limitation. In a large-scale empirical study involving different neural network architectures and a click prediction dataset, we compare this predictive approach to the standard approach based on current performance values. Our experiments show that, although the predictive approach achieves competitive performance, it is not Pareto optimal compared to investing more resources into the standard approach, because it requires fully observed learning curves as training data. However, this downside could be mitigated by leveraging existing learning curve data.
comment: AutoML 2025 Non-Archival Track
☆ Enhancing Contrastive Link Prediction With Edge Balancing Augmentation CIKM 2025
Link prediction is one of the most fundamental tasks in graph mining, which motivates the recent studies of leveraging contrastive learning to enhance the performance. However, we observe two major weaknesses of these studies: i) the lack of theoretical analysis for contrastive learning on link prediction, and ii) inadequate consideration of node degrees in contrastive learning. To address the above weaknesses, we provide the first formal theoretical analysis for contrastive learning on link prediction, where our analysis results can generalize to the autoencoder-based link prediction models with contrastive learning. Motivated by our analysis results, we propose a new graph augmentation approach, Edge Balancing Augmentation (EBA), which adjusts the node degrees in the graph as the augmentation. We then propose a new approach, named Contrastive Link Prediction with Edge Balancing Augmentation (CoEBA), that integrates the proposed EBA and the proposed new contrastive losses to improve the model performance. We conduct experiments on 8 benchmark datasets. The results demonstrate that our proposed CoEBA significantly outperforms the other state-of-the-art link prediction models.
comment: Accepted by CIKM 2025
☆ Source-Guided Flow Matching
Guidance of generative models is typically achieved by modifying the probability flow vector field through the addition of a guidance field. In this paper, we instead propose the Source-Guided Flow Matching (SGFM) framework, which modifies the source distribution directly while keeping the pre-trained vector field intact. This reduces the guidance problem to a well-defined problem of sampling from the source distribution. We theoretically show that SGFM recovers the desired target distribution exactly. Furthermore, we provide bounds on the Wasserstein error for the generated distribution when using an approximate sampler of the source distribution and an approximate vector field. The key benefit of our approach is that it allows the user to flexibly choose the sampling method depending on their specific problem. To illustrate this, we systematically compare different sampling methods and discuss conditions for asymptotically exact guidance. Moreover, our framework integrates well with optimal flow matching models since the straight transport map generated by the vector field is preserved. Experimental results on synthetic 2D benchmarks, image datasets, and physics-informed generative tasks demonstrate the effectiveness and flexibility of the proposed framework.
☆ Learning from user's behaviour of some well-known congested traffic networks
We consider the problem of predicting users' behavior of a congested traffic network under an equilibrium condition, the traffic assignment problem. We propose a two-stage machine learning approach which couples a neural network with a fixed point algorithm, and we evaluate its performance along several classical congested traffic networks.
comment: 12 pages, 4 figures, 3 tables
☆ A Guide for Manual Annotation of Scientific Imagery: How to Prepare for Large Projects
Despite the high demand for manually annotated image data, managing complex and costly annotation projects remains under-discussed. This is partly due to the fact that leading such projects requires dealing with a set of diverse and interconnected challenges which often fall outside the expertise of specific domain experts, leaving practical guidelines scarce. These challenges range widely from data collection to resource allocation and recruitment, from mitigation of biases to effective training of the annotators. This paper provides a domain-agnostic preparation guide for annotation projects, with a focus on scientific imagery. Drawing from the authors' extensive experience in managing a large manual annotation project, it addresses fundamental concepts including success measures, annotation subjects, project goals, data availability, and essential team roles. Additionally, it discusses various human biases and recommends tools and technologies to improve annotation quality and efficiency. The goal is to encourage further research and frameworks for creating a comprehensive knowledge base to reduce the costs of manual annotation projects across various fields.
☆ Synthetic Adaptive Guided Embeddings (SAGE): A Novel Knowledge Distillation Method
Model distillation enables the transfer of knowledge from large-scale models to compact student models, facilitating deployment in resource-constrained environments. However, conventional distillation approaches often suffer from computational overhead and limited generalization. We propose a novel adaptive distillation framework that dynamically augments training data in regions of high student model loss. Using UMAP-based dimensionality reduction and nearest neighbor sampling, our method identifies underperforming regions in the embedding space and generates targeted synthetic examples to guide student learning. To further improve efficiency, we introduce a lightweight teacher-student interface that bypasses the teacher's input layer, enabling direct distillation on vectorized representations. Experiments across standard NLP benchmarks demonstrate that our 66M-parameter student model consistently matches or surpasses established baselines, achieving 91.2% on QNLI and 92.3% on SST-2, while training with fewer epochs. These results highlight the promise of loss-aware data augmentation and vectorized distillation for efficient and effective model compression.
☆ Context Steering: A New Paradigm for Compression-based Embeddings by Synthesizing Relevant Information Features
Compression-based distances (CD) offer a flexible and domain-agnostic means of measuring similarity by identifying implicit information through redundancies between data objects. However, as similarity features are derived from the data, rather than defined as an input, it often proves difficult to align with the task at hand, particularly in complex clustering or classification settings. To address this issue, we introduce "context steering," a novel methodology that actively guides the feature-shaping process. Instead of passively accepting the emergent data structure (typically a hierarchy derived from clustering CDs), our approach "steers" the process by systematically analyzing how each object influences the relational context within a clustering framework. This process generates a custom-tailored embedding that isolates and amplifies class-distinctive information. We validate the capabilities of this strategy using Normalized Compression Distance (NCD) and Relative Compression Distance (NRC) with common hierarchical clustering, providing an effective alternative to common transductive methods. Experimental results across heterogeneous datasets-from text to real-world audio-validate the robustness and generality of context steering, marking a fundamental shift in their application: from merely discovering inherent data structures to actively shaping a feature space tailored to a specific objective.
☆ Federated Distillation on Edge Devices: Efficient Client-Side Filtering for Non-IID Data
Federated distillation has emerged as a promising collaborative machine learning approach, offering enhanced privacy protection and reduced communication compared to traditional federated learning by exchanging model outputs (soft logits) rather than full model parameters. However, existing methods employ complex selective knowledge-sharing strategies that require clients to identify in-distribution proxy data through computationally expensive statistical density ratio estimators. Additionally, server-side filtering of ambiguous knowledge introduces latency to the process. To address these challenges, we propose a robust, resource-efficient EdgeFD method that reduces the complexity of the client-side density ratio estimation and removes the need for server-side filtering. EdgeFD introduces an efficient KMeans-based density ratio estimator for effectively filtering both in-distribution and out-of-distribution proxy data on clients, significantly improving the quality of knowledge sharing. We evaluate EdgeFD across diverse practical scenarios, including strong non-IID, weak non-IID, and IID data distributions on clients, without requiring a pre-trained teacher model on the server for knowledge distillation. Experimental results demonstrate that EdgeFD outperforms state-of-the-art methods, consistently achieving accuracy levels close to IID scenarios even under heterogeneous and challenging conditions. The significantly reduced computational overhead of the KMeans-based estimator is suitable for deployment on resource-constrained edge devices, thereby enhancing the scalability and real-world applicability of federated distillation. The code is available online for reproducibility.
comment: This paper was accepted at the International Conference on Federated Learning Technologies and Applications, 2025. The final version is available at IEEE Xplore
☆ PepThink-R1: LLM for Interpretable Cyclic Peptide Optimization with CoT SFT and Reinforcement Learning
Designing therapeutic peptides with tailored properties is hindered by the vastness of sequence space, limited experimental data, and poor interpretability of current generative models. To address these challenges, we introduce PepThink-R1, a generative framework that integrates large language models (LLMs) with chain-of-thought (CoT) supervised fine-tuning and reinforcement learning (RL). Unlike prior approaches, PepThink-R1 explicitly reasons about monomer-level modifications during sequence generation, enabling interpretable design choices while optimizing for multiple pharmacological properties. Guided by a tailored reward function balancing chemical validity and property improvements, the model autonomously explores diverse sequence variants. We demonstrate that PepThink-R1 generates cyclic peptides with significantly enhanced lipophilicity, stability, and exposure, outperforming existing general LLMs (e.g., GPT-5) and domain-specific baseline in both optimization success and interpretability. To our knowledge, this is the first LLM-based peptide design framework that combines explicit reasoning with RL-driven property control, marking a step toward reliable and transparent peptide optimization for therapeutic discovery.
☆ Distributional Adversarial Attacks and Training in Deep Hedging
In this paper, we study the robustness of classical deep hedging strategies under distributional shifts by leveraging the concept of adversarial attacks. We first demonstrate that standard deep hedging models are highly vulnerable to small perturbations in the input distribution, resulting in significant performance degradation. Motivated by this, we propose an adversarial training framework tailored to increase the robustness of deep hedging strategies. Our approach extends pointwise adversarial attacks to the distributional setting and introduces a computationally tractable reformulation of the adversarial optimization problem over a Wasserstein ball. This enables the efficient training of hedging strategies that are resilient to distributional perturbations. Through extensive numerical experiments, we show that adversarially trained deep hedging strategies consistently outperform their classical counterparts in terms of out-of-sample performance and resilience to model misspecification. Our findings establish a practical and effective framework for robust deep hedging under realistic market uncertainties.
comment: Preprint. Under review
☆ HERAKLES: Hierarchical Skill Compilation for Open-ended LLM Agents
Open-ended AI agents need to be able to learn efficiently goals of increasing complexity, abstraction and heterogeneity over their lifetime. Beyond sampling efficiently their own goals, autotelic agents specifically need to be able to keep the growing complexity of goals under control, limiting the associated growth in sample and computational complexity. To adress this challenge, recent approaches have leveraged hierarchical reinforcement learning (HRL) and language, capitalizing on its compositional and combinatorial generalization capabilities to acquire temporally extended reusable behaviours. Existing approaches use expert defined spaces of subgoals over which they instantiate a hierarchy, and often assume pre-trained associated low-level policies. Such designs are inadequate in open-ended scenarios, where goal spaces naturally diversify across a broad spectrum of difficulties. We introduce HERAKLES, a framework that enables a two-level hierarchical autotelic agent to continuously compile mastered goals into the low-level policy, executed by a small, fast neural network, dynamically expanding the set of subgoals available to the high-level policy. We train a Large Language Model (LLM) to serve as the high-level controller, exploiting its strengths in goal decomposition and generalization to operate effectively over this evolving subgoal space. We evaluate HERAKLES in the open-ended Crafter environment and show that it scales effectively with goal complexity, improves sample efficiency through skill compilation, and enables the agent to adapt robustly to novel challenges over time.
comment: 42 pages
☆ Cross-Modality Controlled Molecule Generation with Diffusion Language Model
Current SMILES-based diffusion models for molecule generation typically support only unimodal constraint. They inject conditioning signals at the start of the training process and require retraining a new model from scratch whenever the constraint changes. However, real-world applications often involve multiple constraints across different modalities, and additional constraints may emerge over the course of a study. This raises a challenge: how to extend a pre-trained diffusion model not only to support cross-modality constraints but also to incorporate new ones without retraining. To tackle this problem, we propose the Cross-Modality Controlled Molecule Generation with Diffusion Language Model (CMCM-DLM), demonstrated by two distinct cross modalities: molecular structure and chemical properties. Our approach builds upon a pre-trained diffusion model, incorporating two trainable modules, the Structure Control Module (SCM) and the Property Control Module (PCM), and operates in two distinct phases during the generation process. In Phase I, we employs the SCM to inject structural constraints during the early diffusion steps, effectively anchoring the molecular backbone. Phase II builds on this by further introducing PCM to guide the later stages of inference to refine the generated molecules, ensuring their chemical properties match the specified targets. Experimental results on multiple datasets demonstrate the efficiency and adaptability of our approach, highlighting CMCM-DLM's significant advancement in molecular generation for drug discovery applications.
☆ MissionHD: Data-Driven Refinement of Reasoning Graph Structure through Hyperdimensional Causal Path Encoding and Decoding
Reasoning graphs from Large Language Models (LLMs) are often misaligned with downstream visual tasks such as video anomaly detection (VAD). Existing Graph Structure Refinement (GSR) methods are ill-suited for these novel, dataset-less graphs. We introduce Data-driven GSR (D-GSR), a new paradigm that directly optimizes graph structure using downstream task data, and propose MissionHD, a hyperdimensional computing (HDC) framework to operationalize it. MissionHD uses an efficient encode-decode process to refine the graph, guided by the downstream task signal. Experiments on challenging VAD and VAR benchmarks show significant performance improvements when using our refined graphs, validating our approach as an effective pre-processing step.
☆ CaTE Data Curation for Trustworthy AI
This report provides practical guidance to teams designing or developing AI-enabled systems for how to promote trustworthiness during the data curation phase of development. In this report, the authors first define data, the data curation phase, and trustworthiness. We then describe a series of steps that the development team, especially data scientists, can take to build a trustworthy AI-enabled system. We enumerate the sequence of core steps and trace parallel paths where alternatives exist. The descriptions of these steps include strengths, weaknesses, preconditions, outcomes, and relevant open-source software tool implementations. In total, this report is a synthesis of data curation tools and approaches from relevant academic literature, and our goal is to equip readers with a diverse yet coherent set of practices for improving AI trustworthiness.
☆ AFABench: A Generic Framework for Benchmarking Active Feature Acquisition
In many real-world scenarios, acquiring all features of a data instance can be expensive or impractical due to monetary cost, latency, or privacy concerns. Active Feature Acquisition (AFA) addresses this challenge by dynamically selecting a subset of informative features for each data instance, trading predictive performance against acquisition cost. While numerous methods have been proposed for AFA, ranging from greedy information-theoretic strategies to non-myopic reinforcement learning approaches, fair and systematic evaluation of these methods has been hindered by the lack of standardized benchmarks. In this paper, we introduce AFABench, the first benchmark framework for AFA. Our benchmark includes a diverse set of synthetic and real-world datasets, supports a wide range of acquisition policies, and provides a modular design that enables easy integration of new methods and tasks. We implement and evaluate representative algorithms from all major categories, including static, greedy, and reinforcement learning-based approaches. To test the lookahead capabilities of AFA policies, we introduce a novel synthetic dataset, AFAContext, designed to expose the limitations of greedy selection. Our results highlight key trade-offs between different AFA strategies and provide actionable insights for future research. The benchmark code is available at: https://github.com/Linusaronsson/AFA-Benchmark.
☆ Assessing the Quality and Security of AI-Generated Code: A Quantitative Analysis
This study presents a quantitative evaluation of the code quality and security of five prominent Large Language Models (LLMs): Claude Sonnet 4, Claude 3.7 Sonnet, GPT-4o, Llama 3.2 90B, and OpenCoder 8B. While prior research has assessed the functional performance of LLM-generated code, this research tested LLM output from 4,442 Java coding assignments through comprehensive static analysis using SonarQube. The findings suggest that although LLMs can generate functional code, they also introduce a range of software defects, including bugs, security vulnerabilities, and code smells. These defects do not appear to be isolated; rather, they may represent shared weaknesses stemming from systemic limitations within current LLM code generation methods. In particular, critically severe issues, such as hard-coded passwords and path traversal vulnerabilities, were observed across multiple models. These results indicate that LLM-generated code requires verification in order to be considered production-ready. This study found no direct correlation between a model's functional performance (measured by Pass@1 rate of unit tests) and the overall quality and security of its generated code, measured by the number of SonarQube issues in benchmark solutions that passed the functional tests. This suggests that functional benchmark performance score is not a good indicator of overall code quality and security. The goal of this study is not to rank LLM performance but to highlight that all evaluated models appear to share certain weaknesses. Consequently, these findings support the view that static analysis can be a valuable instrument for detecting latent defects and an important safeguard for organizations that deploy AI in software development.
☆ ShizhenGPT: Towards Multimodal LLMs for Traditional Chinese Medicine
Despite the success of large language models (LLMs) in various domains, their potential in Traditional Chinese Medicine (TCM) remains largely underexplored due to two critical barriers: (1) the scarcity of high-quality TCM data and (2) the inherently multimodal nature of TCM diagnostics, which involve looking, listening, smelling, and pulse-taking. These sensory-rich modalities are beyond the scope of conventional LLMs. To address these challenges, we present ShizhenGPT, the first multimodal LLM tailored for TCM. To overcome data scarcity, we curate the largest TCM dataset to date, comprising 100GB+ of text and 200GB+ of multimodal data, including 1.2M images, 200 hours of audio, and physiological signals. ShizhenGPT is pretrained and instruction-tuned to achieve deep TCM knowledge and multimodal reasoning. For evaluation, we collect recent national TCM qualification exams and build a visual benchmark for Medicinal Recognition and Visual Diagnosis. Experiments demonstrate that ShizhenGPT outperforms comparable-scale LLMs and competes with larger proprietary models. Moreover, it leads in TCM visual understanding among existing multimodal LLMs and demonstrates unified perception across modalities like sound, pulse, smell, and vision, paving the way toward holistic multimodal perception and diagnosis in TCM. Datasets, models, and code are publicly available. We hope this work will inspire further exploration in this field.
☆ ECHO: Frequency-aware Hierarchical Encoding for Variable-length Signal
Pre-trained foundation models have demonstrated remarkable success in vision and language, yet their potential for general machine signal modeling-covering acoustic, vibration, and other industrial sensor data-remains under-explored. Existing approach using sub-band-based encoders has achieved competitive results but are limited by fixed input lengths, and the absence of explicit frequency positional encoding. In this work, we propose a novel foundation model that integrates an advanced band-split architecture with relative frequency positional embeddings, enabling precise spectral localization across arbitrary sampling configurations. The model supports inputs of arbitrary length without padding or segmentation, producing a concise embedding that retains both temporal and spectral fidelity. We evaluate our method on SIREN (https://github.com/yucongzh/SIREN), a newly introduced large-scale benchmark for machine signal encoding that unifies multiple datasets, including all DCASE task 2 challenges (2020-2025) and widely-used industrial signal corpora. Experimental results demonstrate consistent state-of-the-art performance in anomaly detection and fault identification, confirming the effectiveness and generalization capability of the proposed model. We open-sourced ECHO on https://github.com/yucongzh/ECHO.
☆ Addressing Graph Anomaly Detection via Causal Edge Separation and Spectrum KDD
In the real world, anomalous entities often add more legitimate connections while hiding direct links with other anomalous entities, leading to heterophilic structures in anomalous networks that most GNN-based techniques fail to address. Several works have been proposed to tackle this issue in the spatial domain. However, these methods overlook the complex relationships between node structure encoding, node features, and their contextual environment and rely on principled guidance, research on solving spectral domain heterophilic problems remains limited. This study analyzes the spectral distribution of nodes with different heterophilic degrees and discovers that the heterophily of anomalous nodes causes the spectral energy to shift from low to high frequencies. To address the above challenges, we propose a spectral neural network CES2-GAD based on causal edge separation for anomaly detection on heterophilic graphs. Firstly, CES2-GAD will separate the original graph into homophilic and heterophilic edges using causal interventions. Subsequently, various hybrid-spectrum filters are used to capture signals from the segmented graphs. Finally, representations from multiple signals are concatenated and input into a classifier to predict anomalies. Extensive experiments with real-world datasets have proven the effectiveness of the method we proposed.
comment: Proceedings of the 2024 KDD Workshop
☆ Improving Fairness in Graph Neural Networks via Counterfactual Debiasing KDD
Graph Neural Networks (GNNs) have been successful in modeling graph-structured data. However, similar to other machine learning models, GNNs can exhibit bias in predictions based on attributes like race and gender. Moreover, bias in GNNs can be exacerbated by the graph structure and message-passing mechanisms. Recent cutting-edge methods propose mitigating bias by filtering out sensitive information from input or representations, like edge dropping or feature masking. Yet, we argue that such strategies may unintentionally eliminate non-sensitive features, leading to a compromised balance between predictive accuracy and fairness. To tackle this challenge, we present a novel approach utilizing counterfactual data augmentation for bias mitigation. This method involves creating diverse neighborhoods using counterfactuals before message passing, facilitating unbiased node representations learning from the augmented graph. Subsequently, an adversarial discriminator is employed to diminish bias in predictions by conventional GNN classifiers. Our proposed technique, Fair-ICD, ensures the fairness of GNNs under moderate conditions. Experiments on standard datasets using three GNN backbones demonstrate that Fair-ICD notably enhances fairness metrics while preserving high predictive performance.
comment: Proceedings of the 2024 KDD Workshop
☆ ELATE: Evolutionary Language model for Automated Time-series Engineering
Time-series prediction involves forecasting future values using machine learning models. Feature engineering, whereby existing features are transformed to make new ones, is critical for enhancing model performance, but is often manual and time-intensive. Existing automation attempts rely on exhaustive enumeration, which can be computationally costly and lacks domain-specific insights. We introduce ELATE (Evolutionary Language model for Automated Time-series Engineering), which leverages a language model within an evolutionary framework to automate feature engineering for time-series data. ELATE employs time-series statistical measures and feature importance metrics to guide and prune features, while the language model proposes new, contextually relevant feature transformations. Our experiments demonstrate that ELATE improves forecasting accuracy by an average of 8.4% across various domains.
comment: 27 pages, 4 figures. Comments welcome
☆ Understanding Data Influence with Differential Approximation
Data plays a pivotal role in the groundbreaking advancements in artificial intelligence. The quantitative analysis of data significantly contributes to model training, enhancing both the efficiency and quality of data utilization. However, existing data analysis tools often lag in accuracy. For instance, many of these tools even assume that the loss function of neural networks is convex. These limitations make it challenging to implement current methods effectively. In this paper, we introduce a new formulation to approximate a sample's influence by accumulating the differences in influence between consecutive learning steps, which we term Diff-In. Specifically, we formulate the sample-wise influence as the cumulative sum of its changes/differences across successive training iterations. By employing second-order approximations, we approximate these difference terms with high accuracy while eliminating the need for model convexity required by existing methods. Despite being a second-order method, Diff-In maintains computational complexity comparable to that of first-order methods and remains scalable. This efficiency is achieved by computing the product of the Hessian and gradient, which can be efficiently approximated using finite differences of first-order gradients. We assess the approximation accuracy of Diff-In both theoretically and empirically. Our theoretical analysis demonstrates that Diff-In achieves significantly lower approximation error compared to existing influence estimators. Extensive experiments further confirm its superior performance across multiple benchmark datasets in three data-centric tasks: data cleaning, data deletion, and coreset selection. Notably, our experiments on data pruning for large-scale vision-language pre-training show that Diff-In can scale to millions of data points and outperforms strong baselines.
☆ Clinical semantics for lung cancer prediction
Background: Existing clinical prediction models often represent patient data using features that ignore the semantic relationships between clinical concepts. This study integrates domain-specific semantic information by mapping the SNOMED medical term hierarchy into a low-dimensional hyperbolic space using Poincar\'e embeddings, with the aim of improving lung cancer onset prediction. Methods: Using a retrospective cohort from the Optum EHR dataset, we derived a clinical knowledge graph from the SNOMED taxonomy and generated Poincar\'e embeddings via Riemannian stochastic gradient descent. These embeddings were then incorporated into two deep learning architectures, a ResNet and a Transformer model. Models were evaluated for discrimination (area under the receiver operating characteristic curve) and calibration (average absolute difference between observed and predicted probabilities) performance. Results: Incorporating pre-trained Poincar\'e embeddings resulted in modest and consistent improvements in discrimination performance compared to baseline models using randomly initialized Euclidean embeddings. ResNet models, particularly those using a 10-dimensional Poincar\'e embedding, showed enhanced calibration, whereas Transformer models maintained stable calibration across configurations. Discussion: Embedding clinical knowledge graphs into hyperbolic space and integrating these representations into deep learning models can improve lung cancer onset prediction by preserving the hierarchical structure of clinical terminologies used for prediction. This approach demonstrates a feasible method for combining data-driven feature extraction with established clinical knowledge.
☆ A Fuzzy-Enhanced Explainable AI Framework for Flight Continuous Descent Operations Classification
Continuous Descent Operations (CDO) involve smooth, idle-thrust descents that avoid level-offs, reducing fuel burn, emissions, and noise while improving efficiency and passenger comfort. Despite its operational and environmental benefits, limited research has systematically examined the factors influencing CDO performance. Moreover, many existing methods in related areas, such as trajectory optimization, lack the transparency required in aviation, where explainability is critical for safety and stakeholder trust. This study addresses these gaps by proposing a Fuzzy-Enhanced Explainable AI (FEXAI) framework that integrates fuzzy logic with machine learning and SHapley Additive exPlanations (SHAP) analysis. For this purpose, a comprehensive dataset of 29 features, including 11 operational and 18 weather-related features, was collected from 1,094 flights using Automatic Dependent Surveillance-Broadcast (ADS-B) data. Machine learning models and SHAP were then applied to classify flights' CDO adherence levels and rank features by importance. The three most influential features, as identified by SHAP scores, were then used to construct a fuzzy rule-based classifier, enabling the extraction of interpretable fuzzy rules. All models achieved classification accuracies above 90%, with FEXAI providing meaningful, human-readable rules for operational users. Results indicated that the average descent rate within the arrival route, the number of descent segments, and the average change in directional heading during descent were the strongest predictors of CDO performance. The FEXAI method proposed in this study presents a novel pathway for operational decision support and could be integrated into aviation tools to enable real-time advisories that maintain CDO adherence under varying operational conditions.
☆ Measuring IIA Violations in Similarity Choices with Bayesian Models UAI 2025
Similarity choice data occur when humans make choices among alternatives based on their similarity to a target, e.g., in the context of information retrieval and in embedding learning settings. Classical metric-based models of similarity choice assume independence of irrelevant alternatives (IIA), a property that allows for a simpler formulation. While IIA violations have been detected in many discrete choice settings, the similarity choice setting has received scant attention. This is because the target-dependent nature of the choice complicates IIA testing. We propose two statistical methods to test for IIA: a classical goodness-of-fit test and a Bayesian counterpart based on the framework of Posterior Predictive Checks (PPC). This Bayesian approach, our main technical contribution, quantifies the degree of IIA violation beyond its mere significance. We curate two datasets: one with choice sets designed to elicit IIA violations, and another with randomly generated choice sets from the same item universe. Our tests confirmed significant IIA violations on both datasets, and notably, we find a comparable degree of violation between them. Further, we devise a new PPC test for population homogeneity. Results show that the population is indeed homogenous, suggesting that the IIA violations are driven by context effects -- specifically, interactions within the choice sets. These results highlight the need for new similarity choice models that account for such context effects.
comment: 26 pages and 34 figures, for associated code and data, see https://github.com/correahs/similarity-uai-2025, poster session in UAI 2025
☆ DualNILM: Energy Injection Identification Enabled Disaggregation with Deep Multi-Task Learning
Non-Intrusive Load Monitoring (NILM) offers a cost-effective method to obtain fine-grained appliance-level energy consumption in smart homes and building applications. However, the increasing adoption of behind-the-meter energy sources, such as solar panels and battery storage, poses new challenges for conventional NILM methods that rely solely on at-the-meter data. The injected energy from the behind-the-meter sources can obscure the power signatures of individual appliances, leading to a significant decline in NILM performance. To address this challenge, we present DualNILM, a deep multi-task learning framework designed for the dual tasks of appliance state recognition and injected energy identification in NILM. By integrating sequence-to-point and sequence-to-sequence strategies within a Transformer-based architecture, DualNILM can effectively capture multi-scale temporal dependencies in the aggregate power consumption patterns, allowing for accurate appliance state recognition and energy injection identification. We conduct validation of DualNILM using both self-collected and synthesized open NILM datasets that include both appliance-level energy consumption and energy injection. Extensive experimental results demonstrate that DualNILM maintains an excellent performance for the dual tasks in NILM, much outperforming conventional methods.
comment: Preprint
☆ A Comprehensive Evaluation of the Sensitivity of Density-Ratio Estimation Based Fairness Measurement in Regression
The prevalence of algorithmic bias in Machine Learning (ML)-driven approaches has inspired growing research on measuring and mitigating bias in the ML domain. Accordingly, prior research studied how to measure fairness in regression which is a complex problem. In particular, recent research proposed to formulate it as a density-ratio estimation problem and relied on a Logistic Regression-driven probabilistic classifier-based approach to solve it. However, there are several other methods to estimate a density ratio, and to the best of our knowledge, prior work did not study the sensitivity of such fairness measurement methods to the choice of underlying density ratio estimation algorithm. To fill this gap, this paper develops a set of fairness measurement methods with various density-ratio estimation cores and thoroughly investigates how different cores would affect the achieved level of fairness. Our experimental results show that the choice of density-ratio estimation core could significantly affect the outcome of fairness measurement method, and even, generate inconsistent results with respect to the relative fairness of various algorithms. These observations suggest major issues with density-ratio estimation based fairness measurement in regression and a need for further research to enhance their reliability.
☆ Towards Skeletal and Signer Noise Reduction in Sign Language Production via Quaternion-Based Pose Encoding and Contrastive Learning
One of the main challenges in neural sign language production (SLP) lies in the high intra-class variability of signs, arising from signer morphology and stylistic variety in the training data. To improve robustness to such variations, we propose two enhancements to the standard Progressive Transformers (PT) architecture (Saunders et al., 2020). First, we encode poses using bone rotations in quaternion space and train with a geodesic loss to improve the accuracy and clarity of angular joint movements. Second, we introduce a contrastive loss to structure decoder embeddings by semantic similarity, using either gloss overlap or SBERT-based sentence similarity, aiming to filter out anatomical and stylistic features that do not convey relevant semantic information. On the Phoenix14T dataset, the contrastive loss alone yields a 16% improvement in Probability of Correct Keypoint over the PT baseline. When combined with quaternion-based pose encoding, the model achieves a 6% reduction in Mean Bone Angle Error. These results point to the benefit of incorporating skeletal structure modeling and semantically guided contrastive objectives on sign pose representations into the training of Transformer-based SLP models.
☆ Cooperative SGD with Dynamic Mixing Matrices ECAI-2025
One of the most common methods to train machine learning algorithms today is the stochastic gradient descent (SGD). In a distributed setting, SGD-based algorithms have been shown to converge theoretically under specific circumstances. A substantial number of works in the distributed SGD setting assume a fixed topology for the edge devices. These papers also assume that the contribution of nodes to the global model is uniform. However, experiments have shown that such assumptions are suboptimal and a non uniform aggregation strategy coupled with a dynamically shifting topology and client selection can significantly improve the performance of such models. This paper details a unified framework that covers several Local-Update SGD-based distributed algorithms with dynamic topologies and provides improved or matching theoretical guarantees on convergence compared to existing work.
comment: Accepted at 28th European Conference on Artificial Intelligence (ECAI-2025) in main paper track
☆ Improving OCR using internal document redundancy
Current OCR systems are based on deep learning models trained on large amounts of data. Although they have shown some ability to generalize to unseen data, especially in detection tasks, they can struggle with recognizing low-quality data. This is particularly evident for printed documents, where intra-domain data variability is typically low, but inter-domain data variability is high. In that context, current OCR methods do not fully exploit each document's redundancy. We propose an unsupervised method by leveraging the redundancy of character shapes within a document to correct imperfect outputs of a given OCR system and suggest better clustering. To this aim, we introduce an extended Gaussian Mixture Model (GMM) by alternating an Expectation-Maximization (EM) algorithm with an intra-cluster realignment process and normality statistical testing. We demonstrate improvements in documents with various levels of degradation, including recovered Uruguayan military archives and 17th to mid-20th century European newspapers.
comment: 28 pages, 10 figures, including supplementary material. Code: https://github.com/seginusmowlavi/ocr-using-shape-redundancy. Dataset: https://github.com/camilomarino/ocr_berrutti_dataset
☆ Adaptively Robust LLM Inference Optimization under Prediction Uncertainty
We study the problem of optimizing Large Language Model (LLM) inference scheduling to minimize total latency. LLM inference is an online and multi-task service process and also heavily energy consuming by which a pre-trained LLM processes input requests and generates output tokens sequentially. Therefore, it is vital to improve its scheduling efficiency and reduce the power consumption while a great amount of prompt requests are arriving. A key challenge in LLM inference scheduling is that while the prompt length is known upon arrival, the output length, which critically impacts memory usage and processing time, is unknown. To address this uncertainty, we propose algorithms that leverage machine learning to predict output lengths, assuming the prediction provides an interval classification (min-max range) for each request. We first design a conservative algorithm, $\mathcal{A}_{\max}$, which schedules requests based on the upper bound of predicted output lengths to prevent memory overflow. However, this approach is overly conservative: as prediction accuracy decreases, performance degrades significantly due to potential overestimation. To overcome this limitation, we propose $\mathcal{A}_{\min}$, an adaptive algorithm that initially treats the predicted lower bound as the output length and dynamically refines this estimate during inferencing. We prove that $\mathcal{A}_{\min}$ achieves a log-scale competitive ratio. Through numerical simulations, we demonstrate that $\mathcal{A}_{\min}$ often performs nearly as well as the hindsight scheduler, highlighting both its efficiency and robustness in practical scenarios. Moreover, $\mathcal{A}_{\min}$ relies solely on the lower bound of the prediction interval--an advantageous design choice since upper bounds on output length are typically more challenging to predict accurately.
☆ FedEve: On Bridging the Client Drift and Period Drift for Cross-device Federated Learning
Federated learning (FL) is a machine learning paradigm that allows multiple clients to collaboratively train a shared model without exposing their private data. Data heterogeneity is a fundamental challenge in FL, which can result in poor convergence and performance degradation. Client drift has been recognized as one of the factors contributing to this issue resulting from the multiple local updates in FedAvg. However, in cross-device FL, a different form of drift arises due to the partial client participation, but it has not been studied well. This drift, we referred as period drift, occurs as participating clients at each communication round may exhibit distinct data distribution that deviates from that of all clients. It could be more harmful than client drift since the optimization objective shifts with every round. In this paper, we investigate the interaction between period drift and client drift, finding that period drift can have a particularly detrimental effect on cross-device FL as the degree of data heterogeneity increases. To tackle these issues, we propose a predict-observe framework and present an instantiated method, FedEve, where these two types of drift can compensate each other to mitigate their overall impact. We provide theoretical evidence that our approach can reduce the variance of model updates. Extensive experiments demonstrate that our method outperforms alternatives on non-iid data in cross-device settings.
☆ Beyond ReLU: Chebyshev-DQN for Enhanced Deep Q-Networks
The performance of Deep Q-Networks (DQN) is critically dependent on the ability of its underlying neural network to accurately approximate the action-value function. Standard function approximators, such as multi-layer perceptrons, may struggle to efficiently represent the complex value landscapes inherent in many reinforcement learning problems. This paper introduces a novel architecture, the Chebyshev-DQN (Ch-DQN), which integrates a Chebyshev polynomial basis into the DQN framework to create a more effective feature representation. By leveraging the powerful function approximation properties of Chebyshev polynomials, we hypothesize that the Ch-DQN can learn more efficiently and achieve higher performance. We evaluate our proposed model on the CartPole-v1 benchmark and compare it against a standard DQN with a comparable number of parameters. Our results demonstrate that the Ch-DQN with a moderate polynomial degree (N=4) achieves significantly better asymptotic performance, outperforming the baseline by approximately 39\%. However, we also find that the choice of polynomial degree is a critical hyperparameter, as a high degree (N=8) can be detrimental to learning. This work validates the potential of using orthogonal polynomial bases in deep reinforcement learning while also highlighting the trade-offs involved in model complexity.
☆ Great GATsBi: Hybrid, Multimodal, Trajectory Forecasting for Bicycles using Anticipation Mechanism
Accurate prediction of road user movement is increasingly required by many applications ranging from advanced driver assistance systems to autonomous driving, and especially crucial for road safety. Even though most traffic accident fatalities account to bicycles, they have received little attention, as previous work focused mainly on pedestrians and motorized vehicles. In this work, we present the Great GATsBi, a domain-knowledge-based, hybrid, multimodal trajectory prediction framework for bicycles. The model incorporates both physics-based modeling (inspired by motorized vehicles) and social-based modeling (inspired by pedestrian movements) to explicitly account for the dual nature of bicycle movement. The social interactions are modeled with a graph attention network, and include decayed historical, but also anticipated, future trajectory data of a bicycles neighborhood, following recent insights from psychological and social studies. The results indicate that the proposed ensemble of physics models -- performing well in the short-term predictions -- and social models -- performing well in the long-term predictions -- exceeds state-of-the-art performance. We also conducted a controlled mass-cycling experiment to demonstrate the framework's performance when forecasting bicycle trajectories and modeling social interactions with road users.
Artificial Intelligence-Based Multiscale Temporal Modeling for Anomaly Detection in Cloud Services
This study proposes an anomaly detection method based on the Transformer architecture with integrated multiscale feature perception, aiming to address the limitations of temporal modeling and scale-aware feature representation in cloud service environments. The method first employs an improved Transformer module to perform temporal modeling on high-dimensional monitoring data, using a self-attention mechanism to capture long-range dependencies and contextual semantics. Then, a multiscale feature construction path is introduced to extract temporal features at different granularities through downsampling and parallel encoding. An attention-weighted fusion module is designed to dynamically adjust the contribution of each scale to the final decision, enhancing the model's robustness in anomaly pattern modeling. In the input modeling stage, standardized multidimensional time series are constructed, covering core signals such as CPU utilization, memory usage, and task scheduling states, while positional encoding is used to strengthen the model's temporal awareness. A systematic experimental setup is designed to evaluate performance, including comparative experiments and hyperparameter sensitivity analysis, focusing on the impact of optimizers, learning rates, anomaly ratios, and noise levels. Experimental results show that the proposed method outperforms mainstream baseline models in key metrics, including precision, recall, AUC, and F1-score, and maintains strong stability and detection performance under various perturbation conditions, demonstrating its superior capability in complex cloud environments.
☆ Exact Shapley Attributions in Quadratic-time for FANOVA Gaussian Processes
Shapley values are widely recognized as a principled method for attributing importance to input features in machine learning. However, the exact computation of Shapley values scales exponentially with the number of features, severely limiting the practical application of this powerful approach. The challenge is further compounded when the predictive model is probabilistic - as in Gaussian processes (GPs) - where the outputs are random variables rather than point estimates, necessitating additional computational effort in modeling higher-order moments. In this work, we demonstrate that for an important class of GPs known as FANOVA GP, which explicitly models all main effects and interactions, *exact* Shapley attributions for both local and global explanations can be computed in *quadratic time*. For local, instance-wise explanations, we define a stochastic cooperative game over function components and compute the exact stochastic Shapley value in quadratic time only, capturing both the expected contribution and uncertainty. For global explanations, we introduce a deterministic, variance-based value function and compute exact Shapley values that quantify each feature's contribution to the model's overall sensitivity. Our methods leverage a closed-form (stochastic) M\"{o}bius representation of the FANOVA decomposition and introduce recursive algorithms, inspired by Newton's identities, to efficiently compute the mean and variance of Shapley values. Our work enhances the utility of explainable AI, as demonstrated by empirical studies, by providing more scalable, axiomatically sound, and uncertainty-aware explanations for predictions generated by structured probabilistic models.
☆ Semantic Energy: Detecting LLM Hallucination Beyond Entropy
Large Language Models (LLMs) are being increasingly deployed in real-world applications, but they remain susceptible to hallucinations, which produce fluent yet incorrect responses and lead to erroneous decision-making. Uncertainty estimation is a feasible approach to detect such hallucinations. For example, semantic entropy estimates uncertainty by considering the semantic diversity across multiple sampled responses, thus identifying hallucinations. However, semantic entropy relies on post-softmax probabilities and fails to capture the model's inherent uncertainty, causing it to be ineffective in certain scenarios. To address this issue, we introduce Semantic Energy, a novel uncertainty estimation framework that leverages the inherent confidence of LLMs by operating directly on logits of penultimate layer. By combining semantic clustering with a Boltzmann-inspired energy distribution, our method better captures uncertainty in cases where semantic entropy fails. Experiments across multiple benchmarks show that Semantic Energy significantly improves hallucination detection and uncertainty estimation, offering more reliable signals for downstream applications such as hallucination detection.
☆ On the notion of missingness for path attribution explainability methods in medical settings: Guiding the selection of medically meaningful baselines
The explainability of deep learning models remains a significant challenge, particularly in the medical domain where interpretable outputs are critical for clinical trust and transparency. Path attribution methods such as Integrated Gradients rely on a baseline input representing the absence of relevant features ("missingness"). Commonly used baselines, such as all-zero inputs, are often semantically meaningless, especially in medical contexts where missingness can itself be informative. While alternative baseline choices have been explored, existing methods lack a principled approach to dynamically select baselines tailored to each input. In this work, we examine the notion of missingness in the medical setting, analyze its implications for baseline selection, and introduce a counterfactual-guided approach to address the limitations of conventional baselines. We argue that a clinically normal but input-close counterfactual represents a more accurate representation of a meaningful absence of features in medical data. To implement this, we use a Variational Autoencoder to generate counterfactual baselines, though our concept is generative-model-agnostic and can be applied with any suitable counterfactual method. We evaluate the approach on three distinct medical data sets and empirically demonstrate that counterfactual baselines yield more faithful and medically relevant attributions compared to standard baseline choices.
☆ Fast Symbolic Regression Benchmarking
Symbolic regression (SR) uncovers mathematical models from data. Several benchmarks have been proposed to compare the performance of SR algorithms. However, existing ground-truth rediscovery benchmarks overemphasize the recovery of "the one" expression form or rely solely on computer algebra systems (such as SymPy) to assess success. Furthermore, existing benchmarks continue the expression search even after its discovery. We improve upon these issues by introducing curated lists of acceptable expressions, and a callback mechanism for early termination. As a starting point, we use the symbolic regression for scientific discovery (SRSD) benchmark problems proposed by Yoshitomo et al., and benchmark the two SR packages SymbolicRegression.jl and TiSR. The new benchmarking method increases the rediscovery rate of SymbolicRegression.jl from 26.7%, as reported by Yoshitomo et at., to 44.7%. Performing the benchmark takes 41.2% less computational expense. TiSR's rediscovery rate is 69.4%, while performing the benchmark saves 63% time.
☆ DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization
We present DuPO, a dual learning-based preference optimization framework that generates annotation-free feedback via a generalized duality. DuPO addresses two key limitations: Reinforcement Learning with Verifiable Rewards (RLVR)'s reliance on costly labels and applicability restricted to verifiable tasks, and traditional dual learning's restriction to strictly dual task pairs (e.g., translation and back-translation). Specifically, DuPO decomposes a primal task's input into known and unknown components, then constructs its dual task to reconstruct the unknown part using the primal output and known information (e.g., reversing math solutions to recover hidden variables), broadening applicability to non-invertible tasks. The quality of this reconstruction serves as a self-supervised reward to optimize the primal task, synergizing with LLMs' ability to instantiate both tasks via a single model. Empirically, DuPO achieves substantial gains across diverse tasks: it enhances the average translation quality by 2.13 COMET over 756 directions, boosts the mathematical reasoning accuracy by an average of 6.4 points on three challenge benchmarks, and enhances performance by 9.3 points as an inference-time reranker (trading computation for accuracy). These results position DuPO as a scalable, general, and annotation-free paradigm for LLM optimization.
comment: 18 pages, 4 figures,
☆ NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model
We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achieve improved inference speed when generating the long thinking traces needed for reasoning. We create Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parameter model (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe. After aligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to compress and distill the model with the goal of enabling inference on up to 128k tokens on a single NVIDIA A10G GPU (22GiB of memory, bfloat16 precision). Compared to existing similarly-sized models (e.g., Qwen3-8B), we show that Nemotron-Nano-9B-v2 achieves on-par or better accuracy on reasoning benchmarks while achieving up to 6x higher inference throughput in reasoning settings like 8k input and 16k output tokens. We are releasing Nemotron-Nano-9B-v2, Nemotron-Nano12B-v2-Base, and Nemotron-Nano-9B-v2-Base checkpoints along with the majority of our pre- and post-training datasets on Hugging Face.
☆ Personalized Counterfactual Framework: Generating Potential Outcomes from Wearable Data
Wearable sensor data offer opportunities for personalized health monitoring, yet deriving actionable insights from their complex, longitudinal data streams is challenging. This paper introduces a framework to learn personalized counterfactual models from multivariate wearable data. This enables exploring what-if scenarios to understand potential individual-specific outcomes of lifestyle choices. Our approach first augments individual datasets with data from similar patients via multi-modal similarity analysis. We then use a temporal PC (Peter-Clark) algorithm adaptation to discover predictive relationships, modeling how variables at time t-1 influence physiological changes at time t. Gradient Boosting Machines are trained on these discovered relationships to quantify individual-specific effects. These models drive a counterfactual engine projecting physiological trajectories under hypothetical interventions (e.g., activity or sleep changes). We evaluate the framework via one-step-ahead predictive validation and by assessing the plausibility and impact of interventions. Evaluation showed reasonable predictive accuracy (e.g., mean heart rate MAE 4.71 bpm) and high counterfactual plausibility (median 0.9643). Crucially, these interventions highlighted significant inter-individual variability in response to hypothetical lifestyle changes, showing the framework's potential for personalized insights. This work provides a tool to explore personalized health dynamics and generate hypotheses on individual responses to lifestyle changes.
☆ Disentanglement in T-space for Faster and Distributed Training of Diffusion Models with Fewer Latent-states
We challenge a fundamental assumption of diffusion models, namely, that a large number of latent-states or time-steps is required for training so that the reverse generative process is close to a Gaussian. We first show that with careful selection of a noise schedule, diffusion models trained over a small number of latent states (i.e. $T \sim 32$) match the performance of models trained over a much large number of latent states ($T \sim 1,000$). Second, we push this limit (on the minimum number of latent states required) to a single latent-state, which we refer to as complete disentanglement in T-space. We show that high quality samples can be easily generated by the disentangled model obtained by combining several independently trained single latent-state models. We provide extensive experiments to show that the proposed disentangled model provides 4-6$\times$ faster convergence measured across a variety of metrics on two different datasets.
☆ Online Incident Response Planning under Model Misspecification through Bayesian Learning and Belief Quantization CCS
Effective responses to cyberattacks require fast decisions, even when information about the attack is incomplete or inaccurate. However, most decision-support frameworks for incident response rely on a detailed system model that describes the incident, which restricts their practical utility. In this paper, we address this limitation and present an online method for incident response planning under model misspecification, which we call MOBAL: Misspecified Online Bayesian Learning. MOBAL iteratively refines a conjecture about the model through Bayesian learning as new information becomes available, which facilitates model adaptation as the incident unfolds. To determine effective responses online, we quantize the conjectured model into a finite Markov model, which enables efficient response planning through dynamic programming. We prove that Bayesian learning is asymptotically consistent with respect to the information feedback. Additionally, we establish bounds on misspecification and quantization errors. Experiments on the CAGE-2 benchmark show that MOBAL outperforms the state of the art in terms of adaptability and robustness to model misspecification.
comment: Accepted to ACM CCS AISec2025
☆ Offline Imitation Learning upon Arbitrary Demonstrations by Pre-Training Dynamics Representations
Limited data has become a major bottleneck in scaling up offline imitation learning (IL). In this paper, we propose enhancing IL performance under limited expert data by introducing a pre-training stage that learns dynamics representations, derived from factorizations of the transition dynamics. We first theoretically justify that the optimal decision variable of offline IL lies in the representation space, significantly reducing the parameters to learn in the downstream IL. Moreover, the dynamics representations can be learned from arbitrary data collected with the same dynamics, allowing the reuse of massive non-expert data and mitigating the limited data issues. We present a tractable loss function inspired by noise contrastive estimation to learn the dynamics representations at the pre-training stage. Experiments on MuJoCo demonstrate that our proposed algorithm can mimic expert policies with as few as a single trajectory. Experiments on real quadrupeds show that we can leverage pre-trained dynamics representations from simulator data to learn to walk from a few real-world demonstrations.
comment: 7 pages, 5 figures
☆ Action-Constrained Imitation Learning ICML 2025
Policy learning under action constraints plays a central role in ensuring safe behaviors in various robot control and resource allocation applications. In this paper, we study a new problem setting termed Action-Constrained Imitation Learning (ACIL), where an action-constrained imitator aims to learn from a demonstrative expert with larger action space. The fundamental challenge of ACIL lies in the unavoidable mismatch of occupancy measure between the expert and the imitator caused by the action constraints. We tackle this mismatch through \textit{trajectory alignment} and propose DTWIL, which replaces the original expert demonstrations with a surrogate dataset that follows similar state trajectories while adhering to the action constraints. Specifically, we recast trajectory alignment as a planning problem and solve it via Model Predictive Control, which aligns the surrogate trajectories with the expert trajectories based on the Dynamic Time Warping (DTW) distance. Through extensive experiments, we demonstrate that learning from the dataset generated by DTWIL significantly enhances performance across multiple robot control tasks and outperforms various benchmark imitation learning algorithms in terms of sample efficiency. Our code is publicly available at https://github.com/NYCU-RL-Bandits-Lab/ACRL-Baselines.
comment: Published in ICML 2025
☆ Hilbert geometry of the symmetric positive-definite bicone: Application to the geometry of the extended Gaussian family
The extended Gaussian family is the closure of the Gaussian family obtained by completing the Gaussian family with the counterpart elements induced by degenerate covariance or degenerate precision matrices, or a mix of both degeneracies. The parameter space of the extended Gaussian family forms a symmetric positive semi-definite matrix bicone, i.e. two partial symmetric positive semi-definite matrix cones joined at their bases. In this paper, we study the Hilbert geometry of such an open bounded convex symmetric positive-definite bicone. We report the closed-form formula for the corresponding Hilbert metric distance and study exhaustively its invariance properties. We also touch upon potential applications of this geometry for dealing with extended Gaussian distributions.
comment: 21 pages
☆ Evaluation and Optimization of Leave-one-out Cross-validation for the Lasso
I develop an algorithm to produce the piecewise quadratic that computes leave-one-out cross-validation for the lasso as a function of its hyperparameter. The algorithm can be used to find exact hyperparameters that optimize leave-one-out cross-validation either globally or locally, and its practicality is demonstrated on real-world data sets.
comment: 18 pages, 3 figures, 7 tables
☆ Organ-Agents: Virtual Human Physiology Simulator via LLMs
Recent advances in large language models (LLMs) have enabled new possibilities in simulating complex physiological systems. We introduce Organ-Agents, a multi-agent framework that simulates human physiology via LLM-driven agents. Each Simulator models a specific system (e.g., cardiovascular, renal, immune). Training consists of supervised fine-tuning on system-specific time-series data, followed by reinforcement-guided coordination using dynamic reference selection and error correction. We curated data from 7,134 sepsis patients and 7,895 controls, generating high-resolution trajectories across 9 systems and 125 variables. Organ-Agents achieved high simulation accuracy on 4,509 held-out patients, with per-system MSEs <0.16 and robustness across SOFA-based severity strata. External validation on 22,689 ICU patients from two hospitals showed moderate degradation under distribution shifts with stable simulation. Organ-Agents faithfully reproduces critical multi-system events (e.g., hypotension, hyperlactatemia, hypoxemia) with coherent timing and phase progression. Evaluation by 15 critical care physicians confirmed realism and physiological plausibility (mean Likert ratings 3.9 and 3.7). Organ-Agents also enables counterfactual simulations under alternative sepsis treatment strategies, generating trajectories and APACHE II scores aligned with matched real-world patients. In downstream early warning tasks, classifiers trained on synthetic data showed minimal AUROC drops (<0.04), indicating preserved decision-relevant patterns. These results position Organ-Agents as a credible, interpretable, and generalizable digital twin for precision diagnosis, treatment simulation, and hypothesis testing in critical care.
☆ SBGD: Improving Graph Diffusion Generative Model via Stochastic Block Diffusion
Graph diffusion generative models (GDGMs) have emerged as powerful tools for generating high-quality graphs. However, their broader adoption faces challenges in \emph{scalability and size generalization}. GDGMs struggle to scale to large graphs due to their high memory requirements, as they typically operate in the full graph space, requiring the entire graph to be stored in memory during training and inference. This constraint limits their feasibility for large-scale real-world graphs. GDGMs also exhibit poor size generalization, with limited ability to generate graphs of sizes different from those in the training data, restricting their adaptability across diverse applications. To address these challenges, we propose the stochastic block graph diffusion (SBGD) model, which refines graph representations into a block graph space. This space incorporates structural priors based on real-world graph patterns, significantly reducing memory complexity and enabling scalability to large graphs. The block representation also improves size generalization by capturing fundamental graph structures. Empirical results show that SBGD achieves significant memory improvements (up to 6$\times$) while maintaining comparable or even superior graph generation performance relative to state-of-the-art methods. Furthermore, experiments demonstrate that SBGD better generalizes to unseen graph sizes. The significance of SBGD extends beyond being a scalable and effective GDGM; it also exemplifies the principle of modularization in generative modeling, offering a new avenue for exploring generative models by decomposing complex tasks into more manageable components.
☆ A Non-Asymptotic Convergent Analysis for Scored-Based Graph Generative Model via a System of Stochastic Differential Equations
Score-based graph generative models (SGGMs) have proven effective in critical applications such as drug discovery and protein synthesis. However, their theoretical behavior, particularly regarding convergence, remains underexplored. Unlike common score-based generative models (SGMs), which are governed by a single stochastic differential equation (SDE), SGGMs involve a system of coupled SDEs. In SGGMs, the graph structure and node features are governed by separate but interdependent SDEs. This distinction makes existing convergence analyses from SGMs inapplicable for SGGMs. In this work, we present the first non-asymptotic convergence analysis for SGGMs, focusing on the convergence bound (the risk of generative error) across three key graph generation paradigms: (1) feature generation with a fixed graph structure, (2) graph structure generation with fixed node features, and (3) joint generation of both graph structure and node features. Our analysis reveals several unique factors specific to SGGMs (e.g., the topological properties of the graph structure) which affect the convergence bound. Additionally, we offer theoretical insights into the selection of hyperparameters (e.g., sampling steps and diffusion length) and advocate for techniques like normalization to improve convergence. To validate our theoretical findings, we conduct a controlled empirical study using synthetic graph models, and the results align with our theoretical predictions. This work deepens the theoretical understanding of SGGMs, demonstrates their applicability in critical domains, and provides practical guidance for designing effective models.
☆ HandCraft: Dynamic Sign Generation for Synthetic Data Augmentation
Sign Language Recognition (SLR) models face significant performance limitations due to insufficient training data availability. In this article, we address the challenge of limited data in SLR by introducing a novel and lightweight sign generation model based on CMLPe. This model, coupled with a synthetic data pretraining approach, consistently improves recognition accuracy, establishing new state-of-the-art results for the LSFB and DiSPLaY datasets using our Mamba-SL and Transformer-SL classifiers. Our findings reveal that synthetic data pretraining outperforms traditional augmentation methods in some cases and yields complementary benefits when implemented alongside them. Our approach democratizes sign generation and synthetic data pretraining for SLR by providing computationally efficient methods that achieve significant performance improvements across diverse datasets.
comment: 26 pages, 4 figures, 9 tables, code available at https://github.com/okason97/HandCraft
☆ Generative AI Against Poaching: Latent Composite Flow Matching for Wildlife Conservation
Poaching poses significant threats to wildlife and biodiversity. A valuable step in reducing poaching is to forecast poacher behavior, which can inform patrol planning and other conservation interventions. Existing poaching prediction methods based on linear models or decision trees lack the expressivity to capture complex, nonlinear spatiotemporal patterns. Recent advances in generative modeling, particularly flow matching, offer a more flexible alternative. However, training such models on real-world poaching data faces two central obstacles: imperfect detection of poaching events and limited data. To address imperfect detection, we integrate flow matching with an occupancy-based detection model and train the flow in latent space to infer the underlying occupancy state. To mitigate data scarcity, we adopt a composite flow initialized from a linear-model prediction rather than random noise which is the standard in diffusion models, injecting prior knowledge and improving generalization. Evaluations on datasets from two national parks in Uganda show consistent gains in predictive accuracy.
☆ A Comparative Evaluation of Teacher-Guided Reinforcement Learning Techniques for Autonomous Cyber Operations
Autonomous Cyber Operations (ACO) rely on Reinforcement Learning (RL) to train agents to make effective decisions in the cybersecurity domain. However, existing ACO applications require agents to learn from scratch, leading to slow convergence and poor early-stage performance. While teacher-guided techniques have demonstrated promise in other domains, they have not yet been applied to ACO. In this study, we implement four distinct teacher-guided techniques in the simulated CybORG environment and conduct a comparative evaluation. Our results demonstrate that teacher integration can significantly improve training efficiency in terms of early policy performance and convergence speed, highlighting its potential benefits for autonomous cybersecurity.
☆ On the Interplay between Graph Structure and Learning Algorithms in Graph Neural Networks
This paper studies the interplay between learning algorithms and graph structure for graph neural networks (GNNs). Existing theoretical studies on the learning dynamics of GNNs primarily focus on the convergence rates of learning algorithms under the interpolation regime (noise-free) and offer only a crude connection between these dynamics and the actual graph structure (e.g., maximum degree). This paper aims to bridge this gap by investigating the excessive risk (generalization performance) of learning algorithms in GNNs within the generalization regime (with noise). Specifically, we extend the conventional settings from the learning theory literature to the context of GNNs and examine how graph structure influences the performance of learning algorithms such as stochastic gradient descent (SGD) and Ridge regression. Our study makes several key contributions toward understanding the interplay between graph structure and learning in GNNs. First, we derive the excess risk profiles of SGD and Ridge regression in GNNs and connect these profiles to the graph structure through spectral graph theory. With this established framework, we further explore how different graph structures (regular vs. power-law) impact the performance of these algorithms through comparative analysis. Additionally, we extend our analysis to multi-layer linear GNNs, revealing an increasing non-isotropic effect on the excess risk profile, thereby offering new insights into the over-smoothing issue in GNNs from the perspective of learning algorithms. Our empirical results align with our theoretical predictions, \emph{collectively showcasing a coupling relation among graph structure, GNNs and learning algorithms, and providing insights on GNN algorithm design and selection in practice.}
☆ NeRC: Neural Ranging Correction through Differentiable Moving Horizon Location Estimation
GNSS localization using everyday mobile devices is challenging in urban environments, as ranging errors caused by the complex propagation of satellite signals and low-quality onboard GNSS hardware are blamed for undermining positioning accuracy. Researchers have pinned their hopes on data-driven methods to regress such ranging errors from raw measurements. However, the grueling annotation of ranging errors impedes their pace. This paper presents a robust end-to-end Neural Ranging Correction (NeRC) framework, where localization-related metrics serve as the task objective for training the neural modules. Instead of seeking impractical ranging error labels, we train the neural network using ground-truth locations that are relatively easy to obtain. This functionality is supported by differentiable moving horizon location estimation (MHE) that handles a horizon of measurements for positioning and backpropagates the gradients for training. Even better, as a blessing of end-to-end learning, we propose a new training paradigm using Euclidean Distance Field (EDF) cost maps, which alleviates the demands on labeled locations. We evaluate the proposed NeRC on public benchmarks and our collected datasets, demonstrating its distinguished improvement in positioning accuracy. We also deploy NeRC on the edge to verify its real-time performance for mobile devices.
☆ Multi-view Graph Condensation via Tensor Decomposition
Graph Neural Networks (GNNs) have demonstrated remarkable results in various real-world applications, including drug discovery, object detection, social media analysis, recommender systems, and text classification. In contrast to their vast potential, training them on large-scale graphs presents significant computational challenges due to the resources required for their storage and processing. Graph Condensation has emerged as a promising solution to reduce these demands by learning a synthetic compact graph that preserves the essential information of the original one while maintaining the GNN's predictive performance. Despite their efficacy, current graph condensation approaches frequently rely on a computationally intensive bi-level optimization. Moreover, they fail to maintain a mapping between synthetic and original nodes, limiting the interpretability of the model's decisions. In this sense, a wide range of decomposition techniques have been applied to learn linear or multi-linear functions from graph data, offering a more transparent and less resource-intensive alternative. However, their applicability to graph condensation remains unexplored. This paper addresses this gap and proposes a novel method called Multi-view Graph Condensation via Tensor Decomposition (GCTD) to investigate to what extent such techniques can synthesize an informative smaller graph and achieve comparable downstream task performance. Extensive experiments on six real-world datasets demonstrate that GCTD effectively reduces graph size while preserving GNN performance, achieving up to a 4.0\ improvement in accuracy on three out of six datasets and competitive performance on large graphs compared to existing approaches. Our code is available at https://anonymous.4open.science/r/gctd-345A.
☆ Universal Reinforcement Learning in Coalgebras: Asynchronous Stochastic Computation via Conduction
In this paper, we introduce a categorial generalization of RL, termed universal reinforcement learning (URL), building on powerful mathematical abstractions from the study of coinduction on non-well-founded sets and universal coalgebras, topos theory, and categorial models of asynchronous parallel distributed computation. In the first half of the paper, we review the basic RL framework, illustrate the use of categories and functors in RL, showing how they lead to interesting insights. In particular, we also introduce a standard model of asynchronous distributed minimization proposed by Bertsekas and Tsitsiklis, and describe the relationship between metric coinduction and their proof of the Asynchronous Convergence Theorem. The space of algorithms for MDPs or PSRs can be modeled as a functor category, where the co-domain category forms a topos, which admits all (co)limits, possesses a subobject classifier, and has exponential objects. In the second half of the paper, we move on to universal coalgebras. Dynamical system models, such as Markov decision processes (MDPs), partially observed MDPs (POMDPs), a predictive state representation (PSRs), and linear dynamical systems (LDSs) are all special types of coalgebras. We describe a broad family of universal coalgebras, extending the dynamic system models studied previously in RL. The core problem in finding fixed points in RL to determine the exact or approximate (action) value function is generalized in URL to determining the final coalgebra asynchronously in a parallel distributed manner.
comment: 45 pages
☆ Towards Source-Free Machine Unlearning CVPR 2025
As machine learning becomes more pervasive and data privacy regulations evolve, the ability to remove private or copyrighted information from trained models is becoming an increasingly critical requirement. Existing unlearning methods often rely on the assumption of having access to the entire training dataset during the forgetting process. However, this assumption may not hold true in practical scenarios where the original training data may not be accessible, i.e., the source-free setting. To address this challenge, we focus on the source-free unlearning scenario, where an unlearning algorithm must be capable of removing specific data from a trained model without requiring access to the original training dataset. Building on recent work, we present a method that can estimate the Hessian of the unknown remaining training data, a crucial component required for efficient unlearning. Leveraging this estimation technique, our method enables efficient zero-shot unlearning while providing robust theoretical guarantees on the unlearning performance, while maintaining performance on the remaining data. Extensive experiments over a wide range of datasets verify the efficacy of our method.
comment: Accepted by CVPR 2025
♻ ☆ GenVC: Self-Supervised Zero-Shot Voice Conversion
Most current zero-shot voice conversion methods rely on externally supervised components, particularly speaker encoders, for training. To explore alternatives that eliminate this dependency, this paper introduces GenVC, a novel framework that disentangles speaker identity and linguistic content from speech signals in a self-supervised manner. GenVC leverages speech tokenizers and an autoregressive, Transformer-based language model as its backbone for speech generation. This design supports large-scale training while enhancing both source speaker privacy protection and target speaker cloning fidelity. Experimental results demonstrate that GenVC achieves notably higher speaker similarity, with naturalness on par with leading zero-shot approaches. Moreover, due to its autoregressive formulation, GenVC introduces flexibility in temporal alignment, reducing the preservation of source prosody and speaker-specific traits, and making it highly effective for voice anonymization.
comment: accepted by 2025 IEEE ASRU
♻ ☆ LoSiA: Efficient High-Rank Fine-Tuning via Subnet Localization and Optimization EMNLP 2025
Parameter-Efficient Fine-Tuning (PEFT) methods, such as LoRA, significantly reduce the number of trainable parameters by introducing low-rank decomposition matrices. However, existing methods perform extensive matrix multiplications in domain specialization tasks, resulting in computational inefficiency and sub-optimal fine-tuning performance. Hence, we propose LoSiA(Low-Resources Subnet Integration Adaptation), an innovative method that dynamically localizes and optimizes critical parameters during the training process. Specifically, it identifies a sub-network using gradient sparsity analysis and optimizes it as the trainable target. This design enables effective high-rank adaptation by updating only the sub-network parameters, reducing the additional matrix multiplication. We also present LoSiA-Pro, a faster implementation of LoSiA, which reduces the training latency by about $27\%$ compared to LoRA. Extensive evaluations show that our method achieves minimal performance drop compared to full fine-tuning, while requiring the least training time across domain specialization and common-sense reasoning tasks. Further analysis shows that LoSiA also reduces forgetting during continued training. The source code is available at https://github.com/KlozeWang/LoSiA.
comment: Accepted to EMNLP 2025 (Main Conference); 18 pages, 12 figures
♻ ☆ Action Engine: Automatic Workflow Generation in FaaS
Function as a Service (FaaS) is poised to become the foundation of the next generation of cloud systems due to its inherent advantages in scalability, cost-efficiency, and ease of use. However, challenges such as the need for specialized knowledge, platform dependence, and difficulty in scalability in building functional workflows persist for cloud-native application developers. To overcome these challenges and mitigate the burden of developing FaaS-based applications, in this paper, we propose a mechanism called Action Engine, that makes use of tool-augmented large language models (LLMs) at its kernel to interpret human language queries and automates FaaS workflow generation, thereby, reducing the need for specialized expertise and manual design. Action Engine includes modules to identify relevant functions from the FaaS repository and seamlessly manage the data dependency between them, ensuring the developer's query is processed and resolved. Beyond that, Action Engine can execute the generated workflow by injecting the user-provided arguments. On another front, this work addresses a gap in tool-augmented LLM research via adopting an Automatic FaaS Workflow Generation perspective to systematically evaluate methodologies across four fundamental sub-processes. Through benchmarking various parameters, this research provides critical insights into streamlining workflow automation for real-world applications, specifically in the FaaS continuum. Our evaluations demonstrate that the Action Engine achieves comparable performance to the few-shot learning approach while maintaining platform- and language-agnosticism, thereby, mitigating provider-specific dependencies in workflow generation. We notice that Action Engine can unlock FaaS workflow generation for non-cloud-savvy developers and expedite the development cycles of cloud-native applications.
comment: Published in the Future Generation Computer Systems (FGCS) journal; Source code is available at: https://github.com/hpcclab/action_engine
♻ ☆ TESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and Analysis
Satellite remote sensing enables a wide range of downstream applications, including habitat mapping, carbon accounting, and strategies for conservation and sustainable land use. However, satellite time series are voluminous and often corrupted, making them challenging to use. We present TESSERA, an open, global, land-oriented remote sensing foundation model that uses self-supervised learning to generate `ready-to-use' embeddings at 10~m scale from pixel-level satellite time-series data. TESSERA uses two encoders to combine optical data with synthetic aperture radar backscatter coefficients at 10~m resolution to create embeddings that are fused with a multilayer perceptron to create annual global embedding maps. We compare our work with state-of-the-art task-specific models and other foundation models in five diverse downstream tasks and find that TESSERA closely matches or outperforms these baselines. We believe that TESSERA's ease of use, state-of-the-art performance, openness, and computation- and labelled data-efficiency will prove transformative in a wide range of ecological applications.
♻ ☆ Fragile, Robust, and Antifragile: A Perspective from Parameter Responses in Reinforcement Learning Under Stress
This paper explores Reinforcement learning (RL) policy robustness by systematically analyzing network parameters under internal and external stresses. Inspired by synaptic plasticity in neuroscience, synaptic filtering introduces internal stress by selectively perturbing parameters, while adversarial attacks apply external stress through modified agent observations. This dual approach enables the classification of parameters as fragile, robust, or antifragile, based on their influence on policy performance in clean and adversarial settings. Parameter scores are defined to quantify these characteristics, and the framework is validated on PPO-trained agents in Mujoco continuous control environments. The results highlight the presence of antifragile parameters that enhance policy performance under stress, demonstrating the potential of targeted filtering techniques to improve RL policy adaptability. These insights provide a foundation for future advancements in the design of robust and antifragile RL systems.
comment: Withdrawn pending a review of attribution and overlap with Pravin et al., Artificial Intelligence (2024), DOI: 10.1016/j.artint.2023.104060. Further dissemination is paused while we determine appropriate next steps
♻ ☆ Towards Understanding Gradient Dynamics of the Sliced-Wasserstein Distance via Critical Point Analysis
In this paper, we investigate the properties of the Sliced Wasserstein Distance (SW) when employed as an objective functional. The SW metric has gained significant interest in the optimal transport and machine learning literature, due to its ability to capture intricate geometric properties of probability distributions while remaining computationally tractable, making it a valuable tool for various applications, including generative modeling and domain adaptation. Our study aims to provide a rigorous analysis of the critical points arising from the optimization of the SW objective. By computing explicit perturbations, we establish that stable critical points of SW cannot concentrate on segments. This stability analysis is crucial for understanding the behaviour of optimization algorithms for models trained using the SW objective. Furthermore, we investigate the properties of the SW objective, shedding light on the existence and convergence behavior of critical points. We illustrate our theoretical results through numerical experiments.
comment: 32p
♻ ☆ TASER: Table Agents for Schema-guided Extraction and Recommendation
Real-world financial documents report essential information about an entity's financial holdings that can span millions of different financial instrument types. Yet, these details are often buried in messy, multi-page, fragmented tables - for example, 99.4% of the tables in our dataset have no bounding boxes with the maximum number of rows amounting to 426 per table across 44 pages. To tackle these unique challenges from real-world tables, we present a continuously learning, agentic table extraction system, TASER (Table Agents for Schema-guided Extraction and Recommendation) that extracts highly unstructured, multi-page, heterogeneous tables into normalized, schema-conforming outputs. Our table agents execute on table detection, classification, extraction, and recommendations by leveraging an initial schema. Then, our Recommender Agent reviews the outputs, recommends schema revisions, and decides on the final recommendations, enabling TASER to outperform existing table detection models such as Table Transformer by 10.1%. Within this continuous learning process, we highlight that larger batch sizes result in a 104.3% increase in schema recommendations that are actionable and utilized, resulting in a 9.8% increase in extracted holdings - highlighting the importance of a continuous learning process. To train TASER, we have manually labeled 22,584 pages (28,150,449 tokens), 3,213 tables for $731,685,511,687 of holdings culminating in one of the first real financial table datasets. We release our dataset TASERTab to enable the research community to access real-world financial tables and outputs. Our results highlight the promise of agentic, schema-guided extraction systems for robust understanding of real-world financial tables.
comment: Withdrawn due to missing key sections in the paper
♻ ☆ Coupling without Communication and Drafter-Invariant Speculative Decoding
Suppose Alice has a distribution $P$ and Bob has a distribution $Q$. Alice wants to draw a sample $a\sim P$ and Bob a sample $b \sim Q$ such that $a = b$ with as high of probability as possible. It is well-known that, by sampling from an optimal coupling between the distributions, Alice and Bob can achieve $\Pr[a = b] = 1 - D_{TV}(P,Q)$, where $D_{TV}(P,Q)$ is the total variation distance between $P$ and $Q$. What if Alice and Bob must solve this same problem \emph{without communicating at all?} Perhaps surprisingly, with access to public randomness, they can still achieve $\Pr[a = b] \geq \frac{1 - D_{TV}(P,Q)}{1 + D_{TV}(P,Q)} \geq 1-2D_{TV}(P,Q)$ using a simple protocol based on the Weighted MinHash algorithm. This bound was shown to be optimal in the worst-case by [Bavarian et al., 2020]. In this work, we revisit the communication-free coupling problem. We provide a simpler proof of the optimality result from [Bavarian et al., 2020]. We show that, while the worst-case success probability of Weighted MinHash cannot be improved, an equally simple protocol based on Gumbel sampling offers a Pareto improvement: for every pair of distributions $P, Q$, Gumbel sampling achieves an equal or higher value of $\Pr[a = b]$ than Weighted MinHash. Importantly, this improvement translates to practice. We demonstrate an application of communication-free coupling to \emph{speculative decoding}, a recent method for accelerating autoregressive large language models [Leviathan, Kalman, Matias, ICML 2023]. We show that communication-free protocols can be used to contruct \emph{\CSD{}} schemes, which have the desirable property that their output is fixed given a fixed random seed, regardless of what drafter is used for speculation. In experiments on a language generation task, Gumbel sampling outperforms Weighted MinHash. Code is available at https://github.com/majid-daliri/DISD.
comment: 18 pages
♻ ☆ Learning to Solve Related Linear Systems
Solving multiple parametrised related systems is an essential component of many numerical tasks, and learning from the already solved systems will make this process faster. In this work, we propose a novel probabilistic linear solver over the parameter space. This leverages information from the solved linear systems in a regression setting to provide an efficient posterior mean and covariance. We advocate using this as companion regression model for the preconditioned conjugate gradient method, and discuss the favourable properties of the posterior mean and covariance as the initial guess and preconditioner. We also provide several design choices for this companion solver. Numerical experiments showcase the benefits of using our novel solver in a hyperparameter optimisation problem.
comment: Accepted at the 1st International Conference on Probabilistic Numerics (ProbNum), 2025
♻ ☆ A Novel Vascular Risk Scoring Framework for Quantifying Sex-Specific Cerebral Perfusion from 3D pCASL MRI
We present a novel framework that leverages 3D pseudo-continuous arterial spin labeling (pCASL) MRI to investigate sex- and age-dependent heterogeneity in cerebral perfusion and to establish a biologically informed vascular risk quantification metric. A custom convolutional neural network was trained on ASL-derived cerebral blood flow (CBF) maps from 186 cognitively healthy individuals (89 males and 97 females, ages 8-92 years), achieving 95% accuracy in sex classification and revealing robust sex-specific perfusion signatures. Regional analyses identified significantly elevated CBF in females across medial Brodmann areas 6 and 10, the visual area of the cortex, the polar occipital cortex, and both ventral and dorsal dysgranular insula, highlighting sex-specific neurovascular specialization in motor, cognitive, sensory, and affective domains. In addition, we observed a consistent global age-related decline in CBF across both sexes, reflecting progressive cerebrovascular aging. To integrate these findings, we propose a biologically informed Vascular Risk Score (VRS) derived from age- and sex-stratified normative CBF distributions. The VRS enables individualized assessment of cerebral perfusion integrity by quantifying deviations from expected normative patterns. This metric offers a sensitive, personalized biomarker for detecting early hypoperfusion and stratifying vascular contributions to neurodegenerative diseases, including Alzheimer's disease, thereby advancing the goals of precision neurology.
comment: 18 pages, 6 figures
♻ ☆ Low-rank bias, weight decay, and model merging in neural networks
We explore the low-rank structure of the weight matrices in neural networks at the stationary points (limiting solutions of optimization algorithms) with $L2$ regularization (also known as weight decay). We show several properties of such deep neural networks, induced by $L2$ regularization. In particular, for a stationary point we show alignment of the parameters and the gradient, norm preservation across layers, and low-rank bias: properties previously known in the context of solution of gradient descent/flow type algorithms. Experiments show that the assumptions made in the analysis only mildly affect the observations. In addition, we investigate a multitask learning phenomenon enabled by $L2$ regularization and low-rank bias. In particular, we show that if two networks are trained, such that the inputs in the training set of one network are approximately orthogonal to the inputs in the training set of the other network, the new network obtained by simply summing the weights of the two networks will perform as well on both training sets as the respective individual networks. We demonstrate this for shallow ReLU neural networks trained by gradient descent, as well as deep linear networks trained by gradient flow.
♻ ☆ Identity Preserving 3D Head Stylization with Multiview Score Distillation
3D head stylization transforms realistic facial features into artistic representations, enhancing user engagement across gaming and virtual reality applications. While 3D-aware generators have made significant advancements, many 3D stylization methods primarily provide near-frontal views and struggle to preserve the unique identities of original subjects, often resulting in outputs that lack diversity and individuality. This paper addresses these challenges by leveraging the PanoHead model, synthesizing images from a comprehensive 360-degree perspective. We propose a novel framework that employs negative log-likelihood distillation (LD) to enhance identity preservation and improve stylization quality. By integrating multi-view grid score and mirror gradients within the 3D GAN architecture and introducing a score rank weighing technique, our approach achieves substantial qualitative and quantitative improvements. Our findings not only advance the state of 3D head stylization but also provide valuable insights into effective distillation processes between diffusion models and GANs, focusing on the critical issue of identity preservation. Please visit the https://three-bee.github.io/head_stylization for more visuals.
comment: https://three-bee.github.io/head_stylization
♻ ☆ Don't Push the Button! Exploring Data Leakage Risks in Machine Learning and Transfer Learning
Machine Learning (ML) has revolutionized various domains, offering predictive capabilities in several areas. However, with the increasing accessibility of ML tools, many practitioners, lacking deep ML expertise, adopt a "push the button" approach, utilizing user-friendly interfaces without a thorough understanding of underlying algorithms. While this approach provides convenience, it raises concerns about the reliability of outcomes, leading to challenges such as incorrect performance evaluation. This paper addresses a critical issue in ML, known as data leakage, where unintended information contaminates the training data, impacting model performance evaluation. Users, due to a lack of understanding, may inadvertently overlook crucial steps, leading to optimistic performance estimates that may not hold in real-world scenarios. The discrepancy between evaluated and actual performance on new data is a significant concern. In particular, this paper categorizes data leakage in ML, discussing how certain conditions can propagate through the ML workflow. Furthermore, it explores the connection between data leakage and the specific task being addressed, investigates its occurrence in Transfer Learning, and compares standard inductive ML with transductive ML frameworks. The conclusion summarizes key findings, emphasizing the importance of addressing data leakage for robust and reliable ML applications.
comment: Published on Artificial Intelligence Review journal. The paper is in Open Access available at https://link.springer.com/article/10.1007/s10462-025-11326-3
♻ ☆ Sample Selection Bias in Machine Learning for Healthcare
While machine learning algorithms hold promise for personalised medicine, their clinical adoption remains limited, partly due to biases that can compromise the reliability of predictions. In this paper, we focus on sample selection bias (SSB), a specific type of bias where the study population is less representative of the target population, leading to biased and potentially harmful decisions. Despite being well-known in the literature, SSB remains scarcely studied in machine learning for healthcare. Moreover, the existing machine learning techniques try to correct the bias mostly by balancing distributions between the study and the target populations, which may result in a loss of predictive performance. To address these problems, our study illustrates the potential risks associated with SSB by examining SSB's impact on the performance of machine learning algorithms. Most importantly, we propose a new research direction for addressing SSB, based on the target population identification rather than the bias correction. Specifically, we propose two independent networks(T-Net) and a multitasking network (MT-Net) for addressing SSB, where one network/task identifies the target subpopulation which is representative of the study population and the second makes predictions for the identified subpopulation. Our empirical results with synthetic and semi-synthetic datasets highlight that SSB can lead to a large drop in the performance of an algorithm for the target population as compared with the study population, as well as a substantial difference in the performance for the target subpopulations that are representative of the selected and the non-selected patients from the study population. Furthermore, our proposed techniques demonstrate robustness across various settings, including different dataset sizes, event rates, and selection rates, outperforming the existing bias correction techniques.
comment: Accepted to ACM Transactions on Computing for Healthcare (21 pages and 11 figures)
♻ ☆ Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers
Hallucinations are a persistent problem with Large Language Models (LLMs). As these models become increasingly used in high-stakes domains, such as healthcare and finance, the need for effective hallucination detection is crucial. To this end, we outline a versatile framework for zero-resource hallucination detection that practitioners can apply to real-world use cases. To achieve this, we adapt a variety of existing uncertainty quantification (UQ) techniques, including black-box UQ, white-box UQ, and LLM-as-a-Judge, transforming them as necessary into standardized response-level confidence scores ranging from 0 to 1. To enhance flexibility, we propose a tunable ensemble approach that incorporates any combination of the individual confidence scores. This approach enables practitioners to optimize the ensemble for a specific use case for improved performance. To streamline implementation, the full suite of scorers is offered in this paper's companion Python toolkit, UQLM. To evaluate the performance of the various scorers, we conduct an extensive set of experiments using several LLM question-answering benchmarks. We find that our tunable ensemble typically surpasses its individual components and outperforms existing hallucination detection methods. Our results demonstrate the benefits of customized hallucination detection strategies for improving the accuracy and reliability of LLMs.
comment: UQLM repository: https://github.com/cvs-health/uqlm
♻ ☆ MAVIS: Multi-Objective Alignment via Value-Guided Inference-Time Search
Large Language Models (LLMs) are increasingly deployed across diverse applications that demand balancing multiple, often conflicting, objectives -- such as helpfulness, harmlessness, or humor. Aligning outputs to user-specific preferences in such multi-objective settings typically requires fine-tuning models for each objective or preference configuration, which is computationally expensive and inflexible. We introduce MAVIS -- Multi-Objective Alignment via Value-Guided Inference-Time Search -- a lightweight inference-time alignment framework that enables dynamic control over LLM behavior without modifying the base model's weights. MAVIS trains a set of small value models, each corresponding to a distinct objective. At inference time, these value models are combined using user-specified weights to produce a tilting function that adjusts the base model's output distribution toward desired trade-offs. The value models are trained using a simple iterative algorithm that ensures monotonic improvement of the KL-regularized policy. We show empirically that MAVIS outperforms baselines that fine-tune per-objective models and combine them post hoc, and even approaches the performance of the idealized setting where models are fine-tuned for a user's exact preferences.
comment: 20 pages, 6 figures
♻ ☆ Behind the Myth of Exploration in Policy Gradients
In order to compute near-optimal policies with policy-gradient algorithms, it is common in practice to include intrinsic exploration terms in the learning objective. Although the effectiveness of these terms is usually justified by an intrinsic need to explore environments, we propose a novel analysis with the lens of numerical optimization. Two criteria are introduced on the learning objective and two others on its stochastic gradient estimates, and are afterwards used to discuss the quality of the policy after optimization. The analysis sheds light on two separate effects of exploration techniques. First, they make it possible to smooth the learning objective and to eliminate local optima while preserving the global maximum. Second, they modify the gradient estimates, increasing the probability that the stochastic parameter updates eventually provide an optimal policy. We empirically illustrate these effects with exploration strategies based on entropy bonuses, identifying limitations and suggesting directions for future work.
♻ ☆ Testing Components of the Attention Schema Theory in Artificial Neural Networks
Growing evidence suggests that the brain uses an attention schema, or a simplified model of attention, to help control what it attends to. One proposed benefit of this model is to allow agents to model the attention states of other agents, and thus predict and interact with other agents. The effects of an attention schema may be examined in artificial agents. Although attention mechanisms in artificial agents are different from in biological brains, there may be some principles in common. In both cases, select features or representations are emphasized for better performance. Here, using neural networks with transformer attention mechanisms, we asked whether the addition of an attention schema affected the ability of agents to make judgements about and cooperate with each other. First, we found that an agent with an attention schema is better at categorizing the attention states of other agents (higher accuracy). Second, an agent with an attention schema develops a pattern of attention that is easier for other agents to categorize. Third, in a joint task where two agents must predict each other to paint a scene together, adding an attention schema improves performance. Finally, the performance improvements are not caused by a general increase in network complexity. Instead, improvement is specific to tasks involving judging, categorizing, or predicting the attention of other agents. These results support the hypothesis that an attention schema has computational properties beneficial to mutual interpretability and interactive behavior. We speculate that the same principles might pertain to biological attention and attention schemas in people.
♻ ☆ Hybrid-Hierarchical Fashion Graph Attention Network for Compatibility-Oriented and Personalized Outfit Recommendation
The rapid expansion of the fashion industry and the growing variety of products have made it increasingly challenging for users to identify compatible items on e-commerce platforms. Effective fashion recommendation systems are therefore crucial for filtering irrelevant options and suggesting suitable ones. However, simultaneously addressing outfit compatibility and personalized recommendations remains a significant challenge, as these aspects are typically treated independently in existing studies, thereby overlooking the complex interactions between items and user preferences. This research introduces a new framework named FGAT, which leverages a hierarchical graph representation together with graph attention mechanisms to address this problem. The framework constructs a three-tier graph of users, outfits, and items, integrating visual and textual features to jointly model outfit compatibility and user preferences. By dynamically weighting node importance during representation propagation, the graph attention mechanism captures key interactions and produces precise embeddings for both user preferences and outfit compatibility. Evaluated on the POG dataset, FGAT outperforms strong baselines such as HFGN, achieving notable improvements in accuracy, precision, HR, recall, and NDCG. These results demonstrate that combining multimodal visual and textual features with a hierarchical graph structure and attention mechanisms significantly enhances the effectiveness and efficiency of personalized fashion recommendation systems.
comment: The corresponding author: Babak Teimourpour
♻ ☆ Multi-scale species richness estimation with deep learning
Biodiversity assessments are critically affected by the spatial scale at which species richness is measured. How species richness accumulates with sampling area depends on natural and anthropogenic processes whose effects can change depending on the spatial scale considered. These accumulation dynamics, described by the species-area relationship (SAR), are challenging to assess because most biodiversity surveys are restricted to sampling areas much smaller than the scales at which these processes operate. Here, we combine sampling theory and deep learning to predict local species richness within arbitrarily large sampling areas, enabling for the first time to estimate spatial differences in SARs. We demonstrate our approach by predicting vascular plant species richness across Europe and evaluate predictions against an independent dataset of plant community inventories. The resulting model, named deep SAR, delivers multi-scale species richness maps, improving coarse grain richness estimates by 32% compared to conventional methods, while delivering finer grain estimates. Additional to its predictive capabilities, we show how our deep SAR model can provide fundamental insights on the multi-scale effects of key biodiversity processes. The capacity of our approach to deliver comprehensive species richness estimates across the full spectrum of ecologically relevant scales is essential for robust biodiversity assessments and forecasts under global change.
comment: 36 pages
♻ ☆ MEGA: Second-Order Gradient Alignment for Catastrophic Forgetting Mitigation in GFSCIL
Graph Few-Shot Class-Incremental Learning (GFSCIL) enables models to continually learn from limited samples of novel tasks after initial training on a large base dataset. Existing GFSCIL approaches typically utilize Prototypical Networks (PNs) for metric-based class representations and fine-tune the model during the incremental learning stage. However, these PN-based methods oversimplify learning via novel query set fine-tuning and fail to integrate Graph Continual Learning (GCL) techniques due to architectural constraints. To address these challenges, we propose a more rigorous and practical setting for GFSCIL that excludes query sets during the incremental training phase. Building on this foundation, we introduce Model-Agnostic Meta Graph Continual Learning (MEGA), aimed at effectively alleviating catastrophic forgetting for GFSCIL. Specifically, by calculating the incremental second-order gradient during the meta-training stage, we endow the model to learn high-quality priors that enhance incremental learning by aligning its behaviors across both the meta-training and incremental learning stages. Extensive experiments on four mainstream graph datasets demonstrate that MEGA achieves state-of-the-art results and enhances the effectiveness of various GCL methods in GFSCIL. We believe that our proposed MEGA serves as a model-agnostic GFSCIL paradigm, paving the way for future research.
comment: Under Review
♻ ☆ Ranking by Lifts: A Cost-Benefit Approach to Large-Scale A/B Tests
A/B testing is a core tool for decision-making in business experimentation, particularly in digital platforms and marketplaces. Practitioners often prioritize lift in performance metrics while seeking to control the costs of false discoveries. This paper develops a decision-theoretic framework for maximizing expected profit subject to a constraint on the cost-weighted false discovery rate (FDR). We propose an empirical Bayes approach that uses a greedy knapsack algorithm to rank experiments based on the ratio of expected lift to cost, incorporating the local false discovery rate (lfdr) as a key statistic. The resulting oracle rule is valid and rank-optimal. In large-scale settings, we establish the asymptotic validity of a data-driven implementation and demonstrate superior finite-sample performance over existing FDR-controlling methods. An application to A/B tests run on the Optimizely platform highlights the business value of the approach.
comment: Updated
♻ ☆ Unsupervised Urban Tree Biodiversity Mapping from Street-Level Imagery Using Spatially-Aware Visual Clustering
Urban tree biodiversity is critical for climate resilience, ecological stability, and livability in cities, yet most municipalities lack detailed knowledge of their canopies. Field-based inventories provide reliable estimates of Shannon and Simpson diversity but are costly and time-consuming, while supervised AI methods require labeled data that often fail to generalize across regions. We introduce an unsupervised clustering framework that integrates visual embeddings from street-level imagery with spatial planting patterns to estimate biodiversity without labels. Applied to eight North American cities, the method recovers genus-level diversity patterns with high fidelity, achieving low Wasserstein distances to ground truth for Shannon and Simpson indices and preserving spatial autocorrelation. This scalable, fine-grained approach enables biodiversity mapping in cities lacking detailed inventories and offers a pathway for continuous, low-cost monitoring to support equitable access to greenery and adaptive management of urban ecosystems.
comment: 27 pages, 7 figures, Nature Format
♻ ☆ Redundant feature screening method for human activity recognition based on attention purification mechanism
In the field of sensor-based Human Activity Recognition (HAR), deep neural networks provide advanced technical support. Many studies have proven that recognition accuracy can be improved by increasing the depth or width of the network. However, for wearable devices, the balance between network performance and resource consumption is crucial. With minimum resource consumption as the basic principle, we propose a universal attention feature purification mechanism, called MSAP, which is suitable for multi-scale networks. The mechanism effectively solves the feature redundancy caused by the superposition of multi-scale features by means of inter-scale attention screening and connection method. In addition, we have designed a network correction module that integrates seamlessly between layers of individual network modules to mitigate inherent problems in deep networks. We also built an embedded deployment system that is in line with the current level of wearable technology to test the practical feasibility of the HAR model, and further prove the efficiency of the method. Extensive experiments on four public datasets show that the proposed method model effectively reduces redundant features in filtered data and provides excellent performance with little resource consumption.
comment: 12 pages,7 figures
♻ ☆ Learnable Kernel Density Estimation for Graphs
This work proposes a framework LGKDE that learns kernel density estimation for graphs. The key challenge in graph density estimation lies in effectively capturing both structural patterns and semantic variations while maintaining theoretical guarantees. Combining graph kernels and kernel density estimation (KDE) is a standard approach to graph density estimation, but has unsatisfactory performance due to the handcrafted and fixed features of kernels. Our method LGKDE leverages graph neural networks to represent each graph as a discrete distribution and utilizes maximum mean discrepancy to learn the graph metric for multi-scale KDE, where all parameters are learned by maximizing the density of graphs relative to the density of their well-designed perturbed counterparts. The perturbations are conducted on both node features and graph spectra, which helps better characterize the boundary of normal density regions. Theoretically, we establish consistency and convergence guarantees for LGKDE, including bounds on the mean integrated squared error, robustness, and complexity. We validate LGKDE by demonstrating its effectiveness in recovering the underlying density of synthetic graph distributions and applying it to graph anomaly detection across diverse benchmark datasets. Extensive empirical evaluation shows that LGKDE demonstrates superior performance compared to state-of-the-art baselines on most benchmark datasets.
comment: Under Review
♻ ☆ Comparison of parallel SMC and MCMC for Bayesian deep learning
This work systematically compares parallel implementations of consistent (asymptotically unbiased) Bayesian deep learning algorithms: sequential Monte Carlo sampler (SMC$_\parallel$) or Markov chain Monte Carlo (MCMC$_\parallel$). We provide a proof of convergence for SMC$_\parallel$ showing that it theoretically achieves the same level of convergence as a single monolithic SMC sampler, while the reduced communication lowers wall-clock time. It is well-known that the first samples from MCMC need to be discarded to eliminate initialization bias, and that the number of discarded samples must grow like the logarithm of the number of parallel chains to control that bias for MCMC$_\parallel$. A systematic empirical numerical study on MNIST, CIFAR, and IMDb, reveals that parallel implementations of both methods perform comparably to non-parallel implementations in terms of performance and total cost, and also comparably to each other. However, both methods still require a large wall-clock time, and suffer from catastrophic non-convergence if they aren't run for long enough.
comment: 34 pages, 16 figures, 14 tables
♻ ☆ Towards the Use of Saliency Maps for Explaining Low-Quality Electrocardiograms to End Users ICML 2022
When using medical images for diagnosis, either by clinicians or artificial intelligence (AI) systems, it is important that the images are of high quality. When an image is of low quality, the medical exam that produced the image often needs to be redone. In telemedicine, a common problem is that the quality issue is only flagged once the patient has left the clinic, meaning they must return in order to have the exam redone. This can be especially difficult for people living in remote regions, who make up a substantial portion of the patients at Portal Telemedicina, a digital healthcare organization based in Brazil. In this paper, we report on ongoing work regarding (i) the development of an AI system for flagging and explaining low-quality medical images in real-time, (ii) an interview study to understand the explanation needs of stakeholders using the AI system at OurCompany, and, (iii) a longitudinal user study design to examine the effect of including explanations on the workflow of the technicians in our clinics. To the best of our knowledge, this would be the first longitudinal study on evaluating the effects of XAI methods on end-users -- stakeholders that use AI systems but do not have AI-specific expertise. We welcome feedback and suggestions on our experimental setup.
comment: Accepted to ICML 2022 Workshop on Interpretable ML in Healthcare
♻ ☆ MetaWild: A Multimodal Dataset for Animal Re-Identification with Environmental Metadata
Identifying individual animals within large wildlife populations is essential for effective wildlife monitoring and conservation efforts. Recent advancements in computer vision have shown promise in animal re-identification (Animal ReID) by leveraging data from camera traps. However, existing Animal ReID datasets rely exclusively on visual data, overlooking environmental metadata that ecologists have identified as highly correlated with animal behavior and identity, such as temperature and circadian rhythms. Moreover, the emergence of multimodal models capable of jointly processing visual and textual data presents new opportunities for Animal ReID, but existing datasets fail to leverage these models' text-processing capabilities, limiting their full potential. Additionally, to facilitate the use of metadata in existing ReID methods, we propose the Meta-Feature Adapter (MFA), a lightweight module that can be incorporated into existing vision-language model (VLM)-based Animal ReID methods, allowing ReID models to leverage both environmental metadata and visual information to improve ReID performance. Experiments on MetaWild show that combining baseline ReID models with MFA to incorporate metadata consistently improves performance compared to using visual information alone, validating the effectiveness of incorporating metadata in re-identification. We hope that our proposed dataset can inspire further exploration of multimodal approaches for Animal ReID.
comment: 7 pages, 6 figures
♻ ☆ STEM: Efficient Relative Capability Evaluation of LLMs through Structured Transition Samples AAAI 2026
Evaluating large language models (LLMs) has become increasingly challenging as model capabilities advance rapidly. While recent models often achieve higher scores on standard benchmarks, these improvements do not consistently reflect enhanced real-world reasoning capabilities. Moreover, widespread overfitting to public benchmarks and the high computational cost of full evaluations have made it both expensive and less effective to distinguish meaningful differences between models. To address these challenges, we propose the \textbf{S}tructured \textbf{T}ransition \textbf{E}valuation \textbf{M}ethod (STEM), a lightweight and interpretable evaluation framework for efficiently estimating the relative capabilities of LLMs. STEM identifies \textit{significant transition samples} (STS) by analyzing consistent performance transitions among LLMs of the same architecture but varying parameter scales. These samples enable STEM to effectively estimate the capability position of an unknown model. Qwen3 model family is applied to construct the STS pool on six diverse and representative benchmarks. To assess generalizability. Experimental results indicate that STEM reliably captures performance trends, aligns with ground-truth rankings of model capability. These findings highlight STEM as a practical and scalable method for fine-grained, architecture-agnostic evaluation of LLMs.
comment: Submit to AAAI 2026
♻ ☆ Generalizable Spectral Embedding with an Application to UMAP
Spectral Embedding (SE) is a popular method for dimensionality reduction, applicable across diverse domains. Nevertheless, its current implementations face three prominent drawbacks which curtail its broader applicability: generalizability (i.e., out-of-sample extension), scalability, and eigenvectors separation. Existing SE implementations often address two of these drawbacks; however, they fall short in addressing the remaining one. In this paper, we introduce Sep-SpectralNet (eigenvector-separated SpectralNet), a SE implementation designed to address all three limitations. Sep-SpectralNet extends SpectralNet with an efficient post-processing step to achieve eigenvectors separation, while ensuring both generalizability and scalability. This method expands the applicability of SE to a wider range of tasks and can enhance its performance in existing applications. We empirically demonstrate Sep-SpectralNet's ability to consistently approximate and generalize SE, while maintaining SpectralNet's scalability. Additionally, we show how Sep-SpectralNet can be leveraged to enable generalizable UMAP visualization. Our codes are publicly available.
♻ ☆ Causal Mechanism Estimation in Multi-Sensor Systems Across Multiple Domains
To gain deeper insights into a complex sensor system through the lens of causality, we present common and individual causal mechanism estimation (CICME), a novel three-step approach to inferring causal mechanisms from heterogeneous data collected across multiple domains. By leveraging the principle of Causal Transfer Learning (CTL), CICME is able to reliably detect domain-invariant causal mechanisms when provided with sufficient samples. The identified common causal mechanisms are further used to guide the estimation of the remaining causal mechanisms in each domain individually. The performance of CICME is evaluated on linear Gaussian models under scenarios inspired from a manufacturing process. Building upon existing continuous optimization-based causal discovery methods, we show that CICME leverages the benefits of applying causal discovery on the pooled data and repeatedly on data from individual domains, and it even outperforms both baseline methods under certain scenarios.
comment: To appear in 2025 28th International Conference on Information Fusion (FUSION)
♻ ☆ Poisson Midpoint Method for Log Concave Sampling: Beyond the Strong Error Lower Bounds
We study the problem of sampling from strongly log-concave distributions over $\mathbb{R}^d$ using the Poisson midpoint discretization (a variant of the randomized midpoint method) for overdamped/underdamped Langevin dynamics. We prove its convergence in the 2-Wasserstein distance ($W_2$), achieving a cubic speedup in dependence on the target accuracy ($\epsilon$) over the Euler-Maruyama discretization, surpassing existing bounds for randomized midpoint methods. Notably, in the case of underdamped Langevin dynamics, we demonstrate the complexity of $W_2$ convergence is much smaller than the complexity lower bounds for convergence in $L^2$ strong error established in the literature.
♻ ☆ Near Optimal Non-asymptotic Sample Complexity of 1-Identification
Motivated by an open direction in existing literature, we study the 1-identification problem, a fundamental multi-armed bandit formulation on pure exploration. The goal is to determine whether there exists an arm whose mean reward is at least a known threshold $\mu_0$, or to output None if it believes such an arm does not exist. The agent needs to guarantee its output is correct with probability at least $1-\delta$. Degenne & Koolen 2019 has established the asymptotically tight sample complexity for the 1-identification problem, but they commented that the non-asymptotic analysis remains unclear. We design a new algorithm Sequential-Exploration-Exploitation (SEE), and conduct theoretical analysis from the non-asymptotic perspective. Novel to the literature, we achieve near optimality, in the sense of matching upper and lower bounds on the pulling complexity. The gap between the upper and lower bounds is up to a polynomial logarithmic factor. The numerical result also indicates the effectiveness of our algorithm, compared to existing benchmarks.
♻ ☆ Evaluating Autoencoders for Parametric and Invertible Multidimensional Projections
Recently, neural networks have gained attention for creating parametric and invertible multidimensional data projections. Parametric projections allow for embedding previously unseen data without recomputing the projection as a whole, while invertible projections enable the generation of new data points. However, these properties have never been explored simultaneously for arbitrary projection methods. We evaluate three autoencoder (AE) architectures for creating parametric and invertible projections. Based on a given projection, we train AEs to learn a mapping into 2D space and an inverse mapping into the original space. We perform a quantitative and qualitative comparison on four datasets of varying dimensionality and pattern complexity using t-SNE. Our results indicate that AEs with a customized loss function can create smoother parametric and inverse projections than feed-forward neural networks while giving users control over the strength of the smoothing effect.
comment: 6 pages, 5 figures, 2 tables, LaTeX; fixed typos, added DOI
♻ ☆ AFLoRA: Adaptive Federated Fine-Tuning of Large Language Models with Resource-Aware Low-Rank Adaption
Federated fine-tuning has emerged as a promising approach to adapt foundation models to downstream tasks using decentralized data. However, real-world deployment remains challenging due to the high computational and communication demands of fine-tuning Large Language Models (LLMs) on clients with data and system resources that are heterogeneous and constrained. In such settings, the global model's performance is often bottlenecked by the weakest clients and further degraded by the non-IID nature of local data. Although existing methods leverage parameter-efficient techniques such as Low-Rank Adaptation (LoRA) to reduce communication and computation overhead, they often fail to simultaneously ensure accurate aggregation of low-rank updates and maintain low system costs, thereby hindering overall performance. To address these challenges, we propose AFLoRA, an adaptive and lightweight federated fine-tuning framework for LLMs. AFLoRA decouples shared and client-specific updates to reduce overhead and improve aggregation accuracy, incorporates diagonal matrix-based rank pruning to better utilize local resources, and employs rank-aware aggregation with public data refinement to strengthen generalization under data heterogeneity. Extensive experiments demonstrate that AFLoRA outperforms state-of-the-art methods in both accuracy and efficiency, providing a practical solution for efficient LLM adaptation in heterogeneous environments in the real world.
♻ ☆ TolerantECG: A Foundation Model for Imperfect Electrocardiogram ACM MM 2025
The electrocardiogram (ECG) is an essential and effective tool for diagnosing heart diseases. However, its effectiveness can be compromised by noise or unavailability of one or more leads of the standard 12-lead recordings, resulting in diagnostic errors or uncertainty. To address these challenges, we propose TolerantECG, a foundation model for ECG signals that is robust to noise and capable of functioning with arbitrary subsets of the standard 12-lead ECG. TolerantECG training combines contrastive and self-supervised learning frameworks to jointly learn ECG signal representations alongside their corresponding knowledge-retrieval-based text report descriptions and corrupted or lead-missing signals. Comprehensive benchmarking results demonstrate that TolerantECG consistently ranks as the best or second-best performer across various ECG signal conditions and class levels in the PTB-XL dataset, and achieves the highest performance on the MIT-BIH Arrhythmia Database.
comment: Accepted at ACM MM 2025
♻ ☆ Benchmarking Pre-Trained Time Series Models for Electricity Price Forecasting
Accurate electricity price forecasting (EPF) is crucial for effective decision-making in power trading on the spot market. While recent advances in generative artificial intelligence (GenAI) and pre-trained large language models (LLMs) have inspired the development of numerous time series foundation models (TSFMs) for time series forecasting, their effectiveness in EPF remains uncertain. To address this gap, we benchmark several state-of-the-art pretrained models--Chronos-Bolt, Chronos-T5, TimesFM, Moirai, Time-MoE, and TimeGPT--against established statistical and machine learning (ML) methods for EPF. Using 2024 day-ahead auction (DAA) electricity prices from Germany, France, the Netherlands, Austria, and Belgium, we generate daily forecasts with a one-day horizon. Chronos-Bolt and Time-MoE emerge as the strongest among the TSFMs, performing on par with traditional models. However, the biseasonal MSTL model, which captures daily and weekly seasonality, stands out for its consistent performance across countries and evaluation metrics, with no TSFM statistically outperforming it.
♻ ☆ Improving Actor-Critic Training with Steerable Action-Value Approximation Errors
Off-policy actor-critic algorithms have shown strong potential in deep reinforcement learning for continuous control tasks. Their success primarily comes from leveraging pessimistic state-action value function updates, which reduce function approximation errors and stabilize learning. However, excessive pessimism can limit exploration, preventing the agent from effectively refining its policies. Conversely, optimism can encourage exploration but may lead to high-risk behaviors and unstable learning if not carefully managed. To address this trade-off, we propose Utility Soft Actor-Critic (USAC), a novel framework that allows independent, interpretable control of pessimism and optimism for both the actor and the critic. USAC dynamically adapts its exploration strategy based on the uncertainty of critics using a utility function, enabling a task-specific balance between optimism and pessimism. This approach goes beyond binary choices of pessimism or optimism, making the method both theoretically meaningful and practically feasible. Experiments across a variety of continuous control tasks show that adjusting the degree of pessimism or optimism significantly impacts performance. When configured appropriately, USAC consistently outperforms state-of-the-art algorithms, demonstrating its practical utility and feasibility.
♻ ☆ Deep Exploration with PAC-Bayes ECAI
Reinforcement learning (RL) for continuous control under delayed rewards is an under-explored problem despite its significance in real-world applications. Many complex skills are based on intermediate ones as prerequisites. For instance, a humanoid locomotor must learn how to stand before it can learn to walk. To cope with delayed reward, an agent must perform deep exploration. However, existing deep exploration methods are designed for small discrete action spaces, and their generalization to state-of-the-art continuous control remains unproven. We address the deep exploration problem for the first time from a PAC-Bayesian perspective in the context of actor-critic learning. To do this, we quantify the error of the Bellman operator through a PAC-Bayes bound, where a bootstrapped ensemble of critic networks represents the posterior distribution, and their targets serve as a data-informed function-space prior. We derive an objective function from this bound and use it to train the critic ensemble. Each critic trains an individual soft actor network, implemented as a shared trunk and critic-specific heads. The agent performs deep exploration by acting epsilon-softly on a randomly chosen actor head. Our proposed algorithm, named {\it PAC-Bayesian Actor-Critic (PBAC)}, is the only algorithm to consistently discover delayed rewards on continuous control tasks with varying difficulty.
comment: ECAI camera-ready version; fixed acknowledgements; fixed github reference
♻ ☆ No Metric to Rule Them All: Toward Principled Evaluations of Graph-Learning Datasets ICML 2025
Benchmark datasets have proved pivotal to the success of graph learning, and good benchmark datasets are crucial to guide the development of the field. Recent research has highlighted problems with graph-learning datasets and benchmarking practices -- revealing, for example, that methods which ignore the graph structure can outperform graph-based approaches. Such findings raise two questions: (1) What makes a good graph-learning dataset, and (2) how can we evaluate dataset quality in graph learning? Our work addresses these questions. As the classic evaluation setup uses datasets to evaluate models, it does not apply to dataset evaluation. Hence, we start from first principles. Observing that graph-learning datasets uniquely combine two modes -- graph structure and node features --, we introduce Rings, a flexible and extensible mode-perturbation framework to assess the quality of graph-learning datasets based on dataset ablations -- i.e., quantifying differences between the original dataset and its perturbed representations. Within this framework, we propose two measures -- performance separability and mode complementarity -- as evaluation tools, each assessing the capacity of a graph dataset to benchmark the power and efficacy of graph-learning methods from a distinct angle. We demonstrate the utility of our framework for dataset evaluation via extensive experiments on graph-level tasks and derive actionable recommendations for improving the evaluation of graph-learning methods. Our work opens new research directions in data-centric graph learning, and it constitutes a step toward the systematic evaluation of evaluations.
comment: Accepted at ICML 2025
♻ ☆ LLM4FS: Leveraging Large Language Models for Feature Selection
Recent advances in large language models (LLMs) have provided new opportunities for decision-making, particularly in the task of automated feature selection. In this paper, we first comprehensively evaluate LLM-based feature selection methods, covering the state-of-the-art DeepSeek-R1, GPT-o3-mini, and GPT-4.5. Then, we propose a new hybrid strategy called LLM4FS that integrates LLMs with traditional data-driven methods. Specifically, input data samples into LLMs, and directly call traditional data-driven techniques such as random forest and forward sequential selection. Notably, our analysis reveals that the hybrid strategy leverages the contextual understanding of LLMs and the high statistical reliability of traditional data-driven methods to achieve excellent feature selection performance, even surpassing LLMs and traditional data-driven methods. Finally, we point out the limitations of its application in decision-making. Our code is available at https://github.com/xianchaoxiu/LLM4FS.
comment: CAC 2025
♻ ☆ DE-VAE: Revealing Uncertainty in Parametric and Inverse Projections with Variational Autoencoders using Differential Entropy
Recently, autoencoders (AEs) have gained interest for creating parametric and invertible projections of multidimensional data. Parametric projections make it possible to embed new, unseen samples without recalculating the entire projection, while invertible projections allow the synthesis of new data instances. However, existing methods perform poorly when dealing with out-of-distribution samples in either the data or embedding space. Thus, we propose DE-VAE, an uncertainty-aware variational AE using differential entropy (DE) to improve the learned parametric and invertible projections. Given a fixed projection, we train DE-VAE to learn a mapping into 2D space and an inverse mapping back to the original space. We conduct quantitative and qualitative evaluations on four well-known datasets, using UMAP and t-SNE as baseline projection methods. Our findings show that DE-VAE can create parametric and inverse projections with comparable accuracy to other current AE-based approaches while enabling the analysis of embedding uncertainty.
comment: 5 pages, 3 figures, LaTeX; too appear at the 2025 IEEE Workshop on Uncertainty Visualization
♻ ☆ A Comprehensive Benchmark on Spectral GNNs: The Impact on Efficiency, Memory, and Effectiveness
With recent advancements in graph neural networks (GNNs), spectral GNNs have received increasing popularity by virtue of their ability to retrieve graph signals in the spectral domain. These models feature uniqueness in efficient computation as well as rich expressiveness, which stems from advanced management and profound understanding of graph data. However, few systematic studies have been conducted to assess spectral GNNs, particularly in benchmarking their efficiency, memory consumption, and effectiveness in a unified and fair manner. There is also a pressing need to select spectral models suitable for learning specific graph data and deploying them to massive web-scale graphs, which is currently constrained by the varied model designs and training settings. In this work, we extensively benchmark spectral GNNs with a focus on the spectral perspective, demystifying them as spectral graph filters. We analyze and categorize 35 GNNs with 27 corresponding filters, spanning diverse formulations and utilizations of the graph data. Then, we implement the filters within a unified spectral-oriented framework with dedicated graph computations and efficient training schemes. In particular, our implementation enables the deployment of spectral GNNs over million-scale graphs and various tasks with comparable performance and less overhead. Thorough experiments are conducted on the graph filters with comprehensive metrics on effectiveness and efficiency, offering novel observations and practical guidelines that are only available from our evaluations across graph scales. Different from the prevailing belief, our benchmark reveals an intricate landscape regarding the effectiveness and efficiency of spectral graph filters, demonstrating the potential to achieve desirable performance through tailored spectral manipulation of graph data.
comment: Technical Report with Appendix. Our code and evaluation is available at: https://github.com/gdmnl/Spectral-GNN-Benchmark
♻ ☆ Time-Scale Coupling Between States and Parameters in Recurrent Neural Networks
We study how gating mechanisms in recurrent neural networks (RNNs) implicitly induce adaptive learning-rate behavior, even when training is carried out with a fixed, global learning rate. This effect arises from the coupling between state-space time scales--parametrized by the gates--and parameter-space dynamics during gradient descent. By deriving exact Jacobians for leaky-integrator and gated RNNs, we obtain a first-order expansion that makes explicit how constant, scalar, and multi-dimensional gates reshape gradient propagation, modulate effective step sizes, and introduce anisotropy in parameter updates. These findings reveal that gates not only control information flow, but also act as data-driven preconditioners that adapt optimization trajectories in parameter space. We further draw formal analogies with learning-rate schedules, momentum, and adaptive methods such as Adam, pointing to possible redundancies. Empirical simulations corroborate these claims: in canonical synthetic sequence tasks (adding, copy) we show that gates induce lag-dependent effective learning rates and directional concentration of gradient flow, with multi-gate models matching or exceeding the anisotropic structure produced by Adam. These results highlight that optimizer-driven and gate-driven adaptivity are complementary but not equivalent mechanisms. Overall, this work provides a unified dynamical-systems perspective on how gating couples state evolution with parameter updates, explaining why gated architectures achieve robust trainability and stability in practice.
♻ ☆ Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning
Model-Heterogeneous Federated Learning (Hetero-FL) has attracted growing attention for its ability to aggregate knowledge from heterogeneous models while keeping private data locally. To better aggregate knowledge from clients, ensemble distillation, as a widely used and effective technique, is often employed after global aggregation to enhance the performance of the global model. However, simply combining Hetero-FL and ensemble distillation does not always yield promising results and can make the training process unstable. The reason is that existing methods primarily focus on logit distillation, which, while being model-agnostic with softmax predictions, fails to compensate for the knowledge bias arising from heterogeneous models. To tackle this challenge, we propose a stable and efficient Feature Distillation for model-heterogeneous Federated learning, dubbed FedFD, that can incorporate aligned feature information via orthogonal projection to integrate knowledge from heterogeneous models better. Specifically, a new feature-based ensemble federated knowledge distillation paradigm is proposed. The global model on the server needs to maintain a projection layer for each client-side model architecture to align the features separately. Orthogonal techniques are employed to re-parameterize the projection layer to mitigate knowledge bias from heterogeneous models and thus maximize the distilled knowledge. Extensive experiments show that FedFD achieves superior performance compared to state-of-the-art methods.
♻ ☆ Input Time Scaling
Current Large Language Models (LLMs) are usually post-trained on large-scale carefully curated datasets (data & training scaling) and doing reasoning in test time (inference time scaling). In this work, we present a new scaling paradigm, Input Time Scaling, to complement previous scaling methods by putting resources on queries (input time). During training and testing, we combine meta-knowledge from LLMs to refine inputs with different strategies. We also find a new phenomenon, training-testing co-design there. We need to apply query strategies during both training and testing. Only applying strategies on training or testing would seriously degrade the performance. We are also surprised to find that seemingly low data quality datasets can gain high performance. Adding irrelevant information to the queries, randomly selecting examples from a minimally filtered dataset, can even perform the best. These findings contradict the widely held inductive bias, "garbage in, garbage out". Curating datasets with seemingly high-quality data can even potentially limit the performance ceiling. In addition, models trained on more data with similar quality (15k VS 1k) perform worse, simple dataset size scaling should also be carefully inspected. The good news is that our findings are compatible with the Less is More phenomenon. A small set of examples is enough to evoke high-level reasoning ability. With experiments on models trained on Qwen2.5-32B-Instruct, we are able to reach SOTA performance among 32B models on AIME24(76.7%) and AIME25(76.7%) pass@1. We can further achieve AIME24(76.7%) and AIME25(80%) with a majority vote of three models. Starting from DeepSeek-R1-Distill-Qwen-32B, the best result would be 86.7% on AIME24 and 76.7% on AIME25. To facilitate reproducibility and further research, we are working on open-source our datasets, data pipelines, evaluation results, and checkpoints.
♻ ☆ SketchDNN: Joint Continuous-Discrete Diffusion for CAD Sketch Generation ICML2025
We present SketchDNN, a generative model for synthesizing CAD sketches that jointly models both continuous parameters and discrete class labels through a unified continuous-discrete diffusion process. Our core innovation is Gaussian-Softmax diffusion, where logits perturbed with Gaussian noise are projected onto the probability simplex via a softmax transformation, facilitating blended class labels for discrete variables. This formulation addresses 2 key challenges, namely, the heterogeneity of primitive parameterizations and the permutation invariance of primitives in CAD sketches. Our approach significantly improves generation quality, reducing Fr\'echet Inception Distance (FID) from 16.04 to 7.80 and negative log-likelihood (NLL) from 84.8 to 81.33, establishing a new state-of-the-art in CAD sketch generation on the SketchGraphs dataset.
comment: 17 pages, 63 figures, Proceedings of the 42nd International Conference on Machine Learning (ICML2025)
♻ ☆ PathGPT: Reframing Path Recommendation as a Natural Language Generation Task with Retrieval-Augmented Language Models
Path recommendation (PR) aims to generate travel paths that are customized to a user's specific preferences and constraints. Conventional approaches often employ explicit optimization objectives or specialized machine learning architectures; however, these methods typically exhibit limited flexibility and generalizability, necessitating costly retraining to accommodate new scenarios. This paper introduces an alternative paradigm that conceptualizes PR as a natural language generation task. We present PathGPT, a retrieval-augmented large language model (LLM) system that leverages historical trajectory data and natural language user constraints to generate plausible paths. The proposed methodology first converts raw trajectory data into a human-interpretable textual format, which is then stored in a database. Subsequently, a hybrid retrieval system extracts path-specific context from this database to inform a pretrained LLM. The primary contribution of this work is a novel framework that demonstrates how integrating established information retrieval and generative model components can enable adaptive, zero-shot path generation across diverse scenarios. Extensive experiments on large-scale trajectory datasets indicate that PathGPT's performance is competitive with specialized, learning-based methods, underscoring its potential as a flexible and generalizable path generation system that avoids the need for retraining inherent in previous data-driven models.
♻ ☆ LaViPlan : Language-Guided Visual Path Planning with RLVR ICCV 2025
Out-of-distribution (OOD) scenarios in autonomous driving pose critical challenges, as planners often fail to generalize beyond their training experience, leading to unsafe or unexpected behavior. Vision-Language Models (VLMs) have shown promise in handling such scenarios by providing high-level scene understanding and user-aligned decisions. However, existing VLMs often exhibit a misalignment between their language-based reasoning and the low-level trajectories required for action-level planning. In this paper, we propose LaViPlan, a framework that leverages Reinforcement Learning with Verifiable Rewards (RLVR) to fine-tune VLMs using planning-oriented metrics. Experimental results show that LaViPlan improves planning performance across both in-domain and out-of-domain datasets. While linguistic fidelity slightly decreases after RLVR-based fine-tuning, qualitative evaluation indicates that the outputs remain coherent. We also conduct ablation studies to analyze the effects of sampling ratio and reasoning guidance, highlighting how these design choices influence performance. These findings demonstrate the potential of RLVR as a post-training paradigm for aligning language-guided reasoning with action-level planning in autonomous driving.
comment: Accepted to the 2nd ICCV 2025 Workshop on the Challenge of Out-of-Label Hazards in Autonomous Driving (13 pages, 6 figures)
♻ ☆ Bi-directional Model Cascading with Proxy Confidence
Model Cascading, recently applied successfully to LLMs, is a simple but powerful technique that improves the efficiency of inference by selectively applying models of varying sizes. Models are used in sequence from smallest to largest, only deferring samples to large, costly models when smaller models are not sufficiently confident. Existing approaches to deferral use only limited small model confidence estimates because of the inaccessibility of the large model, although large model confidence is known to be important. We therefore propose a bi-directional approach to deferral that considers the confidence of small and large models in the cascade simultaneously through the use of a proxy for the large model. This requires a richer representation of model confidence to enable comparative calibration: we use an analysis of hidden states to improve post-invocation confidence of the small model, which in itself improves cascading results over prior approaches. We then combine this with a tiny proxy model to estimate pre-invocation confidence of the large model. We examine the proposed cascading system over challenging, multiple-choice datasets, finding improvements over standard cascading baselines reflected in reductions in deferrals to more costly models.
♻ ☆ Extending Foundational Monocular Depth Estimators to Fisheye Cameras with Calibration Tokens
We propose a method to extend foundational monocular depth estimators (FMDEs), trained on perspective images, to fisheye images. Despite being trained on tens of millions of images, FMDEs are susceptible to the covariate shift introduced by changes in camera calibration (intrinsic, distortion) parameters, leading to erroneous depth estimates. Our method aligns the distribution of latent embeddings encoding fisheye images to those of perspective images, enabling the reuse of FMDEs for fisheye cameras without retraining or finetuning. To this end, we introduce a set of Calibration Tokens as a light-weight adaptation mechanism that modulates the latent embeddings for alignment. By exploiting the already expressive latent space of FMDEs, we posit that modulating their embeddings avoids the negative impact of artifacts and loss introduced in conventional recalibration or map projection to a canonical reference frame in the image space. Our method is self-supervised and does not require fisheye images but leverages publicly available large-scale perspective image datasets. This is done by recalibrating perspective images to fisheye images, and enforcing consistency between their estimates during training. We evaluate our approach with several FMDEs, on both indoors and outdoors, where we consistently improve over state-of-the-art methods using a single set of tokens for both. Code available at: https://github.com/JungHeeKim29/calibration-token.
♻ ☆ ETA: Energy-based Test-time Adaptation for Depth Completion
We propose a method for test-time adaptation of pretrained depth completion models. Depth completion models, trained on some ``source'' data, often predict erroneous outputs when transferred to ``target'' data captured in novel environmental conditions due to a covariate shift. The crux of our method lies in quantifying the likelihood of depth predictions belonging to the source data distribution. The challenge is in the lack of access to out-of-distribution (target) data prior to deployment. Hence, rather than making assumptions regarding the target distribution, we utilize adversarial perturbations as a mechanism to explore the data space. This enables us to train an energy model that scores local regions of depth predictions as in- or out-of-distribution. We update the parameters of pretrained depth completion models at test time to minimize energy, effectively aligning test-time predictions to those of the source distribution. We call our method ``Energy-based Test-time Adaptation'', or ETA for short. We evaluate our method across three indoor and three outdoor datasets, where ETA improve over the previous state-of-the-art method by an average of 6.94% for outdoors and 10.23% for indoors. Project Page: https://fuzzythecat.github.io/eta.
♻ ☆ One-Layer Transformers are Provably Optimal for In-context Reasoning and Distributional Association Learning in Next-Token Prediction Tasks
We study the approximation capabilities and on-convergence behaviors of one-layer transformers on the noiseless and noisy in-context reasoning of next-token prediction. Existing theoretical results focus on understanding the in-context reasoning behaviors for either the first gradient step or when the number of samples is infinite. Furthermore, no convergence rates nor generalization abilities were known. Our work addresses these gaps by showing that there exists a class of one-layer transformers that are provably Bayes-optimal with both linear and ReLU attention. When being trained with gradient descent, we show via a finite-sample analysis that the expected loss of these transformers converges at linear rate to the Bayes risk. Moreover, we prove that the trained models generalize to unseen samples as well as exhibit learning behaviors that were empirically observed in previous works. Our theoretical findings are further supported by extensive empirical validations.
comment: V2: added minor fixes for the proof of Theorem 3.4, fixed some typos throughout the paper, adjusted Remark 3.5
♻ ☆ Adaptive Experiments Under Data Sparse Settings: Applications for Educational Platforms
Adaptive experimentation is increasingly used in educational platforms to personalize learning through dynamic content and feedback. However, standard adaptive strategies such as Thompson Sampling often underperform in real-world educational settings where content variations are numerous and student participation is limited, resulting in sparse data. In particular, Thompson Sampling can lead to imbalanced content allocation and delayed convergence on which aspects of content are most effective for student learning. To address these challenges, we introduce Weighted Allocation Probability Adjusted Thompson Sampling (WAPTS), an algorithm that refines the sampling strategy to improve content-related decision-making in data-sparse environments. WAPTS is guided by the principle of lenient regret, allowing near-optimal allocations to accelerate learning while still exploring promising content. We evaluate WAPTS in a learnersourcing scenario where students rate peer-generated learning materials, and demonstrate that it enables earlier and more reliable identification of promising treatments.
♻ ☆ Robust Finite-Memory Policy Gradients for Hidden-Model POMDPs IJCAI 2025
Partially observable Markov decision processes (POMDPs) model specific environments in sequential decision-making under uncertainty. Critically, optimal policies for POMDPs may not be robust against perturbations in the environment. Hidden-model POMDPs (HM-POMDPs) capture sets of different environment models, that is, POMDPs with a shared action and observation space. The intuition is that the true model is hidden among a set of potential models, and it is unknown which model will be the environment at execution time. A policy is robust for a given HM-POMDP if it achieves sufficient performance for each of its POMDPs. We compute such robust policies by combining two orthogonal techniques: (1) a deductive formal verification technique that supports tractable robust policy evaluation by computing a worst-case POMDP within the HM-POMDP, and (2) subgradient ascent to optimize the candidate policy for a worst-case POMDP. The empirical evaluation shows that, compared to various baselines, our approach (1) produces policies that are more robust and generalize better to unseen POMDPs, and (2) scales to HM-POMDPs that consist of over a hundred thousand environments.
comment: Accepted for publication at IJCAI 2025
♻ ☆ Structure As Search: Unsupervised Permutation Learning for Combinatorial Optimization
We propose a non-autoregressive framework for the Travelling Salesman Problem where solutions emerge directly from learned permutations, without requiring explicit search. By applying a similarity transformation to Hamiltonian cycles, the model learns to approximate permutation matrices via continuous relaxations. Our unsupervised approach achieves competitive performance against classical heuristics, demonstrating that the inherent structure of the problem can effectively guide combinatorial optimization without sequential decision-making.
♻ ☆ A Little Human Data Goes A Long Way ACL 2025
Faced with an expensive human annotation process, creators of NLP systems increasingly turn to synthetic data generation. While this method shows promise, the extent to which synthetic data can replace human annotation is poorly understood. We investigate the use of synthetic data in Fact Verification (FV) and Question Answering (QA) by studying the effects of incrementally replacing human generated data with synthetic points on eight diverse datasets. Strikingly, replacing up to 90% of the training data only marginally decreases performance, but replacing the final 10% leads to severe declines. We find that models trained on purely synthetic data can be reliably improved by including as few as 125 human generated data points. We show that matching the performance gain of just a little additional human data (only 200 points) requires an order of magnitude more synthetic data and estimate price ratios at which human annotation would be a more cost-effective solution. Our results suggest that even when human annotation at scale is infeasible, there is great value to having a small proportion of the dataset being human generated.
comment: ACL 2025
♻ ☆ CRINN: Contrastive Reinforcement Learning for Approximate Nearest Neighbor Search
Approximate nearest-neighbor search (ANNS) algorithms have become increasingly critical for recent AI applications, particularly in retrieval-augmented generation (RAG) and agent-based LLM applications. In this paper, we present CRINN, a new paradigm for ANNS algorithms. CRINN treats ANNS optimization as a reinforcement learning problem where execution speed serves as the reward signal. This approach enables the automatic generation of progressively faster ANNS implementations while maintaining accuracy constraints. Our experimental evaluation demonstrates CRINN's effectiveness across six widely-used NNS benchmark datasets. When compared against state-of-the-art open-source ANNS algorithms, CRINN achieves best performance on three of them (GIST-960-Euclidean, MNIST-784-Euclidean, and GloVe-25-angular), and tied for first place on two of them (SIFT-128-Euclidean and GloVe-25-angular). The implications of CRINN's success reach well beyond ANNS optimization: It validates that LLMs augmented with reinforcement learning can function as an effective tool for automating sophisticated algorithmic optimizations that demand specialized knowledge and labor-intensive manual refinement. Code can be found at https://github.com/deepreinforce-ai/CRINN
comment: Preprint Version
♻ ☆ CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning
The exponential growth in demand for GPU computing resources has created an urgent need for automated CUDA optimization strategies. While recent advances in LLMs show promise for code generation, current SOTA models achieve low success rates in improving CUDA speed. In this paper, we introduce CUDA-L1, an automated reinforcement learning framework for CUDA optimization that employs a novel contrastive RL algorithm. CUDA-L1 achieves significant performance improvements on the CUDA optimization task: trained on A100, it delivers an average speedup of x3.12 with a median speedup of x1.42 against default baselines over across all 250 CUDA kernels of KernelBench, with peak speedups reaching x120. In addition to the default baseline provided by KernelBench, CUDA-L1 demonstrates x2.77 over Torch Compile, x2.88 over Torch Compile with reduce overhead, x2.81 over CUDA Graph implementations, and remarkably x7.72 over cuDNN libraries. Furthermore, the model also demonstrates portability across different GPU architectures. Beyond these benchmark results, CUDA-L1 demonstrates several properties: it 1) discovers a variety of CUDA optimization techniques and learns to combine them strategically to achieve optimal performance; 2) uncovers fundamental principles of CUDA optimization, such as the multiplicative nature of optimizations; 3) identifies non-obvious performance bottlenecks and rejects seemingly beneficial optimizations that actually harm performance. The capabilities demonstrate that, RL can transform an initially poor-performing LLM into an effective CUDA optimizer through speedup-based reward signals alone, without human expertise or domain knowledge. This paradigm opens possibilities for automated optimization of CUDA operations, and holds promise to substantially promote GPU efficiency and alleviate the rising pressure on GPU computing resources.
comment: Project Page: https://deepreinforce-ai.github.io/cudal1_blog/
♻ ☆ Sharp Generalization for Nonparametric Regression in Interpolation Space by Over-Parameterized Neural Networks Trained with Preconditioned Gradient Descent and Early-Stopping
We study nonparametric regression using an over-parameterized two-layer neural networks trained with algorithmic guarantees in this paper. We consider the setting where the training features are drawn uniformly from the unit sphere in $\mathbb{R}^d$, and the target function lies in an interpolation space commonly studied in statistical learning theory. We demonstrate that training the neural network with a novel Preconditioned Gradient Descent (PGD) algorithm, equipped with early stopping, achieves a sharp regression rate of $\mathcal O(n^{-\frac{2\alpha s'}{2\alpha s'+1}})$ when the target function is in the interpolation space $[\mathcal H_K]^{s'}$ with $s' \ge 3$. This rate is even sharper than the currently known nearly-optimal rate of $\mathcal O(n^{-\frac{2\alpha s'}{2\alpha s'+1}})\log^2(1/\delta)$~\citep{Li2024-edr-general-domain}, where $n$ is the size of the training data and $\delta \in (0,1)$ is a small probability. This rate is also sharper than the standard kernel regression rate of $\mathcal O(n^{-\frac{2\alpha}{2\alpha+1}})$ obtained under the regular Neural Tangent Kernel (NTK) regime when training the neural network with the vanilla gradient descent (GD), where $2\alpha = d/(d-1)$. Our analysis is based on two key technical contributions. First, we present a principled decomposition of the network output at each PGD step into a function in the reproducing kernel Hilbert space (RKHS) of a newly induced integral kernel, and a residual function with small $L^{\infty}$-norm. Second, leveraging this decomposition, we apply local Rademacher complexity theory to tightly control the complexity of the function class comprising all the neural network functions obtained in the PGD iterates. Our results further suggest that PGD enables the neural network to escape the linear NTK regime and achieve improved generalization by inducing a new integral kernel of lower kernel complexity.
♻ ☆ EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video
Imitation learning for manipulation has a well-known data scarcity problem. Unlike natural language and 2D computer vision, there is no Internet-scale corpus of data for dexterous manipulation. One appealing option is egocentric human video, a passively scalable data source. However, existing large-scale datasets such as Ego4D do not have native hand pose annotations and do not focus on object manipulation. To this end, we use Apple Vision Pro to collect EgoDex: the largest and most diverse dataset of dexterous human manipulation to date. EgoDex has 829 hours of egocentric video with paired 3D hand and finger tracking data collected at the time of recording, where multiple calibrated cameras and on-device SLAM can be used to precisely track the pose of every joint of each hand. The dataset covers a wide range of diverse manipulation behaviors with everyday household objects in 194 different tabletop tasks ranging from tying shoelaces to folding laundry. Furthermore, we train and systematically evaluate imitation learning policies for hand trajectory prediction on the dataset, introducing metrics and benchmarks for measuring progress in this increasingly important area. By releasing this large-scale dataset, we hope to push the frontier of robotics, computer vision, and foundation models. EgoDex is publicly available for download at https://github.com/apple/ml-egodex.
Multimedia 9
☆ ShizhenGPT: Towards Multimodal LLMs for Traditional Chinese Medicine
Despite the success of large language models (LLMs) in various domains, their potential in Traditional Chinese Medicine (TCM) remains largely underexplored due to two critical barriers: (1) the scarcity of high-quality TCM data and (2) the inherently multimodal nature of TCM diagnostics, which involve looking, listening, smelling, and pulse-taking. These sensory-rich modalities are beyond the scope of conventional LLMs. To address these challenges, we present ShizhenGPT, the first multimodal LLM tailored for TCM. To overcome data scarcity, we curate the largest TCM dataset to date, comprising 100GB+ of text and 200GB+ of multimodal data, including 1.2M images, 200 hours of audio, and physiological signals. ShizhenGPT is pretrained and instruction-tuned to achieve deep TCM knowledge and multimodal reasoning. For evaluation, we collect recent national TCM qualification exams and build a visual benchmark for Medicinal Recognition and Visual Diagnosis. Experiments demonstrate that ShizhenGPT outperforms comparable-scale LLMs and competes with larger proprietary models. Moreover, it leads in TCM visual understanding among existing multimodal LLMs and demonstrates unified perception across modalities like sound, pulse, smell, and vision, paving the way toward holistic multimodal perception and diagnosis in TCM. Datasets, models, and code are publicly available. We hope this work will inspire further exploration in this field.
☆ FakeHunter: Multimodal Step-by-Step Reasoning for Explainable Video Forensics
FakeHunter is a multimodal deepfake detection framework that combines memory-guided retrieval, chain-of-thought (Observation-Thought-Action) reasoning, and tool-augmented verification to provide accurate and interpretable video forensics. FakeHunter encodes visual content using CLIP and audio using CLAP, generating joint audio-visual embeddings that retrieve semantically similar real exemplars from a FAISS-indexed memory bank for contextual grounding. Guided by the retrieved context, the system iteratively reasons over evidence to localize manipulations and explain them. When confidence is low, it automatically invokes specialized tools-such as zoom-in image forensics or mel-spectrogram inspection-for fine-grained verification. Built on Qwen2.5-Omni-7B, FakeHunter produces structured JSON verdicts that specify what was modified, where it occurs, and why it is judged fake. We also introduce X-AVFake, a benchmark comprising 5.7k+ manipulated and real videos (950+ min) annotated with manipulation type, region/entity, violated reasoning category, and free-form justification. On X-AVFake, FakeHunter achieves an accuracy of 34.75%, outperforming the vanilla Qwen2.5-Omni-7B by 16.87 percentage points and MiniCPM-2.6 by 25.56 percentage points. Ablation studies reveal that memory retrieval contributes a 7.75 percentage point gain, and tool-based inspection improves low-confidence cases to 46.50%. Despite its multi-stage design, the pipeline processes a 10-minute clip in 8 minutes on a single NVIDIA A800 (0.8x real-time) or 2 minutes on four GPUs (0.2x), demonstrating practical deployability.
☆ Fine-grained Image Quality Assessment for Perceptual Image Restoration
Recent years have witnessed remarkable achievements in perceptual image restoration (IR), creating an urgent demand for accurate image quality assessment (IQA), which is essential for both performance comparison and algorithm optimization. Unfortunately, the existing IQA metrics exhibit inherent weakness for IR task, particularly when distinguishing fine-grained quality differences among restored images. To address this dilemma, we contribute the first-of-its-kind fine-grained image quality assessment dataset for image restoration, termed FGRestore, comprising 18,408 restored images across six common IR tasks. Beyond conventional scalar quality scores, FGRestore was also annotated with 30,886 fine-grained pairwise preferences. Based on FGRestore, a comprehensive benchmark was conducted on the existing IQA metrics, which reveal significant inconsistencies between score-based IQA evaluations and the fine-grained restoration quality. Motivated by these findings, we further propose FGResQ, a new IQA model specifically designed for image restoration, which features both coarse-grained score regression and fine-grained quality ranking. Extensive experiments and comparisons demonstrate that FGResQ significantly outperforms state-of-the-art IQA metrics. Codes and model weights have been released in https://pxf0429.github.io/FGResQ/
comment: 9 pages,6 figures
☆ Scalable Event-Based Video Streaming for Machines with MoQ
Lossy compression and rate-adaptive streaming are a mainstay in traditional video steams. However, a new class of neuromorphic ``event'' sensors records video with asynchronous pixel samples rather than image frames. These sensors are designed for computer vision applications, rather than human video consumption. Until now, researchers have focused their efforts primarily on application development, ignoring the crucial problem of data transmission. We survey the landscape of event-based video systems, discuss the technical issues with our recent scalable event streaming work, and propose a new low-latency event streaming format based on the latest additions to the Media Over QUIC protocol draft.
comment: Accepted to ACM Mile High Video 2025
☆ \textit{adder-viz}: Real-Time Visualization Software for Transcoding Event Video
Recent years have brought about a surge in neuromorphic ``event'' video research, primarily targeting computer vision applications. Event video eschews video frames in favor of asynchronous, per-pixel intensity samples. While much work has focused on a handful of representations for specific event cameras, these representations have shown limitations in flexibility, speed, and compressibility. We previously proposed the unified AD{\Delta}ER representation to address these concerns. This paper introduces numerous improvements to the \textit{adder-viz} software for visualizing real-time event transcode processes and applications in-the-loop. The MIT-licensed software is available from a centralized repository at \href{https://github.com/ac-freeman/adder-codec-rs}{https://github.com/ac-freeman/adder-codec-rs}.
comment: Accepted to the Open-Source Track at ACM Multimedia 2025
☆ Holo-Artisan: A Personalized Multi-User Holographic Experience for Virtual Museums on the Edge Intelligence
We present Holo-Artisan, a novel system architecture enabling immersive multi-user experiences in virtual museums through true holographic displays and personalized edge intelligence. In our design, local edge computing nodes process real-time user data -- including pose, facial expression, and voice -- for multiple visitors concurrently. Generative AI models then drive digital artworks (e.g., a volumetric Mona Lisa) to respond uniquely to each viewer. For instance, the Mona Lisa can return a smile to one visitor while engaging in a spoken Q\&A with another, all in real time. A cloud-assisted collaboration platform composes these interactions in a shared scene using a universal scene description, and employs ray tracing to render high-fidelity, personalized views with a direct pipeline to glasses-free holographic displays. To preserve user privacy and continuously improve personalization, we integrate federated learning (FL) -- edge devices locally fine-tune AI models and share only model updates for aggregation. This edge-centric approach minimizes latency and bandwidth usage, ensuring a synchronized shared experience with individual customization. Through Holo-Artisan, static museum exhibits are transformed into dynamic, living artworks that engage each visitor in a personal dialogue, heralding a new paradigm of cultural heritage interaction.
☆ Robust Symbolic Reasoning for Visual Narratives via Hierarchical and Semantically Normalized Knowledge Graphs
Understanding visual narratives such as comics requires structured representations that capture events, characters, and their relations across multiple levels of story organization. However, symbolic narrative graphs often suffer from inconsistency and redundancy, where similar actions or events are labeled differently across annotations or contexts. Such variance limits the effectiveness of reasoning and generalization. This paper introduces a semantic normalization framework for hierarchical narrative knowledge graphs. Building on cognitively grounded models of narrative comprehension, we propose methods that consolidate semantically related actions and events using lexical similarity and embedding-based clustering. The normalization process reduces annotation noise, aligns symbolic categories across narrative levels, and preserves interpretability. We demonstrate the framework on annotated manga stories from the Manga109 dataset, applying normalization to panel-, event-, and story-level graphs. Preliminary evaluations across narrative reasoning tasks, such as action retrieval, character grounding, and event summarization, show that semantic normalization improves coherence and robustness, while maintaining symbolic transparency. These findings suggest that normalization is a key step toward scalable, cognitively inspired graph models for multimodal narrative understanding.
comment: 12 pages, 4 figures, 2 tables. Extends our earlier framework on hierarchical narrative graphs with a semantic normalization module
♻ ☆ Identity Preserving 3D Head Stylization with Multiview Score Distillation
3D head stylization transforms realistic facial features into artistic representations, enhancing user engagement across gaming and virtual reality applications. While 3D-aware generators have made significant advancements, many 3D stylization methods primarily provide near-frontal views and struggle to preserve the unique identities of original subjects, often resulting in outputs that lack diversity and individuality. This paper addresses these challenges by leveraging the PanoHead model, synthesizing images from a comprehensive 360-degree perspective. We propose a novel framework that employs negative log-likelihood distillation (LD) to enhance identity preservation and improve stylization quality. By integrating multi-view grid score and mirror gradients within the 3D GAN architecture and introducing a score rank weighing technique, our approach achieves substantial qualitative and quantitative improvements. Our findings not only advance the state of 3D head stylization but also provide valuable insights into effective distillation processes between diffusion models and GANs, focusing on the critical issue of identity preservation. Please visit the https://three-bee.github.io/head_stylization for more visuals.
comment: https://three-bee.github.io/head_stylization
♻ ☆ Interpreting the linear structure of vision-language model embedding spaces
Vision-language models encode images and text in a joint space, minimizing the distance between corresponding image and text pairs. How are language and images organized in this joint space, and how do the models encode meaning and modality? To investigate this, we train and release sparse autoencoders (SAEs) on the embedding spaces of four vision-language models (CLIP, SigLIP, SigLIP2, and AIMv2). SAEs approximate model embeddings as sparse linear combinations of learned directions, or "concepts". We find that, compared to other methods of linear feature learning, SAEs are better at reconstructing the real embeddings, while also able to retain the most sparsity. Retraining SAEs with different seeds or different data diet leads to two findings: the rare, specific concepts captured by the SAEs are liable to change drastically, but we also show that commonly-activating concepts are remarkably stable across runs. Interestingly, while most concepts activate primarily for one modality, we find they are not merely encoding modality per se. Many are almost orthogonal to the subspace that defines modality, and the concept directions do not function as good modality classifiers, suggesting that they encode cross-modal semantics. To quantify this bridging behavior, we introduce the Bridge Score, a metric that identifies concept pairs which are both co-activated across aligned image-text inputs and geometrically aligned in the shared space. This reveals that even single-modality concepts can collaborate to support cross-modal integration. We release interactive demos of the SAEs for all models, allowing researchers to explore the organization of the concept spaces. Overall, our findings uncover a sparse linear structure within VLM embedding spaces that is shaped by modality, yet stitched together through latent bridges, offering new insight into how multimodal meaning is constructed.
comment: COLM 2025
Computation and Language 82
☆ The Promise of Large Language Models in Digital Health: Evidence from Sentiment Analysis in Online Health Communities
Digital health analytics face critical challenges nowadays. The sophisticated analysis of patient-generated health content, which contains complex emotional and medical contexts, requires scarce domain expertise, while traditional ML approaches are constrained by data shortage and privacy limitations in healthcare settings. Online Health Communities (OHCs) exemplify these challenges with mixed-sentiment posts, clinical terminology, and implicit emotional expressions that demand specialised knowledge for accurate Sentiment Analysis (SA). To address these challenges, this study explores how Large Language Models (LLMs) can integrate expert knowledge through in-context learning for SA, providing a scalable solution for sophisticated health data analysis. Specifically, we develop a structured codebook that systematically encodes expert interpretation guidelines, enabling LLMs to apply domain-specific knowledge through targeted prompting rather than extensive training. Six GPT models validated alongside DeepSeek and LLaMA 3.1 are compared with pre-trained language models (BioBERT variants) and lexicon-based methods, using 400 expert-annotated posts from two OHCs. LLMs achieve superior performance while demonstrating expert-level agreement. This high agreement, with no statistically significant difference from inter-expert agreement levels, suggests knowledge integration beyond surface-level pattern recognition. The consistent performance across diverse LLM models, supported by in-context learning, offers a promising solution for digital health analytics. This approach addresses the critical challenge of expert knowledge shortage in digital health research, enabling real-time, expert-quality analysis for patient monitoring, intervention assessment, and evidence-based health strategies.
☆ Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation
Beyond simple text generation, Large Language Models (LLMs) have evolved into agentic systems capable of planning and interacting with external tools to solve complex tasks. This evolution involves fine-tuning LLMs on agent-specific tasks to enhance their proficiency. However, safety concerns are frequently overlooked during this fine-tuning process. In this work, we show that aligned LLMs can become unintentionally misaligned, leading to a higher likelihood of executing harmful tasks and a reduced tendency to refuse them when fine-tuned to execute agentic tasks. To address these safety challenges, we propose Prefix INjection Guard (PING), a simple yet effective method that prepends automatically generated natural language prefixes to agent responses, guiding them to refuse harmful requests while preserving performance on benign tasks. Specifically, we introduce an iterative approach that alternates between (1) generating candidate prefixes and (2) selecting those that optimize both task performance and refusal behavior. Experimental results demonstrate that PING significantly enhances the safety of fine-tuned LLM agents without sacrificing their effectiveness. PING consistently outperforms existing prompting approaches across diverse benchmarks in both web navigation and code generation tasks. Our analysis of internal hidden states via linear probes reveals that prefix tokens are crucial for behavior modification, explaining the performance gains. WARNING: This paper contains contents that are unethical or offensive in nature.
comment: Source code: https://github.com/HahmDY/prefix_injection_guard
☆ Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training Large Language Models (LLMs), particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy's generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy's correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.
☆ Ask Good Questions for Large Language Models
Recent advances in large language models (LLMs) have significantly improved the performance of dialog systems, yet current approaches often fail to provide accurate guidance of topic due to their inability to discern user confusion in related concepts. To address this, we introduce the Ask-Good-Question (AGQ) framework, which features an improved Concept-Enhanced Item Response Theory (CEIRT) model to better identify users' knowledge levels. Our contributions include applying the CEIRT model along with LLMs to directly generate guiding questions based on the inspiring text, greatly improving information retrieval efficiency during the question & answer process. Through comparisons with other baseline methods, our approach outperforms by significantly enhencing the users' information retrieval experiences.
☆ Chunks as Arms: Multi-Armed Bandit-Guided Sampling for Long-Context LLM Preference Optimization
Long-context modeling is critical for a wide range of real-world tasks, including long-context question answering, summarization, and complex reasoning tasks. Recent studies have explored fine-tuning Large Language Models (LLMs) with synthetic data to enhance their long-context capabilities. However, the effectiveness of such approaches is often limited by the low diversity and factual inconsistencies in the generated data. To address these challenges, we propose LongMab-PO, a novel framework that leverages a Multi-Armed Bandit (MAB) rollout strategy to identify the most informative chunks from the given long context for sampling high-quality and diverse responses and constructing preference data pairs for Direct Preference Optimization (DPO) training. Specifically, we treat context chunks as arms of MAB, select chunks based on their expected reward scores to input into LLMs to generate responses, and iteratively update these scores based on reward feedback. This exploration and exploitation process enables the model to focus on the most relevant context segments, thereby generating and collecting high-quality and diverse responses. Finally, we collect these generated responses from the rollout process and apply the DPO method to further optimize the LLM. Experimental results show that LongMab-PO significantly improves the diversity and quality of preference data pairs, achieving state-of-the-art performance on long-context reasoning benchmarks. All code and data will be released on https://github.com/NEUIR/LongMab-PO.
☆ RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation
We investigate to what extent Multimodal Large Language Models (MLLMs) can accurately identify the orientation of input images rotated 0{\deg}, 90{\deg}, 180{\deg}, and 270{\deg}. This task demands robust visual reasoning capabilities to detect rotational cues and contextualize spatial relationships within images, regardless of their orientation. To evaluate MLLMs on these abilities, we introduce RotBench -- a 350-image manually-filtered benchmark comprising lifestyle, portrait, and landscape images. Despite the relatively simple nature of this task, we show that several state-of-the-art open and proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably identify rotation in input images. Providing models with auxiliary information -- including captions, depth maps, and more -- or using chain-of-thought prompting offers only small and inconsistent improvements. Our results indicate that most models are able to reliably identify right-side-up (0{\deg}) images, while certain models are able to identify upside-down (180{\deg}) images. None can reliably distinguish between 90{\deg} and 270{\deg}. Simultaneously showing the image rotated in different orientations leads to moderate performance gains for reasoning models, while a modified setup using voting improves the performance of weaker models. We further show that fine-tuning does not improve models' ability to distinguish 90{\deg} and 270{\deg} rotations, despite substantially improving the identification of 180{\deg} images. Together, these results reveal a significant gap between MLLMs' spatial reasoning capabilities and human perception in identifying rotation.
comment: 20 pages. Code and data: https://github.com/tianyiniu/RotBench
ReviewGraph: A Knowledge Graph Embedding Based Framework for Review Rating Prediction with Sentiment Features
In the hospitality industry, understanding the factors that drive customer review ratings is critical for improving guest satisfaction and business performance. This work proposes ReviewGraph for Review Rating Prediction (RRP), a novel framework that transforms textual customer reviews into knowledge graphs by extracting (subject, predicate, object) triples and associating sentiment scores. Using graph embeddings (Node2Vec) and sentiment features, the framework predicts review rating scores through machine learning classifiers. We compare ReviewGraph performance with traditional NLP baselines (such as Bag of Words, TF-IDF, and Word2Vec) and large language models (LLMs), evaluating them in the HotelRec dataset. In comparison to the state of the art literature, our proposed model performs similar to their best performing model but with lower computational cost (without ensemble). While ReviewGraph achieves comparable predictive performance to LLMs and outperforms baselines on agreement-based metrics such as Cohen's Kappa, it offers additional advantages in interpretability, visual exploration, and potential integration into Retrieval-Augmented Generation (RAG) systems. This work highlights the potential of graph-based representations for enhancing review analytics and lays the groundwork for future research integrating advanced graph neural networks and fine-tuned LLM-based extraction methods. We will share ReviewGraph output and platform open-sourced on our GitHub page https://github.com/aaronlifenghan/ReviewGraph
☆ Query Logs Analytics: A Aystematic Literature Review
In the digital era, user interactions with various resources such as databases, data warehouses, websites, and knowledge graphs (KGs) are increasingly mediated through digital platforms. These interactions leave behind digital traces, systematically captured in the form of logs. Logs, when effectively exploited, provide high value across industry and academia, supporting critical services (e.g., recovery and security), user-centric applications (e.g., recommender systems), and quality-of-service improvements (e.g., performance optimization). Despite their importance, research on log usage remains fragmented across domains, and no comprehensive study currently consolidates existing efforts. This paper presents a systematic survey of log usage, focusing on Database (DB), Data Warehouse (DW), Web, and KG logs. More than 300 publications were analyzed to address three central questions: (1) do different types of logs share common structural and functional characteristics? (2) are there standard pipelines for their usage? (3) which constraints and non-functional requirements (NFRs) guide their exploitation?. The survey reveals a limited number of end-to-end approaches, the absence of standardization across log usage pipelines, and the existence of shared structural elements among different types of logs. By consolidating existing knowledge, identifying gaps, and highlighting opportunities, this survey provides researchers and practitioners with a comprehensive overview of log usage and sheds light on promising directions for future research, particularly regarding the exploitation and democratization of KG logs.
Prompt Orchestration Markup Language
Large Language Models (LLMs) require sophisticated prompting, yet current practices face challenges in structure, data integration, format sensitivity, and tooling. Existing methods lack comprehensive solutions for organizing complex prompts involving diverse data types (documents, tables, images) or managing presentation variations systematically. To address these gaps, we introduce POML (Prompt Orchestration Markup Language). POML employs component-based markup for logical structure (roles, tasks, examples), specialized tags for seamless data integration, and a CSS-like styling system to decouple content from presentation, reducing formatting sensitivity. It includes templating for dynamic prompts and a comprehensive developer toolkit (IDE support, SDKs) to improve version control and collaboration. We validate POML through two case studies demonstrating its impact on complex application integration (PomLink) and accuracy performance (TableQA), as well as a user study assessing its effectiveness in real-world development scenarios.
comment: All findings in this paper are derived from a POML snapshot as of February 2025
☆ MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal Large Language Models
Recently, multimodal large language models (MLLMs) have achieved significant advancements across various domains, and corresponding evaluation benchmarks have been continuously refined and improved. In this process, benchmarks in the scientific domain have played an important role in assessing the reasoning capabilities of MLLMs. However, existing benchmarks still face three key challenges: 1) Insufficient evaluation of models' reasoning abilities in multilingual scenarios; 2) Inadequate assessment of MLLMs' comprehensive modality coverage; 3) Lack of fine-grained annotation of scientific knowledge points. To address these gaps, we propose MME-SCI, a comprehensive and challenging benchmark. We carefully collected 1,019 high-quality question-answer pairs, which involve 3 distinct evaluation modes. These pairs cover four subjects, namely mathematics, physics, chemistry, and biology, and support five languages: Chinese, English, French, Spanish, and Japanese. We conducted extensive experiments on 16 open-source models and 4 closed-source models, and the results demonstrate that MME-SCI is widely challenging for existing MLLMs. For instance, under the Image-only evaluation mode, o4-mini achieved accuracy of only 52.11%, 24.73%, 36.57%, and 29.80% in mathematics, physics, chemistry, and biology, respectively, indicating a significantly higher difficulty level compared to existing benchmarks. More importantly, using MME-SCI's multilingual and fine-grained knowledge attributes, we analyzed existing models' performance in depth and identified their weaknesses in specific domains. The Data and Evaluation Code are available at https://github.com/JCruan519/MME-SCI.
comment: 9 pages, 6 figures, work in progress
☆ Improved Generalized Planning with LLMs through Strategy Refinement and Reflection
LLMs have recently been used to generate Python programs representing generalized plans in PDDL planning, i.e., plans that generalize across the tasks of a given PDDL domain. Previous work proposed a framework consisting of three steps: the LLM first generates a summary and then a strategy for the domain, both in natural language, and then implements that strategy as a Python program, that gets debugged on example planning tasks. In that work, only one strategy is generated and passed directly to the program generation. If the strategy is incorrect, its implementation will therefore result in an incorrect generalized plan. Here, we introduce an approach that generates the strategy in the form of pseudocode and enables automatic debugging of the pseudocode, hence allowing us to identify and fix errors prior to the generation of the generalized plan itself. Additionally, we extend the Python debugging phase with a reflection step prompting the LLM to pinpoint the reason for the observed plan failure. Finally, we take inspiration from LLM code generation to produce several program variants and pick the best one. Running experiments on 17 benchmark domains, we show that these extensions substantially improve (and never deteriorate) the quality of the generalized plans. In 12 of the domains, our best Python programs solve all tasks that can be generated with the respective instance generator.
☆ Extracting Structured Requirements from Unstructured Building Technical Specifications for Building Information Modeling
This study explores the integration of Building Information Modeling (BIM) with Natural Language Processing (NLP) to automate the extraction of requirements from unstructured French Building Technical Specification (BTS) documents within the construction industry. Employing Named Entity Recognition (NER) and Relation Extraction (RE) techniques, the study leverages the transformer-based model CamemBERT and applies transfer learning with the French language model Fr\_core\_news\_lg, both pre-trained on a large French corpus in the general domain. To benchmark these models, additional approaches ranging from rule-based to deep learning-based methods are developed. For RE, four different supervised models, including Random Forest, are implemented using a custom feature vector. A hand-crafted annotated dataset is used to compare the effectiveness of NER approaches and RE models. Results indicate that CamemBERT and Fr\_core\_news\_lg exhibited superior performance in NER, achieving F1-scores over 90\%, while Random Forest proved most effective in RE, with an F1 score above 80\%. The outcomes are intended to be represented as a knowledge graph in future work to further enhance automatic verification systems.
☆ The illusion of a perfect metric: Why evaluating AI's words is harder than it looks
Evaluating Natural Language Generation (NLG) is crucial for the practical adoption of AI, but has been a longstanding research challenge. While human evaluation is considered the de-facto standard, it is expensive and lacks scalability. Practical applications have driven the development of various automatic evaluation metrics (AEM), designed to compare the model output with human-written references, generating a score which approximates human judgment. Over time, AEMs have evolved from simple lexical comparisons, to semantic similarity models and, more recently, to LLM-based evaluators. However, it seems that no single metric has emerged as a definitive solution, resulting in studies using different ones without fully considering the implications. This paper aims to show this by conducting a thorough examination of the methodologies of existing metrics, their documented strengths and limitations, validation methods, and correlations with human judgment. We identify several key challenges: metrics often capture only specific aspects of text quality, their effectiveness varies by task and dataset, validation practices remain unstructured, and correlations with human judgment are inconsistent. Importantly, we find that these challenges persist in the most recent type of metric, LLM-as-a-Judge, as well as in the evaluation of Retrieval Augmented Generation (RAG), an increasingly relevant task in academia and industry. Our findings challenge the quest for the 'perfect metric'. We propose selecting metrics based on task-specific needs and leveraging complementary evaluations and advocate that new metrics should focus on enhanced validation methodologies.
comment: 11 pages, 1 figure. Accepted to RANLP 2025
Prompt-Based One-Shot Exact Length-Controlled Generation with LLMs
Controlling the length of text produced by large language models (LLMs) remains challenging: models frequently overshoot or undershoot explicit length instructions because they cannot reliably keep an internal token count. We present a prompt-based, one-shot strategy that compels an off-the-shelf LLM to generate exactly a desired number of tokens - words (English) or characters (Chinese) - without any fine-tuning or iterative sampling. The prompt appends countdown markers and explicit counting rules so that the model "writes while counting." We evaluate on four settings: open-ended generation (1-1000 tokens), XSUM summarization, MT-Bench-LI instruction following, and the LIFEBENCH equal-length track. On MT-Bench-LI, strict length compliance with GPT-4.1 leaps from below 30% under naive prompts to above 95% with our countdown prompt, surpassing the popular draft-then-revise baseline, while judged answer quality is preserved. These results show that precise length control can be achieved through prompt engineering alone, offering a lightweight alternative to training- or decoding-based methods.
comment: 18 pages
☆ Beyond Human Judgment: A Bayesian Evaluation of LLMs' Moral Values Understanding
How do large language models understand moral dimensions compared to humans? This first large-scale Bayesian evaluation of market-leading language models provides the answer. In contrast to prior work using deterministic ground truth (majority or inclusion rules), we model annotator disagreements to capture both aleatoric uncertainty (inherent human disagreement) and epistemic uncertainty (model domain sensitivity). We evaluate top language models (Claude Sonnet 4, DeepSeek-V3, Llama 4 Maverick) across 250K+ annotations from ~700 annotators on 100K+ texts spanning social media, news, and forums. Our GPU-optimized Bayesian framework processed 1M+ model queries, revealing that AI models typically rank among the top 25\% of human annotators, achieving much better-than-average balanced accuracy. Importantly, we find that AI produces far fewer false negatives than humans, highlighting their more sensitive moral detection capabilities.
☆ TracSum: A New Benchmark for Aspect-Based Summarization with Sentence-Level Traceability in Medical Domain
While document summarization with LLMs has enhanced access to textual information, concerns about the factual accuracy of these summaries persist, especially in the medical domain. Tracing evidence from which summaries are derived enables users to assess their accuracy, thereby alleviating this concern. In this paper, we introduce TracSum, a novel benchmark for traceable, aspect-based summarization, in which generated summaries are paired with sentence-level citations, enabling users to trace back to the original context. First, we annotate 500 medical abstracts for seven key medical aspects, yielding 3.5K summary-citation pairs. We then propose a fine-grained evaluation framework for this new task, designed to assess the completeness and consistency of generated content using four metrics. Finally, we introduce a summarization pipeline, Track-Then-Sum, which serves as a baseline method for comparison. In experiments, we evaluate both this baseline and a set of LLMs on TracSum, and conduct a human evaluation to assess the evaluation results. The findings demonstrate that TracSum can serve as an effective benchmark for traceable, aspect-based summarization tasks. We also observe that explicitly performing sentence-level tracking prior to summarization enhances generation accuracy, while incorporating the full context further improves completeness.
comment: 8 main pages, 12 appendix pages
☆ Can Large Language Models (LLMs) Describe Pictures Like Children? A Comparative Corpus Study
The role of large language models (LLMs) in education is increasing, yet little attention has been paid to whether LLM-generated text resembles child language. This study evaluates how LLMs replicate child-like language by comparing LLM-generated texts to a collection of German children's descriptions of picture stories. We generated two LLM-based corpora using the same picture stories and two prompt types: zero-shot and few-shot prompts specifying a general age from the children corpus. We conducted a comparative analysis across psycholinguistic text properties, including word frequency, lexical richness, sentence and word length, part-of-speech tags, and semantic similarity with word embeddings. The results show that LLM-generated texts are longer but less lexically rich, rely more on high-frequency words, and under-represent nouns. Semantic vector space analysis revealed low similarity, highlighting differences between the two corpora on the level of corpus semantics. Few-shot prompt increased similarities between children and LLM text to a minor extent, but still failed to replicate lexical and semantic patterns. The findings contribute to our understanding of how LLMs approximate child language through multimodal prompting (text + image) and give insights into their use in psycholinguistic research and education while raising important questions about the appropriateness of LLM-generated language in child-directed educational tools.
☆ MGT-Prism: Enhancing Domain Generalization for Machine-Generated Text Detection via Spectral Alignment
Large Language Models have shown growing ability to generate fluent and coherent texts that are highly similar to the writing style of humans. Current detectors for Machine-Generated Text (MGT) perform well when they are trained and tested in the same domain but generalize poorly to unseen domains, due to domain shift between data from different sources. In this work, we propose MGT-Prism, an MGT detection method from the perspective of the frequency domain for better domain generalization. Our key insight stems from analyzing text representations in the frequency domain, where we observe consistent spectral patterns across diverse domains, while significant discrepancies in magnitude emerge between MGT and human-written texts (HWTs). The observation initiates the design of a low frequency domain filtering module for filtering out the document-level features that are sensitive to domain shift, and a dynamic spectrum alignment strategy to extract the task-specific and domain-invariant features for improving the detector's performance in domain generalization. Extensive experiments demonstrate that MGT-Prism outperforms state-of-the-art baselines by an average of 0.90% in accuracy and 0.92% in F1 score on 11 test datasets across three domain-generalization scenarios.
☆ Sycophancy under Pressure: Evaluating and Mitigating Sycophantic Bias via Adversarial Dialogues in Scientific QA
Large language models (LLMs), while increasingly used in domains requiring factual rigor, often display a troubling behavior: sycophancy, the tendency to align with user beliefs regardless of correctness. This tendency is reinforced by preference-based alignment techniques that optimize for user satisfaction but can undermine truthfulness. While relatively benign in casual dialogue, sycophancy poses serious risks in high-stakes settings such as scientific question answering (QA), where model outputs may shape collaborative reasoning, decision-making, and knowledge formation. Despite its importance, this phenomenon remains underexamined in factual QA contexts. We address this gap by introducing a unified evaluation framework to quantify the impact of sycophantic context on model behavior in scientific QA, measuring how much user-imposed social pressure distorts model outputs. The framework incorporates adversarial prompting setups and targeted metrics, such as misleading resistance and sycophancy resistance, that capture a model's ability to maintain factual consistency under misleading cues. Systematic evaluations across open-source and proprietary models reveal pervasive sycophantic tendencies, driven more by alignment strategy than by model size. To mitigate this issue, we propose Pressure-Tune, a lightweight post-training method that fine-tunes models on synthetic adversarial dialogues paired with chain-of-thought rationales. These rationales reject user misinformation while reinforcing factual commitments. Experiments on challenging scientific QA benchmarks show that Pressure-Tune significantly enhances sycophancy resistance without compromising accuracy or responsiveness to valid feedback, offering a practical pathway toward more truthful and principled model behavior.
☆ EEG-MedRAG: Enhancing EEG-based Clinical Decision-Making via Hierarchical Hypergraph Retrieval-Augmented Generation
With the widespread application of electroencephalography (EEG) in neuroscience and clinical practice, efficiently retrieving and semantically interpreting large-scale, multi-source, heterogeneous EEG data has become a pressing challenge. We propose EEG-MedRAG, a three-layer hypergraph-based retrieval-augmented generation framework that unifies EEG domain knowledge, individual patient cases, and a large-scale repository into a traversable n-ary relational hypergraph, enabling joint semantic-temporal retrieval and causal-chain diagnostic generation. Concurrently, we introduce the first cross-disease, cross-role EEG clinical QA benchmark, spanning seven disorders and five authentic clinical perspectives. This benchmark allows systematic evaluation of disease-agnostic generalization and role-aware contextual understanding. Experiments show that EEG-MedRAG significantly outperforms TimeRAG and HyperGraphRAG in answer accuracy and retrieval, highlighting its strong potential for real-world clinical decision support. Our data and code are publicly available at https://github.com/yi9206413-boop/EEG-MedRAG.
☆ Prediction is not Explanation: Revisiting the Explanatory Capacity of Mapping Embeddings ECAI 2025
Understanding what knowledge is implicitly encoded in deep learning models is essential for improving the interpretability of AI systems. This paper examines common methods to explain the knowledge encoded in word embeddings, which are core elements of large language models (LLMs). These methods typically involve mapping embeddings onto collections of human-interpretable semantic features, known as feature norms. Prior work assumes that accurately predicting these semantic features from the word embeddings implies that the embeddings contain the corresponding knowledge. We challenge this assumption by demonstrating that prediction accuracy alone does not reliably indicate genuine feature-based interpretability. We show that these methods can successfully predict even random information, concluding that the results are predominantly determined by an algorithmic upper bound rather than meaningful semantic representation in the word embeddings. Consequently, comparisons between datasets based solely on prediction performance do not reliably indicate which dataset is better captured by the word embeddings. Our analysis illustrates that such mappings primarily reflect geometric similarity within vector spaces rather than indicating the genuine emergence of semantic properties.
comment: 10 pages, 6 Figures. Published at ECAI 2025 in a version without the Appendix
☆ Generics and Default Reasoning in Large Language Models
This paper evaluates the capabilities of 28 large language models (LLMs) to reason with 20 defeasible reasoning patterns involving generic generalizations (e.g., 'Birds fly', 'Ravens are black') central to non-monotonic logic. Generics are of special interest to linguists, philosophers, logicians, and cognitive scientists because of their complex exception-permitting behaviour and their centrality to default reasoning, cognition, and concept acquisition. We find that while several frontier models handle many default reasoning problems well, performance varies widely across models and prompting styles. Few-shot prompting modestly improves performance for some models, but chain-of-thought (CoT) prompting often leads to serious performance degradation (mean accuracy drop -11.14%, SD 15.74% in models performing above 75% accuracy in zero-shot condition, temperature 0). Most models either struggle to distinguish between defeasible and deductive inference or misinterpret generics as universal statements. These findings underscore both the promise and limits of current LLMs for default reasoning.
comment: 33 pages, 26 figures
☆ ViExam: Are Vision Language Models Better than Humans on Vietnamese Multimodal Exam Questions?
Vision language models (VLMs) demonstrate remarkable capabilities on English multimodal tasks, but their performance on low-resource languages with genuinely multimodal educational content remains largely unexplored. In this work, we test how VLMs perform on Vietnamese educational assessments, investigating whether VLMs trained predominantly on English data can handle real-world cross-lingual multimodal reasoning. Our work presents the first comprehensive evaluation of VLM capabilities on multimodal Vietnamese exams through proposing ViExam, a benchmark containing 2,548 multimodal questions. We find that state-of-the-art VLMs achieve only 57.74% while open-source models achieve 27.70% mean accuracy across 7 academic domains, including Mathematics, Physics, Chemistry, Biology, Geography, Driving Test, and IQ Test. Most VLMs underperform average human test-takers (66.54%), with only the thinking VLM o3 (74.07%) exceeding human average performance, yet still falling substantially short of human best performance (99.60%). Cross-lingual prompting with English instructions while maintaining Vietnamese content fails to improve performance, decreasing accuracy by 1 percentage point for SOTA VLMs. Human-in-the-loop collaboration can partially improve VLM performance by 5 percentage points. Code and data are available at: https://vi-exam.github.io.
☆ Input Time Scaling
Current Large Language Models (LLMs) are usually post-trained on large-scale carefully curated datasets (data & training scaling) and doing reasoning in test time (inference time scaling). In this work, we present a new scaling paradigm, Input Time Scaling, to complement previous scaling methods by putting resources on queries (input time). During training and testing, we combine meta-knowledge from LLMs to refine inputs with different strategies. We also find a new phenomenon, training-testing co-design there. We need to apply query strategies during both training and testing. Only applying strategies on training or testing would seriously degrade the performance. We are also surprised to find that seemingly low data quality datasets can gain high performance. Adding irrelevant information to the queries, randomly selecting examples from a minimally filtered dataset, can even perform the best. These findings contradict the widely held inductive bias, "garbage in, garbage out". Curating datasets with seemingly high-quality data can even potentially limit the performance ceiling. In addition, models trained on more data with similar quality (15k VS 1k) perform worse, simple dataset size scaling should also be carefully inspected. The good news is that our findings are compatible with the Less is More phenomenon. A small set of examples is enough to evoke high-level reasoning ability. With experiments on models trained on Qwen2.5-32B-Instruct, we are able to reach SOTA performance among 32B models on AIME24(76.7%) and AIME25(76.7%) pass@1. We can further achieve AIME24(76.7%) and AIME25(80%) with a majority vote of three models. Starting from DeepSeek-R1-Distill-Qwen-32B, the best result would be 86.7% on AIME24 and 76.7% on AIME25. To facilitate reproducibility and further research, we are working on open-source our datasets, data pipelines, evaluation results, and checkpoints.
☆ CRISP: Persistent Concept Unlearning via Sparse Autoencoders
As large language models (LLMs) are increasingly deployed in real-world applications, the need to selectively remove unwanted knowledge while preserving model utility has become paramount. Recent work has explored sparse autoencoders (SAEs) to perform precise interventions on monosemantic features. However, most SAE-based methods operate at inference time, which does not create persistent changes in the model's parameters. Such interventions can be bypassed or reversed by malicious actors with parameter access. We introduce CRISP, a parameter-efficient method for persistent concept unlearning using SAEs. CRISP automatically identifies salient SAE features across multiple layers and suppresses their activations. We experiment with two LLMs and show that our method outperforms prior approaches on safety-critical unlearning tasks from the WMDP benchmark, successfully removing harmful knowledge while preserving general and in-domain capabilities. Feature-level analysis reveals that CRISP achieves semantically coherent separation between target and benign concepts, allowing precise suppression of the target features.
comment: 18 pages, 5 figures
☆ AdaDocVQA: Adaptive Framework for Long Document Visual Question Answering in Low-Resource Settings
Document Visual Question Answering (Document VQA) faces significant challenges when processing long documents in low-resource environments due to context limitations and insufficient training data. This paper presents AdaDocVQA, a unified adaptive framework addressing these challenges through three core innovations: a hybrid text retrieval architecture for effective document segmentation, an intelligent data augmentation pipeline that automatically generates high-quality reasoning question-answer pairs with multi-level verification, and adaptive ensemble inference with dynamic configuration generation and early stopping mechanisms. Experiments on Japanese document VQA benchmarks demonstrate substantial improvements with 83.04\% accuracy on Yes/No questions, 52.66\% on factual questions, and 44.12\% on numerical questions in JDocQA, and 59\% accuracy on LAVA dataset. Ablation studies confirm meaningful contributions from each component, and our framework establishes new state-of-the-art results for Japanese document VQA while providing a scalable foundation for other low-resource languages and specialized domains. Our code available at: https://github.com/Haoxuanli-Thu/AdaDocVQA.
☆ Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM
Similar to text-based Large Language Models (LLMs), Speech-LLMs exhibit emergent abilities and context awareness. However, whether these similarities extend to gender bias remains an open question. This study proposes a methodology leveraging speaker assignment as an analytic tool for bias investigation. Unlike text-based models, which encode gendered associations implicitly, Speech-LLMs must produce a gendered voice, making speaker selection an explicit bias cue. We evaluate Bark, a Text-to-Speech (TTS) model, analyzing its default speaker assignments for textual prompts. If Bark's speaker selection systematically aligns with gendered associations, it may reveal patterns in its training data or model design. To test this, we construct two datasets: (i) Professions, containing gender-stereotyped occupations, and (ii) Gender-Colored Words, featuring gendered connotations. While Bark does not exhibit systematic bias, it demonstrates gender awareness and has some gender inclinations.
☆ A Comparative Study of Decoding Strategies in Medical Text Generation
Large Language Models (LLMs) rely on various decoding strategies to generate text, and these choices can significantly affect output quality. In healthcare, where accuracy is critical, the impact of decoding strategies remains underexplored. We investigate this effect in five open-ended medical tasks, including translation, summarization, question answering, dialogue, and image captioning, evaluating 11 decoding strategies with medically specialized and general-purpose LLMs of different sizes. Our results show that deterministic strategies generally outperform stochastic ones: beam search achieves the highest scores, while {\eta} and top-k sampling perform worst. Slower decoding methods tend to yield better quality. Larger models achieve higher scores overall but have longer inference times and are no more robust to decoding. Surprisingly, while medical LLMs outperform general ones in two of the five tasks, statistical analysis shows no overall performance advantage and reveals greater sensitivity to decoding choice. We further compare multiple evaluation metrics and find that correlations vary by task, with MAUVE showing weak agreement with BERTScore and ROUGE, as well as greater sensitivity to the decoding strategy. These results highlight the need for careful selection of decoding methods in medical applications, as their influence can sometimes exceed that of model choice.
☆ Compressed Models are NOT Trust-equivalent to Their Large Counterparts
Large Deep Learning models are often compressed before being deployed in a resource-constrained environment. Can we trust the prediction of compressed models just as we trust the prediction of the original large model? Existing work has keenly studied the effect of compression on accuracy and related performance measures. However, performance parity does not guarantee trust-equivalence. We propose a two-dimensional framework for trust-equivalence evaluation. First, interpretability alignment measures whether the models base their predictions on the same input features. We use LIME and SHAP tests to measure the interpretability alignment. Second, calibration similarity measures whether the models exhibit comparable reliability in their predicted probabilities. It is assessed via ECE, MCE, Brier Score, and reliability diagrams. We conducted experiments using BERT-base as the large model and its multiple compressed variants. We focused on two text classification tasks: natural language inference and paraphrase identification. Our results reveal low interpretability alignment and significant mismatch in calibration similarity. It happens even when the accuracies are nearly identical between models. These findings show that compressed models are not trust-equivalent to their large counterparts. Deploying compressed models as a drop-in replacement for large models requires careful assessment, going beyond performance parity.
☆ MATA (māta): Mindful Assessment of the Telugu Abilities of Large Language Models
In this paper, we introduce MATA, a novel evaluation dataset to assess the ability of Large Language Models (LLMs) in Telugu language, comprising 729 carefully curated multiple-choice and open-ended questions that span diverse linguistic dimensions. We evaluate 11 open-weight and closed-source LLMs on our dataset and present a fine-grained analysis of their performance. Further, we empirically show how LLMs rely on superficial heuristics such as answer position and distractor patterns for multiple-choice questions. Finally, we also compare LLM-as-a-judge evaluation with human evaluation for open-ended questions and draw some conclusions on its reliability in a low-resource language. We argue that such fine-grained evaluation is essential for understanding model limitations and can inform the development of more linguistically capable LLMs, while also serving as a foundation for future research in Telugu NLP.
comment: Pre-print
☆ Saudi-Dialect-ALLaM: LoRA Fine-Tuning for Dialectal Arabic Generation
Large language models (LLMs) for Arabic are still dominated by Modern Standard Arabic (MSA), with limited support for Saudi dialects such as Najdi and Hijazi. This underrepresentation hinders their ability to capture authentic dialectal variation. Using a privately curated Saudi Dialect Instruction dataset (Hijazi and Najdi; 5,466 synthetic instruction-response pairs; 50/50 split), we LoRA-tune ALLaM-7B-Instruct-preview, the first foundation model developed in Saudi Arabia, for Saudi dialect generation. We investigate two variants: (i) Dialect-Token training, which prepends an explicit dialect tag to the instruction, and (ii) No-Token training, which omits the tag at formatting time. Evaluation on a held-out test set combines an external dialect classifier with text fidelity metrics (chrF++ and BERTScore) and diversity measures. The Dialect-Token model achieves the best control, raising the Saudi rate from 47.97% to 84.21% and reducing MSA leakage from 32.63% to 6.21%; fidelity also improves (chrF++ +3.53, BERTScore +0.059). Both LoRA variants outperform strong generic instruction models (Falcon-7B-Instruct, Llama-3.1-8B-Instruct, Qwen-2.5-7B-Instruct, AceGPT-v2-8B-Chat, JAIS-13B-Chat) in dialect control and fidelity, while avoiding metadata-tag echoing that these baselines frequently exhibit. We do not release the dataset or any model weights/adapters; instead, we release training/evaluation/inference code and a detailed datasheet (schema and aggregate statistics) to support independent verification.
comment: 7 pages, 6 figures, 2 tables. Code: https://github.com/HasanBGIt/Saudi-Dialect-ALLaM . Dataset and trained weights/adapters are not released. Primary category: cs.CL
☆ ProMed: Shapley Information Gain Guided Reinforcement Learning for Proactive Medical LLMs
Interactive medical questioning is essential in real-world clinical consultations, where physicians must actively gather information from patients. While medical Large Language Models (LLMs) have shown impressive capabilities in static medical question answering, they predominantly operate under a reactive paradigm: generating answers directly without seeking additional information, which risks incorrect diagnoses in such interactive settings. To address this limitation, we propose ProMed, a reinforcement learning (RL) framework that transitions medical LLMs toward a proactive paradigm, equipping them with the ability to ask clinically valuable questions before decision-making. At the core of ProMed is the Shapley Information Gain (SIG) reward, which quantifies the clinical utility of each question by combining the amount of newly acquired information with its contextual importance, estimated via Shapley values. We integrate SIG into a two-stage training pipeline: (1) SIG-Guided Model Initialization uses Monte Carlo Tree Search (MCTS) to construct high-reward interaction trajectories to supervise the model, and (2) SIG-Augmented Policy Optimization, which integrates SIG and enhances RL with a novel SIG-guided Reward Distribution Mechanism that assigns higher rewards to informative questions for targeted optimization. Extensive experiments on two newly curated partial-information medical benchmarks demonstrate that ProMed significantly outperforms state-of-the-art methods by an average of 6.29% and delivers a 54.45% gain over the reactive paradigm, while also generalizing robustly to out-of-domain cases.
☆ LLM-Enhanced Linear Autoencoders for Recommendation CIKM 2025
Large language models (LLMs) have been widely adopted to enrich the semantic representation of textual item information in recommender systems. However, existing linear autoencoders (LAEs) that incorporate textual information rely on sparse word co-occurrence patterns, limiting their ability to capture rich textual semantics. To address this, we propose L3AE, the first integration of LLMs into the LAE framework. L3AE effectively integrates the heterogeneous knowledge of textual semantics and user-item interactions through a two-phase optimization strategy. (i) L3AE first constructs a semantic item-to-item correlation matrix from LLM-derived item representations. (ii) It then learns an item-to-item weight matrix from collaborative signals while distilling semantic item correlations as regularization. Notably, each phase of L3AE is optimized through closed-form solutions, ensuring global optimality and computational efficiency. Extensive experiments demonstrate that L3AE consistently outperforms state-of-the-art LLM-enhanced models on three benchmark datasets, achieving gains of 27.6% in Recall@20 and 39.3% in NDCG@20. The source code is available at https://github.com/jaewan7599/L3AE_CIKM2025.
comment: Accepted by CIKM 2025
☆ Structured Prompting and Multi-Agent Knowledge Distillation for Traffic Video Interpretation and Risk Inference
Comprehensive highway scene understanding and robust traffic risk inference are vital for advancing Intelligent Transportation Systems (ITS) and autonomous driving. Traditional approaches often struggle with scalability and generalization, particularly under the complex and dynamic conditions of real-world environments. To address these challenges, we introduce a novel structured prompting and knowledge distillation framework that enables automatic generation of high-quality traffic scene annotations and contextual risk assessments. Our framework orchestrates two large Vision-Language Models (VLMs): GPT-4o and o3-mini, using a structured Chain-of-Thought (CoT) strategy to produce rich, multi-perspective outputs. These outputs serve as knowledge-enriched pseudo-annotations for supervised fine-tuning of a much smaller student VLM. The resulting compact 3B-scale model, named VISTA (Vision for Intelligent Scene and Traffic Analysis), is capable of understanding low-resolution traffic videos and generating semantically faithful, risk-aware captions. Despite its significantly reduced parameter count, VISTA achieves strong performance across established captioning metrics (BLEU-4, METEOR, ROUGE-L, and CIDEr) when benchmarked against its teacher models. This demonstrates that effective knowledge distillation and structured multi-agent supervision can empower lightweight VLMs to capture complex reasoning capabilities. The compact architecture of VISTA facilitates efficient deployment on edge devices, enabling real-time risk monitoring without requiring extensive infrastructure upgrades.
comment: 16 pages, 10 figures, 1 table
☆ ALIGN: Word Association Learning for Cross-Cultural Generalization in Large Language Models
As large language models (LLMs) increasingly mediate cross-cultural communication, their behavior still reflects the distributional bias of the languages and viewpoints that are over-represented in their pre-training corpora. Yet, it remains a challenge to model and align culture due to limited cultural knowledge and a lack of exploration into effective learning approaches. We introduce a cost-efficient, cognitively grounded remedy: parameter-efficient fine-tuning on native speakers' free word-association norms, which encode implicit cultural schemas. Leveraging English-US and Mandarin associations from the Small-World-of-Words project, we adapt Llama-3.1-8B and Qwen-2.5-7B via supervised fine-tuning (SFT) and PPO-based preference optimization. SFT boosts held-out association Precision at 5 by 16-20% in English and 43-165% in Mandarin, lifts median concreteness by +0.20, and attains human-level valence and arousal. These lexical gains transfer: on World-Values-Survey questions, fine-tuned models shift answer distributions toward the target culture, and on a 50-item high-tension subset, Qwen's Chinese-aligned responses double while Llama's US bias drops by one-third. Our 7-8B models rival or beat vanilla 70B baselines, showing that a few million culture-grounded associations can instill value alignment without costly retraining. Our work highlights both the promise and the need for future research grounded in human cognition in improving cultural alignment in AI models.
☆ Zero-knowledge LLM hallucination detection and mitigation through fine-grained cross-model consistency
Large language models (LLMs) have demonstrated impressive capabilities across diverse tasks, but they remain susceptible to hallucinations--generating content that appears plausible but contains factual inaccuracies. We present Finch-Zk, a black-box framework that leverages FINe-grained Cross-model consistency to detect and mitigate Hallucinations in LLM outputs without requiring external knowledge sources. Finch-Zk introduces two key innovations: 1) a cross-model consistency checking strategy that reveals fine-grained inaccuracies by comparing responses generated by diverse models from semantically-equivalent prompts, and 2) a targeted mitigation technique that applies precise corrections to problematic segments while preserving accurate content. Experiments on the FELM dataset show Finch-Zk improves hallucination detection F1 scores by 6-39\% compared to existing approaches. For mitigation, Finch-Zk achieves 7-8 absolute percentage points improvement in answer accuracy on the GPQA-diamond dataset when applied to state-of-the-art models like Llama 4 Maverick and Claude 4 Sonnet. Extensive evaluation across multiple models demonstrates that Finch-Zk provides a practical, deployment-ready safeguard for enhancing factual reliability in production LLM systems.
☆ A Joint Multitask Model for Morpho-Syntactic Parsing
We present a joint multitask model for the UniDive 2025 Morpho-Syntactic Parsing shared task, where systems predict both morphological and syntactic analyses following novel UD annotation scheme. Our system uses a shared XLM-RoBERTa encoder with three specialized decoders for content word identification, dependency parsing, and morphosyntactic feature prediction. Our model achieves the best overall performance on the shared task's leaderboard covering nine typologically diverse languages, with an average MSLAS score of 78.7 percent, LAS of 80.1 percent, and Feats F1 of 90.3 percent. Our ablation studies show that matching the task's gold tokenization and content word identification are crucial to model performance. Error analysis reveals that our model struggles with core grammatical cases (particularly Nom-Acc) and nominal features across languages.
comment: 8 pages, SyntaxFest, UniDive 2025 Morpho-Syntactic Parsing shared task
☆ GLASS: Test-Time Acceleration for LLMs via Global-Local Neural Importance Aggregation
Deploying Large Language Models (LLMs) on edge hardware demands aggressive, prompt-aware dynamic pruning to reduce computation without degrading quality. Static or predictor-based schemes either lock in a single sparsity pattern or incur extra runtime overhead, and recent zero-shot methods that rely on statistics from a single prompt fail on short prompt and/or long generation scenarios. We introduce A/I-GLASS: Activation- and Impact-based Global-Local neural importance Aggregation for feed-forward network SparSification, two training-free methods that dynamically select FFN units using a rank-aggregation of prompt local and model-intrinsic global neuron statistics. Empirical results across multiple LLMs and benchmarks demonstrate that GLASS significantly outperforms prior training-free methods, particularly in challenging long-form generation scenarios, without relying on auxiliary predictors or adding any inference overhead.
☆ MultiFuzz: A Dense Retrieval-based Multi-Agent System for Network Protocol Fuzzing
Traditional protocol fuzzing techniques, such as those employed by AFL-based systems, often lack effectiveness due to a limited semantic understanding of complex protocol grammars and rigid seed mutation strategies. Recent works, such as ChatAFL, have integrated Large Language Models (LLMs) to guide protocol fuzzing and address these limitations, pushing protocol fuzzers to wider exploration of the protocol state space. But ChatAFL still faces issues like unreliable output, LLM hallucinations, and assumptions of LLM knowledge about protocol specifications. This paper introduces MultiFuzz, a novel dense retrieval-based multi-agent system designed to overcome these limitations by integrating semantic-aware context retrieval, specialized agents, and structured tool-assisted reasoning. MultiFuzz utilizes agentic chunks of protocol documentation (RFC Documents) to build embeddings in a vector database for a retrieval-augmented generation (RAG) pipeline, enabling agents to generate more reliable and structured outputs, enhancing the fuzzer in mutating protocol messages with enhanced state coverage and adherence to syntactic constraints. The framework decomposes the fuzzing process into modular groups of agents that collaborate through chain-of-thought reasoning to dynamically adapt fuzzing strategies based on the retrieved contextual knowledge. Experimental evaluations on the Real-Time Streaming Protocol (RTSP) demonstrate that MultiFuzz significantly improves branch coverage and explores deeper protocol states and transitions over state-of-the-art (SOTA) fuzzers such as NSFuzz, AFLNet, and ChatAFL. By combining dense retrieval, agentic coordination, and language model reasoning, MultiFuzz establishes a new paradigm in autonomous protocol fuzzing, offering a scalable and extensible foundation for future research in intelligent agentic-based fuzzing systems.
☆ Tokens with Meaning: A Hybrid Tokenization Approach for NLP
Tokenization plays a pivotal role in natural language processing (NLP), shaping how text is segmented and interpreted by language models. While subword methods such as Byte Pair Encoding (BPE) and WordPiece have been effective, they often struggle with morphologically rich and agglutinative languages because they rely on frequency rather than linguistic structure. We introduce a hybrid tokenization framework that combines rule-based morphological analysis with statistical subword segmentation. The method uses phonological normalization, root-affix dictionaries, and a novel algorithm that balances morpheme preservation with vocabulary efficiency. It assigns shared identifiers to phonologically variant affixes (e.g., -ler and -lar) and altered root forms (e.g., kitap vs. kitab{\i}), reducing redundancy while maintaining semantic integrity. Special tokens are added for whitespace and case, including an UPPERCASE marker to avoid vocabulary inflation from capitalization. BPE is integrated for out-of-vocabulary coverage without harming morphological coherence. On the TR-MMLU benchmark, the tokenizer achieves the highest Turkish Token Percentage (90.29\%) and Pure Token Percentage (85.8\%). Comparisons with tokenizers from LLaMA, Gemma, and GPT show more linguistically meaningful and coherent tokens. Although demonstrated on Turkish, the approach is language-independent and adaptable to other languages, offering a practical path toward more interpretable and effective multilingual NLP systems.
☆ Measuring LLM Code Generation Stability via Structural Entropy
Assessing the stability of code generation from large language models (LLMs) is essential for judging their reliability in real-world development. We extend prior "structural-entropy concepts" to the program domain by pairing entropy with abstract syntax tree (AST) analysis. For any fixed prompt, we collect the multiset of depth-bounded subtrees of AST in each generated program and treat their relative frequencies as a probability distribution. We then measure stability in two complementary ways: (i) Jensen-Shannon divergence, a symmetric, bounded indicator of structural overlap, and (ii) a Structural Cross-Entropy ratio that highlights missing high-probability patterns. Both metrics admit structural-only and token-aware variants, enabling separate views on control-flow shape and identifier-level variability. Unlike pass@k, BLEU, or CodeBLEU, our metrics are reference-free, language-agnostic, and execution-independent. We benchmark several leading LLMs on standard code generation tasks, demonstrating that AST-driven structural entropy reveals nuances in model consistency and robustness. The method runs in O(n,d) time with no external tests, providing a lightweight addition to the code-generation evaluation toolkit.
comment: ASE-NIER
☆ GRILE: A Benchmark for Grammar Reasoning and Explanation in Romanian LLMs
LLMs (Large language models) have revolutionized NLP (Natural Language Processing), yet their pedagogical value for low-resource languages remains unclear. We present GRILE (Grammar Romanian Inference and Language Explanations) , the first open benchmark of 1,151 multiple-choice questions harvested from Romanian high-stakes exams (National Evaluation, Baccalaureate, university admissions). GRILE enables us to probe two complementary abilities of seven state-of-the-art multilingual and Romanian-specific LLMs: (i) selecting the correct answer, and (ii) producing linguistically accurate explanations. While Gemini 2.5 Pro reaches 83% accuracy, most open-weight models stay below 65%, and 48% of their explanations contain factual or pedagogical flaws according to expert review. A detailed error analysis pinpoints systematic weaknesses in morphology and in applying the latest DOOM3 orthographic norms. All data, code and a public web demo are released to catalyze future research. Our findings expose open challenges for trustworthy educational NLP in low-resource settings and establish GRILE as a new test-bed for controllable explanation generation and evaluation.
comment: Accepted as long paper @RANLP2025
☆ Disentangling concept semantics via multilingual averaging in Sparse Autoencoders
Connecting LLMs with formal knowledge representation and reasoning is a promising approach to address their shortcomings. Embeddings and sparse autoencoders are widely used to represent textual content, but the semantics are entangled with syntactic and language-specific information. We propose a method that isolates concept semantics in Large Langue Models by averaging concept activations derived via Sparse Autoencoders. We create English text representations from OWL ontology classes, translate the English into French and Chinese and then pass these texts as prompts to the Gemma 2B LLM. Using the open source Gemma Scope suite of Sparse Autoencoders, we obtain concept activations for each class and language version. We average the different language activations to derive a conceptual average. We then correlate the conceptual averages with a ground truth mapping between ontology classes. Our results give a strong indication that the conceptual average aligns to the true relationship between classes when compared with a single language by itself. The result hints at a new technique which enables mechanistic interpretation of internal network states with higher accuracy.
☆ Let's Use ChatGPT To Write Our Paper! Benchmarking LLMs To Write the Introduction of a Research Paper
As researchers increasingly adopt LLMs as writing assistants, generating high-quality research paper introductions remains both challenging and essential. We introduce Scientific Introduction Generation (SciIG), a task that evaluates LLMs' ability to produce coherent introductions from titles, abstracts, and related works. Curating new datasets from NAACL 2025 and ICLR 2025 papers, we assess five state-of-the-art models, including both open-source (DeepSeek-v3, Gemma-3-12B, LLaMA 4-Maverick, MistralAI Small 3.1) and closed-source GPT-4o systems, across multiple dimensions: lexical overlap, semantic similarity, content coverage, faithfulness, consistency, citation correctness, and narrative quality. Our comprehensive framework combines automated metrics with LLM-as-a-judge evaluations. Results demonstrate LLaMA-4 Maverick's superior performance on most metrics, particularly in semantic similarity and faithfulness. Moreover, three-shot prompting consistently outperforms fewer-shot approaches. These findings provide practical insights into developing effective research writing assistants and set realistic expectations for LLM-assisted academic writing. To foster reproducibility and future research, we will publicly release all code and datasets.
comment: 20 pages, 15 figures
☆ Two Birds with One Stone: Multi-Task Detection and Attribution of LLM-Generated Text
Large Language Models (LLMs), such as GPT-4 and Llama, have demonstrated remarkable abilities in generating natural language. However, they also pose security and integrity challenges. Existing countermeasures primarily focus on distinguishing AI-generated content from human-written text, with most solutions tailored for English. Meanwhile, authorship attribution--determining which specific LLM produced a given text--has received comparatively little attention despite its importance in forensic analysis. In this paper, we present DA-MTL, a multi-task learning framework that simultaneously addresses both text detection and authorship attribution. We evaluate DA-MTL on nine datasets and four backbone models, demonstrating its strong performance across multiple languages and LLM sources. Our framework captures each task's unique characteristics and shares insights between them, which boosts performance in both tasks. Additionally, we conduct a thorough analysis of cross-modal and cross-lingual patterns and assess the framework's robustness against adversarial obfuscation techniques. Our findings offer valuable insights into LLM behavior and the generalization of both detection and authorship attribution.
comment: Securecomm 2025
☆ Comparing energy consumption and accuracy in text classification inference
The increasing deployment of large language models (LLMs) in natural language processing (NLP) tasks raises concerns about energy efficiency and sustainability. While prior research has largely focused on energy consumption during model training, the inference phase has received comparatively less attention. This study systematically evaluates the trade-offs between model accuracy and energy consumption in text classification inference across various model architectures and hardware configurations. Our empirical analysis shows that the best-performing model in terms of accuracy can also be energy-efficient, while larger LLMs tend to consume significantly more energy with lower classification accuracy. We observe substantial variability in inference energy consumption ($<$mWh to $>$kWh), influenced by model type, model size, and hardware specifications. Additionally, we find a strong correlation between inference energy consumption and model runtime, indicating that execution time can serve as a practical proxy for energy usage in settings where direct measurement is not feasible. These findings have implications for sustainable AI development, providing actionable insights for researchers, industry practitioners, and policymakers seeking to balance performance and resource efficiency in NLP applications.
comment: Key results in Figure 1, submitted to Nature Communications, 25 pages
☆ DPad: Efficient Diffusion Language Models with Suffix Dropout
Diffusion-based Large Language Models (dLLMs) parallelize text generation by framing decoding as a denoising process, but suffer from high computational overhead since they predict all future suffix tokens at each step while retaining only a small fraction. We propose Diffusion Scratchpad (DPad), a training-free method that restricts attention to a small set of nearby suffix tokens, preserving fidelity while eliminating redundancy. DPad integrates two strategies: (i) a sliding window, which maintains a fixed-length suffix window, and (ii) distance-decay dropout, which deterministically removes distant suffix tokens before attention computation. This simple design is compatible with existing optimizations such as prefix caching and can be implemented with only a few lines of code. Comprehensive evaluations across multiple benchmarks on LLaDA-1.5 and Dream models demonstrate that DPad delivers up to $\mathbf{61.4\times}$ speedup over vanilla dLLMs while maintaining comparable accuracy, highlighting its potential for efficient and scalable long-sequence inference. Our code is available at https://github.com/Crys-Chen/DPad.
☆ MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation
With the rapid growth of academic publications, peer review has become an essential yet time-consuming responsibility within the research community. Large Language Models (LLMs) have increasingly been adopted to assist in the generation of review comments; however, current LLM-based review tasks lack a unified evaluation benchmark to rigorously assess the models' ability to produce comprehensive, accurate, and human-aligned assessments, particularly in scenarios involving multimodal content such as figures and tables. To address this gap, we propose \textbf{MMReview}, a comprehensive benchmark that spans multiple disciplines and modalities. MMReview includes multimodal content and expert-written review comments for 240 papers across 17 research domains within four major academic disciplines: Artificial Intelligence, Natural Sciences, Engineering Sciences, and Social Sciences. We design a total of 13 tasks grouped into four core categories, aimed at evaluating the performance of LLMs and Multimodal LLMs (MLLMs) in step-wise review generation, outcome formulation, alignment with human preferences, and robustness to adversarial input manipulation. Extensive experiments conducted on 16 open-source models and 5 advanced closed-source models demonstrate the thoroughness of the benchmark. We envision MMReview as a critical step toward establishing a standardized foundation for the development of automated peer review systems.
comment: Work in progress
♻ ☆ Hallucinations and Key Information Extraction in Medical Texts: A Comprehensive Assessment of Open-Source Large Language Models
Clinical summarization is crucial in healthcare as it distills complex medical data into digestible information, enhancing patient understanding and care management. Large language models (LLMs) have shown significant potential in automating and improving the accuracy of such summarizations due to their advanced natural language understanding capabilities. These models are particularly applicable in the context of summarizing medical/clinical texts, where precise and concise information transfer is essential. In this paper, we investigate the effectiveness of open-source LLMs in extracting key events from discharge reports, including admission reasons, major in-hospital events, and critical follow-up actions. In addition, we also assess the prevalence of various types of hallucinations in the summaries produced by these models. Detecting hallucinations is vital as it directly influences the reliability of the information, potentially affecting patient care and treatment outcomes. We conduct comprehensive simulations to rigorously evaluate the performance of these models, further probing the accuracy and fidelity of the extracted content in clinical summarization. Our results reveal that while the LLMs (e.g., Qwen2.5 and DeepSeek-v2) perform quite well in capturing admission reasons and hospitalization events, they are generally less consistent when it comes to identifying follow-up recommendations, highlighting broader challenges in leveraging LLMs for comprehensive summarization.
♻ ☆ Soft Reasoning: Navigating Solution Spaces in Large Language Models through Controlled Embedding Exploration ICML 2025
Large Language Models (LLMs) struggle with complex reasoning due to limited diversity and inefficient search. We propose Soft Reasoning, an embedding-based search framework that optimises the embedding of the first token to guide generation. It combines (1) embedding perturbation for controlled exploration and (2) Bayesian optimisation to refine embeddings via a verifier-guided objective, balancing exploration and exploitation. This approach improves reasoning accuracy and coherence while avoiding reliance on heuristic search. Experiments demonstrate superior correctness with minimal computation, making it a scalable, model-agnostic solution. The code is released at https://github.com/alickzhu/Soft-Reasoning.
comment: Accepted as a Spotlight at ICML 2025
♻ ☆ PlantDeBERTa: An Open Source Language Model for Plant Science
The rapid advancement of transformer-based language models has catalyzed breakthroughs in biomedical and clinical natural language processing; however, plant science remains markedly underserved by such domain-adapted tools. In this work, we present PlantDeBERTa, a high-performance, open-source language model specifically tailored for extracting structured knowledge from plant stress-response literature. Built upon the DeBERTa architecture-known for its disentangled attention and robust contextual encoding-PlantDeBERTa is fine-tuned on a meticulously curated corpus of expert-annotated abstracts, with a primary focus on lentil (Lens culinaris) responses to diverse abiotic and biotic stressors. Our methodology combines transformer-based modeling with rule-enhanced linguistic post-processing and ontology-grounded entity normalization, enabling PlantDeBERTa to capture biologically meaningful relationships with precision and semantic fidelity. The underlying corpus is annotated using a hierarchical schema aligned with the Crop Ontology, encompassing molecular, physiological, biochemical, and agronomic dimensions of plant adaptation. PlantDeBERTa exhibits strong generalization capabilities across entity types and demonstrates the feasibility of robust domain adaptation in low-resource scientific fields.By providing a scalable and reproducible framework for high-resolution entity recognition, PlantDeBERTa bridges a critical gap in agricultural NLP and paves the way for intelligent, data-driven systems in plant genomics, phenomics, and agronomic knowledge discovery. Our model is publicly released to promote transparency and accelerate cross-disciplinary innovation in computational plant science.
♻ ☆ CRED-SQL: Enhancing Real-world Large Scale Database Text-to-SQL Parsing through Cluster Retrieval and Execution Description
Recent advances in large language models (LLMs) have significantly improved the accuracy of Text-to-SQL systems. However, a critical challenge remains: the semantic mismatch between natural language questions (NLQs) and their corresponding SQL queries. This issue is exacerbated in large-scale databases, where semantically similar attributes hinder schema linking and semantic drift during SQL generation, ultimately reducing model accuracy. To address these challenges, we introduce CRED-SQL, a framework designed for large-scale databases that integrates Cluster Retrieval and Execution Description. CRED-SQL first performs cluster-based large-scale schema retrieval to pinpoint the tables and columns most relevant to a given NLQ, alleviating schema mismatch. It then introduces an intermediate natural language representation-Execution Description Language (EDL)-to bridge the gap between NLQs and SQL. This reformulation decomposes the task into two stages: Text-to-EDL and EDL-to-SQL, leveraging LLMs' strong general reasoning capabilities while reducing semantic deviation. Extensive experiments on two large-scale, cross-domain benchmarks-SpiderUnion and BirdUnion-demonstrate that CRED-SQL achieves new state-of-the-art (SOTA) performance, validating its effectiveness and scalability. Our code is available at https://github.com/smduan/CRED-SQL.git
♻ ☆ GoAI: Enhancing AI Students' Learning Paths and Idea Generation via Graph of AI Ideas
With the rapid advancement of artificial intelligence technology, AI students are confronted with a significant "information-to-innovation" gap: they must navigate through the rapidly expanding body of literature, trace the development of a specific research field, and synthesize various techniques into feasible innovative concepts. An additional critical step for students is to identify the necessary prerequisite knowledge and learning paths. Although many approaches based on large language models (LLMs) can summarize the content of papers and trace the development of a field through citations, these methods often overlook the prerequisite knowledge involved in the papers and the rich semantic information embedded in the citation relationships between papers. Such information reveals how methods are interrelated, built upon, extended, or challenged. To address these limitations, we propose GoAI, a tool for constructing educational knowledge graphs from AI research papers that leverages these graphs to plan personalized learning paths and support creative ideation. The nodes in the knowledge graph we have built include papers and the prerequisite knowledge, such as concepts, skills, and tools, that they involve; the edges record the semantic information of citations. When a student queries a specific paper, a beam search-based path search method can trace the current development trends of the field from the queried paper and plan a learning path toward cutting-edge objectives. The integrated Idea Studio guides students to clarify problem statements, compare alternative designs, and provide formative feedback on novelty, clarity, feasibility, and alignment with learning objectives.
comment: Work in progress
♻ ☆ "Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs
Recently released LLMs have strong multilingual \& multimodal capabilities. Model vulnerabilities are exposed using audits and red-teaming efforts. Existing efforts have focused primarily on the English language; thus, models continue to be susceptible to multilingual jailbreaking strategies, especially for multimodal contexts. In this study, we introduce a novel strategy that leverages code-mixing and phonetic perturbations to jailbreak LLMs for both text and image generation tasks. We also introduce \textit{two new} jailbreak strategies that show higher effectiveness than baselines. Our work presents a method to effectively bypass safety filters in LLMs while maintaining interpretability by applying phonetic misspellings to sensitive words in code-mixed prompts. We achieve a 99\% Attack Success Rate for text generation and 78\% for image generation, with Attack Relevance Rate of 100\% for text generation and 95\% for image generation for the phonetically perturbed code-mixed prompts. Our interpretability experiments reveal that phonetic perturbations impact word tokenization, leading to jailbreak success. Our study motivates increasing the focus towards more generalizable safety alignment for multilingual multimodal models, especially in real-world settings wherein prompts can have misspelt words. \textit{\textbf{Warning: This paper contains examples of potentially harmful and offensive content.}}
♻ ☆ Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic Thought Reward
Large language models (LLMs) exhibit remarkable problem-solving abilities, but struggle with complex tasks due to static internal knowledge. Retrieval-Augmented Generation (RAG) enhances access to external information, yet remains limited in multi-hop reasoning and strategic search due to rigid workflows. Recent advancements in agentic deep research empower LLMs to autonomously reason, search, and synthesize information. However, current approaches relying on outcome-based reinforcement learning (RL) face critical issues such as conflicting gradients and reward sparsity, limiting performance gains and training efficiency. To address these, we first propose Atomic Thought, a novel LLM thinking paradigm that decomposes reasoning into fine-grained functional units. These units are supervised by Reasoning Reward Models (RRMs), which provide Atomic Thought Rewards (ATR) for fine-grained guidance. Building on this, we propose Atom-Searcher, a novel RL framework for agentic deep research that integrates Atomic Thought and ATR. Atom-Searcher uses a curriculum-inspired reward schedule, prioritizing process-level ATR early and transitioning to outcome rewards, accelerating convergence on effective reasoning paths. Experiments on seven benchmarks show consistent improvements over the state-of-the-art. Key advantages include: (1) Atom-Searcher scales computation at test-time. (2) Atomic Thought provides supervision anchors for RRMs, bridging deep research tasks and RRMs. (3) Atom-Searcher exhibits more interpretable, human-like reasoning patterns.
♻ ☆ Development of Pre-Trained Transformer-based Models for the Nepali Language
Transformer-based pre-trained language models have dominated the field of Natural Language Processing (NLP) for quite some time now. However, the Nepali language, spoken by approximately 32 million people worldwide, remains significantly underrepresented in this domain. This underrepresentation is primarily attributed to the scarcity of monolingual data corpora and limited available resources for the Nepali language. While existing efforts have predominantly concentrated on basic encoder-based models, there is a notable gap in the exploration of decoder-based architectures. To address this gap, we have collected 27.5 GB of Nepali text data, approximately 2.4x larger than any previously available Nepali language corpus. Leveraging this data, we pre-trained three different models i.e., BERT, RoBERTa, and GPT-2, exclusively for the Nepali Language. Furthermore, we performed instruction tuning and explored its potential for monolingual Nepali data, providing a foundation for future research. Our models outperformed the existing best model by 2 points on Nep-gLUE benchmark, scoring 95.60 and also outperformed existing models on text generation tasks, demonstrating improvements in both understanding and generating Nepali text.
♻ ☆ Consolidating and Developing Benchmarking Datasets for the Nepali Natural Language Understanding Tasks
The Nepali language has distinct linguistic features, especially its complex script (Devanagari script), morphology, and various dialects,which pose a unique challenge for Natural Language Understanding (NLU) tasks. While the Nepali Language Understanding Evaluation (Nep-gLUE) benchmark provides a foundation for evaluating models, it remains limited in scope, covering four tasks. This restricts their utility for comprehensive assessments of Natural Language Processing (NLP) models. To address this limitation, we introduce twelve new datasets, creating a new benchmark, the Nepali /Language Understanding Evaluation (NLUE) benchmark for evaluating the performance of models across a diverse set of Natural Language Understanding (NLU) tasks. The added tasks include Single-Sentence Classification, Similarity and Paraphrase Tasks, Natural Language Inference (NLI), and General Masked Evaluation Task (GMET). Through extensive experiments, we demonstrate that existing top models struggle with the added complexity of these tasks. We also find that the best multilingual model outperforms the best monolingual models across most tasks, highlighting the need for more robust solutions tailored to the Nepali language. This expanded benchmark sets a new standard for evaluating, comparing, and advancing models, contributing significantly to the broader goal of advancing NLP research for low-resource languages.
♻ ☆ Iterative Utility Judgment Framework via LLMs Inspired by Relevance in Philosophy
Relevance and utility are two frequently used measures to evaluate the effectiveness of an information retrieval (IR) system. Relevance emphasizes the aboutness of a result to a query, while utility refers to the result's usefulness or value to an information seeker. In Retrieval-Augmented Generation (RAG), high-utility results should be prioritized to feed to LLMs due to their limited input bandwidth. Re-examining RAG's three core components -- relevance ranking derived from retrieval models, utility judgments, and answer generation -- aligns with Schutz's philosophical system of relevances, which encompasses three types of relevance representing different levels of human cognition that enhance each other. These three RAG components also reflect three cognitive levels for LLMs in question-answering. Therefore, we propose an Iterative utiliTy judgmEnt fraMework (ITEM) to promote each step in RAG. We conducted extensive experiments on retrieval (TREC DL, WebAP), utility judgment task (GTI-NQ), and factoid question-answering (NQ) datasets. Experimental results demonstrate significant improvements of ITEM in utility judgments, ranking, and answer generation upon representative baselines.
comment: 22 pages
♻ ☆ LEGO-GraphRAG: Modularizing Graph-based Retrieval-Augmented Generation for Design Space Exploration VLDB'2025
GraphRAG integrates (knowledge) graphs with large language models (LLMs) to improve reasoning accuracy and contextual relevance. Despite its promising applications and strong relevance to multiple research communities, such as databases and natural language processing, GraphRAG currently lacks modular workflow analysis, systematic solution frameworks, and insightful empirical studies. To bridge these gaps, we propose LEGO-GraphRAG, a modular framework that enables: 1) fine-grained decomposition of the GraphRAG workflow, 2) systematic classification of existing techniques and implemented GraphRAG instances, and 3) creation of new GraphRAG instances. Our framework facilitates comprehensive empirical studies of GraphRAG on large-scale real-world graphs and diverse query sets, revealing insights into balancing reasoning quality, runtime efficiency, and token or GPU cost, that are essential for building advanced GraphRAG systems.
comment: VLDB'2025 [Experiment, Analysis & Benchmark]
♻ ☆ Unleashing the Power of LLMs in Dense Retrieval with Query Likelihood Modeling
Dense retrieval is a crucial task in Information Retrieval (IR), serving as the basis for downstream tasks such as re-ranking and augmenting generation. Recently, large language models (LLMs) have demonstrated impressive semantic understanding capabilities, making them attractive to researchers focusing on dense retrieval. While LLMs, as decoder-style generative models, excel in language generation, they often fall short in modeling global information due to a lack of attention to subsequent tokens. Drawing inspiration from the classical word-based language modeling approach for IR, specifically the query likelihood (QL) model, we aim to leverage the generative strengths of LLMs through QL maximization. Rather than employing QL estimation for document ranking, we propose an auxiliary task of QL maximization to enhance the backbone for subsequent contrastive learning of the retriever. We introduce our model, LLM-QL, which incorporates two key components: Attention Block (AB) and Document Corruption (DC). AB blocks the attention of predictive tokens to the document tokens before the document's ending token, while DC corrupts a document by masking a portion of its tokens during prediction. Evaluations on the in-domain (MS MARCO) and out-of-domain dataset (BEIR) indicate LLM-QL's superiority over other LLM-based retrievers. Furthermore, comprehensive analyses also validate the efficacy of LLM-QL and its components.
comment: 12 pages, 3 figures
♻ ☆ Harnessing Structured Knowledge: A Concept Map-Based Approach for High-Quality Multiple Choice Question Generation with Effective Distractors ECAI 2025
Generating high-quality MCQs, especially those targeting diverse cognitive levels and incorporating common misconceptions into distractor design, is time-consuming and expertise-intensive, making manual creation impractical at scale. Current automated approaches typically generate questions at lower cognitive levels and fail to incorporate domain-specific misconceptions. This paper presents a hierarchical concept map-based framework that provides structured knowledge to guide LLMs in generating MCQs with distractors. We chose high-school physics as our test domain and began by developing a hierarchical concept map covering major Physics topics and their interconnections with an efficient database design. Next, through an automated pipeline, topic-relevant sections of these concept maps are retrieved to serve as a structured context for the LLM to generate questions and distractors that specifically target common misconceptions. Lastly, an automated validation is completed to ensure that the generated MCQs meet the requirements provided. We evaluate our framework against two baseline approaches: a base LLM and a RAG-based generation. We conducted expert evaluations and student assessments of the generated MCQs. Expert evaluation shows that our method significantly outperforms the baseline approaches, achieving a success rate of 75.20% in meeting all quality criteria compared to approximately 37% for both baseline methods. Student assessment data reveal that our concept map-driven approach achieved a significantly lower guess success rate of 28.05% compared to 37.10% for the baselines, indicating a more effective assessment of conceptual understanding. The results demonstrate that our concept map-based approach enables robust assessment across cognitive levels and instant identification of conceptual gaps, facilitating faster feedback loops and targeted interventions at scale.
comment: Accepted to ECAI 2025
♻ ☆ PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation
Evaluating LLMs with a single prompt has proven unreliable, with small changes leading to significant performance differences. However, generating the prompt variations needed for a more robust multi-prompt evaluation is challenging, limiting its adoption in practice. To address this, we introduce PromptSuite, a framework that enables the automatic generation of various prompts. PromptSuite is flexible - working out of the box on a wide range of tasks and benchmarks. It follows a modular prompt design, allowing controlled perturbations to each component, and is extensible, supporting the addition of new components and perturbation types. Through a series of case studies, we show that PromptSuite provides meaningful variations to support strong evaluation practices. All resources, including the Python API, source code, user-friendly web interface, and demonstration video, are available at: https://eliyahabba.github.io/PromptSuite/.
comment: Eliya Habba and Noam Dahan contributed equally to this work
♻ ☆ MedKGent: A Large Language Model Agent Framework for Constructing Temporally Evolving Medical Knowledge Graph
The rapid expansion of medical literature presents growing challenges for structuring and integrating domain knowledge at scale. Knowledge Graphs (KGs) offer a promising solution by enabling efficient retrieval, automated reasoning, and knowledge discovery. However, current KG construction methods often rely on supervised pipelines with limited generalizability or naively aggregate outputs from Large Language Models (LLMs), treating biomedical corpora as static and ignoring the temporal dynamics and contextual uncertainty of evolving knowledge. To address these limitations, we introduce MedKGent, a LLM agent framework for constructing temporally evolving medical KGs. Leveraging over 10 million PubMed abstracts published between 1975 and 2023, we simulate the emergence of biomedical knowledge via a fine-grained daily time series. MedKGent incrementally builds the KG in a day-by-day manner using two specialized agents powered by the Qwen2.5-32B-Instruct model. The Extractor Agent identifies knowledge triples and assigns confidence scores via sampling-based estimation, which are used to filter low-confidence extractions and inform downstream processing. The Constructor Agent incrementally integrates the retained triples into a temporally evolving graph, guided by confidence scores and timestamps to reinforce recurring knowledge and resolve conflicts. The resulting KG contains 156,275 entities and 2,971,384 relational triples. Quality assessments by two SOTA LLMs and three domain experts demonstrate an accuracy approaching 90%, with strong inter-rater agreement. To evaluate downstream utility, we conduct RAG across seven medical question answering benchmarks using five leading LLMs, consistently observing significant improvements over non-augmented baselines. Case studies further demonstrate the KG's value in literature-based drug repurposing via confidence-aware causal inference.
♻ ☆ A Survey of LLM-based Deep Search Agents: Paradigm, Optimization, Evaluation, and Challenges
The advent of Large Language Models (LLMs) has significantly revolutionized web search. The emergence of LLM-based Search Agents marks a pivotal shift towards deeper, dynamic, autonomous information seeking. These agents can comprehend user intentions and environmental context and execute multi-turn retrieval with dynamic planning, extending search capabilities far beyond the web. Leading examples like OpenAI's Deep Research highlight their potential for deep information mining and real-world applications. This survey provides the first systematic analysis of search agents. We comprehensively analyze and categorize existing works from the perspectives of architecture, optimization, application, and evaluation, ultimately identifying critical open challenges and outlining promising future research directions in this rapidly evolving field. Our repository is available on https://github.com/YunjiaXi/Awesome-Search-Agent-Papers.
♻ ☆ SEA-LION: Southeast Asian Languages in One Network
Recently, Large Language Models (LLMs) have dominated much of the artificial intelligence scene with their ability to process and generate natural languages. However, the majority of LLM research and development remains English-centric, leaving low-resource languages such as those in the Southeast Asian (SEA) region under-represented. To address this representation gap, we introduce Llama-SEA-LION-v3-8B-IT and Gemma-SEA-LION-v3-9B-IT, two cutting-edge multilingual LLMs designed for SEA languages. The SEA-LION family of LLMs supports 11 SEA languages, namely English, Chinese, Indonesian, Vietnamese, Malay, Thai, Burmese, Lao, Filipino, Tamil, and Khmer. Our work leverages large-scale multilingual continued pre-training with a comprehensive post-training regime involving multiple stages of instruction fine-tuning, alignment, and model merging. Evaluation results on multilingual benchmarks indicate that our models achieve state-of-the-art performance across LLMs supporting SEA languages. We open-source the models to benefit the wider SEA community.
comment: We released our model at https://huggingface.co/collections/aisingapore/sea-lionv3-672589a39cdadd6a5b199581
♻ ☆ MedVisionLlama: Leveraging Pre-Trained Large Language Model Layers to Enhance Medical Image Segmentation ICCV
Large Language Models (LLMs), known for their versatility in textual data, are increasingly being explored for their potential to enhance medical image segmentation, a crucial task for accurate diagnostic imaging. This study explores enhancing Vision Transformers (ViTs) for medical image segmentation by integrating pre-trained LLM transformer blocks. Our approach, which incorporates a frozen LLM transformer block into the encoder of a ViT-based model, leads to substantial improvements in segmentation performance across various medical imaging modalities. We propose a Hybrid Attention Mechanism that combines global and local feature learning with a Multi-Scale Fusion Block for aggregating features across different scales. The enhanced model shows significant performance gains, including an average Dice score increase from 0.74 to 0.79 and improvements in accuracy, precision, and the Jaccard Index. These results demonstrate the effectiveness of LLM-based transformers in refining medical image segmentation, highlighting their potential to significantly boost model accuracy and robustness. The source code and our implementation are available at: https://github.com/AS-Lab/Marthi-et-al-2025-MedVisionLlama-Pre-Trained-LLM-Layers-to-Enhance-Medical-Image-Segmentation
comment: Accepted to the CVAMD Workshop (Computer Vision for Automated Medical Diagnosis) at the 2025 IEEE/CVF International Conference on Computer Vision (ICCVW 2025)
♻ ☆ Legal$Δ$: Enhancing Legal Reasoning in LLMs via Reinforcement Learning with Chain-of-Thought Guided Information Gain
Legal Artificial Intelligence (LegalAI) has achieved notable advances in automating judicial decision-making with the support of Large Language Models (LLMs). However, existing legal LLMs still struggle to generate reliable and interpretable reasoning processes. They often default to fast-thinking behavior by producing direct answers without explicit multi-step reasoning, limiting their effectiveness in complex legal scenarios that demand rigorous justification. To address this challenge, we propose Legal$\Delta$, a reinforcement learning framework designed to enhance legal reasoning through chain-of-thought guided information gain. During training, Legal$\Delta$ employs a dual-mode input setup-comprising direct answer and reasoning-augmented modes-and maximizes the information gain between them. This encourages the model to acquire meaningful reasoning patterns rather than generating superficial or redundant explanations. Legal$\Delta$ follows a two-stage approach: (1) distilling latent reasoning capabilities from a powerful Large Reasoning Model (LRM), DeepSeek-R1, and (2) refining reasoning quality via differential comparisons, combined with a multidimensional reward mechanism that assesses both structural coherence and legal-domain specificity. Experimental results on multiple legal reasoning tasks demonstrate that Legal$\Delta$ outperforms strong baselines in both accuracy and interpretability. It consistently produces more robust and trustworthy legal judgments without relying on labeled preference data. All code and data will be released at https://github.com/NEUIR/LegalDelta.
♻ ☆ PC-Sampler: Position-Aware Calibration of Decoding Bias in Masked Diffusion Models
Recent advances in masked diffusion models (MDMs) have established them as powerful non-autoregressive alternatives for sequence generation. Nevertheless, our preliminary experiments reveal that the generation quality of MDMs is still highly sensitive to the choice of decoding strategy. In particular, widely adopted uncertainty-based samplers suffer from two key limitations: a lack of global trajectory control and a pronounced bias toward trivial tokens in the early stages of decoding. These shortcomings restrict the full potential of MDMs. In this work, we introduce Position-Aware Confidence-Calibrated Sampling (PC-Sampler), a novel decoding strategy that unifies global trajectory planning with content-aware informativeness maximization. PC-Sampler incorporates a position-aware weighting mechanism to regulate the decoding path and a calibrated confidence score to suppress the premature selection of trivial tokens. Extensive experiments on three advanced MDMs across seven challenging benchmarks-including logical reasoning and planning tasks-demonstrate that PC-Sampler consistently outperforms existing MDM decoding strategies by more than 10% on average, significantly narrowing the performance gap with state-of-the-art autoregressive models. All codes are available at https://github.com/NEUIR/PC-Sampler.
comment: 17 pages,13 figures
♻ ☆ Crossing Borders Without Crossing Boundaries: How Sociolinguistic Awareness Can Optimize User Engagement with Localized Spanish AI Models Across Hispanophone Countries
Large language models are, by definition, based on language. In an effort to underscore the critical need for regional localized models, this paper examines primary differences between variants of written Spanish across Latin America and Spain, with an in-depth sociocultural and linguistic contextualization therein. We argue that these differences effectively constitute significant gaps in the quotidian use of Spanish among dialectal groups by creating sociolinguistic dissonances, to the extent that locale-sensitive AI models would play a pivotal role in bridging these divides. In doing so, this approach informs better and more efficient localization strategies that also serve to more adequately meet inclusivity goals, while securing sustainable active daily user growth in a major low-risk investment geographic area. Therefore, implementing at least the proposed five sub variants of Spanish addresses two lines of action: to foment user trust and reliance on AI language models while also demonstrating a level of cultural, historical, and sociolinguistic awareness that reflects positively on any internationalization strategy.
♻ ☆ iTBLS: A Dataset of Interactive Conversations Over Tabular Information
This paper introduces Interactive Tables (iTBLS), a dataset of interactive conversations that focuses on natural-language manipulation of tabular information sourced from academic pre-prints on ArXiv. The iTBLS dataset consists of three types of tabular tasks -- interpretation, modification, and generation. Interpretation focuses on tabular understanding, modification focuses on manipulating tabular information, and generation focuses on the addition of new natural-language evidence. In addition, the paper presents a novel framework that reformulates tabular operations as question-answering, where an appropriate question is formulated based on the nature of interaction and the question is answered using the user request as evidence. The developed approach results in an improvement on all tasks on a sequence-to-sequence modeling baseline on iTBLS. In addition, the question-answering-based reformulation is applied to datasets from prior work for the text-to-table task where textual paragraphs are summarized into tables. The novel approach results in up to 13% improvement in Exact-Match accuracy and up to 16% improvement in BERTScores compared to the prior state-of-the-art.
comment: 15 pages, 4 figures
♻ ☆ A Study of the Framework and Real-World Applications of Language Embedding for 3D Scene Understanding
Gaussian Splatting has rapidly emerged as a transformative technique for real-time 3D scene representation, offering a highly efficient and expressive alternative to Neural Radiance Fields (NeRF). Its ability to render complex scenes with high fidelity has enabled progress across domains such as scene reconstruction, robotics, and interactive content creation. More recently, the integration of Large Language Models (LLMs) and language embeddings into Gaussian Splatting pipelines has opened new possibilities for text-conditioned generation, editing, and semantic scene understanding. Despite these advances, a comprehensive overview of this emerging intersection has been lacking. This survey presents a structured review of current research efforts that combine language guidance with 3D Gaussian Splatting, detailing theoretical foundations, integration strategies, and real-world use cases. We highlight key limitations such as computational bottlenecks, generalizability, and the scarcity of semantically annotated 3D Gaussian data and outline open challenges and future directions for advancing language-guided 3D scene understanding using Gaussian Splatting.
♻ ☆ Universal Abstraction: Harnessing Frontier Models to Structure Real-World Data at Scale
A significant fraction of real-world patient information resides in unstructured clinical text. Medical abstraction extracts and normalizes key structured attributes from free-text clinical notes, which is the prerequisite for a variety of important downstream applications, including registry curation, clinical trial operations, and real-world evidence generation. Prior medical abstraction methods typically resort to building attribute-specific models, each of which requires extensive manual effort such as rule creation or supervised label annotation for the individual attribute, thus limiting scalability. In this paper, we show that existing frontier models already possess the universal abstraction capability for scaling medical abstraction to a wide range of clinical attributes. We present UniMedAbstractor (UMA), a unifying framework for zero-shot medical abstraction with a modular, customizable prompt template and the selection of any frontier large language models. Given a new attribute for abstraction, users only need to conduct lightweight prompt adaptation in UMA to adjust the specification in natural languages. Compared to traditional methods, UMA eliminates the need for attribute-specific training labels or handcrafted rules, thus substantially reducing the development time and cost. We conducted a comprehensive evaluation of UMA in oncology using a wide range of marquee attributes representing the cancer patient journey. These include relatively simple attributes typically specified within a single clinical note (e.g. performance status), as well as complex attributes requiring sophisticated reasoning across multiple notes at various time points (e.g. tumor staging). Based on a single frontier model such as GPT-4o, UMA matched or even exceeded the performance of state-of-the-art attribute-specific methods, each of which was tailored to the individual attribute.
♻ ☆ Fact or Guesswork? Evaluating Large Language Models' Medical Knowledge with Structured One-Hop Judgments
Large language models (LLMs) have been widely adopted in various downstream task domains. However, their abilities to directly recall and apply factual medical knowledge remains under-explored. Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities. Given the high-stakes nature of medical applications, where incorrect information can have critical consequences, it is essential to evaluate the factuality of LLMs to retain medical knowledge. To address this challenge, we introduce the Medical Knowledge Judgment Dataset (MKJ), a dataset derived from the Unified Medical Language System (UMLS), a comprehensive repository of standardized biomedical vocabularies and knowledge graphs. Through a binary classification framework, MKJ evaluates LLMs' grasp of fundamental medical facts by having them assess the validity of concise, one-hop statements, enabling direct measurement of their knowledge retention capabilities. Our experiments reveal that LLMs have difficulty accurately recalling medical facts, with performances varying substantially across semantic types and showing notable weakness in uncommon medical conditions. Furthermore, LLMs show poor calibration, often being overconfident in incorrect answers. To mitigate these issues, we explore retrieval-augmented generation, demonstrating its effectiveness in improving factual accuracy and reducing uncertainty in medical decision-making.
comment: 15 pages, 11 figures
♻ ☆ ExpVG: Investigating the Design Space of Visual Grounding in Multimodal Large Language Model
Fine-grained multimodal capability in Multimodal Large Language Models (MLLMs) has emerged as a critical research direction, particularly for tackling the visual grounding (VG) problem. Despite the strong performance achieved by existing approaches, they often employ disparate design choices when fine-tuning MLLMs for VG, lacking systematic verification to support these designs. To bridge this gap, this paper presents a comprehensive study of various design choices that impact the VG performance of MLLMs. We conduct our analysis using LLaVA-1.5, which has been widely adopted in prior empirical studies of MLLMs. While more recent models exist, we follow this convention to ensure our findings remain broadly applicable and extendable to other architectures. We cover two key aspects: (1) exploring different visual grounding paradigms in MLLMs, identifying the most effective design, and providing our insights; and (2) conducting ablation studies on the design of grounding data to optimize MLLMs' fine-tuning for the VG task. Finally, our findings contribute to a stronger MLLM for VG, achieving improvements of +5.6% / +6.9% / +7.0% on RefCOCO/+/g over the LLaVA-1.5.
comment: 8 pages for the main paper
♻ ☆ Multi-Turn Puzzles: Evaluating Interactive Reasoning and Strategic Dialogue in LLMs
Large language models (LLMs) excel at solving problems with clear and complete statements, but often struggle with nuanced environments or interactive tasks which are common in most real-world scenarios. This highlights the critical need for developing LLMs that can effectively engage in logically consistent multi-turn dialogue, seek information and reason with incomplete data. To this end, we introduce a novel benchmark comprising a suite of multi-turn tasks each designed to test specific reasoning, interactive dialogue, and information-seeking abilities. These tasks have deterministic scoring mechanisms, thus eliminating the need for human intervention. Evaluating frontier models on our benchmark reveals significant headroom. Our analysis shows that most errors emerge from poor instruction following, reasoning failures, and poor planning. This benchmark provides valuable insights into the strengths and weaknesses of current LLMs in handling complex, interactive scenarios and offers a robust platform for future research aimed at improving these critical capabilities.
♻ ☆ SLED: Self Logits Evolution Decoding for Improving Factuality in Large Language Models NeurIPS 2024
Large language models (LLMs) have demonstrated remarkable capabilities, but their outputs can sometimes be unreliable or factually incorrect. To address this, we introduce Self Logits Evolution Decoding (SLED), a novel decoding framework that enhances the truthfulness of LLMs without relying on external knowledge bases or requiring further fine-tuning. From an optimization perspective, our SLED framework leverages the latent knowledge embedded within the LLM by contrasting the output logits from the final layer with those from early layers. It then utilizes an approximate gradient approach to enable latent knowledge to guide the self-refinement of outputs, thereby effectively improving factual accuracy. Extensive experiments have been conducted on established benchmarks across a diverse range of model families (Gemma, Qwen, Mixtral, gpt-oss) and scales (from 1B to 45B), including more advanced architectural configurations such as the mixture of experts (MoE). Our evaluation spans a wide variety of tasks and the results demonstrate that SLED consistently improves factual accuracy compared to existing decoding methods while maintaining natural language fluency and negligible latency overhead. Furthermore, it can be flexibly combined with other decoding methods to further enhance their performance.
comment: Accepted at NeurIPS 2024; project page is available at https://jayzhang42.github.io/sled_page/
♻ ☆ Customizing Speech Recognition Model with Large Language Model Feedback
Automatic speech recognition (ASR) systems have achieved strong performance on general transcription tasks. However, they continue to struggle with recognizing rare named entities and adapting to domain mismatches. In contrast, large language models (LLMs), trained on massive internet-scale datasets, are often more effective across a wide range of domains. In this work, we propose a reinforcement learning based approach for unsupervised domain adaptation, leveraging unlabeled data to enhance transcription quality, particularly the named entities affected by domain mismatch, through feedback from a LLM. Given contextual information, our framework employs a LLM as the reward model to score the hypotheses from the ASR model. These scores serve as reward signals to fine-tune the ASR model via reinforcement learning. Our method achieves a 21\% improvement on entity word error rate over conventional self-training methods.
♻ ☆ Natural Language Generation from Visual Events: State-of-the-Art and Key Open Questions
In recent years, a substantial body of work in visually grounded natural language processing has focused on real-life multimodal scenarios such as describing content depicted in images or videos. However, comparatively less attention has been devoted to study the nature and degree of interaction between the different modalities in these scenarios. In this paper, we argue that any task dealing with natural language generation from sequences of images or frames is an instance of the broader, more general problem of modeling the intricate relationships between visual events unfolding over time and the features of the language used to interpret, describe, or narrate them. Therefore, solving these tasks requires models to be capable of identifying and managing such intricacies. We consider five seemingly different tasks, which we argue are compelling instances of this broader multimodal problem. Subsequently, we survey the modeling and evaluation approaches adopted for these tasks in recent years and examine the common set of challenges these tasks pose. Building on this perspective, we identify key open questions and propose several research directions for future investigation.
♻ ☆ LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters
The growth of large language models underscores the need for parameter-efficient fine-tuning. Despite its popularity, LoRA encounters storage and computational challenges when deploying multiple task- or user-specific modules. To address this, we introduce LoRA-XS, a novel fine-tuning method backed by a theoretical derivation. LoRA-XS drastically reduces trainable parameters by incorporating a small, trainable weight matrix between frozen low-rank matrices derived from the Singular Value Decomposition of pre-trained weights. This design enables LoRA-XS to reduce storage requirements by over 100x in 7B models compared to LoRA. Additionally, unlike other methods, LoRA-XS imposes no lower bound on trainable parameters - it can scale from a single parameter per module to arbitrarily large values, adapting to any storage or computational constraint. Evaluations on GLUE, GSM8K, MATH, and commonsense reasoning benchmarks across different model scales reveal that LoRA-XS consistently outperforms or matches LoRA and VeRA in accuracy, offering unmatched parameter efficiency. Our ablation studies highlight the significance of singular vectors in transformer weights, establishing LoRA-XS as a powerful, storage-efficient solution for scaling and personalizing large language models.
♻ ☆ LGR2: Language Guided Reward Relabeling for Accelerating Hierarchical Reinforcement Learning
Large language models (LLMs) have shown remarkable abilities in logical reasoning, in-context learning, and code generation. However, translating natural language instructions into effective robotic control policies remains a significant challenge, especially for tasks requiring long-horizon planning and operating under sparse reward conditions. Hierarchical Reinforcement Learning (HRL) provides a natural framework to address this challenge in robotics; however, it typically suffers from non-stationarity caused by the changing behavior of the lower-level policy during training, destabilizing higher-level policy learning. We introduce LGR2, a novel HRL framework that leverages LLMs to generate language-guided reward functions for the higher-level policy. By decoupling high-level reward generation from low-level policy changes, LGR2 fundamentally mitigates the non-stationarity problem in off-policy HRL, enabling stable and efficient learning. To further enhance sample efficiency in sparse environments, we integrate goal-conditioned hindsight experience relabeling. Extensive experiments across simulated and real-world robotic navigation and manipulation tasks demonstrate LGR2 outperforms both hierarchical and non-hierarchical baselines, achieving over 55% success rates on challenging tasks and robust transfer to real robots, without additional fine-tuning.
♻ ☆ DeepRetro: Retrosynthetic Pathway Discovery using Iterative LLM Reasoning
The synthesis of complex natural products remains one of the grand challenges of organic chemistry. We present DeepRetro, a major advancement in computational retrosynthesis that enables the discovery of viable synthetic routes for complex molecules typically considered beyond the reach of existing retrosynthetic methods. DeepRetro is a novel, open-source framework that tightly integrates large language models (LLMs), traditional retrosynthetic engines, and expert human feedback in an iterative design loop. Prior approaches rely solely on template-based methods or unconstrained LLM outputs. In contrast, DeepRetro combines the precision of template-based methods with the generative flexibility of LLMs, controlled by rigorous chemical validity checks and enhanced by recursive refinement. This hybrid system dynamically explores and revises synthetic pathways, guided by both algorithmic checks and expert chemist feedback through an interactive user interface. While DeepRetro achieves strong performance on standard retrosynthesis benchmarks, its true strength lies in its ability to propose novel, viable pathways to highly complex natural products-targets that have historically eluded automated planning. Through detailed case studies, we illustrate how this approach enables new routes for total synthesis and facilitates human-machine collaboration in organic chemistry. Beyond retrosynthesis, DeepRetro represents a working model for how to leverage LLMs in scientific discovery. We provide a transparent account of the system's design, algorithms, and human-feedback loop, enabling broad adaptation across scientific domains. By releasing DeepRetro as an open-source tool, we aim to empower chemists to tackle increasingly ambitious synthetic targets, accelerating progress in drug discovery, materials design, and beyond.
comment: 64 pages,
♻ ☆ BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining
Recent advances in large language model (LLM) pretraining have shown that simply scaling data quantity eventually leads to diminishing returns, hitting a data wall. In response, the use of synthetic data for pretraining has emerged as a promising paradigm for pushing the frontier of performance. Despite this, the factors affecting synthetic data quality remain poorly understood. In this work, we introduce BeyondWeb, a synthetic data generation framework that produces high-quality synthetic data for pretraining. BeyondWeb significantly extends the capabilities of traditional web-scale datasets, outperforming state-of-the-art synthetic pretraining datasets such as Cosmopedia and Nemotron-CC's high-quality synthetic subset (Nemotron-Synth) by up to 5.1 percentage points (pp) and 2.6pp, respectively, when averaged across a suite of 14 benchmark evaluations. It delivers up to 7.7x faster training than open web data and 2.7x faster than Nemotron-Synth. Remarkably, a 3B model trained for 180B tokens on BeyondWeb outperforms an 8B model trained for the same token budget on Cosmopedia. We also present several insights from BeyondWeb on synthetic data for pretraining: what drives its benefits, which data to rephrase and how, and the impact of model size and family on data quality. Overall, our work shows that there's no silver bullet for generating high-quality synthetic pretraining data. The best outcomes require jointly optimizing many factors, a challenging task that requires rigorous science and practical expertise. Naive approaches can yield modest improvements, potentially at great cost, while well-executed methods can yield transformative improvements, as exemplified by BeyondWeb.
comment: Blog version can be viewed at: http://blog.datologyai.com/beyondweb
Information Retrieval 25
☆ Trust and Reputation in Data Sharing: A Survey
Data sharing is the fuel of the galloping artificial intelligence economy, providing diverse datasets for training robust models. Trust between data providers and data consumers is widely considered one of the most important factors for enabling data sharing initiatives. Concerns about data sensitivity, privacy breaches, and misuse contribute to reluctance in sharing data across various domains. In recent years, there has been a rise in technological and algorithmic solutions to measure, capture and manage trust, trustworthiness, and reputation in what we collectively refer to as Trust and Reputation Management Systems (TRMSs). Such approaches have been developed and applied to different domains of computer science, such as autonomous vehicles, or IoT networks, but there have not been dedicated approaches to data sharing and its unique characteristics. In this survey, we examine TRMSs from a data-sharing perspective, analyzing how they assess the trustworthiness of both data and entities across different environments. We develop novel taxonomies for system designs, trust evaluation framework, and evaluation metrics for both data and entity, and we systematically analyze the applicability of existing TRMSs in data sharing. Finally, we identify open challenges and propose future research directions to enhance the explainability, comprehensiveness, and accuracy of TRMSs in large-scale data-sharing ecosystems.
☆ Democratizing News Recommenders: Modeling Multiple Perspectives for News Candidate Generation with VQ-VAE
Current News Recommender Systems based on past clicks are designed for engagement, but come at the cost of limiting diversity in the suggested content. While diversity-aware algorithms exist, they suffer from two major limitations. First, they fail to account for normative diversity, which requires fair access to a broad range of perspectives. Second, they typically apply diversity late in the system's pipeline, after a lot of content has already been filtered out. Both limitations confine their effectiveness and prevent them from promoting true normative diversity in news recommendations. We propose Aspect-Aware Candidate Generation (A2CG) to address these limitations. Our framework introduces diversity into the earliest pipeline stage and uses a configurable mechanism to align diversity with specific democratic goals. A2CG represents each news article using multiple aspects of perspectives (e.g., sentiment, political leaning, frame) and uses a Vector Quantized Variational Autoencoder (VQ-VAE) to create a discrete, multi-faceted representation. A decoder-only model then learns user preferences over these aspect codes. We then inject diversity directly by reversing the sign on some of the query vector's aspects during the candidate retrieval process, ensuring a more diverse set of candidates. Our method, evaluated on the MIND dataset, enables a flexible trade-off between personalization and diversity early in the recommendation pipeline. It also generates more novel, diverse, and serendipitous candidates while effectively taking into account aspects that strengthen democratic values. These empirical results make it a promising approach for downstream democratized news recommendation systems.
☆ InPars+: Supercharging Synthetic Data Generation for Information Retrieval Systems
This work revisits and extends synthetic query generation pipelines for Neural Information Retrieval (NIR) by leveraging the InPars Toolkit, a reproducible, end-to-end framework for generating training data using large language models (LLMs). We first assess the reproducibility of the original InPars, InPars-V2, and Promptagator pipelines on the SciFact benchmark and validate their effectiveness using open-source reranker and generator models. Building on this foundation, we introduce two key extensions to the pipeline: (1) fine-tuning a query generator LLM via Contrastive Preference Optimization (CPO) to improve the signal quality in generated queries, and (2) replacing static prompt templates with dynamic, Chain-of-Thought (CoT) optimized prompts using the DSPy framework. Our results show that both extensions reduce the need for aggressive filtering while improving retrieval performance. All code, models, and synthetic datasets are publicly released to support further research at: \href{https://github.com/danilotpnta/IR2-project}{this https URL}.
☆ CARE: Contextual Adaptation of Recommenders for LLM-based Conversational Recommendation
We tackle the challenge of integrating large language models (LLMs) with external recommender systems to enhance domain expertise in conversational recommendation (CRS). Current LLM-based CRS approaches primarily rely on zero- or few-shot methods for generating item recommendations based on user queries, but this method faces two significant challenges: (1) without domain-specific adaptation, LLMs frequently recommend items not in the target item space, resulting in low recommendation accuracy; and (2) LLMs largely rely on dialogue context for content-based recommendations, neglecting the collaborative relationships among entities or item sequences. To address these limitations, we introduce the CARE (Contextual Adaptation of Recommenders) framework. CARE customizes LLMs for CRS tasks, and synergizes them with external recommendation systems. CARE (a) integrates external recommender systems as domain experts, producing recommendations through entity-level insights, and (b) enhances those recommendations by leveraging contextual information for more accurate and unbiased final recommendations using LLMs. Our results demonstrate that incorporating external recommender systems with entity-level information significantly enhances recommendation accuracy of LLM-based CRS by an average of 54% and 25% for ReDial and INSPIRED datasets. The most effective strategy in the CARE framework involves LLMs selecting and reranking candidate items that external recommenders provide based on contextual insights. Our analysis indicates that the CARE framework effectively addresses the identified challenges and mitigates the popularity bias in the external recommender.
☆ Bites of Tomorrow: Personalized Recommendations for a Healthier and Greener Plate
The recent emergence of extreme climate events has significantly raised awareness about sustainable living. In addition to developing energy-saving materials and technologies, existing research mainly relies on traditional methods that encourage behavioral shifts towards sustainability, which can be overly demanding or only passively engaging. In this work, we propose to employ recommendation systems to actively nudge users toward more sustainable choices. We introduce Green Recommender Aligned with Personalized Eating (GRAPE), which is designed to prioritize and recommend sustainable food options that align with users' evolving preferences. We also design two innovative Green Loss functions that cater to green indicators with either uniform or differentiated priorities, thereby enhancing adaptability across a range of scenarios. Extensive experiments on a real-world dataset demonstrate the effectiveness of our GRAPE.
☆ UniECS: Unified Multimodal E-Commerce Search Framework with Gated Cross-modal Fusion CIKM2025
Current e-commerce multimodal retrieval systems face two key limitations: they optimize for specific tasks with fixed modality pairings, and lack comprehensive benchmarks for evaluating unified retrieval approaches. To address these challenges, we introduce UniECS, a unified multimodal e-commerce search framework that handles all retrieval scenarios across image, text, and their combinations. Our work makes three key contributions. First, we propose a flexible architecture with a novel gated multimodal encoder that uses adaptive fusion mechanisms. This encoder integrates different modality representations while handling missing modalities. Second, we develop a comprehensive training strategy to optimize learning. It combines cross-modal alignment loss (CMAL), cohesive local alignment loss (CLAL), intra-modal contrastive loss (IMCL), and adaptive loss weighting. Third, we create M-BEER, a carefully curated multimodal benchmark containing 50K product pairs for e-commerce search evaluation. Extensive experiments demonstrate that UniECS consistently outperforms existing methods across four e-commerce benchmarks with fine-tuning or zero-shot evaluation. On our M-BEER bench, UniECS achieves substantial improvements in cross-modal tasks (up to 28\% gain in R@10 for text-to-image retrieval) while maintaining parameter efficiency (0.2B parameters) compared to larger models like GME-Qwen2VL (2B) and MM-Embed (8B). Furthermore, we deploy UniECS in the e-commerce search platform of Kuaishou Inc. across two search scenarios, achieving notable improvements in Click-Through Rate (+2.74\%) and Revenue (+8.33\%). The comprehensive evaluation demonstrates the effectiveness of our approach in both experimental and real-world settings. Corresponding codes, models and datasets will be made publicly available at https://github.com/qzp2018/UniECS.
comment: Accepted at CIKM2025 as a long paper
☆ Refining Contrastive Learning and Homography Relations for Multi-Modal Recommendation ACM MM 2025
Multi-modal recommender system focuses on utilizing rich modal information ( i.e., images and textual descriptions) of items to improve recommendation performance. The current methods have achieved remarkable success with the powerful structure modeling capability of graph neural networks. However, these methods are often hindered by sparse data in real-world scenarios. Although contrastive learning and homography ( i.e., homogeneous graphs) are employed to address the data sparsity challenge, existing methods still suffer two main limitations: 1) Simple multi-modal feature contrasts fail to produce effective representations, causing noisy modal-shared features and loss of valuable information in modal-unique features; 2) The lack of exploration of the homograph relations between user interests and item co-occurrence results in incomplete mining of user-item interplay. To address the above limitations, we propose a novel framework for \textbf{R}\textbf{E}fining multi-mod\textbf{A}l cont\textbf{R}astive learning and ho\textbf{M}ography relations (\textbf{REARM}). Specifically, we complement multi-modal contrastive learning by employing meta-network and orthogonal constraint strategies, which filter out noise in modal-shared features and retain recommendation-relevant information in modal-unique features. To mine homogeneous relationships effectively, we integrate a newly constructed user interest graph and an item co-occurrence graph with the existing user co-occurrence and item semantic graphs for graph learning. The extensive experiments on three real-world datasets demonstrate the superiority of REARM to various state-of-the-art baselines. Our visualization further shows an improvement made by REARM in distinguishing between modal-shared and modal-unique features. Code is available \href{https://github.com/MrShouxingMa/REARM}{here}.
comment: This paper has been accepted as a full paper at ACM MM 2025
☆ MUFFIN: Mixture of User-Adaptive Frequency Filtering for Sequential Recommendation CIKM 2025
Sequential recommendation (SR) aims to predict users' subsequent interactions by modeling their sequential behaviors. Recent studies have explored frequency domain analysis, which effectively models periodic patterns in user sequences. However, existing frequency-domain SR models still face two major drawbacks: (i) limited frequency band coverage, often missing critical behavioral patterns in a specific frequency range, and (ii) lack of personalized frequency filtering, as they apply an identical filter for all users regardless of their distinct frequency characteristics. To address these challenges, we propose a novel frequency-domain model, Mixture of User-adaptive Frequency FIlteriNg (MUFFIN), operating through two complementary modules. (i) The global filtering module (GFM) handles the entire frequency spectrum to capture comprehensive behavioral patterns. (ii) The local filtering module (LFM) selectively emphasizes important frequency bands without excluding information from other ranges. (iii) In both modules, the user-adaptive filter (UAF) is adopted to generate user-specific frequency filters tailored to individual unique characteristics. Finally, by aggregating both modules, MUFFIN captures diverse user behavioral patterns across the full frequency spectrum. Extensive experiments show that MUFFIN consistently outperforms state-of-the-art frequency-domain SR models over five benchmark datasets. The source code is available at https://github.com/ilwoong100/MUFFIN.
comment: Accepted by CIKM 2025
☆ Understanding Distribution Structure on Calibrated Recommendation Systems
Traditional recommender systems aim to generate a recommendation list comprising the most relevant or similar items to the user's profile. These approaches can create recommendation lists that omit item genres from the less prominent areas of a user's profile, thereby undermining the user's experience. To solve this problem, the calibrated recommendation system provides a guarantee of including less representative areas in the recommended list. The calibrated context works with three distributions. The first is from the user's profile, the second is from the candidate items, and the last is from the recommendation list. These distributions are G-dimensional, where G is the total number of genres in the system. This high dimensionality requires a different evaluation method, considering that traditional recommenders operate in a one-dimensional data space. In this sense, we implement fifteen models that help to understand how these distributions are structured. We evaluate the users' patterns in three datasets from the movie domain. The results indicate that the models of outlier detection provide a better understanding of the structures. The calibrated system creates recommendation lists that act similarly to traditional recommendation lists, allowing users to change their groups of preferences to the same degree.
☆ ENCODE: Breaking the Trade-Off Between Performance and Efficiency in Long-Term User Behavior Modeling
Long-term user behavior sequences are a goldmine for businesses to explore users' interests to improve Click-Through Rate. However, it is very challenging to accurately capture users' long-term interests from their long-term behavior sequences and give quick responses from the online serving systems. To meet such requirements, existing methods "inadvertently" destroy two basic requirements in long-term sequence modeling: R1) make full use of the entire sequence to keep the information as much as possible; R2) extract information from the most relevant behaviors to keep high relevance between learned interests and current target items. The performance of online serving systems is significantly affected by incomplete and inaccurate user interest information obtained by existing methods. To this end, we propose an efficient two-stage long-term sequence modeling approach, named as EfficieNt Clustering based twO-stage interest moDEling (ENCODE), consisting of offline extraction stage and online inference stage. It not only meets the aforementioned two basic requirements but also achieves a desirable balance between online service efficiency and precision. Specifically, in the offline extraction stage, ENCODE clusters the entire behavior sequence and extracts accurate interests. To reduce the overhead of the clustering process, we design a metric learning-based dimension reduction algorithm that preserves the relative pairwise distances of behaviors in the new feature space. While in the online inference stage, ENCODE takes the off-the-shelf user interests to predict the associations with target items. Besides, to further ensure the relevance between user interests and target items, we adopt the same relevance metric throughout the whole pipeline of ENCODE. The extensive experiment and comparison with SOTA have demonstrated the effectiveness and efficiency of our proposed ENCODE.
comment: Accepted to TKDE
☆ Heterogeneous Influence Maximization in User Recommendation CIKM 2025
User recommendation systems enhance user engagement by encouraging users to act as inviters to interact with other users (invitees), potentially fostering information propagation. Conventional recommendation methods typically focus on modeling interaction willingness. Influence-Maximization (IM) methods focus on identifying a set of users to maximize the information propagation. However, existing methods face two significant challenges. First, recommendation methods fail to unleash the candidates' spread capability. Second, IM methods fail to account for the willingness to interact. To solve these issues, we propose two models named HeteroIR and HeteroIM. HeteroIR provides an intuitive solution to unleash the dissemination potential of user recommendation systems. HeteroIM fills the gap between the IM method and the recommendation task, improving interaction willingness and maximizing spread coverage. The HeteroIR introduces a two-stage framework to estimate the spread profits. The HeteroIM incrementally selects the most influential invitee to recommend and rerank based on the number of reverse reachable (RR) sets containing inviters and invitees. RR set denotes a set of nodes that can reach a target via propagation. Extensive experiments show that HeteroIR and HeteroIM significantly outperform the state-of-the-art baselines with the p-value < 0.05. Furthermore, we have deployed HeteroIR and HeteroIM in Tencent's online gaming platforms and gained an 8.5\% and 10\% improvement in the online A/B test, respectively. Implementation codes are available at https://github.com/socialalgo/HIM.
comment: Accepted in CIKM 2025
☆ LLM-Enhanced Linear Autoencoders for Recommendation CIKM 2025
Large language models (LLMs) have been widely adopted to enrich the semantic representation of textual item information in recommender systems. However, existing linear autoencoders (LAEs) that incorporate textual information rely on sparse word co-occurrence patterns, limiting their ability to capture rich textual semantics. To address this, we propose L3AE, the first integration of LLMs into the LAE framework. L3AE effectively integrates the heterogeneous knowledge of textual semantics and user-item interactions through a two-phase optimization strategy. (i) L3AE first constructs a semantic item-to-item correlation matrix from LLM-derived item representations. (ii) It then learns an item-to-item weight matrix from collaborative signals while distilling semantic item correlations as regularization. Notably, each phase of L3AE is optimized through closed-form solutions, ensuring global optimality and computational efficiency. Extensive experiments demonstrate that L3AE consistently outperforms state-of-the-art LLM-enhanced models on three benchmark datasets, achieving gains of 27.6% in Recall@20 and 39.3% in NDCG@20. The source code is available at https://github.com/jaewan7599/L3AE_CIKM2025.
comment: Accepted by CIKM 2025
☆ AdaptJobRec: Enhancing Conversational Career Recommendation through an LLM-Powered Agentic System
In recent years, recommendation systems have evolved from providing a single list of recommendations to offering a comprehensive suite of topic focused services. To better accomplish this task, conversational recommendation systems (CRS) have progressed from basic retrieval augmented LLM generation to agentic systems with advanced reasoning and self correction capabilities. However, agentic systems come with notable response latency, a longstanding challenge for conversational recommendation systems. To balance the trade off between handling complex queries and minimizing latency, we propose AdaptJobRec, the first conversational job recommendation system that leverages autonomous agent to integrate personalized recommendation algorithm tools. The system employs a user query complexity identification mechanism to minimize response latency. For straightforward queries, the agent directly selects the appropriate tool for rapid responses. For complex queries, the agent uses the memory processing module to filter chat history for relevant content, then passes the results to the intelligent task decomposition planner, and finally executes the tasks using personalized recommendation tools. Evaluation on Walmart's real world career recommendation scenarios demonstrates that AdaptJobRec reduces average response latency by up to 53.3% compared to competitive baselines, while significantly improving recommendation accuracy.
☆ RewardRank: Optimizing True Learning-to-Rank Utility
Traditional ranking systems rely on proxy loss functions that assume simplistic user behavior, such as users preferring a rank list where items are sorted by hand-crafted relevance. However, real-world user interactions are influenced by complex behavioral biases, including position bias, brand affinity, decoy effects, and similarity aversion, which these objectives fail to capture. As a result, models trained on such losses often misalign with actual user utility, such as the probability of any click or purchase across the ranked list. In this work, we propose a data-driven framework for modeling user behavior through counterfactual reward learning. Our method, RewardRank, first trains a deep utility model to estimate user engagement for entire item permutations using logged data. Then, a ranking policy is optimized to maximize predicted utility via differentiable soft permutation operators, enabling end-to-end training over the space of factual and counterfactual rankings. To address the challenge of evaluation without ground-truth for unseen permutations, we introduce two automated protocols: (i) $\textit{KD-Eval}$, using a position-aware oracle for counterfactual reward estimation, and (ii) $\textit{LLM-Eval}$, which simulates user preferences via large language models. Experiments on large-scale benchmarks, including Baidu-ULTR and the Amazon KDD Cup datasets, demonstrate that our approach consistently outperforms strong baselines, highlighting the effectiveness of modeling user behavior dynamics for utility-optimized ranking. Our code is available at: https://github.com/GauravBh1010tt/RewardRank
☆ Unveiling Unicode's Unseen Underpinnings in Undermining Authorship Attribution
When using a public communication channel -- whether formal or informal, such as commenting or posting on social media -- end users have no expectation of privacy: they compose a message and broadcast it for the world to see. Even if an end user takes utmost precautions to anonymize their online presence -- using an alias or pseudonym; masking their IP address; spoofing their geolocation; concealing their operating system and user agent; deploying encryption; registering with a disposable phone number or email; disabling non-essential settings; revoking permissions; and blocking cookies and fingerprinting -- one obvious element still lingers: the message itself. Assuming they avoid lapses in judgment or accidental self-exposure, there should be little evidence to validate their actual identity, right? Wrong. The content of their message -- necessarily open for public consumption -- exposes an attack vector: stylometric analysis, or author profiling. In this paper, we dissect the technique of stylometry, discuss an antithetical counter-strategy in adversarial stylometry, and devise enhancements through Unicode steganography.
☆ Scalable Scientific Interest Profiling Using Large Language Models
Research profiles help surface scientists' expertise but are often outdated. We develop and evaluate two large language model-based methods to generate scientific interest profiles: one summarizing PubMed abstracts and one using Medical Subject Headings (MeSH) terms, and compare them with researchers' self-written profiles. We assembled titles, MeSH terms, and abstracts for 595 faculty at Columbia University Irving Medical Center; self-authored profiles were available for 167. Using GPT-4o-mini, we generated profiles and assessed them with automatic metrics and blinded human review. Lexical overlap with self-written profiles was low (ROUGE-L, BLEU, METEOR), while BERTScore indicated moderate semantic similarity (F1: 0.542 for MeSH-based; 0.555 for abstract-based). Paraphrased references yielded 0.851, highlighting metric sensitivity. TF-IDF Kullback-Leibler divergence (8.56 for MeSH-based; 8.58 for abstract-based) suggested distinct keyword choices. In manual review, 77.78 percent of MeSH-based profiles were rated good or excellent, readability was favored in 93.44 percent of cases, and panelists preferred MeSH-based over abstract-based profiles in 67.86 percent of comparisons. Overall, large language models can generate researcher profiles at scale; MeSH-derived profiles tend to be more readable than abstract-derived ones. Machine-generated and self-written profiles differ conceptually, with human summaries introducing more novel ideas.
♻ ☆ Modeling the Diachronic Evolution of Legal Norms: An LRMoo-Based, Component-Level Approach
Effectively representing the temporal evolution of legal norms at the component level is a critical challenge. While frameworks like IFLA LRMoo and standards like Akoma Ntoso provide generic toolkits, a dedicated pattern for granular versioning is needed to enable the deterministic point-in-time reconstruction of legal texts required by reliable AI applications. This paper proposes a temporal modeling pattern grounded in the LRMoo ontology that models a norm's evolution as a diachronic chain of F2 Expressions. We introduce a key distinction between a language-agnostic Temporal Version (TV) - a semantic snapshot of the norm's structure - and its concrete monolingual realizations, the Language Versions (LV). Both are modeled as F2 Expressions linked by the canonical R76 is derivative of property. The model applies this paradigm recursively, representing the legal text's internal structure as a parallel hierarchy of abstract Component Works (F1 Work) and their versioned Component Expressions (F2 Expression). Furthermore, we formalize the amendment process using the F28 Expression Creation event, allowing changes to be traced from a specific provision in an amending act to its precise effect on the amended norm. A case study on the Brazilian Federal Constitution demonstrates how this fine-grained, event-centric architecture enables the precise, deterministic retrieval and reconstruction of any part of a legal text at a specific date. The model provides a robust foundation for building verifiable knowledge graphs and advanced AI tools, overcoming the limitations of current generative models.
comment: Major revision. The model is now fully migrated from the superseded FRBRoo to the current IFLA LRMoo v1.0 standard, ensuring strict formal compliance. The event-centric and component-level modeling patterns have been significantly refined for greater precision and rigor
♻ ☆ What Matters for Bioacoustic Encoding
Bioacoustics, the study of sounds produced by living organisms, plays a vital role in conservation, biodiversity monitoring, and behavioral studies. Many tasks in this field, such as species, individual, and behavior classification and detection, are well-suited to machine learning. However, they often suffer from limited annotated data, highlighting the need for a general-purpose bioacoustic encoder capable of extracting useful representations for diverse downstream tasks. Such encoders have been proposed before, but are often limited in scope due to a focus on a narrow range of species (typically birds), and a reliance on a single model architecture or training paradigm. Moreover, they are usually evaluated on a small set of tasks and datasets. In this work, we present a large-scale empirical study that covers aspects of bioacoustics that are relevant to research but have previously been scarcely considered: training data diversity and scale, model architectures and training recipes, and the breadth of evaluation tasks and datasets. We obtain encoders that are state-of-the-art on the existing and proposed benchmarks. We also identify what matters for training these encoders, such that this work can be extended when more data are available or better architectures are proposed. Specifically, across 26 datasets with tasks including species classification, detection, individual ID, and vocal repertoire discovery, we find self-supervised pre-training followed by supervised post-training on a mixed bioacoustics + general-audio corpus yields the strongest in- and out-of-distribution performance. We show the importance of data diversity in both stages. To support ongoing research and application, we will release the model checkpoints.
♻ ☆ Iterative Utility Judgment Framework via LLMs Inspired by Relevance in Philosophy
Relevance and utility are two frequently used measures to evaluate the effectiveness of an information retrieval (IR) system. Relevance emphasizes the aboutness of a result to a query, while utility refers to the result's usefulness or value to an information seeker. In Retrieval-Augmented Generation (RAG), high-utility results should be prioritized to feed to LLMs due to their limited input bandwidth. Re-examining RAG's three core components -- relevance ranking derived from retrieval models, utility judgments, and answer generation -- aligns with Schutz's philosophical system of relevances, which encompasses three types of relevance representing different levels of human cognition that enhance each other. These three RAG components also reflect three cognitive levels for LLMs in question-answering. Therefore, we propose an Iterative utiliTy judgmEnt fraMework (ITEM) to promote each step in RAG. We conducted extensive experiments on retrieval (TREC DL, WebAP), utility judgment task (GTI-NQ), and factoid question-answering (NQ) datasets. Experimental results demonstrate significant improvements of ITEM in utility judgments, ranking, and answer generation upon representative baselines.
comment: 22 pages
♻ ☆ Unleashing the Power of LLMs in Dense Retrieval with Query Likelihood Modeling
Dense retrieval is a crucial task in Information Retrieval (IR), serving as the basis for downstream tasks such as re-ranking and augmenting generation. Recently, large language models (LLMs) have demonstrated impressive semantic understanding capabilities, making them attractive to researchers focusing on dense retrieval. While LLMs, as decoder-style generative models, excel in language generation, they often fall short in modeling global information due to a lack of attention to subsequent tokens. Drawing inspiration from the classical word-based language modeling approach for IR, specifically the query likelihood (QL) model, we aim to leverage the generative strengths of LLMs through QL maximization. Rather than employing QL estimation for document ranking, we propose an auxiliary task of QL maximization to enhance the backbone for subsequent contrastive learning of the retriever. We introduce our model, LLM-QL, which incorporates two key components: Attention Block (AB) and Document Corruption (DC). AB blocks the attention of predictive tokens to the document tokens before the document's ending token, while DC corrupts a document by masking a portion of its tokens during prediction. Evaluations on the in-domain (MS MARCO) and out-of-domain dataset (BEIR) indicate LLM-QL's superiority over other LLM-based retrievers. Furthermore, comprehensive analyses also validate the efficacy of LLM-QL and its components.
comment: 12 pages, 3 figures
♻ ☆ A Survey of LLM-based Deep Search Agents: Paradigm, Optimization, Evaluation, and Challenges
The advent of Large Language Models (LLMs) has significantly revolutionized web search. The emergence of LLM-based Search Agents marks a pivotal shift towards deeper, dynamic, autonomous information seeking. These agents can comprehend user intentions and environmental context and execute multi-turn retrieval with dynamic planning, extending search capabilities far beyond the web. Leading examples like OpenAI's Deep Research highlight their potential for deep information mining and real-world applications. This survey provides the first systematic analysis of search agents. We comprehensively analyze and categorize existing works from the perspectives of architecture, optimization, application, and evaluation, ultimately identifying critical open challenges and outlining promising future research directions in this rapidly evolving field. Our repository is available on https://github.com/YunjiaXi/Awesome-Search-Agent-Papers.
♻ ☆ TFRank: Think-Free Reasoning Enables Practical Pointwise LLM Ranking
Reasoning-intensive ranking models built on Large Language Models (LLMs) have made notable progress, but existing approaches often rely on large-scale LLMs and explicit Chain-of-Thought (CoT) reasoning, resulting in high computational cost and latency that limit real-world use. To address this, we propose \textbf{TFRank}, an efficient pointwise reasoning ranker based on small-scale LLMs. To improve ranking performance, TFRank effectively integrates CoT data, fine-grained score supervision, and multi-task training. Furthermore, it achieves an efficient ``\textbf{T}hink-\textbf{F}ree" reasoning capability by employing a ``think-mode switch'' and pointwise format constraints. Specifically, this allows the model to leverage explicit reasoning during training while delivering precise relevance scores for complex queries at inference without generating any reasoning chains. Experiments show that TFRank (e.g., 1.7B) achieves performance comparable to models with four times more parameters on the BRIGHT benchmark, and demonstrates strong competitiveness on the BEIR benchmark. Further analysis shows that TFRank achieves an effective balance between performance and efficiency, providing a practical solution for integrating advanced reasoning into real-world systems. Our code and data are released in the repository: https://github.com/JOHNNY-fans/TFRank.
♻ ☆ Diagnostic-Guided Dynamic Profile Optimization for LLM-based User Simulators in Sequential Recommendation
Recent advances in large language models (LLMs) have enabled realistic user simulators for developing and evaluating recommender systems (RSs). However, existing LLM-based simulators for RSs face two major limitations: (1) static and single-step prompt-based inference that leads to inaccurate and incomplete user profile construction; (2) unrealistic and single-round recommendation-feedback interaction pattern that fails to capture real-world scenarios. To address these limitations, we propose DGDPO (Diagnostic-Guided Dynamic Profile Optimization), a novel framework that constructs user profile through a dynamic and iterative optimization process to enhance the simulation fidelity. Specifically, DGDPO incorporates two core modules within each optimization loop: firstly, a specialized LLM-based diagnostic module, calibrated through our novel training strategy, accurately identifies specific defects in the user profile. Subsequently, a generalized LLM-based treatment module analyzes the diagnosed defect and generates targeted suggestions to refine the profile. Furthermore, unlike existing LLM-based user simulators that are limited to single-round interactions, we are the first to integrate DGDPO with sequential recommenders, enabling a bidirectional evolution where user profiles and recommendation strategies adapt to each other over multi-round interactions. Extensive experiments conducted on three real-world datasets demonstrate the effectiveness of our proposed framework.
♻ ☆ Reference-Aligned Retrieval-Augmented Question Answering over Heterogeneous Proprietary Documents CIKM 2025
Proprietary corporate documents contain rich domain-specific knowledge, but their overwhelming volume and disorganized structure make it difficult even for employees to access the right information when needed. For example, in the automotive industry, vehicle crash-collision tests, each costing hundreds of thousands of dollars, produce highly detailed documentation. However, retrieving relevant content during decision-making remains time-consuming due to the scale and complexity of the material. While Retrieval-Augmented Generation (RAG)-based Question Answering (QA) systems offer a promising solution, building an internal RAG-QA system poses several challenges: (1) handling heterogeneous multi-modal data sources, (2) preserving data confidentiality, and (3) enabling traceability between each piece of information in the generated answer and its original source document. To address these, we propose a RAG-QA framework for internal enterprise use, consisting of: (1) a data pipeline that converts raw multi-modal documents into a structured corpus and QA pairs, (2) a fully on-premise, privacy-preserving architecture, and (3) a lightweight reference matcher that links answer segments to supporting content. Applied to the automotive domain, our system improves factual correctness (+1.79, +1.94), informativeness (+1.33, +1.16), and helpfulness (+1.08, +1.67) over a non-RAG baseline, based on 1-5 scale ratings from both human and LLM judge.
comment: Accepted to CIKM 2025 Applied Research Track
♻ ☆ Reinforcement Learning to Rank Using Coarse-grained Rewards
Learning to rank (LTR) plays a crucial role in various Information Retrieval (IR) tasks. Although supervised LTR methods based on fine-grained relevance labels (e.g., document-level annotations) have achieved significant success, their reliance on costly and potentially biased annotations limits scalability and alignment with realistic goals. In contrast, coarse-grained feedback signals, such as duration time and session-level engagement, are more accessible and affordable. Reinforcement Learning (RL) offers a promising framework to directly optimize these objectives using reward signals, but most existing Reinforcement Learning to Rank (RLTR) approaches suffer from high variance and low sample efficiency. Motivated by recent advances in large language models (LLMs), we re-examine the problem of RLTR with coarse-grained rewards and propose new RLTR methods based on widely used RL algorithms for LLMs. We systematically compare supervised learning and RL-based methods across various model architectures and coarse-grained reward functions on large-scale LTR benchmarks. Experimental results demonstrate that advanced RL methods can directly learn from coarse-grained rewards and outperform strong supervised learning baselines even with fine-grained labels. This shows the great potential of RLTR for metric-agnostic ranking optimization.
Machine Learning 175
☆ Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation
Beyond simple text generation, Large Language Models (LLMs) have evolved into agentic systems capable of planning and interacting with external tools to solve complex tasks. This evolution involves fine-tuning LLMs on agent-specific tasks to enhance their proficiency. However, safety concerns are frequently overlooked during this fine-tuning process. In this work, we show that aligned LLMs can become unintentionally misaligned, leading to a higher likelihood of executing harmful tasks and a reduced tendency to refuse them when fine-tuned to execute agentic tasks. To address these safety challenges, we propose Prefix INjection Guard (PING), a simple yet effective method that prepends automatically generated natural language prefixes to agent responses, guiding them to refuse harmful requests while preserving performance on benign tasks. Specifically, we introduce an iterative approach that alternates between (1) generating candidate prefixes and (2) selecting those that optimize both task performance and refusal behavior. Experimental results demonstrate that PING significantly enhances the safety of fine-tuned LLM agents without sacrificing their effectiveness. PING consistently outperforms existing prompting approaches across diverse benchmarks in both web navigation and code generation tasks. Our analysis of internal hidden states via linear probes reveals that prefix tokens are crucial for behavior modification, explaining the performance gains. WARNING: This paper contains contents that are unethical or offensive in nature.
comment: Source code: https://github.com/HahmDY/prefix_injection_guard
☆ Learning from Preferences and Mixed Demonstrations in General Settings
Reinforcement learning is a general method for learning in sequential settings, but it can often be difficult to specify a good reward function when the task is complex. In these cases, preference feedback or expert demonstrations can be used instead. However, existing approaches utilising both together are often ad-hoc, rely on domain-specific properties, or won't scale. We develop a new framing for learning from human data, \emph{reward-rational partial orderings over observations}, designed to be flexible and scalable. Based on this we introduce a practical algorithm, LEOPARD: Learning Estimated Objectives from Preferences And Ranked Demonstrations. LEOPARD can learn from a broad range of data, including negative demonstrations, to efficiently learn reward functions across a wide range of domains. We find that when a limited amount of preference and demonstration feedback is available, LEOPARD outperforms existing baselines by a significant margin. Furthermore, we use LEOPARD to investigate learning from many types of feedback compared to just a single one, and find that combining feedback types is often beneficial.
☆ BLIPs: Bayesian Learned Interatomic Potentials
Machine Learning Interatomic Potentials (MLIPs) are becoming a central tool in simulation-based chemistry. However, like most deep learning models, MLIPs struggle to make accurate predictions on out-of-distribution data or when trained in a data-scarce regime, both common scenarios in simulation-based chemistry. Moreover, MLIPs do not provide uncertainty estimates by construction, which are fundamental to guide active learning pipelines and to ensure the accuracy of simulation results compared to quantum calculations. To address this shortcoming, we propose BLIPs: Bayesian Learned Interatomic Potentials. BLIP is a scalable, architecture-agnostic variational Bayesian framework for training or fine-tuning MLIPs, built on an adaptive version of Variational Dropout. BLIP delivers well-calibrated uncertainty estimates and minimal computational overhead for energy and forces prediction at inference time, while integrating seamlessly with (equivariant) message-passing architectures. Empirical results on simulation-based computational chemistry tasks demonstrate improved predictive accuracy with respect to standard MLIPs, and trustworthy uncertainty estimates, especially in data-scarse or heavy out-of-distribution regimes. Moreover, fine-tuning pretrained MLIPs with BLIP yields consistent performance gains and calibrated uncertainties.
☆ Efficient Knowledge Graph Unlearning with Zeroth-order Information
Due to regulations like the Right to be Forgotten, there is growing demand for removing training data and its influence from models. Since full retraining is costly, various machine unlearning methods have been proposed. In this paper, we firstly present an efficient knowledge graph (KG) unlearning algorithm. We remark that KG unlearning is nontrivial due to the distinctive structure of KG and the semantic relations between entities. Also, unlearning by estimating the influence of removed components incurs significant computational overhead when applied to large-scale knowledge graphs. To this end, we define an influence function for KG unlearning and propose to approximate the model's sensitivity without expensive computation of first-order and second-order derivatives for parameter updates. Specifically, we use Taylor expansion to estimate the parameter changes caused by data removal. Given that the first-order gradients and second-order derivatives dominate the computational load, we use the Fisher matrices and zeroth-order optimization to approximate the inverse-Hessian vector product without constructing the computational graphs. Our experimental results demonstrate that the proposed method outperforms other state-of-the-art graph unlearning baselines significantly in terms of unlearning efficiency and unlearning quality. Our code is released at https://github.com/NKUShaw/ZOWFKGIF.
comment: 9 page
☆ Typed Topological Structures Of Datasets
A datatset $X$ on $R^2$ is a finite topological space. Current research of a dataset focuses on statistical methods and the algebraic topological method \cite{carlsson}. In \cite{hu}, the concept of typed topological space was introduced and showed to have the potential for studying finite topological spaces, such as a dataset. It is a new method from the general topology perspective. A typed topological space is a topological space whose open sets are assigned types. Topological concepts and methods can be redefined using open sets of certain types. In this article, we develop a special set of types and its related typed topology on a dataset $X$. Using it, we can investigate the inner structure of $X$. In particular, $R^2$ has a natural quotient space, in which $X$ is organized into tracks, and each track is split into components. Those components are in a order. Further, they can be represented by an integer sequence. Components crossing tracks form branches, and the relationship can be well represented by a type of pseudotree (called typed-II pseudotree). Such structures provide a platform for new algorithms for problems such as calculating convex hull, holes, clustering and anomaly detection.
comment: 14 pages 5 figures
☆ ASDFormer: A Transformer with Mixtures of Pooling-Classifier Experts for Robust Autism Diagnosis and Biomarker Discovery
Autism Spectrum Disorder (ASD) is a complex neurodevelopmental condition marked by disruptions in brain connectivity. Functional MRI (fMRI) offers a non-invasive window into large-scale neural dynamics by measuring blood-oxygen-level-dependent (BOLD) signals across the brain. These signals can be modeled as interactions among Regions of Interest (ROIs), which are grouped into functional communities based on their underlying roles in brain function. Emerging evidence suggests that connectivity patterns within and between these communities are particularly sensitive to ASD-related alterations. Effectively capturing these patterns and identifying interactions that deviate from typical development is essential for improving ASD diagnosis and enabling biomarker discovery. In this work, we introduce ASDFormer, a Transformer-based architecture that incorporates a Mixture of Pooling-Classifier Experts (MoE) to capture neural signatures associated with ASD. By integrating multiple specialized expert branches with attention mechanisms, ASDFormer adaptively emphasizes different brain regions and connectivity patterns relevant to autism. This enables both improved classification performance and more interpretable identification of disorder-related biomarkers. Applied to the ABIDE dataset, ASDFormer achieves state-of-the-art diagnostic accuracy and reveals robust insights into functional connectivity disruptions linked to ASD, highlighting its potential as a tool for biomarker discovery.
☆ GDNSQ: Gradual Differentiable Noise Scale Quantization for Low-bit Neural Networks
Quantized neural networks can be viewed as a chain of noisy channels, where rounding in each layer reduces capacity as bit-width shrinks; the floating-point (FP) checkpoint sets the maximum input rate. We track capacity dynamics as the average bit-width decreases and identify resulting quantization bottlenecks by casting fine-tuning as a smooth, constrained optimization problem. Our approach employs a fully differentiable Straight-Through Estimator (STE) with learnable bit-width, noise scale and clamp bounds, and enforces a target bit-width via an exterior-point penalty; mild metric smoothing (via distillation) stabilizes training. Despite its simplicity, the method attains competitive accuracy down to the extreme W1A1 setting while retaining the efficiency of STE.
comment: 9 pages, 6 figures, 7 tables
☆ Machine Learning H-theorem
H-theorem provides a microscopic foundation of the Second Law of Thermodynamics and is therefore essential to establishing statistical physics, but at the same time, H-theorem has been subject to controversy that in part persists till this day. To better understand H-theorem and its relation to the arrow of time, we study the equilibration of randomly oriented and positioned hard disks with periodic boundary conditions. Using a model based on the DeepSets architecture, which imposes permutation invariance of the particle labels, we train a model to capture the irreversibility of the H-functional.
☆ Formal Algorithms for Model Efficiency
We introduce the Knob-Meter-Rule (KMR) framework, a unified formalism for representing and reasoning about model efficiency techniques in deep learning. By abstracting diverse methods, including pruning, quantization, knowledge distillation, and parameter-efficient architectures, into a consistent set of controllable knobs, deterministic rules, and measurable meters, KMR provides a mathematically precise and modular perspective on efficiency optimization. The framework enables systematic composition of multiple techniques, flexible policy-driven application, and iterative budgeted optimization through the Budgeted-KMR algorithm. We demonstrate how well-known efficiency methods can be instantiated as KMR triples and present concise algorithmic templates for each. The framework highlights underlying relationships between methods, facilitates hybrid pipelines, and lays the foundation for future research in automated policy learning, dynamic adaptation, and theoretical analysis of cost-quality trade-offs. Overall, KMR offers both a conceptual and practical tool for unifying and advancing model efficiency research.
comment: 17 pages, 0 figures
☆ Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation
Generalization in embodied AI is hindered by the "seeing-to-doing gap," which stems from data scarcity and embodiment heterogeneity. To address this, we pioneer "pointing" as a unified, embodiment-agnostic intermediate representation, defining four core embodied pointing abilities that bridge high-level vision-language comprehension with low-level action primitives. We introduce Embodied-R1, a 3B Vision-Language Model (VLM) specifically designed for embodied reasoning and pointing. We use a wide range of embodied and general visual reasoning datasets as sources to construct a large-scale dataset, Embodied-Points-200K, which supports key embodied pointing capabilities. We then train Embodied-R1 using a two-stage Reinforced Fine-tuning (RFT) curriculum with a specialized multi-task reward design. Embodied-R1 achieves state-of-the-art performance on 11 embodied spatial and pointing benchmarks. Critically, it demonstrates robust zero-shot generalization by achieving a 56.2% success rate in the SIMPLEREnv and 87.5% across 8 real-world XArm tasks without any task-specific fine-tuning, representing a 62% improvement over strong baselines. Furthermore, the model exhibits high robustness against diverse visual disturbances. Our work shows that a pointing-centric representation, combined with an RFT training paradigm, offers an effective and generalizable pathway to closing the perception-action gap in robotics.
comment: Embodied-R1 technical report
☆ Uncertainty-Aware PCA for Arbitrarily Distributed Data Modeled by Gaussian Mixture Models
Multidimensional data is often associated with uncertainties that are not well-described by normal distributions. In this work, we describe how such distributions can be projected to a low-dimensional space using uncertainty-aware principal component analysis (UAPCA). We propose to model multidimensional distributions using Gaussian mixture models (GMMs) and derive the projection from a general formulation that allows projecting arbitrary probability density functions. The low-dimensional projections of the densities exhibit more details about the distributions and represent them more faithfully compared to UAPCA mappings. Further, we support including user-defined weights between the different distributions, which allows for varying the importance of the multidimensional distributions. We evaluate our approach by comparing the distributions in low-dimensional space obtained by our method and UAPCA to those obtained by sample-based projections.
comment: 10 pages, 6 figures
☆ Multi-User Contextual Cascading Bandits for Personalized Recommendation
We introduce a Multi-User Contextual Cascading Bandit model, a new combinatorial bandit framework that captures realistic online advertising scenarios where multiple users interact with sequentially displayed items simultaneously. Unlike classical contextual bandits, MCCB integrates three key structural elements: (i) cascading feedback based on sequential arm exposure, (ii) parallel context sessions enabling selective exploration, and (iii) heterogeneous arm-level rewards. We first propose Upper Confidence Bound with Backward Planning (UCBBP), a UCB-style algorithm tailored to this setting, and prove that it achieves a regret bound of $\widetilde{O}(\sqrt{THN})$ over $T$ episodes, $H$ session steps, and $N$ contexts per episode. Motivated by the fact that many users interact with the system simultaneously, we introduce a second algorithm, termed Active Upper Confidence Bound with Backward Planning (AUCBBP), which shows a strict efficiency improvement in context scaling, i.e., user scaling, with a regret bound of $\widetilde{O}(\sqrt{T+HN})$. We validate our theoretical findings via numerical experiments, demonstrating the empirical effectiveness of both algorithms under various settings.
comment: 35 pages, 5 figures
☆ AutoScale: Linear Scalarization Guided by Multi-Task Optimization Metrics
Recent multi-task learning studies suggest that linear scalarization, when using well-chosen fixed task weights, can achieve comparable to or even better performance than complex multi-task optimization (MTO) methods. It remains unclear why certain weights yield optimal performance and how to determine these weights without relying on exhaustive hyperparameter search. This paper establishes a direct connection between linear scalarization and MTO methods, revealing through extensive experiments that well-performing scalarization weights exhibit specific trends in key MTO metrics, such as high gradient magnitude similarity. Building on this insight, we introduce AutoScale, a simple yet effective two-phase framework that uses these MTO metrics to guide weight selection for linear scalarization, without expensive weight search. AutoScale consistently shows superior performance with high efficiency across diverse datasets including a new large-scale benchmark.
comment: The first two authors hold equal contribution. 10 pages, 6 figures
☆ A PC Algorithm for Max-Linear Bayesian Networks
Max-linear Bayesian networks (MLBNs) are a relatively recent class of structural equation models which arise when the random variables involved have heavy-tailed distributions. Unlike most directed graphical models, MLBNs are typically not faithful to d-separation and thus classical causal discovery algorithms such as the PC algorithm or greedy equivalence search can not be used to accurately recover the true graph structure. In this paper, we begin the study of constraint-based discovery algorithms for MLBNs given an oracle for testing conditional independence in the true, unknown graph. We show that if the oracle is given by the $\ast$-separation criteria in the true graph, then the PC algorithm remains consistent despite the presence of additional CI statements implied by $\ast$-separation. We also introduce a new causal discovery algorithm named "PCstar" which assumes faithfulness to $C^\ast$-separation and is able to orient additional edges which cannot be oriented with only d- or $\ast$-separation.
comment: 24 pages, 7 figures, 1 table
☆ Convergent Reinforcement Learning Algorithms for Stochastic Shortest Path Problem
In this paper we propose two algorithms in the tabular setting and an algorithm for the function approximation setting for the Stochastic Shortest Path (SSP) problem. SSP problems form an important class of problems in Reinforcement Learning (RL), as other types of cost-criteria in RL can be formulated in the setting of SSP. We show asymptotic almost-sure convergence for all our algorithms. We observe superior performance of our tabular algorithms compared to other well-known convergent RL algorithms. We further observe reliable performance of our function approximation algorithm compared to other algorithms in the function approximation setting.
☆ How Usable is Automated Feature Engineering for Tabular Data?
Tabular data, consisting of rows and columns, is omnipresent across various machine learning applications. Each column represents a feature, and features can be combined or transformed to create new, more informative features. Such feature engineering is essential to achieve peak performance in machine learning. Since manual feature engineering is expensive and time-consuming, a substantial effort has been put into automating it. Yet, existing automated feature engineering (AutoFE) methods have never been investigated regarding their usability for practitioners. Thus, we investigated 53 AutoFE methods. We found that these methods are, in general, hard to use, lack documentation, and have no active communities. Furthermore, no method allows users to set time and memory constraints, which we see as a necessity for usable automation. Our survey highlights the need for future work on usable, well-engineered AutoFE methods.
comment: Accepted as a short paper at the non-archival content track of AutoML 2025
☆ Categorical Policies: Multimodal Policy Learning and Exploration in Continuous Control
A policy in deep reinforcement learning (RL), either deterministic or stochastic, is commonly parameterized as a Gaussian distribution alone, limiting the learned behavior to be unimodal. However, the nature of many practical decision-making problems favors a multimodal policy that facilitates robust exploration of the environment and thus to address learning challenges arising from sparse rewards, complex dynamics, or the need for strategic adaptation to varying contexts. This issue is exacerbated in continuous control domains where exploration usually takes place in the vicinity of the predicted optimal action, either through an additive Gaussian noise or the sampling process of a stochastic policy. In this paper, we introduce Categorical Policies to model multimodal behavior modes with an intermediate categorical distribution, and then generate output action that is conditioned on the sampled mode. We explore two sampling schemes that ensure differentiable discrete latent structure while maintaining efficient gradient-based optimization. By utilizing a latent categorical distribution to select the behavior mode, our approach naturally expresses multimodality while remaining fully differentiable via the sampling tricks. We evaluate our multimodal policy on a set of DeepMind Control Suite environments, demonstrating that through better exploration, our learned policies converge faster and outperform standard Gaussian policies. Our results indicate that the Categorical distribution serves as a powerful tool for structured exploration and multimodal behavior representation in continuous control.
comment: 6 pages, 4 figures; Has been submitted and accepted at IEEE SMC, 2025
☆ Automated Energy-Aware Time-Series Model Deployment on Embedded FPGAs for Resilient Combined Sewer Overflow Management
Extreme weather events, intensified by climate change, increasingly challenge aging combined sewer systems, raising the risk of untreated wastewater overflow. Accurate forecasting of sewer overflow basin filling levels can provide actionable insights for early intervention, helping mitigating uncontrolled discharge. In recent years, AI-based forecasting methods have offered scalable alternatives to traditional physics-based models, but their reliance on cloud computing limits their reliability during communication outages. To address this, we propose an end-to-end forecasting framework that enables energy-efficient inference directly on edge devices. Our solution integrates lightweight Transformer and Long Short-Term Memory (LSTM) models, compressed via integer-only quantization for efficient on-device execution. Moreover, an automated hardware-aware deployment pipeline is used to search for optimal model configurations by jointly minimizing prediction error and energy consumption on an AMD Spartan-7 XC7S15 FPGA. Evaluated on real-world sewer data, the selected 8-bit Transformer model, trained on 24 hours of historical measurements, achieves high accuracy (MSE 0.0376) at an energy cost of 0.370 mJ per inference. In contrast, the optimal 8-bit LSTM model requires significantly less energy (0.009 mJ, over 40x lower) but yields 14.89% worse accuracy (MSE 0.0432) and much longer training time. This trade-off highlights the need to align model selection with deployment priorities, favoring LSTM for ultra-low energy consumption or Transformer for higher predictive accuracy. In general, our work enables local, energy-efficient forecasting, contributing to more resilient combined sewer systems. All code can be found in the GitHub Repository (https://github.com/tianheng-ling/EdgeOverflowForecast).
comment: 6 pages, 6 figures, 1 table, accepted by the 11th IEEE International Smart Cities Conference
☆ Revisiting Diffusion Q-Learning: From Iterative Denoising to One-Step Action Generation
The generative power of diffusion models (DMs) has recently enabled high-performing decision-making algorithms in offline reinforcement learning (RL), achieving state-of-the-art results across standard benchmarks. Among them, Diffusion Q-Learning (DQL) stands out as a leading method for its consistently strong performance. Nevertheless, DQL remains limited in practice due to its reliance on multi-step denoising for action generation during both training and inference. Although one-step denoising is desirable, simply applying it to DQL leads to a drastic performance drop. In this work, we revisit DQL and identify its core limitations. We then propose One-Step Flow Q-Learning (OFQL), a novel framework that enables efficient one-step action generation during both training and inference, without requiring auxiliary models, distillation, or multi-phase training. Specifically, OFQL reformulates DQL within the sample-efficient Flow Matching (FM) framework. While conventional FM induces curved generative trajectories that impede one-step generation, OFQL instead learns an average velocity field that facilitates direct, accurate action generation. Collectively, OFQL eliminates the need for multi-step sampling and recursive gradient updates in DQL, resulting in faster and more robust training and inference. Extensive experiments on the D4RL benchmark demonstrate that OFQL outperforms DQL and other diffusion-based baselines, while substantially reducing both training and inference time compared to DQL.
☆ Fisher-Orthogonal Projection Methods for Natural Gradient Descent with Large Batches
Modern GPUs are equipped with large amounts of high-bandwidth memory, enabling them to support mini-batch sizes of up to tens of thousands of training samples. However, most existing optimizers struggle to perform effectively at such a large batch size. As batch size increases, gradient noise decreases due to averaging over many samples, limiting the ability of first-order methods to escape sharp or suboptimal minima and reach the global minimum. Meanwhile, second-order methods like the natural gradient with Kronecker-Factored Approximate Curvature (KFAC) often require excessively high damping to remain stable at large batch sizes. This high damping effectively washes out the curvature information that gives these methods their advantage, reducing their performance to that of simple gradient descent. In this paper, we introduce Fisher-Orthogonal Projection (FOP), a novel technique that restores the effectiveness of the second-order method at very large batch sizes, enabling scalable training with improved generalization and faster convergence. FOP constructs a variance-aware update direction by leveraging gradients from two sub-batches, enhancing the average gradient with a component of the gradient difference that is orthogonal to the average under the Fisher-metric.
☆ Generalisation and benign over-fitting for linear regression onto random functional covariates
We study theoretical predictive performance of ridge and ridge-less least-squares regression when covariate vectors arise from evaluating $p$ random, means-square continuous functions over a latent metric space at $n$ random and unobserved locations, subject to additive noise. This leads us away from the standard assumption of i.i.d. data to a setting in which the $n$ covariate vectors are exchangeable but not independent in general. Under an assumption of independence across dimensions, $4$-th order moment, and other regularity conditions, we obtain probabilistic bounds on a notion of predictive excess risk adapted to our random functional covariate setting, making use of recent results of Barzilai and Shamir. We derive convergence rates in regimes where $p$ grows suitably fast relative to $n$, illustrating interplay between ingredients of the model in determining convergence behaviour and the role of additive covariate noise in benign-overfitting.
☆ A Comprehensive Re-Evaluation of Biometric Modality Properties in the Modern Era
The rapid advancement of authentication systems and their increasing reliance on biometrics for faster and more accurate user verification experience, highlight the critical need for a reliable framework to evaluate the suitability of biometric modalities for specific applications. Currently, the most widely known evaluation framework is a comparative table from 1998, which no longer adequately captures recent technological developments or emerging vulnerabilities in biometric systems. To address these challenges, this work revisits the evaluation of biometric modalities through an expert survey involving 24 biometric specialists. The findings indicate substantial shifts in property ratings across modalities. For example, face recognition, shows improved ratings due to technological progress, while fingerprint, shows decreased reliability because of emerging vulnerabilities and attacks. Further analysis of expert agreement levels across rated properties highlighted the consistency of the provided evaluations and ensured the reliability of the ratings. Finally, expert assessments are compared with dataset-level uncertainty across 55 biometric datasets, revealing strong alignment in most modalities and underscoring the importance of integrating empirical evidence with expert insight. Moreover, the identified expert disagreements reveal key open challenges and help guide future research toward resolving them.
☆ FedUP: Efficient Pruning-based Federated Unlearning for Model Poisoning Attacks
Federated Learning (FL) can be vulnerable to attacks, such as model poisoning, where adversaries send malicious local weights to compromise the global model. Federated Unlearning (FU) is emerging as a solution to address such vulnerabilities by selectively removing the influence of detected malicious contributors on the global model without complete retraining. However, unlike typical FU scenarios where clients are trusted and cooperative, applying FU with malicious and possibly colluding clients is challenging because their collaboration in unlearning their data cannot be assumed. This work presents FedUP, a lightweight FU algorithm designed to efficiently mitigate malicious clients' influence by pruning specific connections within the attacked model. Our approach achieves efficiency by relying only on clients' weights from the last training round before unlearning to identify which connections to inhibit. Isolating malicious influence is non-trivial due to overlapping updates from benign and malicious clients. FedUP addresses this by carefully selecting and zeroing the highest magnitude weights that diverge the most between the latest updates from benign and malicious clients while preserving benign information. FedUP is evaluated under a strong adversarial threat model, where up to 50%-1 of the clients could be malicious and have full knowledge of the aggregation process. We demonstrate the effectiveness, robustness, and efficiency of our solution through experiments across IID and Non-IID data, under label-flipping and backdoor attacks, and by comparing it with state-of-the-art (SOTA) FU solutions. In all scenarios, FedUP reduces malicious influence, lowering accuracy on malicious data to match that of a model retrained from scratch while preserving performance on benign data. FedUP achieves effective unlearning while consistently being faster and saving storage compared to the SOTA.
comment: 15 pages, 5 figures, 7 tables
☆ Online Conformal Selection with Accept-to-Reject Changes
Selecting a subset of promising candidates from a large pool is crucial across various scientific and real-world applications. Conformal selection offers a distribution-free and model-agnostic framework for candidate selection with uncertainty quantification. While effective in offline settings, its application to online scenarios, where data arrives sequentially, poses challenges. Notably, conformal selection permits the deselection of previously selected candidates, which is incompatible with applications requiring irreversible selection decisions. This limitation is particularly evident in resource-intensive sequential processes, such as drug discovery, where advancing a compound to subsequent stages renders reversal impractical. To address this issue, we extend conformal selection to an online Accept-to-Reject Changes (ARC) procedure: non-selected data points can be reconsidered for selection later, and once a candidate is selected, the decision is irreversible. Specifically, we propose a novel conformal selection method, Online Conformal Selection with Accept-to-Reject Changes (dubbed OCS-ARC), which incorporates online Benjamini-Hochberg procedure into the candidate selection process. We provide theoretical guarantees that OCS-ARC controls the false discovery rate (FDR) at or below the nominal level at any timestep under both i.i.d. and exchangeable data assumptions. Additionally, we theoretically show that our approach naturally extends to multivariate response settings. Extensive experiments on synthetic and real-world datasets demonstrate that OCS-ARC significantly improves selection power over the baseline while maintaining valid FDR control across all examined timesteps.
☆ One Shot vs. Iterative: Rethinking Pruning Strategies for Model Compression
Pruning is a core technique for compressing neural networks to improve computational efficiency. This process is typically approached in two ways: one-shot pruning, which involves a single pass of training and pruning, and iterative pruning, where pruning is performed over multiple cycles for potentially finer network refinement. Although iterative pruning has historically seen broader adoption, this preference is often assumed rather than rigorously tested. Our study presents one of the first systematic and comprehensive comparisons of these methods, providing rigorous definitions, benchmarking both across structured and unstructured settings, and applying different pruning criteria and modalities. We find that each method has specific advantages: one-shot pruning proves more effective at lower pruning ratios, while iterative pruning performs better at higher ratios. Building on these findings, we advocate for patience-based pruning and introduce a hybrid approach that can outperform traditional methods in certain scenarios, providing valuable insights for practitioners selecting a pruning strategy tailored to their goals and constraints. Source code is available at https://github.com/janumiko/pruning-benchmark.
☆ Smooth Flow Matching
Functional data, i.e., smooth random functions observed over a continuous domain, are increasingly available in areas such as biomedical research, health informatics, and epidemiology. However, effective statistical analysis for functional data is often hindered by challenges such as privacy constraints, sparse and irregular sampling, infinite dimensionality, and non-Gaussian structures. To address these challenges, we introduce a novel framework named Smooth Flow Matching (SFM), tailored for generative modeling of functional data to enable statistical analysis without exposing sensitive real data. Built upon flow-matching ideas, SFM constructs a semiparametric copula flow to generate infinite-dimensional functional data, free from Gaussianity or low-rank assumptions. It is computationally efficient, handles irregular observations, and guarantees the smoothness of the generated functions, offering a practical and flexible solution in scenarios where existing deep generative methods are not applicable. Through extensive simulation studies, we demonstrate the advantages of SFM in terms of both synthetic data quality and computational efficiency. We then apply SFM to generate clinical trajectory data from the MIMIC-IV patient electronic health records (EHR) longitudinal database. Our analysis showcases the ability of SFM to produce high-quality surrogate data for downstream statistical tasks, highlighting its potential to boost the utility of EHR data for clinical applications.
comment: 86 pages, 7 figures
☆ Disentangled Deep Smoothed Bootstrap for Fair Imbalanced Regression
Imbalanced distribution learning is a common and significant challenge in predictive modeling, often reducing the performance of standard algorithms. Although various approaches address this issue, most are tailored to classification problems, with a limited focus on regression. This paper introduces a novel method to improve learning on tabular data within the Imbalanced Regression (IR) framework, which is a critical problem. We propose using Variational Autoencoders (VAEs) to model and define a latent representation of data distributions. However, VAEs can be inefficient with imbalanced data like other standard approaches. To address this, we develop an innovative data generation method that combines a disentangled VAE with a Smoothed Bootstrap applied in the latent space. We evaluate the efficiency of this method through numerical comparisons with competitors on benchmark datasets for IR.
☆ Unsupervised Urban Tree Biodiversity Mapping from Street-Level Imagery Using Spatially-Aware Visual Clustering
Urban tree biodiversity is critical for climate resilience, ecological stability, and livability in cities, yet most municipalities lack detailed knowledge of their canopies. Field-based inventories provide reliable estimates of Shannon and Simpson diversity but are costly and time-consuming, while supervised AI methods require labeled data that often fail to generalize across regions. We introduce an unsupervised clustering framework that integrates visual embeddings from street-level imagery with spatial planting patterns to estimate biodiversity without labels. Applied to eight North American cities, the method recovers genus-level diversity patterns with high fidelity, achieving low Wasserstein distances to ground truth for Shannon and Simpson indices and preserving spatial autocorrelation. This scalable, fine-grained approach enables biodiversity mapping in cities lacking detailed inventories and offers a pathway for continuous, low-cost monitoring to support equitable access to greenery and adaptive management of urban ecosystems.
comment: 26 pages, 7 figures, Nature Format
☆ Assessing Trustworthiness of AI Training Dataset using Subjective Logic -- A Use Case on Bias ECML
As AI systems increasingly rely on training data, assessing dataset trustworthiness has become critical, particularly for properties like fairness or bias that emerge at the dataset level. Prior work has used Subjective Logic to assess trustworthiness of individual data, but not to evaluate trustworthiness properties that emerge only at the level of the dataset as a whole. This paper introduces the first formal framework for assessing the trustworthiness of AI training datasets, enabling uncertainty-aware evaluations of global properties such as bias. Built on Subjective Logic, our approach supports trust propositions and quantifies uncertainty in scenarios where evidence is incomplete, distributed, and/or conflicting. We instantiate this framework on the trustworthiness property of bias, and we experimentally evaluate it based on a traffic sign recognition dataset. The results demonstrate that our method captures class imbalance and remains interpretable and robust in both centralized and federated contexts.
comment: Accepted at ECML PKDD Bias Workshop '25
☆ Reinforcement Learning-based Adaptive Path Selection for Programmable Networks
This work presents a proof-of-concept implementation of a distributed, in-network reinforcement learning (IN-RL) framework for adaptive path selection in programmable networks. By combining Stochastic Learning Automata (SLA) with real-time telemetry data collected via In-Band Network Telemetry (INT), the proposed system enables local, data-driven forwarding decisions that adapt dynamically to congestion conditions. The system is evaluated on a Mininet-based testbed using P4-programmable BMv2 switches, demonstrating how our SLA-based mechanism converges to effective path selections and adapts to shifting network conditions at line rate.
☆ Communication-Efficient Federated Learning with Adaptive Number of Participants
Rapid scaling of deep learning models has enabled performance gains across domains, yet it introduced several challenges. Federated Learning (FL) has emerged as a promising framework to address these concerns by enabling decentralized training. Nevertheless, communication efficiency remains a key bottleneck in FL, particularly under heterogeneous and dynamic client participation. Existing methods, such as FedAvg and FedProx, or other approaches, including client selection strategies, attempt to mitigate communication costs. However, the problem of choosing the number of clients in a training round remains extremely underexplored. We introduce Intelligent Selection of Participants (ISP), an adaptive mechanism that dynamically determines the optimal number of clients per round to enhance communication efficiency without compromising model accuracy. We validate the effectiveness of ISP across diverse setups, including vision transformers, real-world ECG classification, and training with gradient compression. Our results show consistent communication savings of up to 30\% without losing the final quality. Applying ISP to different real-world ECG classification setups highlighted the selection of the number of clients as a separate task of federated learning.
☆ PENGUIN: Enhancing Transformer with Periodic-Nested Group Attention for Long-term Time Series Forecasting
Long-term time series forecasting (LTSF) is a fundamental task with wide-ranging applications. Although Transformer-based models have made significant breakthroughs in forecasting, their effectiveness for time series forecasting remains debatable. In this paper, we revisit the significance of self-attention and propose a simple yet effective mechanism, Periodic-Nested Group Attention, namely PENGUIN. Our approach highlights the importance of explicitly modeling periodic patterns and incorporating relative attention bias for effective time series modeling. To this end, we introduce a periodic-nested relative attention bias that captures periodic structures directly. To handle multiple coexisting periodicities (e.g., daily and weekly cycles), we design a grouped attention mechanism, where each group targets a specific periodicity using a multi-query attention mechanism. Extensive experiments across diverse benchmarks demonstrate that PENGUIN consistently outperforms both MLP-based and Transformer-based models.
☆ Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration
Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models, yet its full potential is hindered by two under-explored dimensions: Depth-the hardest problem a model can sample; Breadth-the number of instances consumed in a single iteration. We dissect the popular GRPO algorithm and reveal a systematic bias: the cumulative-advantage disproportionately weights samples with medium accuracy, while down-weighting the low-accuracy instances that are crucial for pushing reasoning boundaries. To rectify the depth neglect, we introduce Difficulty Adaptive Rollout Sampling (DARS), which re-weights hard problems through targeted multi-stage rollouts, thereby increasing the number of positive rollouts for hard problems. Empirically, naively enlarging rollout size only accelerates convergence and even hurts Pass@K. Our DARS, in contrast, delivers consistent Pass@K gains without extra inference cost at convergence. Just as we adaptively expanded the depth of exploration, we now ask whether aggressively scaling the breadth of training data can further amplify reasoning gains. To this end, we intensely scale batch size and replace PPO's mini-batch iterations with full-batch updates over multiple epochs. Increasing breadth significantly enhances Pass@1 performance. Large-breadth training sustains high token-level entropy, indicating continued exploration and reduced gradient noise. We further present DARS-B, which augments DARS with large breadth, and demonstrate simultaneous gains in Pass@K and Pass@1. The results confirm that breadth and adaptive exploration across depth operate as orthogonal dimensions in RLVR, which are key to unleashing the reasoning power of RLVR.
comment: 11pages, 9 figures
☆ Order Optimal Regret Bounds for Sharpe Ratio Optimization in the Bandit Setting
In this paper, we investigate the problem of sequential decision-making for Sharpe ratio (SR) maximization in a stochastic bandit setting. We focus on the Thompson Sampling (TS) algorithm, a Bayesian approach celebrated for its empirical performance and exploration efficiency, under the assumption of Gaussian rewards with unknown parameters. Unlike conventional bandit objectives focusing on maximizing cumulative reward, Sharpe ratio optimization instead introduces an inherent tradeoff between achieving high returns and controlling risk, demanding careful exploration of both mean and variance. Our theoretical contributions include a novel regret decomposition specifically designed for the Sharpe ratio, highlighting the role of information acquisition about the reward distribution in driving learning efficiency. Then, we establish fundamental performance limits for the proposed algorithm \texttt{SRTS} in terms of an upper bound on regret. We also derive the matching lower bound and show the order-optimality. Our results show that Thompson Sampling achieves logarithmic regret over time, with distribution-dependent factors capturing the difficulty of distinguishing arms based on risk-adjusted performance. Empirical simulations show that our algorithm significantly outperforms existing algorithms.
☆ DREAMS: Preserving both Local and Global Structure in Dimensionality Reduction
Dimensionality reduction techniques are widely used for visualizing high-dimensional data in two dimensions. Existing methods are typically designed to preserve either local (e.g. $t$-SNE, UMAP) or global (e.g. MDS, PCA) structure of the data, but none of the established methods can represent both aspects well. In this paper, we present DREAMS (Dimensionality Reduction Enhanced Across Multiple Scales), a method that combines the local structure preservation of $t$-SNE with the global structure preservation of PCA via a simple regularization term. Our approach generates a spectrum of embeddings between the locally well-structured $t$-SNE embedding and the globally well-structured PCA embedding, efficiently balancing both local and global structure preservation. We benchmark DREAMS across seven real-world datasets, including five from single-cell transcriptomics and one from population genetics, showcasing qualitatively and quantitatively its superior ability to preserve structure across multiple scales compared to previous approaches.
☆ Prediction is not Explanation: Revisiting the Explanatory Capacity of Mapping Embeddings ECAI 2025
Understanding what knowledge is implicitly encoded in deep learning models is essential for improving the interpretability of AI systems. This paper examines common methods to explain the knowledge encoded in word embeddings, which are core elements of large language models (LLMs). These methods typically involve mapping embeddings onto collections of human-interpretable semantic features, known as feature norms. Prior work assumes that accurately predicting these semantic features from the word embeddings implies that the embeddings contain the corresponding knowledge. We challenge this assumption by demonstrating that prediction accuracy alone does not reliably indicate genuine feature-based interpretability. We show that these methods can successfully predict even random information, concluding that the results are predominantly determined by an algorithmic upper bound rather than meaningful semantic representation in the word embeddings. Consequently, comparisons between datasets based solely on prediction performance do not reliably indicate which dataset is better captured by the word embeddings. Our analysis illustrates that such mappings primarily reflect geometric similarity within vector spaces rather than indicating the genuine emergence of semantic properties.
comment: 10 pages, 6 Figures. Published at ECAI 2025 in a version without the Appendix
☆ Trans-XFed: An Explainable Federated Learning for Supply Chain Credit Assessment
This paper proposes a Trans-XFed architecture that combines federated learning with explainable AI techniques for supply chain credit assessment. The proposed model aims to address several key challenges, including privacy, information silos, class imbalance, non-identically and independently distributed (Non-IID) data, and model interpretability in supply chain credit assessment. We introduce a performance-based client selection strategy (PBCS) to tackle class imbalance and Non-IID problems. This strategy achieves faster convergence by selecting clients with higher local F1 scores. The FedProx architecture, enhanced with homomorphic encryption, is used as the core model, and further incorporates a transformer encoder. The transformer encoder block provides insights into the learned features. Additionally, we employ the integrated gradient explainable AI technique to offer insights into decision-making. We demonstrate the effectiveness of Trans-XFed through experimental evaluations on real-world supply chain datasets. The obtained results show its ability to deliver accurate credit assessments compared to several baselines, while maintaining transparency and privacy.
comment: Accepted by FLTA 2025
☆ Optimizing Region of Interest Selection for Effective Embedding in Video Steganography Based on Genetic Algorithms
With the widespread use of the internet, there is an increasing need to ensure the security and privacy of transmitted data. This has led to an intensified focus on the study of video steganography, which is a technique that hides data within a video cover to avoid detection. The effectiveness of any steganography method depends on its ability to embed data without altering the original video quality while maintaining high efficiency. This paper proposes a new method to video steganography, which involves utilizing a Genetic Algorithm (GA) for identifying the Region of Interest (ROI) in the cover video. The ROI is the area in the video that is the most suitable for data embedding. The secret data is encrypted using the Advanced Encryption Standard (AES), which is a widely accepted encryption standard, before being embedded into the cover video, utilizing up to 10% of the cover video. This process ensures the security and confidentiality of the embedded data. The performance metrics for assessing the proposed method are the Peak Signal to Noise Ratio (PSNR) and the encoding and decoding time. The results show that the proposed method has a high embedding capacity and efficiency, with a PSNR ranging between 64 and 75 dBs, which indicates that the embedded data is almost indistinguishable from the original video. Additionally, the method can encode and decode data quickly, making it efficient for real time applications.
comment: 19 Pages, 7 Figures, 4 Tables
☆ Minimizing the Weighted Number of Tardy Jobs: Data-Driven Heuristic for Single-Machine Scheduling
Existing research on single-machine scheduling is largely focused on exact algorithms, which perform well on typical instances but can significantly deteriorate on certain regions of the problem space. In contrast, data-driven approaches provide strong and scalable performance when tailored to the structure of specific datasets. Leveraging this idea, we focus on a single-machine scheduling problem where each job is defined by its weight, duration, due date, and deadline, aiming to minimize the total weight of tardy jobs. We introduce a novel data-driven scheduling heuristic that combines machine learning with problem-specific characteristics, ensuring feasible solutions, which is a common challenge for ML-based algorithms. Experimental results demonstrate that our approach significantly outperforms the state-of-the-art in terms of optimality gap, number of optimal solutions, and adaptability across varied data scenarios, highlighting its flexibility for practical applications. In addition, we conduct a systematic exploration of ML models, addressing a common gap in similar studies by offering a detailed model selection process and providing insights into why the chosen model is the best fit.
comment: Manuscript submitted for review to Computers & Operations Research
☆ Know Me by My Pulse: Toward Practical Continuous Authentication on Wearable Devices via Wrist-Worn PPG NDSS
Biometric authentication using physiological signals offers a promising path toward secure and user-friendly access control in wearable devices. While electrocardiogram (ECG) signals have shown high discriminability, their intrusive sensing requirements and discontinuous acquisition limit practicality. Photoplethysmography (PPG), on the other hand, enables continuous, non-intrusive authentication with seamless integration into wrist-worn wearable devices. However, most prior work relies on high-frequency PPG (e.g., 75 - 500 Hz) and complex deep models, which incur significant energy and computational overhead, impeding deployment in power-constrained real-world systems. In this paper, we present the first real-world implementation and evaluation of a continuous authentication system on a smartwatch, We-Be Band, using low-frequency (25 Hz) multi-channel PPG signals. Our method employs a Bi-LSTM with attention mechanism to extract identity-specific features from short (4 s) windows of 4-channel PPG. Through extensive evaluations on both public datasets (PTTPPG) and our We-Be Dataset (26 subjects), we demonstrate strong classification performance with an average test accuracy of 88.11%, macro F1-score of 0.88, False Acceptance Rate (FAR) of 0.48%, False Rejection Rate (FRR) of 11.77%, and Equal Error Rate (EER) of 2.76%. Our 25 Hz system reduces sensor power consumption by 53% compared to 512 Hz and 19% compared to 128 Hz setups without compromising performance. We find that sampling at 25 Hz preserves authentication accuracy, whereas performance drops sharply at 20 Hz while offering only trivial additional power savings, underscoring 25 Hz as the practical lower bound. Additionally, we find that models trained exclusively on resting data fail under motion, while activity-diverse training improves robustness across physiological states.
comment: To be published in Network and Distributed System Security (NDSS) Symposium 2026
☆ ViExam: Are Vision Language Models Better than Humans on Vietnamese Multimodal Exam Questions?
Vision language models (VLMs) demonstrate remarkable capabilities on English multimodal tasks, but their performance on low-resource languages with genuinely multimodal educational content remains largely unexplored. In this work, we test how VLMs perform on Vietnamese educational assessments, investigating whether VLMs trained predominantly on English data can handle real-world cross-lingual multimodal reasoning. Our work presents the first comprehensive evaluation of VLM capabilities on multimodal Vietnamese exams through proposing ViExam, a benchmark containing 2,548 multimodal questions. We find that state-of-the-art VLMs achieve only 57.74% while open-source models achieve 27.70% mean accuracy across 7 academic domains, including Mathematics, Physics, Chemistry, Biology, Geography, Driving Test, and IQ Test. Most VLMs underperform average human test-takers (66.54%), with only the thinking VLM o3 (74.07%) exceeding human average performance, yet still falling substantially short of human best performance (99.60%). Cross-lingual prompting with English instructions while maintaining Vietnamese content fails to improve performance, decreasing accuracy by 1 percentage point for SOTA VLMs. Human-in-the-loop collaboration can partially improve VLM performance by 5 percentage points. Code and data are available at: https://vi-exam.github.io.
☆ Heavy-tailed Linear Bandits: Adversarial Robustness, Best-of-both-worlds, and Beyond
Heavy-tailed bandits have been extensively studied since the seminal work of \citet{Bubeck2012BanditsWH}. In particular, heavy-tailed linear bandits, enabling efficient learning with both a large number of arms and heavy-tailed noises, have recently attracted significant attention \citep{ShaoYKL18,XueWWZ20,ZhongHYW21,Wang2025heavy,tajdini2025improved}. However, prior studies focus almost exclusively on stochastic regimes, with few exceptions limited to the special case of heavy-tailed multi-armed bandits (MABs) \citep{Huang0H22,ChengZ024,Chen2024uniINF}. In this work, we propose a general framework for adversarial heavy-tailed bandit problems, which performs follow-the-regularized-leader (FTRL) over the loss estimates shifted by a bonus function. Via a delicate setup of the bonus function, we devise the first FTRL-type best-of-both-worlds (BOBW) algorithm for heavy-tailed MABs, which does not require the truncated non-negativity assumption and achieves an $\widetilde{O}(T^{\frac{1}{\varepsilon}})$ worst-case regret in the adversarial regime as well as an $\widetilde{O}(\log T)$ gap-dependent regret in the stochastic regime. We then extend our framework to the linear case, proposing the first algorithm for adversarial heavy-tailed linear bandits with finite arm sets. This algorithm achieves an $\widetilde{O}(d^{\frac{1}{2}}T^{\frac{1}{\varepsilon}})$ regret, matching the best-known worst-case regret bound in stochastic regimes. Moreover, we propose a general data-dependent learning rate, termed \textit{heavy-tailed noise aware stability-penalty matching} (HT-SPM). We prove that HT-SPM guarantees BOBW regret bounds for general heavy-tailed bandit problems once certain conditions are satisfied. By using HT-SPM and, in particular, a variance-reduced linear loss estimator, we obtain the first BOBW result for heavy-tailed linear bandits.
☆ Neuro-Symbolic Artificial Intelligence: Towards Improving the Reasoning Abilities of Large Language Models IJCAI 2025
Large Language Models (LLMs) have shown promising results across various tasks, yet their reasoning capabilities remain a fundamental challenge. Developing AI systems with strong reasoning capabilities is regarded as a crucial milestone in the pursuit of Artificial General Intelligence (AGI) and has garnered considerable attention from both academia and industry. Various techniques have been explored to enhance the reasoning capabilities of LLMs, with neuro-symbolic approaches being a particularly promising way. This paper comprehensively reviews recent developments in neuro-symbolic approaches for enhancing LLM reasoning. We first present a formalization of reasoning tasks and give a brief introduction to the neurosymbolic learning paradigm. Then, we discuss neuro-symbolic methods for improving the reasoning capabilities of LLMs from three perspectives: Symbolic->LLM, LLM->Symbolic, and LLM+Symbolic. Finally, we discuss several key challenges and promising future directions. We have also released a GitHub repository including papers and resources related to this survey: https://github.com/LAMDASZ-ML/Awesome-LLM-Reasoning-with-NeSy.
comment: 9 pages, 3 figures, IJCAI 2025 Survey Track
☆ Interactive Query Answering on Knowledge Graphs with Soft Entity Constraints
Methods for query answering over incomplete knowledge graphs retrieve entities that are likely to be answers, which is particularly useful when such answers cannot be reached by direct graph traversal due to missing edges. However, existing approaches have focused on queries formalized using first-order-logic. In practice, many real-world queries involve constraints that are inherently vague or context-dependent, such as preferences for attributes or related categories. Addressing this gap, we introduce the problem of query answering with soft constraints. We propose a Neural Query Reranker (NQR) designed to adjust query answer scores by incorporating soft constraints without disrupting the original answers to a query. NQR operates interactively, refining answers based on incremental examples of preferred and non-preferred entities. We extend existing QA benchmarks by generating datasets with soft constraints. Our experiments demonstrate that NQR can capture soft constraints while maintaining robust query answering performance.
☆ MACTAS: Self-Attention-Based Module for Inter-Agent Communication in Multi-Agent Reinforcement Learning AAAI 2026
Communication is essential for the collective execution of complex tasks by human agents, motivating interest in communication mechanisms for multi-agent reinforcement learning (MARL). However, existing communication protocols in MARL are often complex and non-differentiable. In this work, we introduce a self-attention-based communication module that exchanges information between the agents in MARL. Our proposed approach is fully differentiable, allowing agents to learn to generate messages in a reward-driven manner. The module can be seamlessly integrated with any action-value function decomposition method and can be viewed as an extension of such decompositions. Notably, it includes a fixed number of trainable parameters, independent of the number of agents. Experimental results on the SMAC benchmark demonstrate the effectiveness of our approach, which achieves state-of-the-art performance on several maps.
comment: Submitted for AAAI 2026
☆ In-Context Decision Making for Optimizing Complex AutoML Pipelines
Combined Algorithm Selection and Hyperparameter Optimization (CASH) has been fundamental to traditional AutoML systems. However, with the advancements of pre-trained models, modern ML workflows go beyond hyperparameter optimization and often require fine-tuning, ensembling, and other adaptation techniques. While the core challenge of identifying the best-performing model for a downstream task remains, the increasing heterogeneity of ML pipelines demands novel AutoML approaches. This work extends the CASH framework to select and adapt modern ML pipelines. We propose PS-PFN to efficiently explore and exploit adapting ML pipelines by extending Posterior Sampling (PS) to the max k-armed bandit problem setup. PS-PFN leverages prior-data fitted networks (PFNs) to efficiently estimate the posterior distribution of the maximal value via in-context learning. We show how to extend this method to consider varying costs of pulling arms and to use different PFNs to model reward distributions individually per arm. Experimental results on one novel and two existing standard benchmark tasks demonstrate the superior performance of PS-PFN compared to other bandit and AutoML strategies. We make our code and data available at https://github.com/amirbalef/CASHPlus.
☆ Input Time Scaling
Current Large Language Models (LLMs) are usually post-trained on large-scale carefully curated datasets (data & training scaling) and doing reasoning in test time (inference time scaling). In this work, we present a new scaling paradigm, Input Time Scaling, to complement previous scaling methods by putting resources on queries (input time). During training and testing, we combine meta-knowledge from LLMs to refine inputs with different strategies. We also find a new phenomenon, training-testing co-design there. We need to apply query strategies during both training and testing. Only applying strategies on training or testing would seriously degrade the performance. We are also surprised to find that seemingly low data quality datasets can gain high performance. Adding irrelevant information to the queries, randomly selecting examples from a minimally filtered dataset, can even perform the best. These findings contradict the widely held inductive bias, "garbage in, garbage out". Curating datasets with seemingly high-quality data can even potentially limit the performance ceiling. In addition, models trained on more data with similar quality (15k VS 1k) perform worse, simple dataset size scaling should also be carefully inspected. The good news is that our findings are compatible with the Less is More phenomenon. A small set of examples is enough to evoke high-level reasoning ability. With experiments on models trained on Qwen2.5-32B-Instruct, we are able to reach SOTA performance among 32B models on AIME24(76.7%) and AIME25(76.7%) pass@1. We can further achieve AIME24(76.7%) and AIME25(80%) with a majority vote of three models. Starting from DeepSeek-R1-Distill-Qwen-32B, the best result would be 86.7% on AIME24 and 76.7% on AIME25. To facilitate reproducibility and further research, we are working on open-source our datasets, data pipelines, evaluation results, and checkpoints.
☆ GRAFT: Gradient-Aware Fast MaxVol Technique for Dynamic Data Sampling
Training modern neural networks on large datasets is computationally and environmentally costly. We introduce GRAFT, a scalable in-training subset selection method that (i) extracts a low-rank feature representation for each batch, (ii) applies a Fast MaxVol sampler to select a small, diverse subset that spans the batch's dominant subspace, and (iii) dynamically adjusts the subset size using a gradient-approximation criterion. By operating in low-rank subspaces and training on carefully chosen examples instead of full batches, GRAFT preserves the training trajectory while reducing wall-clock time, energy consumption, and $\mathrm{CO}_2$ emissions. Across multiple benchmarks, GRAFT matches or exceeds recent selection baselines in both accuracy and efficiency, providing a favorable trade-off between accuracy, efficiency, and emissions.
☆ Personalized Subgraph Federated Learning with Sheaf Collaboration ECAI 2025
Graph-structured data is prevalent in many applications. In subgraph federated learning (FL), this data is distributed across clients, each with a local subgraph. Personalized subgraph FL aims to develop a customized model for each client to handle diverse data distributions. However, performance variation across clients remains a key issue due to the heterogeneity of local subgraphs. To overcome the challenge, we propose FedSheafHN, a novel framework built on a sheaf collaboration mechanism to unify enhanced client descriptors with efficient personalized model generation. Specifically, FedSheafHN embeds each client's local subgraph into a server-constructed collaboration graph by leveraging graph-level embeddings and employing sheaf diffusion within the collaboration graph to enrich client representations. Subsequently, FedSheafHN generates customized client models via a server-optimized hypernetwork. Empirical evaluations demonstrate that FedSheafHN outperforms existing personalized subgraph FL methods on various graph datasets. Additionally, it exhibits fast model convergence and effectively generalizes to new clients.
comment: Full version of our ECAI 2025 accepted paper
☆ Explainable Learning Rate Regimes for Stochastic Optimization
Modern machine learning is trained by stochastic gradient descent (SGD), whose performance critically depends on how the learning rate (LR) is adjusted and decreased over time. Yet existing LR regimes may be intricate, or need to tune one or more additional hyper-parameters manually whose bottlenecks include huge computational expenditure, time and power in practice. This work, in a natural and direct manner, clarifies how LR should be updated automatically only according to the intrinsic variation of stochastic gradients. An explainable LR regime by leveraging stochastic second-order algorithms is developed, behaving a similar pattern to heuristic algorithms but implemented simply without any parameter tuning requirement, where it is of an automatic procedure that LR should increase (decrease) as the norm of stochastic gradients decreases (increases). The resulting LR regime shows its efficiency, robustness, and scalability in different classical stochastic algorithms, containing SGD, SGDM, and SIGNSGD, on machine learning tasks.
☆ Text2Weight: Bridging Natural Language and Neural Network Weight Spaces ACM MM 2025
How far are we really from automatically generating neural networks? While neural network weight generation shows promise, current approaches struggle with generalization to unseen tasks and practical application exploration. To address this, we propose T2W, a diffusion transformer framework that generates task-specific weights conditioned on natural language descriptions. T2W hierarchically processes network parameters into uniform blocks, integrates text embeddings from CLIP via a prior attention mechanism, and employs adversarial training with weight-space augmentation to enhance generalization. Experiments on Cifar100, Caltech256, and TinyImageNet demonstrate T2W's ability to produce high-quality weights for unseen tasks, outperforming optimization-based initialization and enabling novel applications such as weight enhancement and text-guided model fusion. Our work bridges textual semantics with weight-space dynamics, supported by an open-source dataset of text-weight pairs, advancing the practicality of generative models in neural network parameter synthesis. Our code is available on Github.
comment: Accepted By ACM MM 2025 Main Track
☆ Towards a Larger Model via One-Shot Federated Learning on Heterogeneous Client Models
Large models, renowned for superior performance, outperform smaller ones even without billion-parameter scales. While mobile network servers have ample computational resources to support larger models than client devices, privacy constraints prevent clients from directly sharing their raw data. Federated Learning (FL) enables decentralized clients to collaboratively train a shared model by exchanging model parameters instead of transmitting raw data. Yet, it requires a uniform model architecture and multiple communication rounds, which neglect resource heterogeneity, impose heavy computational demands on clients, and increase communication overhead. To address these challenges, we propose FedOL, to construct a larger and more comprehensive server model in one-shot settings (i.e., in a single communication round). Instead of model parameter sharing, FedOL employs knowledge distillation, where clients only exchange model prediction outputs on an unlabeled public dataset. This reduces communication overhead by transmitting compact predictions instead of full model weights and enables model customization by allowing heterogeneous model architectures. A key challenge in this setting is that client predictions may be biased due to skewed local data distributions, and the lack of ground-truth labels in the public dataset further complicates reliable learning. To mitigate these issues, FedOL introduces a specialized objective function that iteratively refines pseudo-labels and the server model, improving learning reliability. To complement this, FedOL incorporates a tailored pseudo-label generation and knowledge distillation strategy that effectively integrates diverse knowledge. Simulation results show that FedOL significantly outperforms existing baselines, offering a cost-effective solution for mobile networks where clients possess valuable private data but limited computational resources.
comment: Accepted to Globecom 2025
☆ Towards safe control parameter tuning in distributed multi-agent systems
Many safety-critical real-world problems, such as autonomous driving and collaborative robots, are of a distributed multi-agent nature. To optimize the performance of these systems while ensuring safety, we can cast them as distributed optimization problems, where each agent aims to optimize their parameters to maximize a coupled reward function subject to coupled constraints. Prior work either studies a centralized setting, does not consider safety, or struggles with sample efficiency. Since we require sample efficiency and work with unknown and nonconvex rewards and constraints, we solve this optimization problem using safe Bayesian optimization with Gaussian process regression. Moreover, we consider nearest-neighbor communication between the agents. To capture the behavior of non-neighboring agents, we reformulate the static global optimization problem as a time-varying local optimization problem for each agent, essentially introducing time as a latent variable. To this end, we propose a custom spatio-temporal kernel to integrate prior knowledge. We show the successful deployment of our algorithm in simulations.
comment: Accepted to CDC 2025
☆ Bounding Causal Effects and Counterfactuals
Causal inference often hinges on strong assumptions - such as no unmeasured confounding or perfect compliance - that are rarely satisfied in practice. Partial identification offers a principled alternative: instead of relying on unverifiable assumptions to estimate causal effects precisely, it derives bounds that reflect the uncertainty inherent in the data. Despite its theoretical appeal, partial identification remains underutilized in applied work, in part due to the fragmented nature of existing methods and the lack of practical guidance. This thesis addresses these challenges by systematically comparing a diverse set of bounding algorithms across multiple causal scenarios. We implement, extend, and unify state-of-the-art methods - including symbolic, optimization-based, and information-theoretic approaches - within a common evaluation framework. In particular, we propose an extension of a recently introduced entropy-bounded method, making it applicable to counterfactual queries such as the Probability of Necessity and Sufficiency (PNS). Our empirical study spans thousands of randomized simulations involving both discrete and continuous data-generating processes. We assess each method in terms of bound tightness, computational efficiency, and robustness to assumption violations. To support practitioners, we distill our findings into a practical decision tree for algorithm selection and train a machine learning model to predict the best-performing method based on observable data characteristics. All implementations are released as part of an open-source Python package, CausalBoundingEngine, which enables users to apply and compare bounding methods through a unified interface.
comment: Bachelor's thesis, Technical University of Munich, 2025. 102 pages, 20 figures
☆ Approximate Bayesian Inference via Bitstring Representations UAI 2025
The machine learning community has recently put effort into quantized or low-precision arithmetics to scale large models. This paper proposes performing probabilistic inference in the quantized, discrete parameter space created by these representations, effectively enabling us to learn a continuous distribution using discrete parameters. We consider both 2D densities and quantized neural networks, where we introduce a tractable learning approach using probabilistic circuits. This method offers a scalable solution to manage complex distributions and provides clear insights into model behavior. We validate our approach with various models, demonstrating inference efficiency without sacrificing accuracy. This work advances scalable, interpretable machine learning by utilizing discrete approximations for probabilistic computations.
comment: Published at Uncertainty in Artificial Intelligence (UAI 2025)
☆ A Generalized Learning Framework for Self-Supervised Contrastive Learning
Self-supervised contrastive learning (SSCL) has recently demonstrated superiority in multiple downstream tasks. In this paper, we generalize the standard SSCL methods to a Generalized Learning Framework (GLF) consisting of two parts: the aligning part and the constraining part. We analyze three existing SSCL methods: BYOL, Barlow Twins, and SwAV, and show that they can be unified under GLF with different choices of the constraining part. We further propose empirical and theoretical analyses providing two insights into designing the constraining part of GLF: intra-class compactness and inter-class separability, which measure how well the feature space preserves the class information of the inputs. However, since SSCL can not use labels, it is challenging to design a constraining part that satisfies these properties. To address this issue, we consider inducing intra-class compactness and inter-class separability by iteratively capturing the dynamic relationship between anchor and other samples and propose a plug-and-play method called Adaptive Distribution Calibration (ADC) to ensure that samples that are near or far from the anchor point in the original input space are closer or further away from the anchor point in the feature space. Both the theoretical analysis and the empirical evaluation demonstrate the superiority of ADC.
☆ Understanding Distribution Structure on Calibrated Recommendation Systems
Traditional recommender systems aim to generate a recommendation list comprising the most relevant or similar items to the user's profile. These approaches can create recommendation lists that omit item genres from the less prominent areas of a user's profile, thereby undermining the user's experience. To solve this problem, the calibrated recommendation system provides a guarantee of including less representative areas in the recommended list. The calibrated context works with three distributions. The first is from the user's profile, the second is from the candidate items, and the last is from the recommendation list. These distributions are G-dimensional, where G is the total number of genres in the system. This high dimensionality requires a different evaluation method, considering that traditional recommenders operate in a one-dimensional data space. In this sense, we implement fifteen models that help to understand how these distributions are structured. We evaluate the users' patterns in three datasets from the movie domain. The results indicate that the models of outlier detection provide a better understanding of the structures. The calibrated system creates recommendation lists that act similarly to traditional recommendation lists, allowing users to change their groups of preferences to the same degree.
☆ The 9th AI City Challenge ICCV 2025
The ninth AI City Challenge continues to advance real-world applications of computer vision and AI in transportation, industrial automation, and public safety. The 2025 edition featured four tracks and saw a 17% increase in participation, with 245 teams from 15 countries registered on the evaluation server. Public release of challenge datasets led to over 30,000 downloads to date. Track 1 focused on multi-class 3D multi-camera tracking, involving people, humanoids, autonomous mobile robots, and forklifts, using detailed calibration and 3D bounding box annotations. Track 2 tackled video question answering in traffic safety, with multi-camera incident understanding enriched by 3D gaze labels. Track 3 addressed fine-grained spatial reasoning in dynamic warehouse environments, requiring AI systems to interpret RGB-D inputs and answer spatial questions that combine perception, geometry, and language. Both Track 1 and Track 3 datasets were generated in NVIDIA Omniverse. Track 4 emphasized efficient road object detection from fisheye cameras, supporting lightweight, real-time deployment on edge devices. The evaluation framework enforced submission limits and used a partially held-out test set to ensure fair benchmarking. Final rankings were revealed after the competition concluded, fostering reproducibility and mitigating overfitting. Several teams achieved top-tier results, setting new benchmarks in multiple tasks.
comment: Summary of the 9th AI City Challenge Workshop in conjunction with ICCV 2025
☆ Prediction of Hospital Associated Infections During Continuous Hospital Stays
The US Centers for Disease Control and Prevention (CDC), in 2019, designated Methicillin-resistant Staphylococcus aureus (MRSA) as a serious antimicrobial resistance threat. The risk of acquiring MRSA and suffering life-threatening consequences due to it remains especially high for hospitalized patients due to a unique combination of factors, including: co-morbid conditions, immuno suppression, antibiotic use, and risk of contact with contaminated hospital workers and equipment. In this paper, we present a novel generative probabilistic model, GenHAI, for modeling sequences of MRSA test results outcomes for patients during a single hospitalization. This model can be used to answer many important questions from the perspectives of hospital administrators for mitigating the risk of MRSA infections. Our model is based on the probabilistic programming paradigm, and can be used to approximately answer a variety of predictive, causal, and counterfactual questions. We demonstrate the efficacy of our model by comparing it against discriminative and generative machine learning models using two real-world datasets.
☆ Collapsing ROC approach for risk prediction research on both common and rare variants
Risk prediction that capitalizes on emerging genetic findings holds great promise for improving public health and clinical care. However, recent risk prediction research has shown that predictive tests formed on existing common genetic loci, including those from genome-wide association studies, have lacked sufficient accuracy for clinical use. Because most rare variants on the genome have not yet been studied for their role in risk prediction, future disease prediction discoveries should shift toward a more comprehensive risk prediction strategy that takes into account both common and rare variants. We are proposing a collapsing receiver operating characteristic CROC approach for risk prediction research on both common and rare variants. The new approach is an extension of a previously developed forward ROC FROC approach, with additional procedures for handling rare variants. The approach was evaluated through the use of 533 single-nucleotide polymorphisms SNPs in 37 candidate genes from the Genetic Analysis Workshop 17 mini-exome data set. We found that a prediction model built on all SNPs gained more accuracy AUC = 0.605 than one built on common variants alone AUC = 0.585. We further evaluated the performance of two approaches by gradually reducing the number of common variants in the analysis. We found that the CROC method attained more accuracy than the FROC method when the number of common variants in the data decreased. In an extreme scenario, when there are only rare variants in the data, the CROC reached an AUC value of 0.603, whereas the FROC had an AUC value of 0.524.
☆ CALYPSO: Forecasting and Analyzing MRSA Infection Patterns with Community and Healthcare Transmission Dynamics
Methicillin-resistant Staphylococcus aureus (MRSA) is a critical public health threat within hospitals as well as long-term care facilities. Better understanding of MRSA risks, evaluation of interventions and forecasting MRSA rates are important public health problems. Existing forecasting models rely on statistical or neural network approaches, which lack epidemiological interpretability, and have limited performance. Mechanistic epidemic models are difficult to calibrate and limited in incorporating diverse datasets. We present CALYPSO, a hybrid framework that integrates neural networks with mechanistic metapopulation models to capture the spread dynamics of infectious diseases (i.e., MRSA) across healthcare and community settings. Our model leverages patient-level insurance claims, commuting data, and healthcare transfer patterns to learn region- and time-specific parameters governing MRSA spread. This enables accurate, interpretable forecasts at multiple spatial resolutions (county, healthcare facility, region, state) and supports counterfactual analyses of infection control policies and outbreak risks. We also show that CALYPSO improves statewide forecasting performance by over 4.5% compared to machine learning baselines, while also identifying high-risk regions and cost-effective strategies for allocating infection prevention resources.
☆ Compressed Models are NOT Trust-equivalent to Their Large Counterparts
Large Deep Learning models are often compressed before being deployed in a resource-constrained environment. Can we trust the prediction of compressed models just as we trust the prediction of the original large model? Existing work has keenly studied the effect of compression on accuracy and related performance measures. However, performance parity does not guarantee trust-equivalence. We propose a two-dimensional framework for trust-equivalence evaluation. First, interpretability alignment measures whether the models base their predictions on the same input features. We use LIME and SHAP tests to measure the interpretability alignment. Second, calibration similarity measures whether the models exhibit comparable reliability in their predicted probabilities. It is assessed via ECE, MCE, Brier Score, and reliability diagrams. We conducted experiments using BERT-base as the large model and its multiple compressed variants. We focused on two text classification tasks: natural language inference and paraphrase identification. Our results reveal low interpretability alignment and significant mismatch in calibration similarity. It happens even when the accuracies are nearly identical between models. These findings show that compressed models are not trust-equivalent to their large counterparts. Deploying compressed models as a drop-in replacement for large models requires careful assessment, going beyond performance parity.
☆ MuFlex: A Scalable, Physics-based Platform for Multi-Building Flexibility Analysis and Coordination
With the increasing penetration of renewable generation on the power grid, maintaining system balance requires coordinated demand flexibility from aggregations of buildings. Reinforcement learning (RL) has been widely explored for building controls because of its model-free nature. Open-source simulation testbeds are essential not only for training RL agents but also for fairly benchmarking control strategies. However, most building-sector testbeds target single buildings; multi-building platforms are relatively limited and typically rely on simplified models (e.g., Resistance-Capacitance) or data-driven approaches, which lack the ability to fully capture the physical intricacies and intermediate variables necessary for interpreting control performance. Moreover, these platforms often impose fixed inputs, outputs, and model formats, restricting their applicability as benchmarking tools across diverse control scenarios. To address these gaps, MuFlex, a scalable, open-source platform for benchmarking and testing control strategies for multi-building flexibility coordination, was developed in this study. MuFlex enables synchronous information exchange across EnergyPlus building models and adheres to the latest OpenAI Gym interface, providing a modular, standardized RL implementation. The platform capabilities were demonstrated in a case study coordinating demand flexibility across four office buildings using the Soft Actor-Critic algorithm with carefully fine-tuned hyperparameters. The results show that aggregating the four buildings flexibility reduced total peak demand below a specified threshold while maintaining indoor environmental quality.
comment: The platform will be released open-source on GitHub: https://github.com/BuildNexusX/MuFlex once pre-printed
☆ Explainability of Algorithms
The opaqueness of many complex machine learning algorithms is often mentioned as one of the main obstacles to the ethical development of artificial intelligence (AI). But what does it mean for an algorithm to be opaque? Highly complex algorithms such as artificial neural networks process enormous volumes of data in parallel along multiple hidden layers of interconnected nodes, rendering their inner workings epistemically inaccessible to any human being, including their designers and developers; they are "black boxes" for all their stakeholders. But opaqueness is not always the inevitable result of technical complexity. Sometimes, the way an algorithm works is intentionally hidden from view for proprietary reasons, especially in commercial automated decision systems, creating an entirely different type of opaqueness. In the first part of the chapter, we will examine these two ways of understanding opacity and the ethical implications that stem from each of them. In the second part, we explore the different explanatory methods that have been developed in computer science to overcome an AI system's technical opaqueness. As the analysis shows, explainable AI (XAI) still faces numerous challenges.
☆ Saudi-Dialect-ALLaM: LoRA Fine-Tuning for Dialectal Arabic Generation
Large language models (LLMs) for Arabic are still dominated by Modern Standard Arabic (MSA), with limited support for Saudi dialects such as Najdi and Hijazi. This underrepresentation hinders their ability to capture authentic dialectal variation. Using a privately curated Saudi Dialect Instruction dataset (Hijazi and Najdi; 5,466 synthetic instruction-response pairs; 50/50 split), we LoRA-tune ALLaM-7B-Instruct-preview, the first foundation model developed in Saudi Arabia, for Saudi dialect generation. We investigate two variants: (i) Dialect-Token training, which prepends an explicit dialect tag to the instruction, and (ii) No-Token training, which omits the tag at formatting time. Evaluation on a held-out test set combines an external dialect classifier with text fidelity metrics (chrF++ and BERTScore) and diversity measures. The Dialect-Token model achieves the best control, raising the Saudi rate from 47.97% to 84.21% and reducing MSA leakage from 32.63% to 6.21%; fidelity also improves (chrF++ +3.53, BERTScore +0.059). Both LoRA variants outperform strong generic instruction models (Falcon-7B-Instruct, Llama-3.1-8B-Instruct, Qwen-2.5-7B-Instruct, AceGPT-v2-8B-Chat, JAIS-13B-Chat) in dialect control and fidelity, while avoiding metadata-tag echoing that these baselines frequently exhibit. We do not release the dataset or any model weights/adapters; instead, we release training/evaluation/inference code and a detailed datasheet (schema and aggregate statistics) to support independent verification.
comment: 7 pages, 6 figures, 2 tables. Code: https://github.com/HasanBGIt/Saudi-Dialect-ALLaM . Dataset and trained weights/adapters are not released. Primary category: cs.CL
☆ Heterogeneous Influence Maximization in User Recommendation CIKM 2025
User recommendation systems enhance user engagement by encouraging users to act as inviters to interact with other users (invitees), potentially fostering information propagation. Conventional recommendation methods typically focus on modeling interaction willingness. Influence-Maximization (IM) methods focus on identifying a set of users to maximize the information propagation. However, existing methods face two significant challenges. First, recommendation methods fail to unleash the candidates' spread capability. Second, IM methods fail to account for the willingness to interact. To solve these issues, we propose two models named HeteroIR and HeteroIM. HeteroIR provides an intuitive solution to unleash the dissemination potential of user recommendation systems. HeteroIM fills the gap between the IM method and the recommendation task, improving interaction willingness and maximizing spread coverage. The HeteroIR introduces a two-stage framework to estimate the spread profits. The HeteroIM incrementally selects the most influential invitee to recommend and rerank based on the number of reverse reachable (RR) sets containing inviters and invitees. RR set denotes a set of nodes that can reach a target via propagation. Extensive experiments show that HeteroIR and HeteroIM significantly outperform the state-of-the-art baselines with the p-value < 0.05. Furthermore, we have deployed HeteroIR and HeteroIM in Tencent's online gaming platforms and gained an 8.5\% and 10\% improvement in the online A/B test, respectively. Implementation codes are available at https://github.com/socialalgo/HIM.
comment: Accepted in CIKM 2025
☆ Uncertainty Tube Visualization of Particle Trajectories
Predicting particle trajectories with neural networks (NNs) has substantially enhanced many scientific and engineering domains. However, effectively quantifying and visualizing the inherent uncertainty in predictions remains challenging. Without an understanding of the uncertainty, the reliability of NN models in applications where trustworthiness is paramount is significantly compromised. This paper introduces the uncertainty tube, a novel, computationally efficient visualization method designed to represent this uncertainty in NN-derived particle paths. Our key innovation is the design and implementation of a superelliptical tube that accurately captures and intuitively conveys nonsymmetric uncertainty. By integrating well-established uncertainty quantification techniques, such as Deep Ensembles, Monte Carlo Dropout (MC Dropout), and Stochastic Weight Averaging-Gaussian (SWAG), we demonstrate the practical utility of the uncertainty tube, showcasing its application on both synthetic and simulation datasets.
☆ LLM-Enhanced Linear Autoencoders for Recommendation CIKM 2025
Large language models (LLMs) have been widely adopted to enrich the semantic representation of textual item information in recommender systems. However, existing linear autoencoders (LAEs) that incorporate textual information rely on sparse word co-occurrence patterns, limiting their ability to capture rich textual semantics. To address this, we propose L3AE, the first integration of LLMs into the LAE framework. L3AE effectively integrates the heterogeneous knowledge of textual semantics and user-item interactions through a two-phase optimization strategy. (i) L3AE first constructs a semantic item-to-item correlation matrix from LLM-derived item representations. (ii) It then learns an item-to-item weight matrix from collaborative signals while distilling semantic item correlations as regularization. Notably, each phase of L3AE is optimized through closed-form solutions, ensuring global optimality and computational efficiency. Extensive experiments demonstrate that L3AE consistently outperforms state-of-the-art LLM-enhanced models on three benchmark datasets, achieving gains of 27.6% in Recall@20 and 39.3% in NDCG@20. The source code is available at https://github.com/jaewan7599/L3AE_CIKM2025.
comment: Accepted by CIKM 2025
☆ Multi-view Clustering via Bi-level Decoupling and Consistency Learning
Multi-view clustering has shown to be an effective method for analyzing underlying patterns in multi-view data. The performance of clustering can be improved by learning the consistency and complementarity between multi-view features, however, cluster-oriented representation learning is often overlooked. In this paper, we propose a novel Bi-level Decoupling and Consistency Learning framework (BDCL) to further explore the effective representation for multi-view data to enhance inter-cluster discriminability and intra-cluster compactness of features in multi-view clustering. Our framework comprises three modules: 1) The multi-view instance learning module aligns the consistent information while preserving the private features between views through reconstruction autoencoder and contrastive learning. 2) The bi-level decoupling of features and clusters enhances the discriminability of feature space and cluster space. 3) The consistency learning module treats the different views of the sample and their neighbors as positive pairs, learns the consistency of their clustering assignments, and further compresses the intra-cluster space. Experimental results on five benchmark datasets demonstrate the superiority of the proposed method compared with the SOTA methods. Our code is published on https://github.com/LouisDong95/BDCL.
☆ DyMixOp: Guiding Neural Operator Design for PDEs from a Complex Dynamics Perspective with Local-Global-Mixing
A primary challenge in using neural networks to approximate nonlinear dynamical systems governed by partial differential equations (PDEs) is transforming these systems into a suitable format, especially when dealing with non-linearizable dynamics or the need for infinite-dimensional spaces for linearization. This paper introduces DyMixOp, a novel neural operator framework for PDEs that integrates insights from complex dynamical systems to address this challenge. Grounded in inertial manifold theory, DyMixOp transforms infinite-dimensional nonlinear PDE dynamics into a finite-dimensional latent space, establishing a structured foundation that maintains essential nonlinear interactions and enhances physical interpretability. A key innovation is the Local-Global-Mixing (LGM) transformation, inspired by convection dynamics in turbulence. This transformation effectively captures both fine-scale details and nonlinear interactions, while mitigating spectral bias commonly found in existing neural operators. The framework is further strengthened by a dynamics-informed architecture that connects multiple LGM layers to approximate linear and nonlinear dynamics, reflecting the temporal evolution of dynamical systems. Experimental results across diverse PDE benchmarks demonstrate that DyMixOp achieves state-of-the-art performance, significantly reducing prediction errors, particularly in convection-dominated scenarios reaching up to 86.7\%, while maintaining computational efficiency and scalability.
☆ Classifying Clinical Outcome of Epilepsy Patients with Ictal Chirp Embeddings
This study presents a pipeline leveraging t-Distributed Stochastic Neighbor Embedding (t-SNE) for interpretable visualizations of chirp features across diverse outcome scenarios. The dataset, comprising chirp-based temporal, spectral, and frequency metrics. Using t-SNE, local neighborhood relationships were preserved while addressing the crowding problem through Student t-distribution-based similarity optimization. Three classification tasks were formulated on the 2D t-SNE embeddings: (1) distinguishing clinical success from failure/no-resection, (2) separating high-difficulty from low-difficulty cases, and (3) identifying optimal cases, defined as successful outcomes with minimal clinical difficulty. Four classifiers, namely, Random Forests, Support Vector Machines, Logistic Regression, and k-Nearest Neighbors, were trained and evaluated using stratified 5-fold cross-validation. Across tasks, the Random Forest and k-NN classifiers demonstrated superior performance, achieving up to 88.8% accuracy in optimal case detection (successful outcomes with minimal clinical difficulty). Additionally, feature influence sensitivity maps were generated using SHAP explanations applied to model predicting t-SNE coordinates, revealing spatially localized feature importance within the embedding space. These maps highlighted how specific chirp attributes drive regional clustering and class separation, offering insights into the latent structure of the data. The integrated framework showcases the potential of interpretable embeddings and local feature attribution for clinical stratification and decision support.
comment: 21 pages, 10 figures
☆ Vision Transformers for Kidney Stone Image Classification: A Comparative Study with CNNs
Kidney stone classification from endoscopic images is critical for personalized treatment and recurrence prevention. While convolutional neural networks (CNNs) have shown promise in this task, their limited ability to capture long-range dependencies can hinder performance under variable imaging conditions. This study presents a comparative analysis between Vision Transformers (ViTs) and CNN-based models, evaluating their performance on two ex vivo datasets comprising CCD camera and flexible ureteroscope images. The ViT-base model pretrained on ImageNet-21k consistently outperformed a ResNet50 baseline across multiple imaging conditions. For instance, in the most visually complex subset (Section patches from endoscopic images), the ViT model achieved 95.2% accuracy and 95.1% F1-score, compared to 64.5% and 59.3% with ResNet50. In the mixed-view subset from CCD-camera images, ViT reached 87.1% accuracy versus 78.4% with CNN. These improvements extend across precision and recall as well. The results demonstrate that ViT-based architectures provide superior classification performance and offer a scalable alternative to conventional CNNs for kidney stone image analysis.
☆ Hierarchy-Consistent Learning and Adaptive Loss Balancing for Hierarchical Multi-Label Classification CIKM 2025
Hierarchical Multi-Label Classification (HMC) faces critical challenges in maintaining structural consistency and balancing loss weighting in Multi-Task Learning (MTL). In order to address these issues, we propose a classifier called HCAL based on MTL integrated with prototype contrastive learning and adaptive task-weighting mechanisms. The most significant advantage of our classifier is semantic consistency including both prototype with explicitly modeling label and feature aggregation from child classes to parent classes. The other important advantage is an adaptive loss-weighting mechanism that dynamically allocates optimization resources by monitoring task-specific convergence rates. It effectively resolves the "one-strong-many-weak" optimization bias inherent in traditional MTL approaches. To further enhance robustness, a prototype perturbation mechanism is formulated by injecting controlled noise into prototype to expand decision boundaries. Additionally, we formalize a quantitative metric called Hierarchical Violation Rate (HVR) as to evaluate hierarchical consistency and generalization. Extensive experiments across three datasets demonstrate both the higher classification accuracy and reduced hierarchical violation rate of the proposed classifier over baseline models.
comment: 10 pages, 7 figures, accepted by CIKM 2025
☆ ASAP: Unsupervised Post-training with Label Distribution Shift Adaptive Learning Rate CIKM 2025
In real-world applications, machine learning models face online label shift, where label distributions change over time. Effective adaptation requires careful learning rate selection: too low slows adaptation and too high causes instability. We propose ASAP (Adaptive Shift Aware Post-training), which dynamically adjusts the learning rate by computing the cosine distance between current and previous unlabeled outputs and mapping it within a bounded range. ASAP requires no labels, model ensembles, or past inputs, using only the previous softmax output for fast, lightweight adaptation. Experiments across multiple datasets and shift scenarios show ASAP consistently improves accuracy and efficiency, making it practical for unsupervised model adaptation.
comment: 5 pages, 3 figures, accepted for ACM CIKM 2025
☆ FedRAIN-Lite: Federated Reinforcement Algorithms for Improving Idealised Numerical Weather and Climate Models
Sub-grid parameterisations in climate models are traditionally static and tuned offline, limiting adaptability to evolving states. This work introduces FedRAIN-Lite, a federated reinforcement learning (FedRL) framework that mirrors the spatial decomposition used in general circulation models (GCMs) by assigning agents to latitude bands, enabling local parameter learning with periodic global aggregation. Using a hierarchy of simplified energy-balance climate models, from a single-agent baseline (ebm-v1) to multi-agent ensemble (ebm-v2) and GCM-like (ebm-v3) setups, we benchmark three RL algorithms under different FedRL configurations. Results show that Deep Deterministic Policy Gradient (DDPG) consistently outperforms both static and single-agent baselines, with faster convergence and lower area-weighted RMSE in tropical and mid-latitude zones across both ebm-v2 and ebm-v3 setups. DDPG's ability to transfer across hyperparameters and low computational cost make it well-suited for geographically adaptive parameter learning. This capability offers a scalable pathway towards high-complexity GCMs and provides a prototype for physically aligned, online-learning climate models that can evolve with a changing climate. Code accessible at https://github.com/p3jitnath/climate-rl-fedrl.
comment: 21 pages, 6 figures
☆ Zero-knowledge LLM hallucination detection and mitigation through fine-grained cross-model consistency
Large language models (LLMs) have demonstrated impressive capabilities across diverse tasks, but they remain susceptible to hallucinations--generating content that appears plausible but contains factual inaccuracies. We present Finch-Zk, a black-box framework that leverages FINe-grained Cross-model consistency to detect and mitigate Hallucinations in LLM outputs without requiring external knowledge sources. Finch-Zk introduces two key innovations: 1) a cross-model consistency checking strategy that reveals fine-grained inaccuracies by comparing responses generated by diverse models from semantically-equivalent prompts, and 2) a targeted mitigation technique that applies precise corrections to problematic segments while preserving accurate content. Experiments on the FELM dataset show Finch-Zk improves hallucination detection F1 scores by 6-39\% compared to existing approaches. For mitigation, Finch-Zk achieves 7-8 absolute percentage points improvement in answer accuracy on the GPQA-diamond dataset when applied to state-of-the-art models like Llama 4 Maverick and Claude 4 Sonnet. Extensive evaluation across multiple models demonstrates that Finch-Zk provides a practical, deployment-ready safeguard for enhancing factual reliability in production LLM systems.
☆ Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS
Test-time scaling (TTS) for large language models (LLMs) has thus far fallen into two largely separate paradigms: (1) reinforcement learning (RL) methods that optimize sparse outcome-based rewards, yet suffer from instability and low sample efficiency; and (2) search-based techniques guided by independently trained, static process reward models (PRMs), which require expensive human- or LLM-generated labels and often degrade under distribution shifts. In this paper, we introduce AIRL-S, the first natural unification of RL-based and search-based TTS. Central to AIRL-S is the insight that the reward function learned during RL training inherently represents the ideal PRM for guiding downstream search. Specifically, we leverage adversarial inverse reinforcement learning (AIRL) combined with group relative policy optimization (GRPO) to learn a dense, dynamic PRM directly from correct reasoning traces, entirely eliminating the need for labeled intermediate process data. At inference, the resulting PRM simultaneously serves as the critic for RL rollouts and as a heuristic to effectively guide search procedures, facilitating robust reasoning chain extension, mitigating reward hacking, and enhancing cross-task generalization. Experimental results across eight benchmarks, including mathematics, scientific reasoning, and code generation, demonstrate that our unified approach improves performance by 9 % on average over the base model, matching GPT-4o. Furthermore, when integrated into multiple search algorithms, our PRM consistently outperforms all baseline PRMs trained with labeled data. These results underscore that, indeed, your reward function for RL is your best PRM for search, providing a robust and cost-effective solution to complex reasoning tasks in LLMs.
☆ Learning Time-Varying Convexifications of Multiple Fairness Measures
There is an increasing appreciation that one may need to consider multiple measures of fairness, e.g., considering multiple group and individual fairness notions. The relative weights of the fairness regularisers are a priori unknown, may be time varying, and need to be learned on the fly. We consider the learning of time-varying convexifications of multiple fairness measures with limited graph-structured feedback.
☆ GLASS: Test-Time Acceleration for LLMs via Global-Local Neural Importance Aggregation
Deploying Large Language Models (LLMs) on edge hardware demands aggressive, prompt-aware dynamic pruning to reduce computation without degrading quality. Static or predictor-based schemes either lock in a single sparsity pattern or incur extra runtime overhead, and recent zero-shot methods that rely on statistics from a single prompt fail on short prompt and/or long generation scenarios. We introduce A/I-GLASS: Activation- and Impact-based Global-Local neural importance Aggregation for feed-forward network SparSification, two training-free methods that dynamically select FFN units using a rank-aggregation of prompt local and model-intrinsic global neuron statistics. Empirical results across multiple LLMs and benchmarks demonstrate that GLASS significantly outperforms prior training-free methods, particularly in challenging long-form generation scenarios, without relying on auxiliary predictors or adding any inference overhead.
☆ Pixels to Play: A Foundation Model for 3D Gameplay
We introduce Pixels2Play-0.1 (P2P0.1), a foundation model that learns to play a wide range of 3D video games with recognizable human-like behavior. Motivated by emerging consumer and developer use cases - AI teammates, controllable NPCs, personalized live-streamers, assistive testers - we argue that an agent must rely on the same pixel stream available to players and generalize to new titles with minimal game-specific engineering. P2P0.1 is trained end-to-end with behavior cloning: labeled demonstrations collected from instrumented human game-play are complemented by unlabeled public videos, to which we impute actions via an inverse-dynamics model. A decoder-only transformer with auto-regressive action output handles the large action space while remaining latency-friendly on a single consumer GPU. We report qualitative results showing competent play across simple Roblox and classic MS-DOS titles, ablations on unlabeled data, and outline the scaling and evaluation steps required to reach expert-level, text-conditioned control.
☆ Amortized Bayesian Meta-Learning for Low-Rank Adaptation of Large Language Models
Fine-tuning large language models (LLMs) with low-rank adaptaion (LoRA) is a cost-effective way to incorporate information from a specific dataset. However, it is often unclear how well the fine-tuned LLM will generalize, i.e., how well it will perform on unseen datasets. Methods have been proposed to improve generalization by optimizing with in-context prompts, or by using meta-learning to fine-tune LLMs. However, these methods are expensive in memory and computation, requiring either long-context prompts or saving copies of parameters and using second-order gradient updates. To address these challenges, we propose Amortized Bayesian Meta-Learning for LoRA (ABMLL). This method builds on amortized Bayesian meta-learning for smaller models, adapting this approach to LLMs while maintaining its computational efficiency. We reframe task-specific and global parameters in the context of LoRA and use a set of new hyperparameters to balance reconstruction accuracy and the fidelity of task-specific parameters to the global ones. ABMLL provides effective generalization and scales to large models such as Llama3-8B. Furthermore, as a result of using a Bayesian framework, ABMLL provides improved uncertainty quantification. We test ABMLL on Unified-QA and CrossFit datasets and find that it outperforms existing methods on these benchmarks in terms of both accuracy and expected calibration error.
comment: 6 pages, 2 figures
☆ Comparing Model-agnostic Feature Selection Methods through Relative Efficiency
Feature selection and importance estimation in a model-agnostic setting is an ongoing challenge of significant interest. Wrapper methods are commonly used because they are typically model-agnostic, even though they are computationally intensive. In this paper, we focus on feature selection methods related to the Generalized Covariance Measure (GCM) and Leave-One-Covariate-Out (LOCO) estimation, and provide a comparison based on relative efficiency. In particular, we present a theoretical comparison under three model settings: linear models, non-linear additive models, and single index models that mimic a single-layer neural network. We complement this with extensive simulations and real data examples. Our theoretical results, along with empirical findings, demonstrate that GCM-related methods generally outperform LOCO under suitable regularity conditions. Furthermore, we quantify the asymptotic relative efficiency of these approaches. Our simulations and real data analysis include widely used machine learning methods such as neural networks and gradient boosting trees.
☆ Graph Concept Bottleneck Models
Concept Bottleneck Models (CBMs) provide explicit interpretations for deep neural networks through concepts and allow intervention with concepts to adjust final predictions. Existing CBMs assume concepts are conditionally independent given labels and isolated from each other, ignoring the hidden relationships among concepts. However, the set of concepts in CBMs often has an intrinsic structure where concepts are generally correlated: changing one concept will inherently impact its related concepts. To mitigate this limitation, we propose GraphCBMs: a new variant of CBM that facilitates concept relationships by constructing latent concept graphs, which can be combined with CBMs to enhance model performance while retaining their interpretability. Our experiment results on real-world image classification tasks demonstrate Graph CBMs offer the following benefits: (1) superior in image classification tasks while providing more concept structure information for interpretability; (2) able to utilize latent concept graphs for more effective interventions; and (3) robust in performance across different training and architecture settings.
☆ Optimal Subspace Embeddings: Resolving Nelson-Nguyen Conjecture Up to Sub-Polylogarithmic Factors
We give a proof of the conjecture of Nelson and Nguyen [FOCS 2013] on the optimal dimension and sparsity of oblivious subspace embeddings, up to sub-polylogarithmic factors: For any $n\geq d$ and $\epsilon\geq d^{-O(1)}$, there is a random $\tilde O(d/\epsilon^2)\times n$ matrix $\Pi$ with $\tilde O(\log(d)/\epsilon)$ non-zeros per column such that for any $A\in\mathbb{R}^{n\times d}$, with high probability, $(1-\epsilon)\|Ax\|\leq\|\Pi Ax\|\leq(1+\epsilon)\|Ax\|$ for all $x\in\mathbb{R}^d$, where $\tilde O(\cdot)$ hides only sub-polylogarithmic factors in $d$. Our result in particular implies a new fastest sub-current matrix multiplication time reduction of size $\tilde O(d/\epsilon^2)$ for a broad class of $n\times d$ linear regression tasks. A key novelty in our analysis is a matrix concentration technique we call iterative decoupling, which we use to fine-tune the higher-order trace moment bounds attainable via existing random matrix universality tools [Brailovskaya and van Handel, GAFA 2024].
☆ Accelerating Image Classification with Graph Convolutional Neural Networks using Voronoi Diagrams
Recent advances in image classification have been significantly propelled by the integration of Graph Convolutional Networks (GCNs), offering a novel paradigm for handling complex data structures. This study introduces an innovative framework that employs GCNs in conjunction with Voronoi diagrams to peform image classification, leveraging their exceptional capability to model relational data. Unlike conventional convolutional neural networks, our approach utilizes a graph-based representation of images, where pixels or regions are treated as vertices of a graph, which are then simplified in the form of the corresponding Delaunay triangulations. Our model yields significant improvement in pre-processing time and classification accuracy on several benchmark datasets, surpassing existing state-of-the-art models, especially in scenarios that involve complex scenes and fine-grained categories. The experimental results, validated via cross-validation, underscore the potential of integrating GCNs with Voronoi diagrams in advancing image classification tasks. This research contributes to the field by introducing a novel approach to image classification, while opening new avenues for developing graph-based learning paradigms in other domains of computer vision and non-structured data. In particular, we have proposed a new version of the GCN in this paper, namely normalized Voronoi Graph Convolution Network (NVGCN), which is faster than the regular GCN.
comment: 14 pages, 13 figures
♻ ☆ POPri: Private Federated Learning using Preference-Optimized Synthetic Data ICML 2025
In practical settings, differentially private Federated learning (DP-FL) is the dominant method for training models from private, on-device client data. Recent work has suggested that DP-FL may be enhanced or outperformed by methods that use DP synthetic data (Wu et al., 2024; Hou et al., 2024). The primary algorithms for generating DP synthetic data for FL applications require careful prompt engineering based on public information and/or iterative private client feedback. Our key insight is that the private client feedback collected by prior DP synthetic data methods (Hou et al., 2024; Xie et al., 2024) can be viewed as an RL (reinforcement learning) reward. Our algorithm, Policy Optimization for Private Data (POPri) harnesses client feedback using policy optimization algorithms such as Direct Preference Optimization (DPO) to fine-tune LLMs to generate high-quality DP synthetic data. To evaluate POPri, we release LargeFedBench, a new federated text benchmark for uncontaminated LLM evaluations on federated client data. POPri substantially improves the utility of DP synthetic data relative to prior work on LargeFedBench datasets and an existing benchmark from Xie et al. (2024). POPri closes the gap between next-token prediction accuracy in the fully-private and non-private settings by up to 58%, compared to 28% for prior synthetic data methods, and 3% for state-of-the-art DP federated learning methods. The code and data are available at https://github.com/meiyuw/POPri.
comment: ICML 2025 camera-ready
♻ ☆ Closed-Form Feedback-Free Learning with Forward Projection
State-of-the-art methods for backpropagation-free learning employ local error feedback to direct iterative optimisation via gradient descent. In this study, we examine the more restrictive setting where retrograde communication from neuronal outputs is unavailable for pre-synaptic weight optimisation. To address this challenge, we propose Forward Projection (FP). This novel randomised closed-form training method requires only a single forward pass over the entire dataset for model fitting, without retrograde communication. Target values for pre-activation membrane potentials are generated layer-wise via nonlinear projections of pre-synaptic inputs and the labels. Local loss functions are optimised over pre-synaptic inputs using closed-form regression, without feedback from neuronal outputs or downstream layers. Interpretability is a key advantage of FP training; membrane potentials of hidden neurons in FP-trained networks encode information which is interpretable layer-wise as label predictions. We demonstrate the effectiveness of FP across four biomedical datasets. In few-shot learning tasks, FP yielded more generalisable models than those optimised via backpropagation. In large-sample tasks, FP-based models achieve generalisation comparable to gradient descent-based local learning methods while requiring only a single forward propagation step, achieving significant speed up for training. Interpretation functions defined on local neuronal activity in FP-based models successfully identified clinically salient features for diagnosis in two biomedical datasets. Forward Projection is a computationally efficient machine learning approach that yields interpretable neural network models without retrograde communication of neuronal activity during training.
comment: 26 pages, 5 figures. Study code available at https://github.com/robertoshea/forward_projection. Study data available at https://data.mendeley.com/datasets/fb7xddyxs4/2
♻ ☆ Bidirectional Information Flow (BIF) -- A Sample Efficient Hierarchical Gaussian Process for Bayesian Optimization
Hierarchical Gaussian Process (H-GP) models divide problems into different subtasks, allowing for different models to address each part, making them well-suited for problems with inherent hierarchical structure. However, typical H-GP models do not fully take advantage of this structure, only sending information up or down the hierarchy. This one-way coupling limits sample efficiency and slows convergence. We propose Bidirectional Information Flow (BIF), an efficient H-GP framework that establishes bidirectional information exchange between parent and child models in H-GPs for online training. BIF retains the modular structure of hierarchical models - the parent combines subtask knowledge from children GPs - while introducing top-down feedback to continually refine children models during online learning. This mutual exchange improves sample efficiency, enables robust training, and allows modular reuse of learned subtask models. BIF outperforms conventional H-GP Bayesian Optimization methods, achieving up to 4x and 3x higher $R^2$ scores for the parent and children respectively, on synthetic and real-world neurostimulation optimization tasks.
♻ ☆ A kinetic-based regularization method for data science applications
We propose a physics-based regularization technique for function learning, inspired by statistical mechanics. By drawing an analogy between optimizing the parameters of an interpolator and minimizing the energy of a system, we introduce corrections that impose constraints on the lower-order moments of the data distribution. This minimizes the discrepancy between the discrete and continuum representations of the data, in turn allowing to access more favorable energy landscapes, thus improving the accuracy of the interpolator. Our approach improves performance in both interpolation and regression tasks, even in high-dimensional spaces. Unlike traditional methods, it does not require empirical parameter tuning, making it particularly effective for handling noisy data. We also show that thanks to its local nature, the method offers computational and memory efficiency advantages over Radial Basis Function interpolators, especially for large datasets.
♻ ☆ Penalizing Infeasible Actions and Reward Scaling in Reinforcement Learning with Offline Data ICML2025
Reinforcement learning with offline data suffers from Q-value extrapolation errors. To address this issue, we first demonstrate that linear extrapolation of the Q-function beyond the data range is particularly problematic. To mitigate this, we propose guiding the gradual decrease of Q-values outside the data range, which is achieved through reward scaling with layer normalization (RS-LN) and a penalization mechanism for infeasible actions (PA). By combining RS-LN and PA, we develop a new algorithm called PARS. We evaluate PARS across a range of tasks, demonstrating superior performance compared to state-of-the-art algorithms in both offline training and online fine-tuning on the D4RL benchmark, with notable success in the challenging AntMaze Ultra task.
comment: Accepted to ICML2025 (spotlight)
♻ ☆ Langevin Monte-Carlo Provably Learns Depth Two Neural Nets at Any Size and Data
In this work, we will establish that the Langevin Monte-Carlo algorithm can learn depth-2 neural nets of any size and for any data and we give non-asymptotic convergence rates for it. We achieve this via showing that in q-Renyi divergence, the iterates of Langevin Monte Carlo converge to the Gibbs distribution of Frobenius norm regularized losses for any of these nets, when using smooth activations and in both classification and regression settings. Most critically, the amount of regularization needed for our results is independent of the size of the net. This result achieves a synthesis of several recent observations about isoperimetry conditions under which LMC converges and that two-layer neural loss functions can always be regularized by a certain constant amount such that they satisfy the Villani conditions, and thus their Gibbs measures satisfy a Poincare inequality.
♻ ☆ Fully Automated Segmentation of Fiber Bundles in Anatomic Tracing Data MICCAI 2025
Anatomic tracer studies are critical for validating and improving diffusion MRI (dMRI) tractography. However, large-scale analysis of data from such studies is hampered by the labor-intensive process of annotating fiber bundles manually on histological slides. Existing automated methods often miss sparse bundles or require complex post-processing across consecutive sections, limiting their flexibility and generalizability. We present a streamlined, fully automated framework for fiber bundle segmentation in macaque tracer data, based on a U-Net architecture with large patch sizes, foreground aware sampling, and semisupervised pre-training. Our approach eliminates common errors such as mislabeling terminals as bundles, improves detection of sparse bundles by over 20% and reduces the False Discovery Rate (FDR) by 40% compared to the state-of-the-art, all while enabling analysis of standalone slices. This new framework will facilitate the automated analysis of anatomic tracing data at a large scale, generating more ground-truth data that can be used to validate and optimize dMRI tractography methods.
comment: Accepted at CDMRI, MICCAI 2025
♻ ☆ Finite Expression Method for Solving High-Dimensional Partial Differential Equations
Designing efficient and accurate numerical solvers for high-dimensional partial differential equations (PDEs) remains a challenging and important topic in computational science and engineering, mainly due to the "curse of dimensionality" in designing numerical schemes that scale in dimension. This paper introduces a new methodology that seeks an approximate PDE solution in the space of functions with finitely many analytic expressions and, hence, this methodology is named the finite expression method (FEX). It is proved in approximation theory that FEX can avoid the curse of dimensionality. As a proof of concept, a deep reinforcement learning method is proposed to implement FEX for various high-dimensional PDEs in different dimensions, achieving high and even machine accuracy with a memory complexity polynomial in dimension and an amenable time complexity. An approximate solution with finite analytic expressions also provides interpretable insights into the ground truth PDE solution, which can further help to advance the understanding of physical systems and design postprocessing techniques for a refined solution.
♻ ☆ Spatial-Temporal Transformer with Curriculum Learning for EEG-Based Emotion Recognition
EEG-based emotion recognition plays an important role in developing adaptive brain-computer communication systems, yet faces two fundamental challenges in practical implementations: (1) effective integration of non-stationary spatial-temporal neural patterns, (2) robust adaptation to dynamic emotional intensity variations in real-world scenarios. This paper proposes SST-CL, a novel framework integrating spatial-temporal transformers with curriculum learning. Our method introduces two core components: a spatial encoder that models inter-channel relationships and a temporal encoder that captures multi-scale dependencies through windowed attention mechanisms, enabling simultaneous extraction of spatial correlations and temporal dynamics from EEG signals. Complementing this architecture, an intensity-aware curriculum learning strategy progressively guides training from high-intensity to low-intensity emotional states through dynamic sample scheduling based on a dual difficulty assessment. Comprehensive experiments on three benchmark datasets demonstrate state-of-the-art performance across various emotional intensity levels, with ablation studies confirming the necessity of both architectural components and the curriculum learning mechanism.
♻ ☆ Clus-UCB: A Near-Optimal Algorithm for Clustered Bandits
We study a stochastic multi-armed bandit setting where arms are partitioned into known clusters, such that the mean rewards of arms within a cluster differ by at most a known threshold. While the clustering structure is known a priori, the arm means are unknown. We derive an asymptotic lower bound on the regret that improves upon the classical bound of Lai & Robbins (1985). We then propose Clus-UCB, an efficient algorithm that closely matches this lower bound asymptotically. Clus-UCB is designed to exploit the clustering structure and introduces a new index to evaluate an arm, which depends on other arms within the cluster. In this way, arms share information among each other. We present simulation results of our algorithm and compare its performance against KL-UCB and other wellknown algorithms for bandits with dependent arms. Finally, we address some limitations of this work and conclude by mentioning some possible future research.
♻ ☆ Vision Backbone Efficient Selection for Image Classification in Low-Data Regimes BMVC 2025
Transfer learning has become an essential tool in modern computer vision, allowing practitioners to leverage backbones, pretrained on large datasets, to train successful models from limited annotated data. Choosing the right backbone is crucial, especially for small datasets, since final performance depends heavily on the quality of the initial feature representations. While prior work has conducted benchmarks across various datasets to identify universal top-performing backbones, we demonstrate that backbone effectiveness is highly dataset-dependent, especially in low-data scenarios where no single backbone consistently excels. To overcome this limitation, we introduce dataset-specific backbone selection as a new research direction and investigate its practical viability in low-data regimes. Since exhaustive evaluation is computationally impractical for large backbone pools, we formalize Vision Backbone Efficient Selection (VIBES) as the problem of searching for high-performing backbones under computational constraints. We define the solution space, propose several heuristics, and demonstrate VIBES feasibility for low-data image classification by performing experiments on four diverse datasets. Our results show that even simple search strategies can find well-suited backbones within a pool of over $1300$ pretrained models, outperforming generic benchmark recommendations within just ten minutes of search time on a single GPU (NVIDIA RTX A5000).
comment: 16 pages, 8 figures, Accepted at BMVC 2025
♻ ☆ Blending 3D Geometry and Machine Learning for Multi-View Stereopsis
Traditional multi-view stereo (MVS) methods primarily depend on photometric and geometric consistency constraints. In contrast, modern learning-based algorithms often rely on the plane sweep algorithm to infer 3D geometry, applying explicit geometric consistency (GC) checks only as a post-processing step, with no impact on the learning process itself. In this work, we introduce GC MVSNet plus plus, a novel approach that actively enforces geometric consistency of reference view depth maps across multiple source views (multi view) and at various scales (multi scale) during the learning phase (see Fig. 1). This integrated GC check significantly accelerates the learning process by directly penalizing geometrically inconsistent pixels, effectively halving the number of training iterations compared to other MVS methods. Furthermore, we introduce a densely connected cost regularization network with two distinct block designs simple and feature dense optimized to harness dense feature connections for enhanced regularization. Extensive experiments demonstrate that our approach achieves a new state of the art on the DTU and BlendedMVS datasets and secures second place on the Tanks and Temples benchmark. To our knowledge, GC MVSNet plus plus is the first method to enforce multi-view, multi-scale supervised geometric consistency during learning. Our code is available.
comment: A pre-print -- accepted at Neurocomputing. arXiv admin note: substantial text overlap with arXiv:2310.19583
♻ ☆ Joint Problems in Learning Multiple Dynamical Systems
Clustering of time series is a well-studied problem, with applications ranging from quantitative, personalized models of metabolism obtained from metabolite concentrations to state discrimination in quantum information theory. We consider a variant, where given a set of trajectories and a number of parts, we jointly partition the set of trajectories and learn linear dynamical system (LDS) models for each part, so as to minimize the maximum error across all the models. We present globally convergent methods and EM heuristics, accompanied by promising computational results. The key highlight of this method is that it does not require a predefined hidden state dimension but instead provides an upper bound. Additionally, it offers guidance for determining regularization in the system identification.
♻ ☆ ConTextTab: A Semantics-Aware Tabular In-Context Learner
Tabular in-context learning (ICL) has recently achieved state-of-the-art (SOTA) performance on several tabular prediction tasks. Previously restricted to classification problems on small tables, recent advances such as TabPFN and TabICL have extended its use to larger datasets. Although current table-native ICL architectures are architecturally efficient and well-adapted to tabular data structures, their exclusive training on synthetic data limits their ability to fully leverage the rich semantics and world knowledge contained in real-world tabular data. At the other end of the spectrum, tabular ICL models based on pretrained large language models such as TabuLa-8B integrate deep semantic understanding and world knowledge but are only able to make use of a small amount of context due to inherent architectural limitations. With the aim to combine the best of both these worlds, we introduce ConTextTab, integrating semantic understanding and alignment into a table-native ICL framework. By employing specialized embeddings for different data modalities and by training on large-scale real-world tabular data, our model is competitive with SOTA across a broad set of benchmarks while setting a new standard on the semantically rich CARTE benchmark. Code and model checkpoints are available at: https://github.com/SAP-samples/contexttab
♻ ☆ Incorporating Attributes and Multi-Scale Structures for Heterogeneous Graph Contrastive Learning
Heterogeneous graphs (HGs) are composed of multiple types of nodes and edges, making it more effective in capturing the complex relational structures inherent in the real world. However, in real-world scenarios, labeled data is often difficult to obtain, which limits the applicability of semi-supervised approaches. Self-supervised learning aims to enable models to automatically learn useful features from data, effectively addressing the challenge of limited labeling data. In this paper, we propose a novel contrastive learning framework for heterogeneous graphs (ASHGCL), which incorporates three distinct views, each focusing on node attributes, high-order and low-order structural information, respectively, to effectively capture attribute information, high-order structures, and low-order structures for node representation learning. Furthermore, we introduce an attribute-enhanced positive sample selection strategy that combines both structural information and attribute information, effectively addressing the issue of sampling bias. Extensive experiments on four real-world datasets show that ASHGCL outperforms state-of-the-art unsupervised baselines and even surpasses some supervised benchmarks.
♻ ☆ SSD-TS: Exploring the Potential of Linear State Space Models for Diffusion Models in Time Series Imputation KDD' 25
Probabilistic time series imputation has been widely applied in real-world scenarios due to its ability for uncertainty estimation and denoising diffusion probabilistic models~(DDPMs) have achieved great success in probabilistic time series imputation tasks with its power to model complex distributions. However, current DDPM-based probabilistic time series imputation methodologies are confronted with two types of challenges: 1)\textit{The backbone modules of the denoising parts are not capable of achieving sequence modeling with low time complexity.} 2)~\textit{The architecture of denoising modules can not handle the dependencies in the time series data effectively.} To address the first challenge, we explore the potential of state space model, namely Mamba, as the backbone denoising module for DDPMs. To tackle the second challenge, we carefully devise several SSM-based blocks for time series data modeling. Experimental results demonstrate that our approach can achieve state-of-the-art time series imputation results on multiple real-world datasets. Our datasets and code are available at \href{https://github.com/decisionintelligence/SSD-TS/}{https://github.com/decisionintelligence/SSD-TS/}
comment: KDD' 25
♻ ☆ Can Masked Autoencoders Also Listen to Birds?
Masked Autoencoders (MAEs) learn rich semantic representations in audio classification through an efficient self-supervised reconstruction task. However, general-purpose models fail to generalize well when applied directly to fine-grained audio domains. Specifically, bird-sound classification requires distinguishing subtle inter-species differences and managing high intra-species acoustic variability, revealing the performance limitations of general-domain Audio-MAEs. This work demonstrates that bridging this domain gap domain gap requires full-pipeline adaptation, not just domain-specific pretraining data. We systematically revisit and adapt the pretraining recipe, fine-tuning methods, and frozen feature utilization to bird sounds using BirdSet, a large-scale bioacoustic dataset comparable to AudioSet. Our resulting Bird-MAE achieves new state-of-the-art results in BirdSet's multi-label classification benchmark. Additionally, we introduce the parameter-efficient prototypical probing, enhancing the utility of frozen MAE representations and closely approaching fine-tuning performance in low-resource settings. Bird-MAE's prototypical probes outperform linear probing by up to 37 percentage points in mean average precision and narrow the gap to fine-tuning across BirdSet downstream tasks. Bird-MAE also demonstrates robust few-shot capabilities with prototypical probing in our newly established few-shot benchmark on BirdSet, highlighting the potential of tailored self-supervised learning pipelines for fine-grained audio domains.
comment: accepted @TMLR: https://openreview.net/forum?id=GIBWR0Xo2J
♻ ☆ Flexible Operator Fusion for Fast Sparse Transformer with Diverse Masking on GPU
Large language models are popular around the world due to their powerful understanding capabilities. As the core component of LLMs, accelerating Transformer through parallelization has gradually become a hot research topic. Mask layers introduce sparsity into Transformer to reduce calculations. However, previous works rarely focus on the performance optimization of sparse Transformer. Moreover, rule-based mechanisms ignore the fusion opportunities of mixed-type operators and fail to adapt to various sequence lengths. To address the above problems, we propose STOF, a framework that incorporates optimizations for Sparse Transformer via flexible masking and operator fusion on GPU. We firstly unify the storage format and kernel implementation for the multi-head attention. Then, we map fusion schemes to compilation templates and determine the optimal parameter setting through a two-stage search engine. The experimental results show that compared to the state-of-the-art work, STOF achieves maximum speedups of 1.7x in MHA computation and 1.5x in end-to-end inference.
♻ ☆ Measurement as Bricolage: Examining How Data Scientists Construct Target Variables for Predictive Modeling Tasks SC
Data scientists often formulate predictive modeling tasks involving fuzzy, hard-to-define concepts, such as the "authenticity" of student writing or the "healthcare need" of a patient. Yet the process by which data scientists translate fuzzy concepts into a concrete, proxy target variable remains poorly understood. We interview fifteen data scientists in education (N=8) and healthcare (N=7) to understand how they construct target variables for predictive modeling tasks. Our findings suggest that data scientists construct target variables through a bricolage process, in which they use creative and pragmatic approaches to make do with the limited data at hand. Data scientists attempt to satisfy five major criteria for a target variable through bricolage: validity, simplicity, predictability, portability, and resource requirements. To achieve this, data scientists adaptively apply problem (re)formulation strategies, such as swapping out one candidate target variable for another when the first fails to meet certain criteria (e.g., predictability), or composing multiple outcomes into a single target variable to capture a more holistic set of modeling objectives. Based on our findings, we present opportunities for future HCI, CSCW, and ML research to better support the art and science of target variable construction.
comment: CSCW 2025
♻ ☆ Age-Normalized HRV Features for Non-Invasive Glucose Prediction: A Pilot Sleep-Aware Machine Learning Study
Non-invasive glucose monitoring remains a critical challenge in the management of diabetes. HRV during sleep shows promise for glucose prediction however, age-related autonomic changes significantly confound traditional HRV analyses. We analyzed 43 subjects with multi-modal data including sleep-stage specific ECG, HRV features, and clinical measurements. A novel age-normalization technique was applied to the HRV features by, dividing the raw values by age-scaled factors. BayesianRidge regression with 5-fold cross-validation was employed for log-glucose prediction. Age-normalized HRV features achieved R2 = 0.161 (MAE = 0.182) for log-glucose prediction, representing a 25.6% improvement over non-normalized features (R2 = 0.132). The top predictive features were hrv rem mean rr age normalized (r = 0.443, p = 0.004), hrv ds mean rr age normalized (r = 0.438, p = 0.005), and diastolic blood pressure (r = 0.437, p = 0.005). Systematic ablation studies confirmed age-normalization as the critical component, with sleep-stage specific features providing additional predictive value. Age-normalized HRV features significantly enhance glucose prediction accuracy compared with traditional approaches. This sleep-aware methodology addresses fundamental limitations in autonomic function assessment and suggests a preliminary feasibility for non-invasive glucose monitoring applications. However, these results require validation in larger cohorts before clinical consideration.
♻ ☆ Active Learning of Mealy Machines with Timers
We present the first algorithm for query learning Mealy machines with timers in a black-box context. Our algorithm is an extension of the L# algorithm of Vaandrager et al. to a timed setting. We rely on symbolic queries which empower us to reason on untimed executions while learning. Similarly to the algorithm for learning timed automata of Waga, these symbolic queries can be realized using finitely many concrete queries. Experiments with a prototype implementation show that our algorithm is able to efficiently learn realistic benchmarks.
comment: 57 pages, 13 figures. Published at QEST+FORMATS 2025
♻ ☆ MEGA: Second-Order Gradient Alignment for Catastrophic Forgetting Mitigation in GFSCIL
Graph Few-Shot Class-Incremental Learning (GFSCIL) enables models to continually learn from limited samples of novel tasks after initial training on a large base dataset. Existing GFSCIL approaches typically utilize Prototypical Networks (PNs) for metric-based class representations and fine-tune the model during the incremental learning stage. However, these PN-based methods oversimplify learning via novel query set fine-tuning and fail to integrate Graph Continual Learning (GCL) techniques due to architectural constraints. To address these challenges, we propose a more rigorous and practical setting for GFSCIL that excludes query sets during the incremental training phase. Building on this foundation, we introduce Model-Agnostic Meta Graph Continual Learning (MEGA), aimed at effectively alleviating catastrophic forgetting for GFSCIL. Specifically, by calculating the incremental second-order gradient during the meta-training stage, we endow the model to learn high-quality priors that enhance incremental learning by aligning its behaviors across both the meta-training and incremental learning stages. Extensive experiments on four mainstream graph datasets demonstrate that MEGA achieves state-of-the-art results and enhances the effectiveness of various GCL methods in GFSCIL. We believe that our proposed MEGA serves as a model-agnostic GFSCIL paradigm, paving the way for future research.
comment: Under Review
♻ ☆ Rectifying Conformity Scores for Better Conditional Coverage
We present a new method for generating confidence sets within the split conformal prediction framework. Our method performs a trainable transformation of any given conformity score to improve conditional coverage while ensuring exact marginal coverage. The transformation is based on an estimate of the conditional quantile of conformity scores. The resulting method is particularly beneficial for constructing adaptive confidence sets in multi-output problems where standard conformal quantile regression approaches have limited applicability. We develop a theoretical bound that captures the influence of the accuracy of the quantile estimate on the approximate conditional validity, unlike classical bounds for conformal prediction methods that only offer marginal coverage. We experimentally show that our method is highly adaptive to the local data structure and outperforms existing methods in terms of conditional coverage, improving the reliability of statistical inference in various applications.
♻ ☆ Disentangled Representation Learning with the Gromov-Monge Gap ICLR 2025
Learning disentangled representations from unlabelled data is a fundamental challenge in machine learning. Solving it may unlock other problems, such as generalization, interpretability, or fairness. Although remarkably challenging to solve in theory, disentanglement is often achieved in practice through prior matching. Furthermore, recent works have shown that prior matching approaches can be enhanced by leveraging geometrical considerations, e.g., by learning representations that preserve geometric features of the data, such as distances or angles between points. However, matching the prior while preserving geometric features is challenging, as a mapping that fully preserves these features while aligning the data distribution with the prior does not exist in general. To address these challenges, we introduce a novel approach to disentangled representation learning based on quadratic optimal transport. We formulate the problem using Gromov-Monge maps that transport one distribution onto another with minimal distortion of predefined geometric features, preserving them as much as can be achieved. To compute such maps, we propose the Gromov-Monge-Gap (GMG), a regularizer quantifying whether a map moves a reference distribution with minimal geometry distortion. We demonstrate the effectiveness of our approach for disentanglement across four standard benchmarks, outperforming other methods leveraging geometric considerations.
comment: ICLR 2025
♻ ☆ FlowState: Sampling Rate Invariant Time Series Forecasting
Foundation models (FMs) have transformed natural language processing, but their success has not yet translated to time series forecasting. Existing time series foundation models (TSFMs), often based on transformer variants, struggle with generalization across varying context and target lengths, lack adaptability to different sampling rates, and are computationally inefficient. We introduce FlowState, a novel TSFM architecture that addresses these challenges through two key innovations: a state space model (SSM) based encoder and a functional basis decoder. This design enables continuous-time modeling and dynamic time-scale adjustment, allowing FlowState to inherently generalize across all possible temporal resolutions, and dynamically adjust the forecasting horizons. In contrast to other state-of-the-art TSFMs, which require training data across all possible sampling rates to memorize patterns at each scale, FlowState inherently adapts its internal dynamics to the input scale, enabling smaller models, reduced data requirements, and improved efficiency. We further propose an efficient pretraining strategy that improves robustness and accelerates training. Despite being the smallest model, FlowState outperforms all other models and is state-of-the-art for the GIFT-ZS and the Chronos-ZS benchmarks. Ablation studies confirm the effectiveness of its components, and we demonstrate its unique ability to adapt online to varying input sampling rates.
comment: Currently under review
♻ ☆ Environmental Feature Engineering and Statistical Validation for ML-Based Path Loss Prediction
Wireless communications rely on path loss modeling, which is most effective when it includes the physical details of the propagation environment. Acquiring this data has historically been challenging, but geographic information systems data is becoming increasingly available with higher resolution and accuracy. Access to such details enables propagation models to more accurately predict coverage and account for interference in wireless deployments. Machine learning-based modeling can significantly support this effort, with feature based approaches allowing for accurate, efficient, and scalable propagation modeling. Building on previous work, we introduce an extended set of features that improves prediction accuracy while, most importantly, proving model generalization through rigorous statistical assessment and the use of test set holdouts.
comment: 4 pages, 4 figures, AWPL paper
♻ ☆ Rethinking Weight-Averaged Model-merging
Model merging, particularly through weight averaging, has shown surprising effectiveness in saving computations and improving model performance without any additional training. However, the interpretability of why and how this technique works remains unclear. In this work, we reinterpret weight-averaged model merging through the lens of interpretability and provide empirical insights into the underlying mechanisms that govern its behavior. We approach the problem from three perspectives: (1) we analyze the learned weight structures and demonstrate that model weights encode structured representations that help explain the compatibility of weight averaging; (2) we compare averaging in weight space and feature space across diverse model architectures (CNNs and ViTs) and datasets, aiming to expose under which circumstances what combination paradigm will work more effectively; (3) we study the effect of parameter scaling on prediction stability, highlighting how weight averaging acts as a form of regularization that contributes to robustness. By framing these analyses in an interpretability context, our work contributes to a more transparent and systematic understanding of model merging for stakeholders interested in the safety and reliability of untrained model combination methods. The code is available at https://github.com/billhhh/Rethink-Merge.
♻ ☆ What Matters for Bioacoustic Encoding
Bioacoustics, the study of sounds produced by living organisms, plays a vital role in conservation, biodiversity monitoring, and behavioral studies. Many tasks in this field, such as species, individual, and behavior classification and detection, are well-suited to machine learning. However, they often suffer from limited annotated data, highlighting the need for a general-purpose bioacoustic encoder capable of extracting useful representations for diverse downstream tasks. Such encoders have been proposed before, but are often limited in scope due to a focus on a narrow range of species (typically birds), and a reliance on a single model architecture or training paradigm. Moreover, they are usually evaluated on a small set of tasks and datasets. In this work, we present a large-scale empirical study that covers aspects of bioacoustics that are relevant to research but have previously been scarcely considered: training data diversity and scale, model architectures and training recipes, and the breadth of evaluation tasks and datasets. We obtain encoders that are state-of-the-art on the existing and proposed benchmarks. We also identify what matters for training these encoders, such that this work can be extended when more data are available or better architectures are proposed. Specifically, across 26 datasets with tasks including species classification, detection, individual ID, and vocal repertoire discovery, we find self-supervised pre-training followed by supervised post-training on a mixed bioacoustics + general-audio corpus yields the strongest in- and out-of-distribution performance. We show the importance of data diversity in both stages. To support ongoing research and application, we will release the model checkpoints.
♻ ☆ Recommendations with Sparse Comparison Data: Provably Fast Convergence for Nonconvex Matrix Factorization ICML 2025
This paper provides a theoretical analysis of a new learning problem for recommender systems where users provide feedback by comparing pairs of items instead of rating them individually. We assume that comparisons stem from latent user and item features, which reduces the task of predicting preferences to learning these features from comparison data. Similar to the classical matrix factorization problem, the main challenge in this learning task is that the resulting loss function is nonconvex. Our analysis shows that the loss function exhibits (restricted) strong convexity near the true solution, which ensures gradient-based methods converge exponentially, given an appropriate warm start. Importantly, this result holds in a sparse data regime, where each user compares only a few pairs of items. Our main technical contribution is to extend certain concentration inequalities commonly used in matrix completion to our model. Our work demonstrates that learning personalized recommendations from comparison data is computationally and statistically efficient.
comment: Accepted for publication in ICML 2025
♻ ☆ Towards Theoretical Understanding of Transformer Test-Time Computing: Investigation on In-Context Linear Regression
Using more test-time computation during language model inference, such as generating more intermediate thoughts or sampling multiple candidate answers, has proven effective in significantly improving model performance. This paper takes an initial step toward bridging the gap between practical language model inference and theoretical transformer analysis by incorporating randomness and sampling. We focus on in-context linear regression with continuous/binary coefficients, where our framework simulates language model decoding through noise injection and binary coefficient sampling. Through this framework, we provide detailed analyses of widely adopted inference techniques. Supported by empirical results, our theoretical framework and analysis demonstrate the potential for offering new insights into understanding inference behaviors in real-world language models.
♻ ☆ Enhancing Cost Efficiency in Active Learning with Candidate Set Query
This paper introduces a cost-efficient active learning (AL) framework for classification, featuring a novel query design called candidate set query. Unlike traditional AL queries requiring the oracle to examine all possible classes, our method narrows down the set of candidate classes likely to include the ground-truth class, significantly reducing the search space and labeling cost. Moreover, we leverage conformal prediction to dynamically generate small yet reliable candidate sets, adapting to model enhancement over successive AL rounds. To this end, we introduce an acquisition function designed to prioritize data points that offer high information gain at lower cost. Empirical evaluations on CIFAR-10, CIFAR-100, and ImageNet64x64 demonstrate the effectiveness and scalability of our framework. Notably, it reduces labeling cost by 48% on ImageNet64x64. The project page can be found at https://yehogwon.github.io/csq-al.
comment: Accepted to TMLR
♻ ☆ HRS: Hybrid Representation Framework with Scheduling Awareness for Time Series Forecasting in Crowdsourced Cloud-Edge Platforms ECAI2025
With the rapid proliferation of streaming services, network load exhibits highly time-varying and bursty behavior, posing serious challenges for maintaining Quality of Service (QoS) in Crowdsourced Cloud-Edge Platforms (CCPs). While CCPs leverage Predict-then-Schedule architecture to improve QoS and profitability, accurate load forecasting remains challenging under traffic surges. Existing methods either minimize mean absolute error, resulting in underprovisioning and potential Service Level Agreement (SLA) violations during peak periods, or adopt conservative overprovisioning strategies, which mitigate SLA risks at the expense of increased resource expenditure. To address this dilemma, we propose HRS, a hybrid representation framework with scheduling awareness that integrates numerical and image-based representations to better capture extreme load dynamics. We further introduce a Scheduling-Aware Loss (SAL) that captures the asymmetric impact of prediction errors, guiding predictions that better support scheduling decisions. Extensive experiments on four real-world datasets demonstrate that HRS consistently outperforms ten baselines and achieves state-of-the-art performance, reducing SLA violation rates by 63.1% and total profit loss by 32.3%.
comment: 10 pages, 14 figures, ECAI2025
♻ ☆ Gaussian Approximation and Multiplier Bootstrap for Stochastic Gradient Descent
In this paper, we establish the non-asymptotic validity of the multiplier bootstrap procedure for constructing the confidence sets using the Stochastic Gradient Descent (SGD) algorithm. Under appropriate regularity conditions, our approach avoids the need to approximate the limiting covariance of Polyak-Ruppert SGD iterates, which allows us to derive approximation rates in convex distance of order up to $1/\sqrt{n}$. Notably, this rate can be faster than the one that can be proven in the Polyak-Juditsky central limit theorem. To our knowledge, this provides the first fully non-asymptotic bound on the accuracy of bootstrap approximations in SGD algorithms. Our analysis builds on the Gaussian approximation results for nonlinear statistics of independent random variables.
comment: Added a lower bound for the accuracy of Gaussian approximation and relaxed the conditions of the main theorem
♻ ☆ Development of Pre-Trained Transformer-based Models for the Nepali Language
Transformer-based pre-trained language models have dominated the field of Natural Language Processing (NLP) for quite some time now. However, the Nepali language, spoken by approximately 32 million people worldwide, remains significantly underrepresented in this domain. This underrepresentation is primarily attributed to the scarcity of monolingual data corpora and limited available resources for the Nepali language. While existing efforts have predominantly concentrated on basic encoder-based models, there is a notable gap in the exploration of decoder-based architectures. To address this gap, we have collected 27.5 GB of Nepali text data, approximately 2.4x larger than any previously available Nepali language corpus. Leveraging this data, we pre-trained three different models i.e., BERT, RoBERTa, and GPT-2, exclusively for the Nepali Language. Furthermore, we performed instruction tuning and explored its potential for monolingual Nepali data, providing a foundation for future research. Our models outperformed the existing best model by 2 points on Nep-gLUE benchmark, scoring 95.60 and also outperformed existing models on text generation tasks, demonstrating improvements in both understanding and generating Nepali text.
♻ ☆ RAPNet: A Receptive-Field Adaptive Convolutional Neural Network for Pansharpening
Pansharpening refers to the process of integrating a high resolution panchromatic (PAN) image with a lower resolution multispectral (MS) image to generate a fused product, which is pivotal in remote sensing. Despite the effectiveness of CNNs in addressing this challenge, they are inherently constrained by the uniform application of convolutional kernels across all spatial positions, overlooking local content variations. To overcome this issue, we introduce RAPNet, a new architecture that leverages content-adaptive convolution. At its core, RAPNet employs the Receptive-field Adaptive Pansharpening Convolution (RAPConv), designed to produce spatially adaptive kernels responsive to local feature context, thereby enhancing the precision of spatial detail extraction. Additionally, the network integrates the Pansharpening Dynamic Feature Fusion (PAN-DFF) module, which incorporates an attention mechanism to achieve an optimal balance between spatial detail enhancement and spectral fidelity. Comprehensive evaluations on publicly available datasets confirm that RAPNet delivers superior performance compared to existing approaches, as demonstrated by both quantitative metrics and qualitative assessments. Ablation analyses further substantiate the effectiveness of the proposed adaptive components.
comment: Accepted by the 6th International Conference on Artificial Intelligence and Electromechanical Automation (AIEA 2025). 5 pages, 6 figures
♻ ☆ Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines SC'21
Training large deep learning models at scale is very challenging. This paper proposes Chimera, a novel pipeline parallelism scheme which combines bidirectional pipelines for efficiently training large-scale models. Chimera is a synchronous approach and therefore no loss of accuracy, which is more convergence-friendly than asynchronous approaches. Compared with the latest synchronous pipeline approach, Chimera reduces the number of bubbles by up to 50%; benefiting from the sophisticated scheduling of bidirectional pipelines, Chimera has a more balanced activation memory consumption. Evaluations are conducted on Transformer based language models. For a GPT-2 model with 1.3 billion parameters running on 2,048 GPU nodes of the Piz Daint supercomputer, Chimera improves the training throughput by 1.16x-2.34x over the state-of-the-art synchronous and asynchronous pipeline approaches.
comment: Published in Proceedings of the 2021 International Conference for High Performance Computing, Networking, Storage and Analysis (SC'21), November 2021, Article No.: 27, Pages 1-14. Best Paper Finalist
♻ ☆ Iterative Utility Judgment Framework via LLMs Inspired by Relevance in Philosophy
Relevance and utility are two frequently used measures to evaluate the effectiveness of an information retrieval (IR) system. Relevance emphasizes the aboutness of a result to a query, while utility refers to the result's usefulness or value to an information seeker. In Retrieval-Augmented Generation (RAG), high-utility results should be prioritized to feed to LLMs due to their limited input bandwidth. Re-examining RAG's three core components -- relevance ranking derived from retrieval models, utility judgments, and answer generation -- aligns with Schutz's philosophical system of relevances, which encompasses three types of relevance representing different levels of human cognition that enhance each other. These three RAG components also reflect three cognitive levels for LLMs in question-answering. Therefore, we propose an Iterative utiliTy judgmEnt fraMework (ITEM) to promote each step in RAG. We conducted extensive experiments on retrieval (TREC DL, WebAP), utility judgment task (GTI-NQ), and factoid question-answering (NQ) datasets. Experimental results demonstrate significant improvements of ITEM in utility judgments, ranking, and answer generation upon representative baselines.
comment: 22 pages
♻ ☆ Tensor Program Optimization for the RISC-V Vector Extension Using Probabilistic Programs
RISC-V provides a flexible and scalable platform for applications ranging from embedded devices to high-performance computing clusters. Particularly, its RISC-V Vector Extension (RVV) becomes of interest for the acceleration of AI workloads. But writing software that efficiently utilizes the vector units of RISC-V CPUs without expert knowledge requires the programmer to rely on the autovectorization features of compilers or hand-crafted libraries like muRISCV-NN. Smarter approaches, like autotuning frameworks, have been missing the integration with the RISC-V RVV extension, thus heavily limiting the efficient deployment of complex AI workloads. In this paper, we present a workflow based on the TVM compiler to efficiently map AI workloads onto RISC-V vector units. Instead of relying on hand-crafted libraries, we integrated the RVV extension into TVM's MetaSchedule framework, a probabilistic program framework for tensor operation tuning. We implemented different RISC-V SoCs on an FPGA and tuned a wide range of AI workloads on them. We found that our proposal shows a mean improvement of 46% in execution latency when compared against the autovectorization feature of GCC, and 29% against muRISCV-NN. Moreover, the binary resulting from our proposal has a smaller code memory footprint, making it more suitable for embedded devices. Finally, we also evaluated our solution on a commercially available RISC-V SoC implementing the RVV 1.0 Vector Extension and found our solution is able to find mappings that are 35% faster on average than the ones proposed by LLVM. We open-sourced our proposal for the community to expand it to target other RISC-V extensions.
comment: 9 pages, 10 figures, 2 algorithms
♻ ☆ Predictable Scale: Part I, Step Law -- Optimal Hyperparameter Scaling Law in Large Language Model Pretraining
The impressive capabilities of Large Language Models (LLMs) across diverse tasks are now well established, yet their effective deployment necessitates careful hyperparameter optimization. Although existing methods have explored the influence of hyperparameters on model performance, a principled and generalizable framework across model architectures and data recipes remains absent. In this study, we conduct an unprecedented empirical investigation training over 3,700 LLMs from scratch across 100 trillion tokens, consuming nearly one million NVIDIA H800 GPU hours to establish a universal Scaling Law for hyperparameter optimization in LLM Pre-training, called Step Law. We empirically observe that, under fixed model size ($N$) and dataset size ($D$), the hyperparameter landscape exhibits convexity with a broad optimum, substantially reducing the complexity of hyperparameter search. Building on this insight, we formally define and empirically validate the Step Law: The optimal learning rate follows a power-law relationship with $N$ and $D$, while the optimal batch size is primarily influenced by $D$ and remains largely invariant to $N$.Notably, our estimated optima deviate from the global best performance found via exhaustive search by merely 0.094\% on the test set. To our best known, Step Law is the first that unifies different model shapes and structures, such as Mixture-of-Experts models and dense transformers, as well as establishes optimal hyperparameter scaling laws across diverse data recipes. We contribute a universal, plug-and-play optimal hyperparameter tool for the community, which is expected to advance efficient LLM training at scale. All experimental code, data and checkpoints are publicly available at https://github.com/step-law/steplaw
comment: 22 pages
♻ ☆ Learning In-context n-grams with Transformers: Sub-n-grams Are Near-stationary Points ICML2025
Motivated by empirical observations of prolonged plateaus and stage-wise progression during training, we investigate the loss landscape of transformer models trained on in-context next-token prediction tasks. In particular, we focus on learning in-context $n$-gram language models under cross-entropy loss, and establish a sufficient condition for parameter configurations to be stationary points. We then construct a set of parameter configurations for a simplified transformer model that represent $k$-gram estimators (for $k \leq n$), and show that the gradient of the population loss at these solutions vanishes in the limit of infinite sequence length and parameter norm. This reveals a key property of the loss landscape: {sub-$n$-grams are near-stationary points of the population cross-entropy loss}, offering theoretical insight into widely observed phenomena such as stage-wise learning dynamics and emergent phase transitions. These insights are further supported by numerical experiments that illustrate the learning dynamics of $n$-grams, characterized by discrete transitions between near-stationary solutions.
comment: ICML2025
♻ ☆ Parameter-Efficient Continual Fine-Tuning: A Survey
The emergence of large pre-trained networks has revolutionized the AI field, unlocking new possibilities and achieving unprecedented performance. However, these models inherit a fundamental limitation from traditional Machine Learning approaches: their strong dependence on the \textit{i.i.d.} assumption hinders their adaptability to dynamic learning scenarios. We believe the next breakthrough in AI lies in enabling efficient adaptation to evolving environments -- such as the real world -- where new data and tasks arrive sequentially. This challenge defines the field of Continual Learning (CL), a Machine Learning paradigm focused on developing lifelong learning neural models. One alternative to efficiently adapt these large-scale models is known Parameter-Efficient Fine-Tuning (PEFT). These methods tackle the issue of adapting the model to a particular data or scenario by performing small and efficient modifications, achieving similar performance to full fine-tuning. However, these techniques still lack the ability to adjust the model to multiple tasks continually, as they suffer from the issue of Catastrophic Forgetting. In this survey, we first provide an overview of CL algorithms and PEFT methods before reviewing the state-of-the-art on Parameter-Efficient Continual Fine-Tuning (PECFT). We examine various approaches, discuss evaluation metrics, and explore potential future research directions. Our goal is to highlight the synergy between CL and Parameter-Efficient Fine-Tuning, guide researchers in this field, and pave the way for novel future research directions.
♻ ☆ Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging
Deep learning at scale is dominated by communication time. Distributing samples across nodes usually yields the best performance, but poses scaling challenges due to global information dissemination and load imbalance across uneven sample lengths. State-of-the-art decentralized optimizers mitigate the problem, but require more iterations to achieve the same accuracy as their globally-communicating counterparts. We present Wait-Avoiding Group Model Averaging (WAGMA) SGD, a wait-avoiding stochastic optimizer that reduces global communication via subgroup weight exchange. The key insight is a combination of algorithmic changes to the averaging scheme and the use of a group allreduce operation. We prove the convergence of WAGMA-SGD, and empirically show that it retains convergence rates similar to Allreduce-SGD. For evaluation, we train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale. Compared with state-of-the-art decentralized SGD variants, WAGMA-SGD significantly improves training throughput (e.g., 2.1x on 1,024 GPUs for reinforcement learning), and achieves the fastest time-to-solution (e.g., the highest score using the shortest training time for Transformer).
comment: Published in IEEE Transactions on Parallel and Distributed Systems (IEEE TPDS), vol. 32, no. 7, pp. 1725-1739, 1 July 2021
♻ ☆ Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations PPoPP'20
Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself. Traditional synchronous Stochastic Gradient Descent (SGD) achieves good accuracy for a wide variety of tasks, but relies on global synchronization to accumulate the gradients at every training step. In this paper, we propose eager-SGD, which relaxes the global synchronization for decentralized accumulation. To implement eager-SGD, we propose to use two partial collectives: solo and majority. With solo allreduce, the faster processes contribute their gradients eagerly without waiting for the slower processes, whereas with majority allreduce, at least half of the participants must contribute gradients before continuing, all without using a central parameter server. We theoretically prove the convergence of the algorithms and describe the partial collectives in detail. Experimental results on load-imbalanced environments (CIFAR-10, ImageNet, and UCF101 datasets) show that eager-SGD achieves 1.27x speedup over the state-of-the-art synchronous SGD, without losing accuracy.
comment: Published in Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'20), pp. 45-61. 2020
♻ ☆ Spatially-guided Temporal Aggregation for Robust Event-RGB Optical Flow Estimation
Current optical flow methods exploit the stable appearance of frame (or RGB) data to establish robust correspondences across time. Event cameras, on the other hand, provide high-temporal-resolution motion cues and excel in challenging scenarios. These complementary characteristics underscore the potential of integrating frame and event data for optical flow estimation. However, most cross-modal approaches fail to fully utilize the complementary advantages, relying instead on simply stacking information. This study introduces a novel approach that uses a spatially dense modality to guide the aggregation of the temporally dense event modality, achieving effective cross-modal fusion. Specifically, we propose an event-enhanced frame representation that preserves the rich texture of frames and the basic structure of events. We use the enhanced representation as the guiding modality and employ events to capture temporally dense motion information. The robust motion features derived from the guiding modality direct the aggregation of motion information from events. To further enhance fusion, we propose a transformer-based module that complements sparse event motion features with spatially rich frame information and enhances global information propagation. Additionally, a mix-fusion encoder is designed to extract comprehensive spatiotemporal contextual features from both modalities. Extensive experiments on the MVSEC and DSEC-Flow datasets demonstrate the effectiveness of our framework. Leveraging the complementary strengths of frames and events, our method achieves leading performance on the DSEC-Flow dataset. Compared to the event-only model, frame guidance improves accuracy by 10\%. Furthermore, it outperforms the state-of-the-art fusion-based method with a 4\% accuracy gain and a 45\% reduction in inference time.
comment: 11 pages, 8 figures, under review
♻ ☆ Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks
Fine-tuning as service injects domain-specific knowledge into large language models (LLMs), while challenging the original alignment mechanisms and introducing safety risks. A series of defense strategies have been proposed for the alignment, fine-tuning, and post-fine-tuning phases, where most post-fine-tuning defenses rely on coarse-grained safety layer mapping. These methods lack a comprehensive consideration of both safety layers and fine-grained neurons, limiting their ability to efficiently balance safety and utility. To address this, we propose the Fine-Grained Safety Neurons (FGSN) with Training-Free Continual Projection method to reduce the fine-tuning safety risks. FGSN inherently integrates the multi-scale interactions between safety layers and neurons, localizing sparser and more precise fine-grained safety neurons while minimizing interference with downstream task neurons. We then project the safety neuron parameters onto safety directions, improving model safety while aligning more closely with human preferences. Extensive experiments across multiple fine-tuned LLM models demonstrate that our method significantly reduce harmfulness scores and attack success rates with minimal parameter modifications, while preserving the model's utility. Furthermore, by introducing a task-specific, multi-dimensional heterogeneous safety neuron cluster optimization mechanism, we achieve continual defense and generalization capability against unforeseen emerging safety concerns.
♻ ☆ Reinforcement Learning for Solving the Pricing Problem in Column Generation: Applications to Vehicle Routing
In this paper, we address the problem of Column Generation (CG) using Reinforcement Learning (RL). Specifically, we use a RL model based on the attention-mechanism architecture to find the columns with most negative reduced cost in the Pricing Problem (PP). Unlike previous Machine Learning (ML) applications for CG, our model deploys an end-to-end mechanism as it independently solves the pricing problem without the help of any heuristic. We consider a variant of Vehicle Routing Problem (VRP) as a case study for our method. Through a set of experiments where our method is compared against a Dynamic Programming (DP)-based heuristic for solving the PP, we show that our method solves the linear relaxation up to a reasonable objective gap in significantly shorter running times.
comment: 24 pages, 7 figures, 5 tables, Journal Submission
♻ ☆ Joint Learning of Energy-based Models and their Partition Function
Energy-based models (EBMs) offer a flexible framework for parameterizing probability distributions using neural networks. However, learning EBMs by exact maximum likelihood estimation (MLE) is generally intractable, due to the need to compute the partition function (normalization constant). In this paper, we propose a novel formulation for approximately learning probabilistic EBMs in combinatorially-large discrete spaces, such as sets or permutations. Our key idea is to jointly learn both an energy model and its log-partition, both parameterized as a neural network. Our approach not only provides a novel tractable objective criterion to learn EBMs by stochastic gradient descent (without relying on MCMC), but also a novel means to estimate the log-partition function on unseen data points. On the theoretical side, we show that our approach recovers the optimal MLE solution when optimizing in the space of continuous functions. Furthermore, we show that our approach naturally extends to the broader family of Fenchel-Young losses, allowing us to obtain the first tractable method for optimizing the sparsemax loss in combinatorially-large spaces. We demonstrate our approach on multilabel classification and label ranking.
♻ ☆ Correlations Are Ruining Your Gradient Descent
Herein the topics of (natural) gradient descent, data decorrelation, and approximate methods for backpropagation are brought into a common discussion. Natural gradient descent illuminates how gradient vectors, pointing at directions of steepest descent, can be improved by considering the local curvature of loss landscapes. We extend this perspective and show that to fully solve the problem illuminated by natural gradients in neural networks, one must recognise that correlations in the data at any linear transformation, including node responses at every layer of a neural network, cause a non-orthonormal relationship between the model's parameters. To solve this requires a method for decorrelating inputs at each individual layer of a neural network. We describe a range of methods which have been proposed for decorrelation and whitening of node output, and expand on these to provide a novel method specifically useful for distributed computing and computational neuroscience. Implementing decorrelation within multi-layer neural networks, we can show that not only is training via backpropagation sped up significantly but also existing approximations of backpropagation, which have failed catastrophically in the past, benefit significantly in their accuracy and convergence speed. This has the potential to provide a route forward for approximate gradient descent methods which have previously been discarded, training approaches for analogue and neuromorphic hardware, and potentially insights as to the efficacy and utility of decorrelation processes in the brain.
comment: 15 pages, 4 figures
♻ ☆ Dynamics-Informed Reservoir Computing with Visibility Graphs
Accurate prediction of complex and nonlinear time series remains a challenging problem across engineering and scientific disciplines. Reservoir computing (RC) offers a computationally efficient alternative to traditional deep learning by training only the read-out layer while employing a randomly structured and fixed reservoir network. Despite its advantages, the largely random reservoir graph architecture often results in suboptimal and oversized networks with poorly understood dynamics. Addressing this issue, we propose a novel Dynamics-Informed Reservoir Computing (DyRC) framework that systematically infers the reservoir network structure directly from the input training sequence. This work proposes to employ the visibility graph (VG) technique, which converts time series data into networks by representing measurement points as nodes linked by mutual visibility. The reservoir network is constructed by directly adopting the VG network from a training data sequence, leveraging the parameter-free visibility graph approach to avoid expensive hyperparameter tuning. This process results in a reservoir that is directly informed by the specific dynamics of the prediction task under study. We assess the DyRC-VG method through prediction tasks involving the canonical nonlinear Duffing oscillator, evaluating prediction accuracy and consistency. Compared to an Erd\H{o}s-R\'enyi (ER) graph of the same size, spectral radius, and fixed density, we observe higher prediction quality and more consistent performance over repeated implementations in the DyRC-VG. An ER graph with density matched to the DyRC-VG can in some conditions outperform both approaches.
comment: 7 pages, 5 figures. The following article has been submitted to by Chaos: An Interdisciplinary Journal of Nonlinear Science
♻ ☆ Identify, Isolate, and Purge: Mitigating Hallucinations in LVLMs via Self-Evolving Distillation
Large Vision-Language Models (LVLMs) have demonstrated remarkable advancements in numerous areas such as multimedia. However, hallucination issues significantly limit their credibility and application potential. Existing mitigation methods typically rely on external tools or the comparison of multi-round inference, which significantly increase inference time. In this paper, we propose \textbf{SE}lf-\textbf{E}volving \textbf{D}istillation (\textbf{SEED}), which identifies hallucinations within the inner knowledge of LVLMs, isolates and purges them, and then distills the purified knowledge back into the model, enabling self-evolution. Furthermore, we identified that traditional distillation methods are prone to inducing void spaces in the output space of LVLMs. To address this issue, we propose a Mode-Seeking Evolving approach, which performs distillation to capture the dominant modes of the purified knowledge distribution, thereby avoiding the chaotic results that could emerge from void spaces. Moreover, we introduce a Hallucination Elimination Adapter, which corrects the dark knowledge of the original model by learning purified knowledge. Extensive experiments on multiple benchmarks validate the superiority of our SEED, demonstrating substantial improvements in mitigating hallucinations for representative LVLM models such as LLaVA-1.5 and InternVL2. Remarkably, the F1 score of LLaVA-1.5 on the hallucination evaluation metric POPE-Random improved from 81.3 to 88.3.
comment: In Figure 2, the correlation coefficient and the scatter plot do not match. I calculated this correlation using two sets of settings. I used the scatter plot from setting A, but accidentally wrote the correlation coefficient, r, from setting B
♻ ☆ LaDi-WM: A Latent Diffusion-based World Model for Predictive Manipulation
Predictive manipulation has recently gained considerable attention in the Embodied AI community due to its potential to improve robot policy performance by leveraging predicted states. However, generating accurate future visual states of robot-object interactions from world models remains a well-known challenge, particularly in achieving high-quality pixel-level representations. To this end, we propose LaDi-WM, a world model that predicts the latent space of future states using diffusion modeling. Specifically, LaDi-WM leverages the well-established latent space aligned with pre-trained Visual Foundation Models (VFMs), which comprises both geometric features (DINO-based) and semantic features (CLIP-based). We find that predicting the evolution of the latent space is easier to learn and more generalizable than directly predicting pixel-level images. Building on LaDi-WM, we design a diffusion policy that iteratively refines output actions by incorporating forecasted states, thereby generating more consistent and accurate results. Extensive experiments on both synthetic and real-world benchmarks demonstrate that LaDi-WM significantly enhances policy performance by 27.9\% on the LIBERO-LONG benchmark and 20\% on the real-world scenario. Furthermore, our world model and policies achieve impressive generalizability in real-world experiments.
comment: CoRL 2025
♻ ☆ Automatic Image Colorization with Convolutional Neural Networks and Generative Adversarial Networks
Image colorization, the task of adding colors to grayscale images, has been the focus of significant research efforts in computer vision in recent years for its various application areas such as color restoration and automatic animation colorization [15, 1]. The colorization problem is challenging as it is highly ill-posed with two out of three image dimensions lost, resulting in large degrees of freedom. However, semantics of the scene as well as the surface texture could provide important cues for colors: the sky is typically blue, the clouds are typically white and the grass is typically green, and there are huge amounts of training data available for learning such priors since any colored image could serve as a training data point [20]. Colorization is initially formulated as a regression task[5], which ignores the multi-modal nature of color prediction. In this project, we explore automatic image colorization via classification and adversarial learning. We will build our models on prior works, apply modifications for our specific scenario and make comparisons.
comment: All authors have equal authorship and equal contribution, ranked in alphabetic order. First version of this paper was completed and published in 2021
♻ ☆ Hybrid Machine Learning Model with a Constrained Action Space for Trajectory Prediction
Trajectory prediction is crucial to advance autonomous driving, improving safety, and efficiency. Although end-to-end models based on deep learning have great potential, they often do not consider vehicle dynamic limitations, leading to unrealistic predictions. To address this problem, this work introduces a novel hybrid model that combines deep learning with a kinematic motion model. It is able to predict object attributes such as acceleration and yaw rate and generate trajectories based on them. A key contribution is the incorporation of expert knowledge into the learning objective of the deep learning model. This results in the constraint of the available action space, thus enabling the prediction of physically feasible object attributes and trajectories, thereby increasing safety and robustness. The proposed hybrid model facilitates enhanced interpretability, thereby reinforcing the trustworthiness of deep learning methods and promoting the development of safe planning solutions. Experiments conducted on the publicly available real-world Argoverse dataset demonstrate realistic driving behaviour, with benchmark comparisons and ablation studies showing promising results.
comment: Copyright 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
♻ ☆ Adaptation and Optimization of Automatic Speech Recognition (ASR) for the Maritime Domain in the Field of VHF Communication
This paper introduces a multilingual automatic speech recognizer (ASR) for maritime radio communi-cation that automatically converts received VHF radio signals into text. The challenges of maritime radio communication are described at first, and the deep learning architecture of marFM consisting of audio processing techniques and machine learning algorithms is presented. Subsequently, maritime radio data of interest is analyzed and then used to evaluate the transcription performance of our ASR model for various maritime radio data.
♻ ☆ Comparative Analysis of Time Series Foundation Models for Demographic Forecasting: Enhancing Predictive Accuracy in US Population Dynamics
Demographic shifts, influenced by globalization, economic conditions, geopolitical events, and environmental factors, pose significant challenges for policymakers and researchers. Accurate demographic forecasting is essential for informed decision-making in areas such as urban planning, healthcare, and economic policy. This study explores the application of time series foundation models to predict demographic changes in the United States using datasets from the U.S. Census Bureau and Federal Reserve Economic Data (FRED). We evaluate the performance of the Time Series Foundation Model (TimesFM) against traditional baselines including Long Short-Term Memory (LSTM) networks, Autoregressive Integrated Moving Average (ARIMA), and Linear Regression. Our experiments across six demographically diverse states demonstrate that TimesFM achieves the lowest Mean Squared Error (MSE) in 86.67% of test cases, with particularly strong performance on minority populations with sparse historical data. These findings highlight the potential of pre-trained foundation models to enhance demographic analysis and inform proactive policy interventions without requiring extensive task-specific fine-tuning.
comment: 6 pages, 4 figures, 3 tables
♻ ☆ Efficient and Verifiable Privacy-Preserving Convolutional Computation for CNN Inference with Untrusted Clouds
The widespread adoption of convolutional neural networks (CNNs) in resource-constrained scenarios has driven the development of Machine Learning as a Service (MLaaS) system. However, this approach is susceptible to privacy leakage, as the data sent from the client to the untrusted cloud server often contains sensitive information. Existing CNN privacy-preserving schemes, while effective in ensuring data confidentiality through homomorphic encryption and secret sharing, face efficiency bottlenecks, particularly in convolution operations. In this paper, we propose a novel verifiable privacy-preserving scheme tailored for CNN convolutional layers. Our scheme enables efficient encryption and decryption, allowing resource-constrained clients to securely offload computations to the untrusted cloud server. Additionally, we present a verification mechanism capable of detecting the correctness of the results with a success probability of at least $1-\frac{1}{\left|Z\right|}$. Extensive experiments conducted on 10 datasets and various CNN models demonstrate that our scheme achieves speedups ranging $26 \times$ ~ $\ 87\times$ compared to the original plaintext model while maintaining accuracy.
comment: Conference link: [ICIC 2025](http://www.ic-icc.cn/2025/index.php) will provide further details
♻ ☆ Efficient Network Automatic Relevance Determination ICML 2025
We propose Network Automatic Relevance Determination (NARD), an extension of ARD for linearly probabilistic models, to simultaneously model sparse relationships between inputs $X \in \mathbb R^{d \times N}$ and outputs $Y \in \mathbb R^{m \times N}$, while capturing the correlation structure among the $Y$. NARD employs a matrix normal prior which contains a sparsity-inducing parameter to identify and discard irrelevant features, thereby promoting sparsity in the model. Algorithmically, it iteratively updates both the precision matrix and the relationship between $Y$ and the refined inputs. To mitigate the computational inefficiencies of the $\mathcal O(m^3 + d^3)$ cost per iteration, we introduce Sequential NARD, which evaluates features sequentially, and a Surrogate Function Method, leveraging an efficient approximation of the marginal likelihood and simplifying the calculation of determinant and inverse of an intermediate matrix. Combining the Sequential update with the Surrogate Function method further reduces computational costs. The computational complexity per iteration for these three methods is reduced to $\mathcal O(m^3+p^3)$, $\mathcal O(m^3 + d^2)$, $\mathcal O(m^3+p^2)$, respectively, where $p \ll d$ is the final number of features in the model. Our methods demonstrate significant improvements in computational efficiency with comparable performance on both synthetic and real-world datasets.
comment: ICML 2025
♻ ☆ A Hierarchical Surrogate Model for Efficient Multi-Task Parameter Learning in Closed-Loop Control
Many control problems require repeated tuning and adaptation of controllers across distinct closed-loop tasks, where data efficiency and adaptability are critical. We propose a hierarchical Bayesian optimization (BO) framework that is tailored to efficient controller parameter learning in sequential decision-making and control scenarios for distinct tasks. Instead of treating the closed-loop cost as a black-box, our method exploits structural knowledge of the underlying problem, consisting of a dynamical system, a control law, and an associated closed-loop cost function. We construct a hierarchical surrogate model using Gaussian processes that capture the closed-loop state evolution under different parameterizations, while the task-specific weighting and accumulation into the closed-loop cost are computed exactly via known closed-form expressions. This allows knowledge transfer and enhanced data efficiency between different closed-loop tasks. The proposed framework retains sublinear regret guarantees on par with standard black-box BO, while enabling multi-task or transfer learning. Simulation experiments with model predictive control demonstrate substantial benefits in both sample efficiency and adaptability when compared to purely black-box BO approaches.
comment: 8 pages, 4 figures, accepted for CDC 2025
♻ ☆ Contrastive Learning on Multimodal Analysis of Electronic Health Records
Electronic health record (EHR) systems contain a wealth of multimodal clinical data including structured data like clinical codes and unstructured data such as clinical notes. However, many existing EHR-focused studies has traditionally either concentrated on an individual modality or merged different modalities in a rather rudimentary fashion. This approach often results in the perception of structured and unstructured data as separate entities, neglecting the inherent synergy between them. Specifically, the two important modalities contain clinically relevant, inextricably linked and complementary health information. A more complete picture of a patient's medical history is captured by the joint analysis of the two modalities of data. Despite the great success of multimodal contrastive learning on vision-language, its potential remains under-explored in the realm of multimodal EHR, particularly in terms of its theoretical understanding. To accommodate the statistical analysis of multimodal EHR data, in this paper, we propose a novel multimodal feature embedding generative model and design a multimodal contrastive loss to obtain the multimodal EHR feature representation. Our theoretical analysis demonstrates the effectiveness of multimodal learning compared to single-modality learning and connects the solution of the loss function to the singular value decomposition of a pointwise mutual information matrix. This connection paves the way for a privacy-preserving algorithm tailored for multimodal EHR feature representation learning. Simulation studies show that the proposed algorithm performs well under a variety of configurations. We further validate the clinical utility of the proposed algorithm in real-world EHR data.
comment: 34 pages
♻ ☆ Mask and Restore: Blind Backdoor Defense at Test Time with Masked Autoencoder
Deep neural networks are vulnerable to backdoor attacks, where an adversary manipulates the model behavior through overlaying images with special triggers. Existing backdoor defense methods often require accessing a few validation data and model parameters, which is impractical in many real-world applications, e.g., when the model is provided as a cloud service. In this paper, we address the practical task of blind backdoor defense at test time, in particular for local attacks and black-box models. The true label of every test image needs to be recovered on the fly from a suspicious model regardless of image benignity. We consider test-time image purification that incapacitates local triggers while keeping semantic contents intact. Due to diverse trigger patterns and sizes, the heuristic trigger search can be unscalable. We circumvent such barrier by leveraging the strong reconstruction power of generative models, and propose Blind Defense with Masked AutoEncoder (BDMAE). BDMAE detects possible local triggers using image structural similarity and label consistency between the test image and MAE restorations. The detection results are then refined by considering trigger topology. Finally, we fuse MAE restorations adaptively into a purified image for making prediction. Extensive experiments under different backdoor settings validate its effectiveness and generalizability.
♻ ☆ Improving DAPO from a Mixed-Policy Perspective
This paper introduces two novel modifications to the Dynamic sAmpling Policy Optimization (DAPO) algorithm [1], approached from a mixed-policy perspective. Standard policy gradient methods can suffer from instability and sample inefficiency, particularly in sparse reward settings. To address this, we first propose a method that incorporates a pre-trained, stable guiding policy ($\piphi$) to provide off-policy experience, thereby regularizing the training of the target policy ($\pion$). This approach improves training stability and convergence speed by adaptively adjusting the learning step size. Secondly, we extend this idea to re-utilize zero-reward samples, which are often discarded by dynamic sampling strategies like DAPO's. By treating these samples as a distinct batch guided by the expert policy, we further enhance sample efficiency. We provide a theoretical analysis for both methods, demonstrating that their objective functions converge to the optimal solution within the established theoretical framework of reinforcement learning. The proposed mixed-policy framework effectively balances exploration and exploitation, promising more stable and efficient policy optimization.
♻ ☆ Robustly estimating heterogeneity in factorial data using Rashomon Partitions
In both observational data and randomized control trials, researchers select statistical models to articulate how the outcome of interest varies with combinations of observable covariates. Choosing a model that is too simple can obfuscate important heterogeneity in outcomes between covariate groups, while too much complexity risks identifying spurious patterns. In this paper, we propose a novel Bayesian framework for model uncertainty called Rashomon Partition Sets (RPSs). The RPS consists of all models that have posterior density close to the maximum a posteriori (MAP) model. We construct the RPS by enumeration, rather than sampling, which ensures that we explore all models models with high evidence in the data, even if they offer dramatically different substantive explanations. We use a l0 prior, which allows the allows us to capture complex heterogeneity without imposing strong assumptions about the associations between effects, showing this prior is minimax optimal from an information-theoretic perspective. We characterize the approximation error of (functions of) parameters computed conditional on being in the RPS relative to the entire posterior. We propose an algorithm to enumerate the RPS from the class of models that are interpretable and unique, then provide bounds on the size of the RPS. We give simulation evidence along with three empirical examples: price effects on charitable giving, heterogeneity in chromosomal structure, and the introduction of microfinance.
♻ ☆ Sample Complexity of Diffusion Model Training Without Empirical Risk Minimizer Access
Diffusion models have demonstrated state-of-the-art performance across vision, language, and scientific domains. Despite their empirical success, prior theoretical analyses of the sample complexity suffer from poor scaling with input data dimension or rely on unrealistic assumptions such as access to exact empirical risk minimizers. In this work, we provide a principled analysis of score estimation, establishing a sample complexity bound of $\widetilde{\mathcal{O}}(\epsilon^{-6})$. Our approach leverages a structured decomposition of the score estimation error into statistical, approximation, and optimization errors, enabling us to eliminate the exponential dependence on neural network parameters that arises in prior analyses. It is the first such result which achieves sample complexity bounds without assuming access to the empirical risk minimizer of score function estimation loss.
♻ ☆ Robust Federated Learning under Adversarial Attacks via Loss-Based Client Clustering
Federated Learning (FL) enables collaborative model training across multiple clients without sharing private data. We consider FL scenarios wherein FL clients are subject to adversarial (Byzantine) attacks, while the FL server is trusted (honest) and has a trustworthy side dataset. This may correspond to, e.g., cases where the server possesses trusted data prior to federation, or to the presence of a trusted client that temporarily assumes the server role. Our approach requires only two honest participants, i.e., the server and one client, to function effectively, without prior knowledge of the number of malicious clients. Theoretical analysis demonstrates bounded optimality gaps even under strong Byzantine attacks. Experimental results show that our algorithm significantly outperforms standard and robust FL baselines such as Mean, Trimmed Mean, Median, Krum, and Multi-Krum under various attack strategies including label flipping, sign flipping, and Gaussian noise addition across MNIST, FMNIST, and CIFAR-10 benchmarks using the Flower framework.
comment: 16 pages, 5 figures
♻ ☆ Position: We Need Responsible, Application-Driven (RAD) AI Research
This position paper argues that achieving meaningful scientific and societal advances with artificial intelligence (AI) requires a responsible, application-driven approach (RAD) to AI research. As AI is increasingly integrated into society, AI researchers must engage with the specific contexts where AI is being applied. This includes being responsive to ethical and legal considerations, technical and societal constraints, and public discourse. We present the case for RAD-AI to drive research through a three-staged approach: (1) building transdisciplinary teams and people-centred studies; (2) addressing context-specific methods, ethical commitments, assumptions, and metrics; and (3) testing and sustaining efficacy through staged testbeds and a community of practice. We present a vision for the future of application-driven AI research to unlock new value through technically feasible methods that are adaptive to the contextual needs and values of the communities they ultimately serve.
comment: 12 pages, 1 figure, Final Camera Ready version for proceedings with updated figure color and reference, Accepted to Proceedings of the 41 st International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025
♻ ☆ MR-EEGWaveNet: Multiresolutional EEGWaveNet for Seizure Detection from Long EEG Recordings
Feature engineering for generalized seizure detection models remains a significant challenge. Recently proposed models show variable performance depending on the training data and remain ineffective at accurately distinguishing artifacts from seizure data. In this study, we propose a novel end-to-end model, "Multiresolutional EEGWaveNet (MR-EEGWaveNet)," which efficiently distinguishes seizure events from background electroencephalogram (EEG) and artifacts/noise by capturing both temporal dependencies across different time frames and spatial relationships between channels. The model has three modules: convolution, feature extraction, and predictor. The convolution module extracts features through depth-wise and spatio-temporal convolution. The feature extraction module individually reduces the feature dimension extracted from EEG segments and their sub-segments. Subsequently, the extracted features are concatenated into a single vector for classification using a fully connected classifier called the predictor module. In addition, an anomaly score-based post-classification processing technique is introduced to reduce the false-positive rates of the model. Experimental results are reported and analyzed using different parameter settings and datasets (Siena (public) and Juntendo (private)). The proposed MR-EEGWaveNet significantly outperformed the conventional non-multiresolution approach, improving the F1 scores from 0.177 to 0.336 on Siena and 0.327 to 0.488 on Juntendo, with precision gains of 15.9% and 20.62%, respectively.
comment: 33 pages, 10 figures, 18 tables
♻ ☆ Disciplined Geodesically Convex Programming
Convex programming plays a fundamental role in machine learning, data science, and engineering. Testing convexity structure in nonlinear programs relies on verifying the convexity of objectives and constraints. Grant et al. (2006) introduced a framework, Disciplined Convex Programming (DCP), for automating this verification task for a wide range of convex functions that can be decomposed into basic convex functions (atoms) using convexity-preserving compositions and transformations (rules). Here, we extend this framework to functions defined on manifolds with non-positive curvature (Hadamard manifolds) by introducing Disciplined Geodesically Convex Programming (DGCP). In particular, this allows for verifying a broader range of convexity notions. For instance, many notable instances of statistical estimators and matrix-valued (sub)routines in machine learning applications are Euclidean non-convex, but exhibit geodesic convexity through a more general Riemannian lens. To define the DGCP framework, we determine convexity-preserving compositions and transformations for geodesically convex functions on general Hadamard manifolds, as well as for the special case of symmetric positive definite matrices, a common setting in matrix-valued optimization. For the latter, we also define a basic set of atoms. Our paper is accompanied by a Julia package SymbolicAnalysis.jl, which provides functionality for testing and certifying DGCP-compliant expressions. Our library interfaces with manifold optimization software, which allows for directly solving verified geodesically convex programs.
♻ ☆ Always Skip Attention ICCV 2025
We highlight a curious empirical result within modern Vision Transformers (ViTs). Specifically, self-attention catastrophically fails to train unless it is used in conjunction with a skip connection. This is in contrast to other elements of a ViT that continue to exhibit good performance (albeit suboptimal) when skip connections are removed. Further, we show that this critical dependence on skip connections is a relatively new phenomenon, with previous deep architectures (\eg, CNNs) exhibiting good performance in their absence. In this paper, we theoretically characterize that the self-attention mechanism is fundamentally ill-conditioned and is, therefore, uniquely dependent on skip connections for regularization. Additionally, we propose Token Graying -- a simple yet effective complement (to skip connections) that further improves the condition of input tokens. We validate our approach in both supervised and self-supervised training methods.
comment: This work has just been accepted by ICCV 2025
♻ ☆ A Causal Graph-Enhanced Gaussian Process Regression for Modeling Engine-out NOx
The stringent regulatory requirements on nitrogen oxides (NOx) emissions from diesel compression ignition engines require accurate and reliable models for real-time monitoring and diagnostics. Although traditional methods such as physical sensors and virtual engine control module (ECM) sensors provide essential data, they are only used for estimation. Ubiquitous literature primarily focuses on deterministic models with little emphasis on capturing the various uncertainties. The lack of probabilistic frameworks restricts the applicability of these models for robust diagnostics. The objective of this paper is to develop and validate a probabilistic model to predict engine-out NOx emissions using Gaussian process regression. Our approach is as follows. We employ three variants of Gaussian process models: the first with a standard radial basis function kernel with input window, the second incorporating a deep kernel using convolutional neural networks to capture temporal dependencies, and the third enriching the deep kernel with a causal graph derived via graph convolutional networks. The causal graph embeds physics knowledge into the learning process. All models are compared against a virtual ECM sensor using both quantitative and qualitative metrics. We conclude that our model provides an improvement in predictive performance when using an input window and a deep kernel structure. Even more compelling is the further enhancement achieved by the incorporation of a causal graph into the deep kernel. These findings are corroborated across different verification and validation datasets.
♻ ☆ Dataset Condensation with Color Compensation
Dataset condensation always faces a constitutive trade-off: balancing performance and fidelity under extreme compression. Existing methods struggle with two bottlenecks: image-level selection methods (Coreset Selection, Dataset Quantization) suffer from inefficiency condensation, while pixel-level optimization (Dataset Distillation) introduces semantic distortion due to over-parameterization. With empirical observations, we find that a critical problem in dataset condensation is the oversight of color's dual role as an information carrier and a basic semantic representation unit. We argue that improving the colorfulness of condensed images is beneficial for representation learning. Motivated by this, we propose DC3: a Dataset Condensation framework with Color Compensation. After a calibrated selection strategy, DC3 utilizes the latent diffusion model to enhance the color diversity of an image rather than creating a brand-new one. Extensive experiments demonstrate the superior performance and generalization of DC3 that outperforms SOTA methods across multiple benchmarks. To the best of our knowledge, besides focusing on downstream tasks, DC3 is the first research to fine-tune pre-trained diffusion models with condensed datasets. The FID results prove that training networks with our high-quality datasets is feasible without model collapse or other degradation issues. Code and generated data are available at https://github.com/528why/Dataset-Condensation-with-Color-Compensation.
♻ ☆ Training Machine Learning Models on Human Spatio-temporal Mobility Data: An Experimental Study [Experiment Paper]
Individual-level human mobility prediction has emerged as a significant topic of research with applications in infectious disease monitoring, child, and elderly care. Existing studies predominantly focus on the microscopic aspects of human trajectories: such as predicting short-term trajectories or the next location visited, while offering limited attention to macro-level mobility patterns and the corresponding life routines. In this paper, we focus on an underexplored problem in human mobility prediction: determining the best practices to train a machine learning model using historical data to forecast an individuals complete trajectory over the next days and weeks. In this experiment paper, we undertake a comprehensive experimental analysis of diverse models, parameter configurations, and training strategies, accompanied by an in-depth examination of the statistical distribution inherent in human mobility patterns. Our empirical evaluations encompass both Long Short-Term Memory and Transformer-based architectures, and further investigate how incorporating individual life patterns can enhance the effectiveness of the prediction. We show that explicitly including semantic information such as day-of-the-week and user-specific historical information can help the model better understand individual patterns of life and improve predictions. Moreover, since the absence of explicit user information is often missing due to user privacy, we show that the sampling of users may exacerbate data skewness and result in a substantial loss in predictive accuracy. To mitigate data imbalance and preserve diversity, we apply user semantic clustering with stratified sampling to ensure that the sampled dataset remains representative. Our results further show that small-batch stochastic gradient optimization improves model performance, especially when human mobility training data is limited.
comment: This paper is the extended version of our work accepted at the 33rd ACM International Conference on Advances in Geographic Information Systems
♻ ☆ Benchmarking Federated Learning for Semantic Datasets: Federated Scene Graph Generation
Federated learning (FL) enables decentralized training while preserving data privacy, yet existing FL benchmarks address relatively simple classification tasks, where each sample is annotated with a one-hot label. However, little attention has been paid to demonstrating an FL benchmark that handles complicated semantics, where each sample encompasses diverse semantic information, such as relations between objects. Because the existing benchmarks are designed to distribute data in a narrow view of a single semantic, managing the complicated semantic heterogeneity across clients when formalizing FL benchmarks is non-trivial. In this paper, we propose a benchmark process to establish an FL benchmark with controllable semantic heterogeneity across clients: two key steps are (i) data clustering with semantics and (ii) data distributing via controllable semantic heterogeneity across clients. As a proof of concept, we construct a federated PSG benchmark, demonstrating the efficacy of the existing PSG methods in an FL setting with controllable semantic heterogeneity of scene graphs. We also present the effectiveness of our benchmark by applying robust federated learning algorithms to data heterogeneity to show increased performance. To our knowledge, this is the first benchmark framework that enables federated learning and its evaluation for multi-semantic vision tasks under the controlled semantic heterogeneity. Our code is available at https://github.com/Seung-B/FL-PSG.
comment: This work has been accepted for publication in Pattern Recognition Letters
♻ ☆ FreeGAD: A Training-Free yet Effective Approach for Graph Anomaly Detection
Graph Anomaly Detection (GAD) aims to identify nodes that deviate from the majority within a graph, playing a crucial role in applications such as social networks and e-commerce. Despite the current advancements in deep learning-based GAD, existing approaches often suffer from high deployment costs and poor scalability due to their complex and resource-intensive training processes. Surprisingly, our empirical findings suggest that the training phase of deep GAD methods, commonly perceived as crucial, may actually contribute less to anomaly detection performance than expected. Inspired by this, we propose FreeGAD, a novel training-free yet effective GAD method. Specifically, it leverages an affinity-gated residual encoder to generate anomaly-aware representations. Meanwhile, FreeGAD identifies anchor nodes as pseudo-normal and anomalous guides, followed by calculating anomaly scores through anchor-guided statistical deviations. Extensive experiments demonstrate that FreeGAD achieves superior anomaly detection performance, efficiency, and scalability on multiple benchmark datasets from diverse domains, without any training or iterative optimization.
comment: Accepted by ClKM 2025
♻ ☆ FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction
Future prediction is a complex task for LLM agents, requiring a high level of analytical thinking, information gathering, contextual understanding, and decision-making under uncertainty. Agents must not only gather and interpret vast amounts of dynamic information but also integrate diverse data sources, weigh uncertainties, and adapt predictions based on emerging trends, just as human experts do in fields like politics, economics, and finance. Despite its importance, no large-scale benchmark exists for evaluating agents on future prediction, largely due to challenges in handling real-time updates and retrieving timely, accurate answers. To address this, we introduce $\textbf{FutureX}$, a dynamic and live evaluation benchmark specifically designed for LLM agents performing future prediction tasks. FutureX is the largest and most diverse live benchmark for future prediction, supporting real-time daily updates and eliminating data contamination through an automated pipeline for question gathering and answer collection. We evaluate 25 LLM/agent models, including those with reasoning, search capabilities, and integration of external tools such as the open-source Deep Research Agent and closed-source Deep Research models. This comprehensive evaluation assesses agents' adaptive reasoning and performance in dynamic environments. Additionally, we provide in-depth analyses of agents' failure modes and performance pitfalls in future-oriented tasks, including the vulnerability to fake web pages and the temporal validity. Our goal is to establish a dynamic, contamination-free evaluation standard that drives the development of LLM agents capable of performing at the level of professional human analysts in complex reasoning and predictive thinking.
comment: Technical report, 51 pages. Update the results
♻ ☆ iTBLS: A Dataset of Interactive Conversations Over Tabular Information
This paper introduces Interactive Tables (iTBLS), a dataset of interactive conversations that focuses on natural-language manipulation of tabular information sourced from academic pre-prints on ArXiv. The iTBLS dataset consists of three types of tabular tasks -- interpretation, modification, and generation. Interpretation focuses on tabular understanding, modification focuses on manipulating tabular information, and generation focuses on the addition of new natural-language evidence. In addition, the paper presents a novel framework that reformulates tabular operations as question-answering, where an appropriate question is formulated based on the nature of interaction and the question is answered using the user request as evidence. The developed approach results in an improvement on all tasks on a sequence-to-sequence modeling baseline on iTBLS. In addition, the question-answering-based reformulation is applied to datasets from prior work for the text-to-table task where textual paragraphs are summarized into tables. The novel approach results in up to 13% improvement in Exact-Match accuracy and up to 16% improvement in BERTScores compared to the prior state-of-the-art.
comment: 15 pages, 4 figures
♻ ☆ G1: Teaching LLMs to Reason on Graphs with Reinforcement Learning
Although Large Language Models (LLMs) have demonstrated remarkable progress, their proficiency in graph-related tasks remains notably limited, hindering the development of truly general-purpose models. Previous attempts, including pretraining graph foundation models or employing supervised fine-tuning, often face challenges such as the scarcity of large-scale, universally represented graph data. We introduce G1, a simple yet effective approach demonstrating that Reinforcement Learning (RL) on synthetic graph-theoretic tasks can significantly scale LLMs' graph reasoning abilities. To enable RL training, we curate Erd\~os, the largest graph reasoning dataset to date comprising 50 diverse graph-theoretic tasks of varying difficulty levels, 100k training data and 5k test data, all drived from real-world graphs. With RL on Erd\~os, G1 obtains substantial improvements in graph reasoning, where our finetuned 3B model even outperforms Qwen2.5-72B-Instruct (24x size). RL-trained models also show strong zero-shot generalization to unseen tasks, domains, and graph encoding schemes, including other graph-theoretic benchmarks as well as real-world node classification and link prediction tasks, without compromising general reasoning abilities. Our findings offer an efficient, scalable path for building strong graph reasoners by finetuning LLMs with RL on graph-theoretic tasks, which combines the strengths of pretrained LLM capabilities with abundant, automatically generated synthetic data, suggesting that LLMs possess graph understanding abilities that RL can elicit successfully. Our implementation is open-sourced at https://github.com/PKU-ML/G1, with models and datasets hosted on Hugging Face collections https://huggingface.co/collections/PKU-ML/g1-683d659e992794fc99618cf2 for broader accessibility.
♻ ☆ ExpVG: Investigating the Design Space of Visual Grounding in Multimodal Large Language Model
Fine-grained multimodal capability in Multimodal Large Language Models (MLLMs) has emerged as a critical research direction, particularly for tackling the visual grounding (VG) problem. Despite the strong performance achieved by existing approaches, they often employ disparate design choices when fine-tuning MLLMs for VG, lacking systematic verification to support these designs. To bridge this gap, this paper presents a comprehensive study of various design choices that impact the VG performance of MLLMs. We conduct our analysis using LLaVA-1.5, which has been widely adopted in prior empirical studies of MLLMs. While more recent models exist, we follow this convention to ensure our findings remain broadly applicable and extendable to other architectures. We cover two key aspects: (1) exploring different visual grounding paradigms in MLLMs, identifying the most effective design, and providing our insights; and (2) conducting ablation studies on the design of grounding data to optimize MLLMs' fine-tuning for the VG task. Finally, our findings contribute to a stronger MLLM for VG, achieving improvements of +5.6% / +6.9% / +7.0% on RefCOCO/+/g over the LLaVA-1.5.
comment: 8 pages for the main paper
♻ ☆ Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions
In recent years, masked diffusion models (MDMs) have emerged as a promising alternative approach for generative modeling over discrete domains. Compared to autoregressive models (ARMs), MDMs trade off complexity at training time with flexibility at inference time. At training time, they must learn to solve an exponentially large number of infilling problems, but at inference time, they can decode tokens in essentially arbitrary order. In this work, we closely examine these two competing effects. On the training front, we theoretically and empirically demonstrate that MDMs indeed train on computationally intractable subproblems compared to their autoregressive counterparts. On the inference front, we show that a suitable strategy for adaptively choosing the token decoding order significantly enhances the capabilities of MDMs, allowing them to sidestep hard subproblems. On logic puzzles like Sudoku, we show that adaptive inference can boost solving accuracy in pretrained MDMs from $<7$% to $\approx 90$%, even outperforming ARMs with $7\times$ as many parameters and that were explicitly trained via teacher forcing to learn the right order of decoding.
♻ ☆ Diffusion MRI with Machine Learning
\hspace{2mm} Diffusion-weighted magnetic resonance imaging (dMRI) of the brain offers unique capabilities including noninvasive probing of tissue microstructure and structural connectivity. It is widely used for clinical assessment of disease and injury, and for neuroscience research. Analyzing the dMRI data to extract useful information for medical and scientific purposes can be challenging. The dMRI measurements may suffer from strong noise and artifacts, and may exhibit high inter-session and inter-scanner variability in the data, as well as inter-subject heterogeneity in brain structure. Moreover, the relationship between measurements and the phenomena of interest can be highly complex. Recent years have witnessed increasing use of machine learning methods for dMRI analysis. This manuscript aims to assess these efforts, with a focus on methods that have addressed data preprocessing and harmonization, microstructure mapping, tractography, and white matter tract analysis. We study the main findings, strengths, and weaknesses of the existing methods and suggest topics for future research. We find that machine learning may be exceptionally suited to tackle some of the difficult tasks in dMRI analysis. However, for this to happen, several shortcomings of existing methods and critical unresolved issues need to be addressed. There is a pressing need to improve evaluation practices, to increase the availability of rich training datasets and validation benchmarks, as well as model generalizability, reliability, and explainability concerns.
♻ ☆ Parallelly Tempered Generative Adversarial Nets: Toward Stabilized Gradients
A generative adversarial network (GAN) has been a representative backbone model in generative artificial intelligence (AI) because of its powerful performance in capturing intricate data-generating processes. However, the GAN training is well-known for its notorious training instability, usually characterized by the occurrence of mode collapse. Through the lens of gradients' variance, this work particularly analyzes the training instability and inefficiency in the presence of mode collapse by linking it to multimodality in the target distribution. To ease the raised training issues from severe multimodality, we introduce a novel GAN training framework that leverages a series of tempered distributions produced via convex interpolation. With our newly developed GAN objective function, the generator can learn all the tempered distributions simultaneously, conceptually resonating with the parallel tempering in statistics. Our simulation studies demonstrate the superiority of our approach over existing popular training strategies in both image and tabular data synthesis. We theoretically analyze that such significant improvement can arise from reducing the variance of gradient estimates by using the tempered distributions. Finally, we further develop a variant of the proposed framework aimed at generating fair synthetic data which is one of the growing interests in the field of trustworthy AI.
♻ ☆ Estimation of Energy-dissipation Lower-bounds for Neuromorphic Learning-in-memory
Neuromorphic or neurally-inspired optimizers rely on local but parallel parameter updates to solve problems that range from quadratic programming to Ising machines. An ideal realization of such an optimizer not only uses a compute-in-memory (CIM) paradigm to address the so-called memory-wall (i.e. energy dissipated due to repeated memory read access), but also uses a learning-in-memory (LIM) paradigm to address the energy bottlenecks due to repeated memory writes at the precision required for optimization (the update-wall), and to address the energy bottleneck due to the repeated transfer of information between short-term and long-term memories (the consolidation-wall). In this paper, we derive theoretical estimates for the energy-to-solution metric that can be achieved by this ideal neuromorphic optimizer which is realized by modulating the energy-barrier of the physical memories such that the dynamics of memory updates and memory consolidation matches the optimization or the annealing dynamics. The analysis presented in this paper captures the out-of-equilibrium thermodynamics of learning and the resulting energy-efficiency estimates are model-agnostic which only depend on the number of model-update operations (OPS), the model-size in terms of number of parameters, the speed of convergence, and the precision of the solution. To show the practical applicability of our results, we apply our analysis for estimating the lower-bound on the energy-to-solution metrics for large-scale AI workloads.
comment: 16 pages, 6 figures
♻ ☆ The calculus of variations of the Transformer on the hyperspherical tangent bundle
We offer a theoretical mathematical background to Transformers through Lagrangian optimization across the token space. The Transformer, as a flow map, exists in the tangent fiber for each token along the high-dimensional unit sphere. The circumstance of the hypersphere across the latent data is reasonable due to the trained diagonal matrix equal to the identity, which has various empirical justifications. Thus, under the continuum limit of the dynamics, the latent vectors flow among the tangent bundle. Using these facts, we devise a mathematical framework for the Transformer through calculus of variations. We develop a functional and show that the continuous flow map induced by the Transformer satisfies this functional, therefore the Transformer can be viewed as a natural solver of a calculus of variations problem. We invent new scenarios of when our methods are applicable based on loss optimization with respect to path optimality. We derive the Euler-Lagrange equation for the Transformer. The variant of the Euler-Lagrange equation we present has various appearances in literature, but, to our understanding, oftentimes not foundationally proven or under other specialized cases. Our overarching proof is new: our techniques are classical and the use of the flow map object is original. We provide several other relevant results, primarily ones specific to neural scenarios. In particular, much of our analysis will be attempting to quantify Transformer data in variational contexts under neural approximations. Calculus of variations on manifolds is a well-nourished research area, but for the Transformer specifically, it is uncharted: we lay the foundation for this area through an introduction to the Lagrangian for the Transformer.
comment: Small corrections and improvements, primarily to Theorem 4
♻ ☆ Natural Language Generation from Visual Events: State-of-the-Art and Key Open Questions
In recent years, a substantial body of work in visually grounded natural language processing has focused on real-life multimodal scenarios such as describing content depicted in images or videos. However, comparatively less attention has been devoted to study the nature and degree of interaction between the different modalities in these scenarios. In this paper, we argue that any task dealing with natural language generation from sequences of images or frames is an instance of the broader, more general problem of modeling the intricate relationships between visual events unfolding over time and the features of the language used to interpret, describe, or narrate them. Therefore, solving these tasks requires models to be capable of identifying and managing such intricacies. We consider five seemingly different tasks, which we argue are compelling instances of this broader multimodal problem. Subsequently, we survey the modeling and evaluation approaches adopted for these tasks in recent years and examine the common set of challenges these tasks pose. Building on this perspective, we identify key open questions and propose several research directions for future investigation.
♻ ☆ Fluorescence molecular optomic signatures improve identification of tumors in head and neck specimens
In this study, a radiomics approach was extended to optical fluorescence molecular imaging data for tissue classification, termed 'optomics'. Fluorescence molecular imaging is emerging for precise surgical guidance during head and neck squamous cell carcinoma (HNSCC) resection. However, the tumor-to-normal tissue contrast is confounded by intrinsic physiological limitations of heterogeneous expression of the target molecule, epidermal growth factor receptor (EGFR). Optomics seek to improve tumor identification by probing textural pattern differences in EGFR expression conveyed by fluorescence. A total of 1,472 standardized optomic features were extracted from fluorescence image samples. A supervised machine learning pipeline involving a support vector machine classifier was trained with 25 top-ranked features selected by minimum redundancy maximum relevance criterion. Model predictive performance was compared to fluorescence intensity thresholding method by classifying testing set image patches of resected tissue with histologically confirmed malignancy status. The optomics approach provided consistent improvement in prediction accuracy on all test set samples, irrespective of dose, compared to fluorescence intensity thresholding method (mean accuracies of 89% vs. 81%; P = 0.0072). The improved performance demonstrates that extending the radiomics approach to fluorescence molecular imaging data offers a promising image analysis technique for cancer detection in fluorescence-guided surgery.
comment: 21 pages, 8 figures, 1 table, submitted as a manuscript at Frontiers in Medical Technology
♻ ☆ Unsupervised Learning for Quadratic Assignment
We introduce PLUME search, a data-driven framework that enhances search efficiency in combinatorial optimization through unsupervised learning. Unlike supervised or reinforcement learning, PLUME search learns directly from problem instances using a permutation-based loss with a non-autoregressive approach. We evaluate its performance on the quadratic assignment problem, a fundamental NP-hard problem that encompasses various combinatorial optimization problems. Experimental results demonstrate that PLUME search consistently improves solution quality. Furthermore, we study the generalization behavior and show that the learned model generalizes across different densities and sizes.
comment: preprint
♻ ☆ Nash Convergence of Mean-Based Learning Algorithms in First-Price Auctions WWW
The convergence properties of learning dynamics in repeated auctions is a timely and important question, with numerous applications in, e.g., online advertising markets. This work focuses on repeated first-price auctions where bidders with fixed values learn to bid using mean-based algorithms -- a large class of online learning algorithms that include popular no-regret algorithms such as Multiplicative Weights Update and Follow the Perturbed Leader. We completely characterize the learning dynamics of mean-based algorithms, under two notions of convergence: (1) time-average: the fraction of rounds where bidders play a Nash equilibrium converges to 1; (2) last-iterate: the mixed strategy profile of bidders converges to a Nash equilibrium. Specifically, the results depend on the number of bidders with the highest value: - If the number is at least three, the dynamics almost surely converges to a Nash equilibrium of the auction, in both time-average and last-iterate. - If the number is two, the dynamics almost surely converges to a Nash equilibrium in time-average but not necessarily last-iterate. - If the number is one, the dynamics may not converge to a Nash equilibrium in time-average or last-iterate. Our discovery opens up new possibilities in the study of the convergence of learning dynamics.
comment: A preliminary version was published at the Web Conference (WWW) 2022. This version updates references and figures
♻ ☆ Is The Watermarking Of LLM-Generated Code Robust?
We present the first in depth study on the robustness of existing watermarking techniques applied to code generated by large language models (LLMs). As LLMs increasingly contribute to software development, watermarking has emerged as a potential solution for detecting AI generated code and mitigating misuse, such as plagiarism or the automated generation of malicious programs. While previous research has demonstrated the resilience of watermarking in the text setting, our work reveals that watermarking techniques are significantly more fragile in code-based contexts. Specifically, we show that simple semantic-preserving transformations, such as variable renaming and dead code insertion, can effectively erase watermarks without altering the program's functionality. To systematically evaluate watermark robustness, we develop an algorithm that traverses the Abstract Syntax Tree (AST) of a watermarked program and applies a sequence of randomized, semantics-preserving transformations. Our experimental results, conducted on Python code generated by different LLMs, indicate that even minor modifications can drastically reduce watermark detectability, with true positive rates (TPR) dropping below 50% in many cases. Our code is publicly available at https://github.com/uiuc-arc/llm-code-watermark.
♻ ☆ LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters
The growth of large language models underscores the need for parameter-efficient fine-tuning. Despite its popularity, LoRA encounters storage and computational challenges when deploying multiple task- or user-specific modules. To address this, we introduce LoRA-XS, a novel fine-tuning method backed by a theoretical derivation. LoRA-XS drastically reduces trainable parameters by incorporating a small, trainable weight matrix between frozen low-rank matrices derived from the Singular Value Decomposition of pre-trained weights. This design enables LoRA-XS to reduce storage requirements by over 100x in 7B models compared to LoRA. Additionally, unlike other methods, LoRA-XS imposes no lower bound on trainable parameters - it can scale from a single parameter per module to arbitrarily large values, adapting to any storage or computational constraint. Evaluations on GLUE, GSM8K, MATH, and commonsense reasoning benchmarks across different model scales reveal that LoRA-XS consistently outperforms or matches LoRA and VeRA in accuracy, offering unmatched parameter efficiency. Our ablation studies highlight the significance of singular vectors in transformer weights, establishing LoRA-XS as a powerful, storage-efficient solution for scaling and personalizing large language models.
♻ ☆ LGR2: Language Guided Reward Relabeling for Accelerating Hierarchical Reinforcement Learning
Large language models (LLMs) have shown remarkable abilities in logical reasoning, in-context learning, and code generation. However, translating natural language instructions into effective robotic control policies remains a significant challenge, especially for tasks requiring long-horizon planning and operating under sparse reward conditions. Hierarchical Reinforcement Learning (HRL) provides a natural framework to address this challenge in robotics; however, it typically suffers from non-stationarity caused by the changing behavior of the lower-level policy during training, destabilizing higher-level policy learning. We introduce LGR2, a novel HRL framework that leverages LLMs to generate language-guided reward functions for the higher-level policy. By decoupling high-level reward generation from low-level policy changes, LGR2 fundamentally mitigates the non-stationarity problem in off-policy HRL, enabling stable and efficient learning. To further enhance sample efficiency in sparse environments, we integrate goal-conditioned hindsight experience relabeling. Extensive experiments across simulated and real-world robotic navigation and manipulation tasks demonstrate LGR2 outperforms both hierarchical and non-hierarchical baselines, achieving over 55% success rates on challenging tasks and robust transfer to real robots, without additional fine-tuning.
♻ ☆ The Spectral Barycentre of a Set of Graphs with Community Structure
The notion of barycentre graph is of crucial importance for machine learning algorithms that process graph-valued data. The barycentre graph is a "summary graph" that captures the mean topology and connectivity structure of a training dataset of graphs. The construction of a barycentre requires the definition of a metric to quantify distances between pairs of graphs. In this work, we use a multiscale spectral distance that is defined using the eigenvalues of the normalized graph Laplacian. The eigenvalues -- but not the eigenvectors -- of the normalized Laplacian of the barycentre graph can be determined from the optimization problem that defines the barycentre. In this work, we propose a structural constraint on the eigenvectors of the normalized graph Laplacian of the barycentre graph that guarantees that the barycentre inherits the topological structure of the graphs in the sample dataset. The eigenvectors can be computed using an algorithm that explores the large library of Soules bases. When the graphs are random realizations of a balanced stochastic block model, then our algorithm returns a barycentre that converges asymptotically (in the limit of large graph size) almost-surely to the population mean of the graphs. We perform Monte Carlo simulations to validate the theoretical properties of the estimator; we conduct experiments on real-life graphs that suggest that our approach works beyond the controlled environment of stochastic block models.
comment: 28 pages
Multimedia 8
☆ INDS: Incremental Named Data Streaming for Real-Time Point Cloud Video
Real-time streaming of point cloud video, characterized by massive data volumes and high sensitivity to packet loss, remains a key challenge for immersive applications under dynamic network conditions. While connection-oriented protocols such as TCP and more modern alternatives like QUIC alleviate some transport-layer inefficiencies, including head-of-line blocking, they still retain a coarse-grained, segment-based delivery model and a centralized control loop that limit fine-grained adaptation and effective caching. We introduce INDS (Incremental Named Data Streaming), an adaptive streaming framework based on Information-Centric Networking (ICN) that rethinks delivery for hierarchical, layered media. INDS leverages the Octree structure of point cloud video and expressive content naming to support progressive, partial retrieval of enhancement layers based on consumer bandwidth and decoding capability. By combining time-windows with Group-of-Frames (GoF), INDS's naming scheme supports fine-grained in-network caching and facilitates efficient multi-user data reuse. INDS can be deployed as an overlay, remaining compatible with QUIC-based transport infrastructure as well as future Media-over-QUIC (MoQ) architectures, without requiring changes to underlying IP networks. Our prototype implementation shows up to 80% lower delay, 15-50% higher throughput, and 20-30% increased cache hit rates compared to state-of-the-art DASH-style systems. Together, these results establish INDS as a scalable, cache-friendly solution for real-time point cloud streaming under variable and lossy conditions, while its compatibility with MoQ overlays further positions it as a practical, forward-compatible architecture for emerging immersive media systems.
comment: 9 pages, 9 figures, 2 tables. To appear in Proc. of the 33rd ACM International Conference on Multimedia (MM '25), October 27--31, 2025, Dublin, Ireland
☆ Optimizing Region of Interest Selection for Effective Embedding in Video Steganography Based on Genetic Algorithms
With the widespread use of the internet, there is an increasing need to ensure the security and privacy of transmitted data. This has led to an intensified focus on the study of video steganography, which is a technique that hides data within a video cover to avoid detection. The effectiveness of any steganography method depends on its ability to embed data without altering the original video quality while maintaining high efficiency. This paper proposes a new method to video steganography, which involves utilizing a Genetic Algorithm (GA) for identifying the Region of Interest (ROI) in the cover video. The ROI is the area in the video that is the most suitable for data embedding. The secret data is encrypted using the Advanced Encryption Standard (AES), which is a widely accepted encryption standard, before being embedded into the cover video, utilizing up to 10% of the cover video. This process ensures the security and confidentiality of the embedded data. The performance metrics for assessing the proposed method are the Peak Signal to Noise Ratio (PSNR) and the encoding and decoding time. The results show that the proposed method has a high embedding capacity and efficiency, with a PSNR ranging between 64 and 75 dBs, which indicates that the embedded data is almost indistinguishable from the original video. Additionally, the method can encode and decode data quickly, making it efficient for real time applications.
comment: 19 Pages, 7 Figures, 4 Tables
☆ Mitigating Easy Option Bias in Multiple-Choice Question Answering
In this early study, we observe an Easy-Options Bias (EOB) issue in some multiple-choice Visual Question Answering (VQA) benchmarks such as MMStar, RealWorldQA, SEED-Bench, Next-QA, STAR benchmark and Video-MME. This bias allows vision-language models (VLMs) to select the correct answer using only the vision (V) and options (O) as inputs, without the need for the question (Q). Through grounding experiments, we attribute the bias to an imbalance in visual relevance: the correct answer typically aligns more closely with the visual contents than the negative options in feature space, creating a shortcut for VLMs to infer the answer via simply vision-option similarity matching. To fix this, we introduce GroundAttack, a toolkit that automatically generates hard negative options as visually plausible as the correct answer. We apply it to the NExT-QA and MMStar datasets, creating new EOB-free annotations. On these EOB-free annotations, current VLMs approach to random accuracies under (V+O) settings, and drop to non-saturated accuracies under (V+Q+O) settings, providing a more realistic evaluation of VLMs' QA ability. Codes and new annotations will be released soon.
comment: Under review
☆ OmniSense: Towards Edge-Assisted Online Analytics for 360-Degree Videos
With the reduced hardware costs of omnidirectional cameras and the proliferation of various extended reality applications, more and more $360^\circ$ videos are being captured. To fully unleash their potential, advanced video analytics is expected to extract actionable insights and situational knowledge without blind spots from the videos. In this paper, we present OmniSense, a novel edge-assisted framework for online immersive video analytics. OmniSense achieves both low latency and high accuracy, combating the significant computation and network resource challenges of analyzing $360^\circ$ videos. Motivated by our measurement insights into $360^\circ$ videos, OmniSense introduces a lightweight spherical region of interest (SRoI) prediction algorithm to prune redundant information in $360^\circ$ frames. Incorporating the video content and network dynamics, it then smartly scales vision models to analyze the predicted SRoIs with optimized resource utilization. We implement a prototype of OmniSense with commodity devices and evaluate it on diverse real-world collected $360^\circ$ videos. Extensive evaluation results show that compared to resource-agnostic baselines, it improves the accuracy by $19.8\%$ -- $114.6\%$ with similar end-to-end latencies. Meanwhile, it hits $2.0\times$ -- $2.4\times$ speedups while keeping the accuracy on par with the highest accuracy of baselines.
comment: 10 pages; Accepted by INFOCOM'23
☆ StarStream: Live Video Analytics over Space Networking
Streaming videos from resource-constrained front-end devices over networks to resource-rich cloud servers has long been a common practice for surveillance and analytics. Most existing live video analytics (LVA) systems, however, have been built over terrestrial networks, limiting their applications during natural disasters and in remote areas that desperately call for real-time visual data delivery and scene analysis. With the recent advent of space networking, in particular, Low Earth Orbit (LEO) satellite constellations such as Starlink, high-speed truly global Internet access is becoming available and affordable. This paper examines the challenges and potentials of LVA over modern LEO satellite networking (LSN). Using Starlink as the testbed, we have carried out extensive in-the-wild measurements to gain insights into its achievable performance for LVA. The results reveal that the uplink bottleneck in today's LSN, together with the volatile network conditions, can significantly affect the service quality of LVA and necessitate prompt adaptation. We accordingly develop StarStream, a novel LSN-adaptive streaming framework for LVA. At its core, StarStream is empowered by a Transformer-based network performance predictor tailored for LSN and a content-aware configuration optimizer. We discuss a series of key design and implementation issues of StarStream and demonstrate its effectiveness and superiority through trace-driven experiments with real-world network and video processing data.
comment: Accepted by MM'24
♻ ☆ RAPNet: A Receptive-Field Adaptive Convolutional Neural Network for Pansharpening
Pansharpening refers to the process of integrating a high resolution panchromatic (PAN) image with a lower resolution multispectral (MS) image to generate a fused product, which is pivotal in remote sensing. Despite the effectiveness of CNNs in addressing this challenge, they are inherently constrained by the uniform application of convolutional kernels across all spatial positions, overlooking local content variations. To overcome this issue, we introduce RAPNet, a new architecture that leverages content-adaptive convolution. At its core, RAPNet employs the Receptive-field Adaptive Pansharpening Convolution (RAPConv), designed to produce spatially adaptive kernels responsive to local feature context, thereby enhancing the precision of spatial detail extraction. Additionally, the network integrates the Pansharpening Dynamic Feature Fusion (PAN-DFF) module, which incorporates an attention mechanism to achieve an optimal balance between spatial detail enhancement and spectral fidelity. Comprehensive evaluations on publicly available datasets confirm that RAPNet delivers superior performance compared to existing approaches, as demonstrated by both quantitative metrics and qualitative assessments. Ablation analyses further substantiate the effectiveness of the proposed adaptive components.
comment: Accepted by the 6th International Conference on Artificial Intelligence and Electromechanical Automation (AIEA 2025). 5 pages, 6 figures
♻ ☆ MAGNeT: Multimodal Adaptive Gaussian Networks for Intent Inference in Moving Target Selection across Complex Scenarios ACM MM 2025
Moving target selection in multimedia interactive systems faces unprecedented challenges as users increasingly interact across diverse and dynamic contexts-from live streaming in moving vehicles to VR gaming in varying environments. Existing approaches rely on probabilistic models that relate endpoint distribution to target properties such as size and speed. However, these methods require substantial training data for each new context and lack transferability across scenarios, limiting their practical deployment in diverse multimedia environments where rich multimodal contextual information is readily available. This paper introduces MAGNeT (Multimodal Adaptive Gaussian Networks), which addresses these problems by combining classical statistical modeling with a context-aware multimodal method. MAGNeT dynamically fuses pre-fitted Ternary-Gaussian models from various scenarios based on real-time contextual cues, enabling effective adaptation with minimal training data while preserving model interpretability. We conduct experiments on self-constructed 2D and 3D moving target selection datasets under in-vehicle vibration conditions. Extensive experiments demonstrate that MAGNeT achieves lower error rates with few-shot samples by applying context-aware fusion of Gaussian experts from multi-factor conditions.
comment: Accepted by ACM MM 2025
♻ ☆ VoiceCloak: A Multi-Dimensional Defense Framework against Unauthorized Diffusion-based Voice Cloning
Diffusion Models (DMs) have achieved remarkable success in realistic voice cloning (VC), while they also increase the risk of malicious misuse. Existing proactive defenses designed for traditional VC models aim to disrupt the forgery process, but they have been proven incompatible with DMs due to the intricate generative mechanisms of diffusion. To bridge this gap, we introduce VoiceCloak, a multi-dimensional proactive defense framework with the goal of obfuscating speaker identity and degrading perceptual quality in potential unauthorized VC. To achieve these goals, we conduct a focused analysis to identify specific vulnerabilities within DMs, allowing VoiceCloak to disrupt the cloning process by introducing adversarial perturbations into the reference audio. Specifically, to obfuscate speaker identity, VoiceCloak first targets speaker identity by distorting representation learning embeddings to maximize identity variation, which is guided by auditory perception principles. Additionally, VoiceCloak disrupts crucial conditional guidance processes, particularly attention context, thereby preventing the alignment of vocal characteristics that are essential for achieving convincing cloning. Then, to address the second objective, VoiceCloak introduces score magnitude amplification to actively steer the reverse trajectory away from the generation of high-quality speech. Noise-guided semantic corruption is further employed to disrupt structural speech semantics captured by DMs, degrading output quality. Extensive experiments highlight VoiceCloak's outstanding defense success rate against unauthorized diffusion-based voice cloning. Audio samples of VoiceCloak are available at https://voice-cloak.github.io/VoiceCloak/.