Arce: Augmented Roberta with Contextualized Elucidations for Ner in Automated Rule Checking
Abstract
Accurate information extraction from specialized texts is a critical challenge for automated rule checking (ARC) in the architecture, engineering, and construction (AEC) domain. While large language models (LLMs) possess strong reasoning capabilities, their deployment in resource-constrained AEC environments is often impractical. Conversely, standard efficient models struggle with the significant domain gap. Although this gap can be mitigated by pre-training on large, humancurated corpora, such approaches are labor-intensive and costly. To address this, we propose ARCE (Augmented RoBERTa with Contextualized Elucidations), a novel knowledge distillation framework that leverages LLMs to synthesize a task-oriented corpus, termed Cote, for incrementally pre-training smaller models. ARCE systematically explores the optimal strategy for knowledge transfer. Our extensive experiments demonstrate that ARCE establishes a new state-of-the-art on a benchmark AEC dataset, achieving a Macro-F1 score of 77.20% and outperforming both domain-specific baselines and fine-tuned LLMs. Crucially, our study reveals a less is more principle: simple, direct explanations prove significantly more effective for domain adaptation than complex, role-based rationales in the NER task, which tend to introduce semantic noise. The source code will be made publicly available upon acceptance.