Abstract:This study is designed to tackle the dual challenges prevalent in the geological domain:the suboptimal accuracy in recognizing complex geological entities and the prohibitive costs associated with sample annotation. Its ambition is to significantly uplift the precision in geological named entity recognition and to streamline the process of geological text analysis. Achieving these aims is pivotal for the effective construction of geological knowledge graphs and the advancement of intelligent geological information processing.Methods:To meet these objectives, we have introduced the BERTwwm-BiLSTM-Attention-CRF model. This innovative model leverages the enhanced capabilities of an improved BERTwwm pre-training layer, combined with a Self-Attention module, aiming to elevate the accuracy in identifying complex geological entities. To mitigate the high costs and labor associated with sample annotation, and to bolster accuracy for analyses involving small-scale datasets, our approach includes optimizing the model's construction process. This optimization is achieved through a model-assisted annotation strategy, significantly speeding up the dataset annotation process. Additionally, we have refined the Easy Data Augmentation (EDA) strategy and utilized a geological dictionary to efficiently broaden the dataset while minimizing the need for manual annotation labor.Results:The deployment of the BERTwwm-BiLSTM-Attention-CRF model yielded impressive outcomes, demonstrating a precision rate of 92.67%, a recall rate of 94.21%, and an F1-Score of 93.29%. Through comparative experiments and ablation studies, the efficacy of the proposed enhancements in geological entity recognition was conclusively validated.Conclusions:The innovations presented in this research effectively surmount the existing obstacles in geological named entity recognition, delivering a model that substantially boosts accuracy. The incorporation of a model-assisted annotation technique and the optimization of data augmentation practices significantly diminish annotation costs and enhance dataset analysis performance, especially in smaller datasets. These breakthroughs present a cost-effective and efficient methodology for geological text analysis, facilitating the construction of geological knowledge graphs and promoting the sophisticated processing of geological data.