基于BERTwwm与数据增强的地质实体识别研究
作者:
基金项目:

本文为国家自然科学基金资助项目(编号:42172172,42130813); 油气资源与勘探技术教育部重点实验室(长江大学)开放基金资助项目(编号:PI2023-04)的成果


Research ongeological name entity recognition based on BERTwwm and data augmentation
Author:
单位:
  • 1)长江大学地球科学学院,武汉,430100    
  • 1)长江大学地球科学学院,武汉,430100    
  • 1)长江大学地球科学学院,武汉,430100;3) 油气资源与勘探技术教育部重点实验室(长江大学),武汉,430100    
  • 2) 中国石化胜利油田分公司,山东东营, 257000    
  • 2) 中国石化胜利油田分公司,山东东营, 257000    
  • 1)长江大学地球科学学院,武汉,430100;3) 油气资源与勘探技术教育部重点实验室(长江大学),武汉,430100    
  • 1) School of Geosciences, Yangtze University, Wuhan, 430100    
  • 1) School of Geosciences, Yangtze University, Wuhan, 430100    
  • 1) School of Geosciences, Yangtze University, Wuhan, 430100;3) Key Laboratory of Exploration Technologies for Oil and Gas Resources (Yangtze University), Ministry of Education,Wuhan,430100    
  • 2) SINOPEC Shengli Oilfield Company, Dongying,Shandong,257000    
  • 2) SINOPEC Shengli Oilfield Company, Dongying,Shandong,257000    
  • 1) School of Geosciences, Yangtze University, Wuhan, 430100;3) Key Laboratory of Exploration Technologies for Oil and Gas Resources (Yangtze University), Ministry of Education,Wuhan,430100    
  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • | |
  • 文章评论
    摘要:

    地质命名实体识别是识别地质文本中的地质实体并分类到准确的地质概念中的一项地质知识智能抽取任务,也是构建地质领域知识图谱的关键技术之一。本研究针对地质命名实体识别领域中复杂实体识别精度不足和样本标注成本较高这两大挑战,构建了一种地质实体识别模型BERTwwm-BiLSTM-Attention-CRF,该模型通过改进的预训练层BERTwwm并在模型中加入Self-Attention模块,显著提升了复杂地质实体的识别精度,对地质实体识别的精度达到92.67%的精确率,94.21%的召回率,以及93.29%的F1值。同时,为降低标注成本,提升小规模数据集的识别精度,本研究优化了模型构建流程,采用模型辅助标注方法,加快数据集的标注速度;改进简单数据增强方法,并利用地质字典有效扩充数据集,降低了人工标注的难度。经过实验证明,本研究提出的改进方法提高了地质实体识别效果,为地质文本分析提供了一种高效且经济的解决方案,有助于推动地质领域知识图谱的构建和地质信息的智能化处理。

    Abstract:

    This study is designed to tackle the dual challenges prevalent in the geological domain:the suboptimal accuracy in recognizing complex geological entities and the prohibitive costs associated with sample annotation. Its ambition is to significantly uplift the precision in geological named entity recognition and to streamline the process of geological text analysis. Achieving these aims is pivotal for the effective construction of geological knowledge graphs and the advancement of intelligent geological information processing.Methods:To meet these objectives, we have introduced the BERTwwm-BiLSTM-Attention-CRF model. This innovative model leverages the enhanced capabilities of an improved BERTwwm pre-training layer, combined with a Self-Attention module, aiming to elevate the accuracy in identifying complex geological entities. To mitigate the high costs and labor associated with sample annotation, and to bolster accuracy for analyses involving small-scale datasets, our approach includes optimizing the model's construction process. This optimization is achieved through a model-assisted annotation strategy, significantly speeding up the dataset annotation process. Additionally, we have refined the Easy Data Augmentation (EDA) strategy and utilized a geological dictionary to efficiently broaden the dataset while minimizing the need for manual annotation labor.Results:The deployment of the BERTwwm-BiLSTM-Attention-CRF model yielded impressive outcomes, demonstrating a precision rate of 92.67%, a recall rate of 94.21%, and an F1-Score of 93.29%. Through comparative experiments and ablation studies, the efficacy of the proposed enhancements in geological entity recognition was conclusively validated.Conclusions:The innovations presented in this research effectively surmount the existing obstacles in geological named entity recognition, delivering a model that substantially boosts accuracy. The incorporation of a model-assisted annotation technique and the optimization of data augmentation practices significantly diminish annotation costs and enhance dataset analysis performance, especially in smaller datasets. These breakthroughs present a cost-effective and efficient methodology for geological text analysis, facilitating the construction of geological knowledge graphs and promoting the sophisticated processing of geological data.

    参考文献
    相似文献
    引证文献
引用本文

章文琦,刘远刚,李少华,于金彪,史敬华,张昌民.2024.基于BERTwwm与数据增强的地质实体识别研究[J].地质论评,70(3):2024030034,[DOI].
ZHANG Wenqi, LIU Yuangagn, LI Shaohua, YU Jinbiao, SHI Jinghua, ZHANG Changmin.2024. Research ongeological name entity recognition based on BERTwwm and data augmentation[J]. Geological Review,70(3):2024030034.

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 在线发布日期: 2024-06-20