+高级检索
基于深度学习的混合语言源代码漏洞检测方法
作者单位:

兰州交通大学

基金项目:

国家自然科学基金资助项目(No.61762058),甘肃省教育厅产业支撑项(No.2022CYZC-38),国家电网科技项目(No.W32KJ2722010,No. 522722220013)


DL-HLVD:Deep Learning-based Hybrid Language Source Code Vulnerabil-ity Detection
Affiliation:

兰州交通大学

Fund Project:

The National Natural Science Foundation of China (No.61762058), The Natural Science Foundation of Gansu Province (No.21JR7RA282), The Industrial support project of Gansu Provincial Department of Education (No.2022CYZC-38), The State Grid Science and Technology Project (No. W32KJ2722010, No. 522722220013)

  • 摘要
  • | |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
    摘要:

    为了提高软件开发效率,为软件体系的开发提供更多的选择。目前许多开源软件系统往往会使用多种编程语言共同编写,但是不同编程语言代码单元间通常具有关联和调用关系,由此产生的安全漏洞在实际环境中更加普遍。现有的漏洞检测技术主要针对单一编程语言进行特征学习,难以实现对混合编程语言软件项目漏洞的有效检测,就此提出一种基于深度学习的混合语言源代码漏洞检测方法DL-HLVD(Deep Learning-based Hybrid Language Source Code Vulnerability Detection)。DL-HLVD首先利用BERT层将代码文本转换为低维向量,将该向量作为双向门控循环单元(Bidirectional Gated Recurrent Unit,BGRU)的输入捕获上下文特征,使用条件随机场(Conditional Random Field,CRF)捕获相邻标签间的依赖关系。使用该方法对混合语言软件中不同类型编程语言的函数进行命名实体识别,然后将其和程序切片结果进行重构,进而减少代码表征过程中的语法和语义信息的损失,最后设计双向长短期记忆网络模型提取漏洞代码特征,实现对混合语言软件漏洞检测。在SARD和CrossVul数据集上的全面实验结果表明,DL-HLVD在两类漏洞数据集上识别软件漏洞的综合召回率达到了95.0%,F1值达到了93.6%,相比于最新的深度学习方法VulDeePecker、SySeVR、Project Achilles,DL-HLVD在各个指标上均有提升,表明DL-HLVD能够提高混合语言场景下源代码漏洞检测的综合性能。

    Abstract:

    In order to improve the efficiency of software development, more options are provided for the develop-ment of software system. At present, many open-source software systems are often written in multiple programming languages, but there are usually associations and invocation relationships between code units in different programming languages, and the resulting security vulnerabilities are more common in the actual environment. The existing vulnerability detection technology mainly focuses on the feature of a single programming language, and it is difficult to effectively detect the vulnerabilities of mixed pro-gramming language software projects. Based on the idea of deep learning model fusion, DL-HLVD(Deep Learning-based Hybrid Language Source Code Vulnerability Detection) is proposed. DL-HLVD first uses the BERT layer to convert the code text into a low-dimensional vector, then captures the con-text features as the input of the Bidirectional Gated Recurrent Unit (BGRU), and finally uses the Condi-tional Random Field (CRF) to capture the dependencies between adjacent labels. The deep learning mod-el is used to recognize named entities for functions of different types of programming languages in mixed language software, and then reconstructs them with program slicing results to reduce the loss of syntax and semantic information in the process of code representation. The comprehensive experimental results on the SARD and CrossVul datasets show that the comprehensive recall rate of DL-HLVD on the two types of vulnerability datasets is 95.0%, and the F1 value reaches 93.6%, which is improved in all indica-tors compared with the latest deep learning methods VulDeePecker, SySeVR, and Project Achilles. The results show that DL-HLVD can improve the comprehensive performance of source code vulnerability detection in mixed language scenarios.

    参考文献
    相似文献
    引证文献
文章指标
  • PDF下载次数:
  • HTML阅读次数:
  • 摘要点击次数:
  • 引用次数:
引用本文
历史
  • 收稿日期: 2024-01-04
  • 最后修改日期: 2024-03-15
  • 录用日期: 2024-05-13
作者稿件一经被我刊录用,如无特别声明,即视作同意授予我刊论文整体的全部复制传播的权利,包括但不限于复制权、发行权、信息网络传播权、广播权、表演权、翻译权、汇编权、改编权等著作使用权转让给我刊,我刊有权根据工作需要,允许合作的数据库、新媒体平台及其他数字平台进行数字传播和国际传播等。特此声明。
关闭