BinaryAI: Binary Software Composition Analysis via Intelligent Binary Source Code Matching

AI-generated keywords: BinaryAI

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • The paper addresses challenges and risks of using third-party libraries in software development
  • Software composition analysis (SCA) techniques have been developed to mitigate these risks
  • Existing techniques for binary-to-source SCA have limitations, leading to false positives and compromised recall
  • The authors propose a novel technique called BinaryAI that utilizes a two-phase binary source code matching approach
  • BinaryAI trains a transformer-based model to generate function-level embeddings and find similar source functions for each binary function
  • Experimental results show that BinaryAI outperforms existing models in terms of binary source code matching and downstream SCA tasks
  • BinaryAI achieves higher recall@1 and MRR compared to the state-of-the-art model CodeCMR
  • BinaryAI also surpasses existing binary-to-source SCA tools in detecting third-party libraries, increasing precision and recall compared to the commercial SCA product Black Duck
  • Overall, BinaryAI provides an innovative approach for improving the accuracy and effectiveness of binary-to-source SCA techniques
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Ling Jiang, Junwen An, Huihui Huang, Qiyi Tang, Sen Nie, Shi Wu, Yuqun Zhang

Abstract: While third-party libraries are extensively reused to enhance productivity during software development, they can also introduce potential security risks such as vulnerability propagation. Software composition analysis, proposed to identify reused TPLs for reducing such risks, has become an essential procedure within modern DevSecOps. As one of the mainstream SCA techniques, binary-to-source SCA identifies the third-party source projects contained in binary files via binary source code matching, which is a major challenge in reverse engineering since binary and source code exhibit substantial disparities after compilation. The existing binary-to-source SCA techniques leverage basic syntactic features that suffer from redundancy and lack robustness in the large-scale TPL dataset, leading to inevitable false positives and compromised recall. To mitigate these limitations, we introduce BinaryAI, a novel binary-to-source SCA technique with two-phase binary source code matching to capture both syntactic and semantic code features. First, BinaryAI trains a transformer-based model to produce function-level embeddings and obtain similar source functions for each binary function accordingly. Then by applying the link-time locality to facilitate function matching, BinaryAI detects the reused TPLs based on the ratio of matched source functions. Our experimental results demonstrate the superior performance of BinaryAI in terms of binary source code matching and the downstream SCA task. Specifically, our embedding model outperforms the state-of-the-art model CodeCMR, i.e., achieving 22.54% recall@1 and 0.34 MRR compared with 10.75% and 0.17 respectively. Additionally, BinaryAI outperforms all existing binary-to-source SCA tools in TPL detection, increasing the precision from 73.36% to 85.84% and recall from 59.81% to 64.98% compared with the well-recognized commercial SCA product Black Duck.

Submitted to arXiv on 20 Jan. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2401.11161v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

, , , , The paper titled "BinaryAI: Binary Software Composition Analysis via Intelligent Binary Source Code Matching" addresses the challenges and risks associated with using third-party libraries in software development. While these libraries can enhance productivity, they also introduce potential security vulnerabilities. To mitigate these risks, software composition analysis (SCA) techniques have been developed, with binary-to-source SCA being a popular approach. However, this process is challenging due to disparities between binary and source code after compilation. Existing techniques rely on basic syntactic features, leading to false positives and compromised recall in large-scale third-party library datasets. To overcome these limitations, the authors propose a novel technique called BinaryAI that utilizes a two-phase binary source code matching approach to capture both syntactic and semantic code features. In the first phase, BinaryAI trains a transformer-based model to generate function-level embeddings and find similar source functions for each binary function. The second phase employs link-time locality to facilitate function matching and detects reused third-party libraries based on the ratio of matched source functions. Experimental results demonstrate that BinaryAI outperforms existing models in terms of binary source code matching and downstream SCA tasks. The embedding model of BinaryAI achieves 22.54% recall@1 and 0.34 MRR compared to 10.75% and 0.17 respectively for the state-of-the-art model CodeCMR. Additionally, BinaryAI surpasses all existing binary-to-source SCA tools in detecting third-party libraries by increasing precision from 73.36% to 85.84% and recall from 59.81% to 64.98% compared to the commercial SCA product Black Duck. Overall, BinaryAI presents an innovative approach for improving the accuracy and effectiveness of binary-to-source SCA techniques. By considering both syntactic and semantic code features, it provides a more robust solution for identifying reused third-party libraries and reducing security risks in software development.
Created on 01 Feb. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.