Moving Faster and Reducing Risk: Using LLMs in Release Deployment

AI-generated keywords: Release engineering

AI-generated Key Points

  • Traditional focus in release engineering: delivering features and bug fixes continuously
  • Challenge at scale for release engineering teams to determine which changes should be released
  • Development of diff risk score (DRS) models to assess likelihood of severe faults caused by diffs
  • Gating risky code changes based on risk thresholds (green, weekend, yellow, red)
  • Research approaches explored: logistic regression models, BERT-based models like StarBERT, generative LLMs like iCodeLlama-34B and iDiffLlama-13B
  • Performance comparison: generative LLMs show better performance in capturing severe faults compared to regression and BERT-based models
  • Providing developers with detailed information on risk scores in Phabricator UI for feedback and actions to reduce risks
  • Use of predictors in models to prioritize actionable factors while maintaining developer understanding
  • Future potential benefits of exploring large language models for incorporating content-based features into risk assessment
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Rui Abreu, Vijayaraghavan Murali, Peter C Rigby, Chandra Maddila, Weiyan Sun, Jun Ge, Kaavya Chinniah, Audris Mockus, Megh Mehta, Nachiappan Nagappan

License: CC BY 4.0

Abstract: Release engineering has traditionally focused on continuously delivering features and bug fixes to users, but at a certain scale, it becomes impossible for a release engineering team to determine what should be released. At Meta's scale, the responsibility appropriately and necessarily falls back on the engineer writing and reviewing the code. To address this challenge, we developed models of diff risk scores (DRS) to determine how likely a diff is to cause a SEV, i.e., a severe fault that impacts end-users. Assuming that SEVs are only caused by diffs, a naive model could randomly gate X% of diffs from landing, which would automatically catch X% of SEVs on average. However, we aimed to build a model that can capture Y% of SEVs by gating X% of diffs, where Y >> X. By training the model on historical data on diffs that have caused SEVs in the past, we can predict the riskiness of an outgoing diff to cause a SEV. Diffs that are beyond a particular threshold of risk can then be gated. We have four types of gating: no gating (green), weekend gating (weekend), medium impact on end-users (yellow), and high impact on end-users (red). The input parameter for our models is the level of gating, and the outcome measure is the number of captured SEVs. Our research approaches include a logistic regression model, a BERT-based model, and generative LLMs. Our baseline regression model captures 18.7%, 27.9%, and 84.6% of SEVs while respectively gating the top 5% (weekend), 10% (yellow), and 50% (red) of risky diffs. The BERT-based model, StarBERT, only captures 0.61x, 0.85x, and 0.81x as many SEVs as the logistic regression for the weekend, yellow, and red gating zones, respectively. The generative LLMs, iCodeLlama-34B and iDiffLlama-13B, when risk-aligned, capture more SEVs than the logistic regression model in production: 1.40x, 1.52x, 1.05x, respectively.

Submitted to arXiv on 08 Oct. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2410.06351v1

, , , , In the realm of release engineering, the focus has traditionally been on delivering features and bug fixes to users in a continuous manner. However, as organizations like Meta reach a certain scale, it becomes increasingly challenging for release engineering teams to determine which changes should be released. This responsibility then shifts back to the engineers who are writing and reviewing the code. To tackle this challenge, models of diff risk scores (DRS) have been developed to assess the likelihood of a diff causing a severe fault that impacts end-users. The aim is to build a model that can capture a high percentage of these severe faults by gating only a fraction of incoming code changes. By training the model on historical data of diffs that have led to severe faults in the past, it becomes possible to predict the riskiness of an outgoing diff. Diffs that exceed a certain threshold of risk can then be gated using different levels such as no gating (green), weekend gating (weekend), medium impact on end-users (yellow), and high impact on end-users (red). Various research approaches have been explored, including logistic regression models, BERT-based models like StarBERT, and generative LLMs such as iCodeLlama-34B and iDiffLlama-13B. The baseline regression model captures a significant percentage of severe faults when gating different percentages of risky diffs. While BERT-based models show some improvement over regression models, generative LLMs demonstrate even better performance in capturing severe faults. Furthermore, developers are provided with detailed information in the Phabricator UI regarding the risk score of their diffs, feedback mechanisms for providing input on risk scores, reasons for considering a diff risky, and potential actions to reduce risk scores and successfully land changes. The use of various predictors in these models helps prioritize actionable factors while balancing performance and ease of understanding for developers. Moving forward, exploring large language models offers potential benefits by incorporating content-based features like code changes and test plans into risk assessment. This allows for a more comprehensive understanding of diff risks beyond metadata analysis alone. Overall, this paper delves into the complexities of software development at Meta, evolution in code freeze practices, presentation of gated diff risks to developers through UI elements like risk scores and feedback mechanisms, as well as an in-depth analysis of different risk modeling techniques aimed at improving release engineering processes.
Created on 07 Apr. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.