Reliability of Large Language Models for Design Synthesis: An Empirical Study of Variance, Prompt Sensitivity, and Method Scaffolding

AI-generated keywords: Large Language Models

AI-generated Key Points

  • Large Language Models (LLMs) are used in software engineering for generating UML class diagrams from natural language descriptions
  • LLMs can produce syntactically correct diagrams but may lack meaningful design synthesis
  • A preference-based few-shot prompting approach is introduced to bias LLM outputs towards object-oriented principles and pattern-consistent structures
  • Evaluation involves three LLMs and three modeling strategies, showing that preference-based alignment improves adherence to design intent but does not eliminate non-determinism
  • Model behavior significantly influences the reliability of generated designs, highlighting the importance of effective prompting techniques and understanding model behavior for dependable LLM-assisted software design
  • LLMs may not capture essential design principles like abstraction and encapsulation, leading to inconsistency and unreliability in downstream implementation processes
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Rabia Iftikhar, Andreas Rausch

International Conference on Software Architecture 2026
License: CC BY 4.0

Abstract: Large Language Models (LLMs) are increasingly applied to automate software engineering tasks, including the generation of UML class diagrams from natural language descriptions. While prior work demonstrates that LLMs can produce syntactically valid diagrams, syntactic correctness alone does not guarantee meaningful design. This study investigates whether LLMs can move beyond diagram translation to perform design synthesis, and how reliably they maintain design-oriented reasoning under variation. We introduce a preference-based few-shot prompting approach that biases LLM outputs toward designs satisfying object-oriented principles and pattern-consistent structures. Two design-intent benchmarks, each with three domain-only, paraphrased prompts and 10 repeated runs, are used to evaluate three LLMs (ChatGPT 4o-mini, Claude 3.5 Sonnet, Gemini 2.5 Flash) across three modeling strategies: standard prompting, rule-injection prompting, and preference-based prompting, totaling 540 experiments (i.e. 2x3x10x3x3). Results indicate that while preference-based alignment improves adherence to design intent it does not eliminate non-determinism, and model-level behavior strongly influences design reliability. These findings highlight that achieving dependable LLM-assisted software design requires not only effective prompting but also careful consideration of model behavior and robustness.

Submitted to arXiv on 01 Apr. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2604.00851v1

, , , , Large Language Models (LLMs) are increasingly being utilized in software engineering tasks, particularly in the automation of generating UML class diagrams from natural language descriptions. Previous research has shown that LLMs can produce syntactically correct diagrams, but simply adhering to syntax does not ensure meaningful design. This study delves into whether LLMs can go beyond mere diagram translation and engage in design synthesis, while also exploring how consistently they maintain design-oriented reasoning amidst variations. To enhance the quality of generated designs, a preference-based few-shot prompting approach is introduced. This method biases LLM outputs towards designs that align with object-oriented principles and exhibit pattern-consistent structures. The evaluation involves three LLMs (ChatGPT 4o-mini, Claude 3.5 Sonnet, Gemini 2.5 Flash) across three modeling strategies: standard prompting, rule-injection prompting, and preference-based prompting. A total of 540 experiments are conducted using two design-intent benchmarks with domain-only prompts and repeated runs. Results indicate that the preference-based alignment improves adherence to design intent but does not completely eliminate non-determinism. Moreover, the behavior of the models significantly influences the reliability of the generated designs. These findings underscore the importance of not only effective prompting techniques but also careful consideration of model behavior and robustness when aiming for dependable LLM-assisted software design. Furthermore, it is highlighted that while LLMs can produce syntactically valid diagrams from natural language descriptions, there is a gap in capturing essential design principles such as abstraction and encapsulation that experienced designers apply for extensibility and reusability. Without this crucial knowledge embedded in the generated diagrams, there is a risk of inconsistency and unreliability in downstream implementation processes. In conclusion, achieving reliable LLM-assisted software design requires a holistic approach that encompasses effective prompting methods alongside an understanding of model behavior and robustness to ensure consistent adherence to design intent and principles throughout the generation process.
Created on 22 Apr. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.