Reliability of Large Language Models for Design Synthesis: An Empirical Study of Variance, Prompt Sensitivity, and Method Scaffolding

AI-generated keywords: Large Language Models

AI-generated Key Points

Large Language Models (LLMs) are used in software engineering for generating UML class diagrams from natural language descriptions
LLMs can produce syntactically correct diagrams but may lack meaningful design synthesis
A preference-based few-shot prompting approach is introduced to bias LLM outputs towards object-oriented principles and pattern-consistent structures
Evaluation involves three LLMs and three modeling strategies, showing that preference-based alignment improves adherence to design intent but does not eliminate non-determinism
Model behavior significantly influences the reliability of generated designs, highlighting the importance of effective prompting techniques and understanding model behavior for dependable LLM-assisted software design
LLMs may not capture essential design principles like abstraction and encapsulation, leading to inconsistency and unreliability in downstream implementation processes

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Rabia Iftikhar, Andreas Rausch

International Conference on Software Architecture 2026

arXiv: 2604.00851v1 - DOI (cs.SE)

License: CC BY 4.0

Abstract: Large Language Models (LLMs) are increasingly applied to automate software engineering tasks, including the generation of UML class diagrams from natural language descriptions. While prior work demonstrates that LLMs can produce syntactically valid diagrams, syntactic correctness alone does not guarantee meaningful design. This study investigates whether LLMs can move beyond diagram translation to perform design synthesis, and how reliably they maintain design-oriented reasoning under variation. We introduce a preference-based few-shot prompting approach that biases LLM outputs toward designs satisfying object-oriented principles and pattern-consistent structures. Two design-intent benchmarks, each with three domain-only, paraphrased prompts and 10 repeated runs, are used to evaluate three LLMs (ChatGPT 4o-mini, Claude 3.5 Sonnet, Gemini 2.5 Flash) across three modeling strategies: standard prompting, rule-injection prompting, and preference-based prompting, totaling 540 experiments (i.e. 2x3x10x3x3). Results indicate that while preference-based alignment improves adherence to design intent it does not eliminate non-determinism, and model-level behavior strongly influences design reliability. These findings highlight that achieving dependable LLM-assisted software design requires not only effective prompting but also careful consideration of model behavior and robustness.

Submitted to arXiv on 01 Apr. 2026

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2604.00851v1

Comprehensive Summary
Key points
Layman's Summary
Blog article

, , , , Large Language Models (LLMs) are increasingly being utilized in software engineering tasks, particularly in the automation of generating UML class diagrams from natural language descriptions. Previous research has shown that LLMs can produce syntactically correct diagrams, but simply adhering to syntax does not ensure meaningful design. This study delves into whether LLMs can go beyond mere diagram translation and engage in design synthesis, while also exploring how consistently they maintain design-oriented reasoning amidst variations. To enhance the quality of generated designs, a preference-based few-shot prompting approach is introduced. This method biases LLM outputs towards designs that align with object-oriented principles and exhibit pattern-consistent structures. The evaluation involves three LLMs (ChatGPT 4o-mini, Claude 3.5 Sonnet, Gemini 2.5 Flash) across three modeling strategies: standard prompting, rule-injection prompting, and preference-based prompting. A total of 540 experiments are conducted using two design-intent benchmarks with domain-only prompts and repeated runs. Results indicate that the preference-based alignment improves adherence to design intent but does not completely eliminate non-determinism. Moreover, the behavior of the models significantly influences the reliability of the generated designs. These findings underscore the importance of not only effective prompting techniques but also careful consideration of model behavior and robustness when aiming for dependable LLM-assisted software design. Furthermore, it is highlighted that while LLMs can produce syntactically valid diagrams from natural language descriptions, there is a gap in capturing essential design principles such as abstraction and encapsulation that experienced designers apply for extensibility and reusability. Without this crucial knowledge embedded in the generated diagrams, there is a risk of inconsistency and unreliability in downstream implementation processes. In conclusion, achieving reliable LLM-assisted software design requires a holistic approach that encompasses effective prompting methods alongside an understanding of model behavior and robustness to ensure consistent adherence to design intent and principles throughout the generation process.

- Large Language Models (LLMs) are used in software engineering for generating UML class diagrams from natural language descriptions
- LLMs can produce syntactically correct diagrams but may lack meaningful design synthesis
- A preference-based few-shot prompting approach is introduced to bias LLM outputs towards object-oriented principles and pattern-consistent structures
- Evaluation involves three LLMs and three modeling strategies, showing that preference-based alignment improves adherence to design intent but does not eliminate non-determinism
- Model behavior significantly influences the reliability of generated designs, highlighting the importance of effective prompting techniques and understanding model behavior for dependable LLM-assisted software design
- LLMs may not capture essential design principles like abstraction and encapsulation, leading to inconsistency and unreliability in downstream implementation processes

SummaryLarge Language Models (LLMs) are like smart tools used in computer work to make pictures from talking. They can make correct pictures but might not be very good at making them special. A new way of telling LLMs what to do is introduced to help them make better pictures that follow certain rules. Testing with different methods shows that this new way helps LLMs follow the rules better, but they can still sometimes be unpredictable. How the models act affects how good the pictures turn out, so it's important to use good ways of telling them what to do. Definitions- Large Language Models (LLMs): Big computer programs that understand and generate human language. - UML class diagrams: Pictures used in software design to show how different parts of a program relate to each other. - Object-oriented principles: Rules for organizing and designing software based on real-world objects. - Pattern-consistent structures: Following a set way of arranging things in software design. - Few-shot prompting approach: Giving small amounts of specific guidance or instructions to LLMs. - Design intent: The original idea or plan behind creating something, like software designs. - Non-determinism: When results are not always predictable or consistent. - Abstraction and encapsulation: Concepts in software design for hiding complex details and protecting data.

Introduction

Large Language Models (LLMs) have gained significant attention in recent years due to their impressive capabilities in natural language processing tasks. These models, such as GPT-3 and BERT, have shown remarkable performance in various applications, including text generation and translation. In the field of software engineering, LLMs are being increasingly utilized for automating tasks such as generating UML class diagrams from natural language descriptions. While previous research has demonstrated that LLMs can produce syntactically correct diagrams, there is a concern about whether they can go beyond mere diagram translation and engage in design synthesis. This study aims to address this gap by exploring how consistently LLMs maintain design-oriented reasoning amidst variations and if they can generate meaningful designs that align with object-oriented principles.

The Research Paper

The research paper titled "Preference-based Few-shot Prompting for Large Language Model-assisted Software Design" investigates the use of LLMs for generating UML class diagrams from natural language descriptions. The authors conduct experiments using three different LLMs (ChatGPT 4o-mini, Claude 3.5 Sonnet, Gemini 2.5 Flash) across three modeling strategies: standard prompting, rule-injection prompting, and preference-based prompting. To enhance the quality of generated designs, a preference-based few-shot prompting approach is introduced. This method biases LLM outputs towards designs that align with object-oriented principles and exhibit pattern-consistent structures. The evaluation involves two design-intent benchmarks with domain-only prompts and repeated runs to ensure reliable results.

Results

The results of the experiments indicate that the preference-based alignment improves adherence to design intent but does not completely eliminate non-determinism. Moreover, the behavior of the models significantly influences the reliability of the generated designs. This highlights the importance of considering model behavior and robustness when aiming for dependable LLM-assisted software design.

Discussion

The findings of this study underscore the need for a holistic approach to achieve reliable LLM-assisted software design. This includes not only effective prompting techniques but also an understanding of model behavior and robustness to ensure consistent adherence to design intent and principles throughout the generation process. Furthermore, the research paper highlights a crucial gap in current LLM capabilities – the lack of capturing essential design principles such as abstraction and encapsulation. These principles are vital for extensibility and reusability in software development, and without them embedded in the generated diagrams, there is a risk of inconsistency and unreliability in downstream implementation processes.

Conclusion

In conclusion, this research paper provides valuable insights into using LLMs for software engineering tasks, specifically UML class diagram generation from natural language descriptions. The study introduces a preference-based few-shot prompting approach that enhances the quality of generated designs by biasing outputs towards object-oriented principles. However, it also highlights the importance of considering model behavior and robustness when aiming for dependable LLM-assisted software design. This research has significant implications for future work on utilizing LLMs in software engineering tasks. It emphasizes the need for further advancements in these models to capture essential design principles accurately. Additionally, it calls for more comprehensive evaluation methods that consider both syntactic correctness and adherence to design intent when assessing LLM-generated designs. Overall, this study contributes to our understanding of how we can effectively utilize large language models in automating software engineering tasks while ensuring reliable results that align with established design principles.

Created on 22 Apr. 2026

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

64.0%

Requirements Engineering using Generative AI: Prompts and Prompting Patterns

cs.SE

59.9%

Extracting Knowledge Graphs from User Stories using LangChain

cs.SE

58.7%

ASTRAL: Automated Safety Testing of Large Language Models

cs.SE

57.7%

Can Large Language Models Transform Natural Language Intent into Formal Metho…

cs.SE

57.7%

Can LLMs Generate Architectural Design Decisions? -An Exploratory Empirical s…

cs.SE

57.3%

Which Prompting Technique Should I Use? An Empirical Investigation of Prompti…

cs.SE

56.7%

The Matthew Effect of AI Programming Assistants: A Hidden Bias in Software Ev…

cs.SE

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.