Efficient Knowledge Graph Construction and Retrieval from Unstructured Text for Large-Scale RAG Systems

AI-generated keywords: Graph-based Retrieval Augmented Generation Enterprise Environments Knowledge Graph Construction Lightweight Subgraph Retrieval Large Language Models

AI-generated Key Points

  • Proposed a scalable and cost-efficient framework for deploying GraphRAG in enterprise environments
  • Introduced two core innovations:
  • Dependency-based knowledge graph construction pipeline that eliminates reliance on large language models (LLMs)
  • Lightweight graph retrieval strategy for high-recall, low-latency subgraph extraction
  • Evaluated the framework on SAP datasets for legacy code migration with strong empirical performance improvements over traditional RAG baselines
  • Dependency-based construction approach achieved comparable performance to LLM-generated knowledge graphs while reducing costs and improving scalability
  • Highlighted scalability by eliminating dependence on large language models for knowledge graph construction
  • Future investigations needed to address limitations such as missing context-dependent or implicit relations not directly expressed in surface syntax
  • Plan to evaluate generalizability of the method to other settings beyond SAP-specific domains by testing on broader public benchmarks like HotpotQA
  • Study presents a promising path for scaling GraphRAG systems in real-world enterprise applications without prohibitive resource requirements, enabling practical, explainable, and domain-adaptable retrieval-augmented reasoning systems.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Congmin Min, Rhea Mathew, Joyce Pan, Sahil Bansal, Abbas Keshavarzi, Amar Viswanathan Kannan

License: CC BY-NC-SA 4.0

Abstract: We propose a scalable and cost-efficient framework for deploying Graph-based Retrieval Augmented Generation (GraphRAG) in enterprise environments. While GraphRAG has shown promise for multi-hop reasoning and structured retrieval, its adoption has been limited by the high computational cost of constructing knowledge graphs using large language models (LLMs) and the latency of graph-based retrieval. To address these challenges, we introduce two core innovations: (1) a dependency-based knowledge graph construction pipeline that leverages industrial-grade NLP libraries to extract entities and relations from unstructured text completely eliminating reliance on LLMs; and (2) a lightweight graph retrieval strategy that combines hybrid query node identification with efficient one-hop traversal for high-recall, low-latency subgraph extraction. We evaluate our framework on two SAP datasets focused on legacy code migration and demonstrate strong empirical performance. Our system achieves up to 15% and 4.35% improvements over traditional RAG baselines based on LLM-as-Judge and RAGAS metrics, respectively. Moreover, our dependency-based construction approach attains 94% of the performance of LLM-generated knowledge graphs (61.87% vs. 65.83%) while significantly reducing cost and improving scalability. These results validate the feasibility of deploying GraphRAG systems in real-world, large-scale enterprise applications without incurring prohibitive resource requirements paving the way for practical, explainable, and domain-adaptable retrieval-augmented reasoning.

Submitted to arXiv on 04 Jul. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2507.03226v1

In this study, we propose a scalable and cost-efficient framework for deploying in enterprise environments. The adoption of GraphRAG has been limited due to the high computational cost of constructing knowledge graphs using large language models (LLMs) and the latency of graph-based retrieval. To address these challenges, we introduce two core innovations: a dependency-based knowledge graph construction pipeline that eliminates reliance on LLMs by leveraging industrial-grade NLP libraries for entity and relation extraction from unstructured text, and a lightweight graph retrieval strategy that combines hybrid query node identification with efficient one-hop traversal for high-recall, low-latency subgraph extraction. We evaluate our framework on two SAP datasets focused on legacy code migration and demonstrate strong empirical performance. Our system achieves significant improvements over traditional RAG baselines based on LLM-as-Judge and RAGAS metrics. Additionally, our dependency-based construction approach attains comparable performance to LLM-generated knowledge graphs while reducing costs and improving scalability. Furthermore, we highlight the scalability of our approach by eliminating dependence on large language models for knowledge graph construction. However, future investigations are needed to address limitations such as missing context-dependent or implicit relations not directly expressed in surface syntax. We also plan to evaluate the generalizability of our method to other settings beyond SAP-specific domains by testing it on broader public benchmarks like HotpotQA. In conclusion, our study presents a promising path for scaling GraphRAG systems in real-world enterprise applications without prohibitive resource requirements. By combining efficient knowledge graph construction from unstructured text with lightweight subgraph retrieval strategies, we pave the way for practical, explainable, and domain-adaptable retrieval-augmented reasoning systems in large-scale enterprise environments.
Created on 01 Aug. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.