Ontology Matching with Large Language Models and Prioritized Depth-First Search

AI-generated keywords: Ontology Matching Large Language Models MILA Retrieve-Identify-Prompt Pipeline Semantic Correspondences

AI-generated Key Points

Ontology matching (OM) is crucial for data interoperability and knowledge sharing.
Challenges in OM include the need for large training datasets and limited vocabulary processing in machine learning approaches.
Recent advancements in Large Language Models (LLMs) have shown promise in OM through a retrieve-then-prompt pipeline.
A novel approach called MILA embeds a retrieve-identify-prompt pipeline within a prioritized depth-first search (PDFS) strategy to efficiently identify semantic correspondences with high accuracy.
MILA achieved the highest F-Measure in four out of five unsupervised tasks, outperforming existing OM systems by up to 17% and performing comparably or better than leading supervised OM systems.
MILA exhibited task-agnostic performance across all tasks and settings while significantly reducing LLM requests.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Maria Taboada, Diego Martinez, Mohammed Arideh, Rosa Mosquera

arXiv: 2501.11441v2 - DOI (cs.IR)

License: CC BY 4.0

Abstract: Ontology matching (OM) plays a key role in enabling data interoperability and knowledge sharing, but it remains challenging due to the need for large training datasets and limited vocabulary processing in machine learning approaches. Recently, methods based on Large Language Model (LLMs) have shown great promise in OM, particularly through the use of a retrieve-then-prompt pipeline. In this approach, relevant target entities are first retrieved and then used to prompt the LLM to predict the final matches. Despite their potential, these systems still present limited performance and high computational overhead. To address these issues, we introduce MILA, a novel approach that embeds a retrieve-identify-prompt pipeline within a prioritized depth-first search (PDFS) strategy. This approach efficiently identifies a large number of semantic correspondences with high accuracy, limiting LLM requests to only the most borderline cases. We evaluated MILA using the biomedical challenge proposed in the 2023 and 2024 editions of the Ontology Alignment Evaluation Initiative. Our method achieved the highest F-Measure in four of the five unsupervised tasks, outperforming state-of-the-art OM systems by up to 17%. It also performed better than or comparable to the leading supervised OM systems. MILA further exhibited task-agnostic performance, remaining stable across all tasks and settings, while significantly reducing LLM requests. These findings highlight that high-performance LLM-based OM can be achieved through a combination of programmed (PDFS), learned (embedding vectors), and prompting-based heuristics, without the need of domain-specific heuristics or fine-tuning.

Submitted to arXiv on 20 Jan. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2501.11441v2

Comprehensive Summary
Key points
Layman's Summary
Blog article

Ontology matching (OM) is crucial for data interoperability and knowledge sharing. However, it faces challenges such as the need for large training datasets and limited vocabulary processing in machine learning approaches. Recent advancements in Large Language Models (LLMs) have shown promise in OM through a retrieve-then-prompt pipeline. Despite this progress, these systems still have limitations in performance and computational overhead. To address these issues, a novel approach called MILA has been introduced. MILA embeds a retrieve-identify-prompt pipeline within a prioritized depth-first search (PDFS) strategy to efficiently identify semantic correspondences with high accuracy. This approach limits LLM requests to only the most borderline cases, reducing computational burden. The effectiveness of MILA was evaluated using the biomedical challenge from the Ontology Alignment Evaluation Initiative. Results showed that MILA achieved the highest F-Measure in four out of five unsupervised tasks, outperforming existing OM systems by up to 17%. It also performed comparably or better than leading supervised OM systems. Additionally, MILA exhibited task-agnostic performance across all tasks and settings while significantly reducing LLM requests. These findings emphasize that high-performance LLM-based OM can be achieved through a combination of programmed (PDFS), learned (embedding vectors), and prompting-based heuristics without relying on domain-specific heuristics or fine-tuning. The authors acknowledge support from the University of Santiago de Compostela and projects AF4EU and SUS-SOIL funded by the European Union's Horizon Europe program. They also express gratitude to Diego Martinez-Taboada for insightful conversations. For more information on MILA, data availability can be found at https://github.com/mariatab/MILA. Additionally, a detailed overview of state-of-the-art OM systems used for comparison in the study is provided in Table A.10 in the appendix.

- Ontology matching (OM) is crucial for data interoperability and knowledge sharing.
- Challenges in OM include the need for large training datasets and limited vocabulary processing in machine learning approaches.
- Recent advancements in Large Language Models (LLMs) have shown promise in OM through a retrieve-then-prompt pipeline.
- A novel approach called MILA embeds a retrieve-identify-prompt pipeline within a prioritized depth-first search (PDFS) strategy to efficiently identify semantic correspondences with high accuracy.
- MILA achieved the highest F-Measure in four out of five unsupervised tasks, outperforming existing OM systems by up to 17% and performing comparably or better than leading supervised OM systems.
- MILA exhibited task-agnostic performance across all tasks and settings while significantly reducing LLM requests.

SummaryOntology matching (OM) helps different computers understand and share information. It can be hard because we need big sets of examples and some machines don't know many words. New technology called Large Language Models (LLMs) is making OM better by using a special process. One new way, MILA, uses a smart strategy to find the right answers quickly and accurately. MILA is very good at its job, doing better than other systems in many tests. Definitions- Ontology matching (OM): Making sure different computers can understand and share information. - Large Language Models (LLMs): Advanced technology that helps computers understand language better. - Retrieve-then-prompt pipeline: A step-by-step process where the computer looks for information before asking questions. - Semantic correspondences: Finding connections between words or ideas that have similar meanings. - F-Measure: A way to measure how well a system performs in finding the right answers. - Task-agnostic performance: Being able to do well on different jobs without needing special instructions.

Ontology matching (OM) is a crucial aspect of data interoperability and knowledge sharing. It involves identifying semantic correspondences between different ontologies, which are formal representations of knowledge in a specific domain. OM plays a vital role in various fields such as healthcare, e-commerce, and information retrieval systems. However, it faces challenges such as the need for large training datasets and limited vocabulary processing in machine learning approaches. In recent years, there has been significant progress in using Large Language Models (LLMs) for ontology matching. LLMs are pre-trained language models that can process vast amounts of text data and generate high-quality outputs. They have shown promise in OM through a retrieve-then-prompt pipeline, where the model retrieves relevant information from existing ontologies and prompts users to provide additional input to improve accuracy. Despite this progress, LLM-based OM systems still face limitations in performance and computational overhead. To address these issues, researchers at the University of Santiago de Compostela have introduced a novel approach called MILA (Machine Learning-based Ontology Matching with Prioritized Depth-First Search). This approach embeds a retrieve-identify-prompt pipeline within a prioritized depth-first search strategy to efficiently identify semantic correspondences with high accuracy. The key advantage of MILA is its ability to limit LLM requests only to the most borderline cases, reducing computational burden significantly. This is achieved by combining programmed strategies (PDFS), learned embedding vectors from pre-trained LLMs, and prompting-based heuristics without relying on domain-specific heuristics or fine-tuning. To evaluate the effectiveness of MILA, it was tested on the biomedical challenge from the Ontology Alignment Evaluation Initiative. The results showed that MILA achieved the highest F-Measure in four out of five unsupervised tasks compared to existing OM systems by up to 17%. It also performed comparably or better than leading supervised OM systems while significantly reducing LLM requests. These findings demonstrate that MILA can achieve high-performance LLM-based OM without relying on domain-specific heuristics or fine-tuning. The authors of the research paper acknowledge support from the University of Santiago de Compostela and projects AF4EU and SUS-SOIL funded by the European Union's Horizon Europe program. They also express gratitude to Diego Martinez-Taboada for insightful conversations. The data used in this study is publicly available at https://github.com/mariatab/MILA. For a more comprehensive understanding, Table A.10 in the appendix provides a detailed overview of state-of-the-art OM systems used for comparison in the study. This includes information such as system name, type (supervised or unsupervised), and performance metrics. In conclusion, MILA offers a promising solution to address challenges faced by LLM-based OM systems. Its efficient retrieve-identify-prompt pipeline within a prioritized depth-first search strategy has shown superior performance compared to existing approaches while significantly reducing computational burden. With further development and refinement, MILA could potentially revolutionize ontology matching and contribute to better data interoperability and knowledge sharing across various domains.

Created on 25 Apr. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

55.0%

Incorporating Explicit Knowledge in Pre-trained Language Models for Passage R…

cs.IR

54.2%

Large Search Model: Redefining Search Stack in the Era of LLMs

cs.IR

53.7%

In-depth Analysis of Graph-based RAG in a Unified Framework

cs.IR

53.7%

LLMs may Dominate Information Access: Neural Retrievers are Biased Towards LL…

cs.IR

53.3%

Large Language Models are Effective Text Rankers with Pairwise Ranking Prompt…

cs.IR

52.6%

Dynamic Q&A of Clinical Documents with Large Language Models

cs.IR

52.0%

Retrieve Anything To Augment Large Language Models

cs.IR

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.