Comparing Lexical and Semantic Vector Search Methods When Classifying Medical Documents

AI-generated keywords: Classification

AI-generated Key Points

Text categorization is a common AI problem aiming to group data into distinct categories.
Popular solution: Using embeddings to transform text into numerical representations.
Recent advancements in vector search focus on optimizing speed and predictive accuracy by learning language semantics.
Traditional methods like BM25 formula with addition of hyperparameters k1 and b still hold merit in information retrieval toolkit.
Popular neural methods for vector search include Word2Vec and Language Models, which excel at handling various input formats and storing learned knowledge efficiently.
Methodology: Data extracted from 1472 medical documents involving personal details and medical histories from 100 cases over a 4-month period. Task involved assigning one of seven classes to each document based on explicit filename content or visual artifacts present.
Ethical considerations taken into account; data stored following ISO 12007 standards on company's cloud storage.
Results presented using RGB bar colors to distinguish among different embedding methods; observed variation in number of words before and after preprocessing across different document classes.
Study suggests that off-the-shelf semantic vector search may not always yield optimal results compared to bespoke lexical vector search models for classifying rigidly structured medical documents.
Conclusion: While neural methods are popular, traditional methods still hold value in certain contexts within information retrieval processes. Important to consider data and task when choosing the most suitable approach for classification.

Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Lee Harris

arXiv: 2505.11582v2 - DOI (cs.IR)

This project was funded by a UKRI grant, number: 10048265

License: CC BY-SA 4.0

Abstract: Classification is a common AI problem, and vector search is a typical solution. This transforms a given body of text into a numerical representation, known as an embedding, and modern improvements to vector search focus on optimising speed and predictive accuracy. This is often achieved through neural methods that aim to learn language semantics. However, our results suggest that these are not always the best solution. Our task was to classify rigidly-structured medical documents according to their content, and we found that using off-the-shelf semantic vector search produced slightly worse predictive accuracy than creating a bespoke lexical vector search model, and that it required significantly more time to execute. These findings suggest that traditional methods deserve to be contenders in the information retrieval toolkit, despite the prevalence and success of neural models.

Submitted to arXiv on 16 May. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2505.11582v2

Comprehensive Summary
Key points
Layman's Summary
Blog article

, , , , <text> is a common AI problem that aims to categorize data into distinct groups. One popular solution for this task is , which transforms text into numerical representations known as embeddings. However, recent advancements in vector search have focused on using to optimize speed and predictive accuracy by learning language semantics. Despite the success of these modern techniques, there is evidence that suggests traditional methods still hold merit in the information retrieval toolkit. For instance, the BM25 formula with addition of δ (introduced by BM25+ with default values of 1, 0.75, and 1) has been utilized for text analysis and introduces two hyperparameters: k1 and b. Some popular neural methods used for vector search include Word2Vec and Language Models. These models excel at handling various input formats and have shown exceptional performance on a wide range of tasks. They also store learned knowledge in model parameters, making them highly efficient. In terms of methodology, data was extracted from 1472 medical documents containing personal details and medical histories from 100 cases collected over a 4-month period. The task involved assigning one of seven classes to each document based on explicit filename content assigned during data collection or visual artifacts present. Ethical considerations were taken into account during this process, with document classification being used solely for internal company purposes without consequences to stakeholders if classification results were inaccurate. To ensure security, the data was stored following ISO 12007 standards on the company's cloud storage. The results from various classification experiments were presented using RGB bar colors to distinguish among different embedding methods. It was observed that the number of words in document texts before and after preprocessing varied across different document classes. Overall, the study suggests that off-the-shelf semantic vector search may not always yield optimal results compared to bespoke lexical vector search models when classifying rigidly-structured medical documents. In conclusion, while neural methods have gained popularity in recent years, traditional methods still hold value in certain contexts within information retrieval processes. It is important to carefully consider the data and task at hand when choosing the most suitable approach for classification.

- Text categorization is a common AI problem aiming to group data into distinct categories.
- Popular solution: Using embeddings to transform text into numerical representations.
- Recent advancements in vector search focus on optimizing speed and predictive accuracy by learning language semantics.
- Traditional methods like BM25 formula with addition of hyperparameters k1 and b still hold merit in information retrieval toolkit.
- Popular neural methods for vector search include Word2Vec and Language Models, which excel at handling various input formats and storing learned knowledge efficiently.
- Methodology: Data extracted from 1472 medical documents involving personal details and medical histories from 100 cases over a 4-month period. Task involved assigning one of seven classes to each document based on explicit filename content or visual artifacts present.
- Ethical considerations taken into account; data stored following ISO 12007 standards on company's cloud storage.
- Results presented using RGB bar colors to distinguish among different embedding methods; observed variation in number of words before and after preprocessing across different document classes.
- Study suggests that off-the-shelf semantic vector search may not always yield optimal results compared to bespoke lexical vector search models for classifying rigidly structured medical documents.
- Conclusion: While neural methods are popular, traditional methods still hold value in certain contexts within information retrieval processes. Important to consider data and task when choosing the most suitable approach for classification.

SummaryText categorization is like sorting things into different groups using AI. One way to do this is by changing words into numbers. New ways to search for words quickly and accurately are being developed. Some old methods are still useful, like the BM25 formula with extra settings. There are popular ways, like Word2Vec, to search for words efficiently. Definitions- Text categorization: Sorting data into different categories. - Embeddings: Changing text into numerical representations. - Vector search: Searching for words quickly and accurately based on language meanings. - Hyperparameters: Extra settings used in formulas. - Neural methods: Techniques using artificial neural networks for processing data.

Introduction

Text classification is a common problem in the field of artificial intelligence (AI) that involves categorizing data into distinct groups. One popular solution for this task is using embeddings, which transform text into numerical representations. However, recent advancements in vector search have focused on using semantic vector search to optimize speed and predictive accuracy by learning language semantics. Despite the success of these modern techniques, there is evidence that suggests traditional methods still hold merit in the information retrieval toolkit. This research paper explores the use of traditional lexical vector search models compared to off-the-shelf semantic vector search methods for classifying rigidly-structured medical documents.

Background

In recent years, neural methods such as Word2Vec and Language Models have gained popularity for their ability to handle various input formats and achieve exceptional performance on a wide range of tasks. These models also store learned knowledge in model parameters, making them highly efficient. On the other hand, traditional methods like BM25 with addition of δ (introduced by BM25+ with default values of 1, 0.75, and 1) have been utilized for text analysis and introduce two hyperparameters: k1 and b. These methods have shown success in information retrieval processes but may not be as widely used due to the rise of neural approaches.

Methodology

The study extracted data from 1472 medical documents containing personal details and medical histories from 100 cases collected over a 4-month period. The task involved assigning one of seven classes to each document based on explicit filename content assigned during data collection or visual artifacts present. Ethical considerations were taken into account during this process, with document classification being used solely for internal company purposes without consequences to stakeholders if classification results were inaccurate. To ensure security, the data was stored following ISO 12007 standards on the company's cloud storage. The results from various classification experiments were presented using RGB bar colors to distinguish among different embedding methods. It was observed that the number of words in document texts before and after preprocessing varied across different document classes.

Results

The results showed that off-the-shelf semantic vector search may not always yield optimal results compared to bespoke lexical vector search models when classifying rigidly-structured medical documents. This suggests that traditional methods still hold value in certain contexts within information retrieval processes.

Conclusion

In conclusion, while neural methods have gained popularity in recent years, traditional methods still hold value in certain contexts within information retrieval processes. It is important to carefully consider the data and task at hand when choosing the most suitable approach for classification. The study highlights the importance of understanding the strengths and limitations of both traditional and modern techniques for text classification tasks. Further research could explore combining these approaches for even better performance on specific datasets or tasks.

Created on 08 Jul. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Similar papers summarized with our AI tools

63.0%

Unsupervised Dense Information Retrieval with Contrastive Learning

cs.IR

60.0%

Recent advances in text embedding: A Comprehensive Review of Top-Performing M…

cs.IR

57.8%

Dynamic Q&A of Clinical Documents with Large Language Models

cs.IR

56.5%

LLMs may Dominate Information Access: Neural Retrievers are Biased Towards LL…

cs.IR

55.5%

Analysis of Chinese Tourists in Japan by Text Mining of a Hotel Portal Site

cs.IR

55.3%

EnterpriseEM: Fine-tuned Embeddings for Enterprise Semantic Search

cs.IR

55.1%

SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval

cs.IR

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.