Comparing Lexical and Semantic Vector Search Methods When Classifying Medical Documents

AI-generated keywords: Classification

AI-generated Key Points

  • Text categorization is a common AI problem aiming to group data into distinct categories.
  • Popular solution: Using embeddings to transform text into numerical representations.
  • Recent advancements in vector search focus on optimizing speed and predictive accuracy by learning language semantics.
  • Traditional methods like BM25 formula with addition of hyperparameters k1 and b still hold merit in information retrieval toolkit.
  • Popular neural methods for vector search include Word2Vec and Language Models, which excel at handling various input formats and storing learned knowledge efficiently.
  • Methodology: Data extracted from 1472 medical documents involving personal details and medical histories from 100 cases over a 4-month period. Task involved assigning one of seven classes to each document based on explicit filename content or visual artifacts present.
  • Ethical considerations taken into account; data stored following ISO 12007 standards on company's cloud storage.
  • Results presented using RGB bar colors to distinguish among different embedding methods; observed variation in number of words before and after preprocessing across different document classes.
  • Study suggests that off-the-shelf semantic vector search may not always yield optimal results compared to bespoke lexical vector search models for classifying rigidly structured medical documents.
  • Conclusion: While neural methods are popular, traditional methods still hold value in certain contexts within information retrieval processes. Important to consider data and task when choosing the most suitable approach for classification.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Lee Harris

This project was funded by a UKRI grant, number: 10048265
License: CC BY-SA 4.0

Abstract: Classification is a common AI problem, and vector search is a typical solution. This transforms a given body of text into a numerical representation, known as an embedding, and modern improvements to vector search focus on optimising speed and predictive accuracy. This is often achieved through neural methods that aim to learn language semantics. However, our results suggest that these are not always the best solution. Our task was to classify rigidly-structured medical documents according to their content, and we found that using off-the-shelf semantic vector search produced slightly worse predictive accuracy than creating a bespoke lexical vector search model, and that it required significantly more time to execute. These findings suggest that traditional methods deserve to be contenders in the information retrieval toolkit, despite the prevalence and success of neural models.

Submitted to arXiv on 16 May. 2025

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2505.11582v2

, , , , <text> is a common AI problem that aims to categorize data into distinct groups. One popular solution for this task is , which transforms text into numerical representations known as embeddings. However, recent advancements in vector search have focused on using to optimize speed and predictive accuracy by learning language semantics. Despite the success of these modern techniques, there is evidence that suggests traditional methods still hold merit in the information retrieval toolkit. For instance, the BM25 formula with addition of δ (introduced by BM25+ with default values of 1, 0.75, and 1) has been utilized for text analysis and introduces two hyperparameters: k1 and b. Some popular neural methods used for vector search include Word2Vec and Language Models. These models excel at handling various input formats and have shown exceptional performance on a wide range of tasks. They also store learned knowledge in model parameters, making them highly efficient. In terms of methodology, data was extracted from 1472 medical documents containing personal details and medical histories from 100 cases collected over a 4-month period. The task involved assigning one of seven classes to each document based on explicit filename content assigned during data collection or visual artifacts present. Ethical considerations were taken into account during this process, with document classification being used solely for internal company purposes without consequences to stakeholders if classification results were inaccurate. To ensure security, the data was stored following ISO 12007 standards on the company's cloud storage. The results from various classification experiments were presented using RGB bar colors to distinguish among different embedding methods. It was observed that the number of words in document texts before and after preprocessing varied across different document classes. Overall, the study suggests that off-the-shelf semantic vector search may not always yield optimal results compared to bespoke lexical vector search models when classifying rigidly-structured medical documents. In conclusion, while neural methods have gained popularity in recent years, traditional methods still hold value in certain contexts within information retrieval processes. It is important to carefully consider the data and task at hand when choosing the most suitable approach for classification.
Created on 08 Jul. 2025

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.