EnterpriseEM: Fine-tuned Embeddings for Enterprise Semantic Search

AI-generated keywords: enterprise data management AI-driven information retrieval pre-trained embedding models fine-tuning methodology enterprise-specific data

AI-generated Key Points

  • Challenge of handling proprietary unstructured data in enterprise data management
  • Solutions emerging to extract insights for effective employee inquiries
  • Reliance on pre-trained embeddings and generative models in these solutions
  • Suboptimal alignment of pre-trained embeddings with unique characteristics of enterprise data
  • Proposal of fine-tuning pre-trained embedding models specifically for enterprise settings
  • Dataset used includes diverse range of internal Infosys Ltd. data sources
  • Extraction process involving various file formats and specialized tools like PDFMiner and BeautifulSoup
  • Preprocessing steps such as masking PII, cleaning undesirable elements, and chunking cleaned data into relevant units for question generation
  • Findings suggest that fine-tuning embedding models can enhance information retrieval performance in enterprises
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Kamalkumar Rathinasamy, Jayarama Nettar, Amit Kumar, Vishal Manchanda, Arun Vijayakumar, Ayush Kataria, Venkateshprasanna Manjunath, Chidambaram GS, Jaskirat Singh Sodhi, Shoeb Shaikh, Wasim Akhtar Khan, Prashant Singh, Tanishq Dattatray Ige, Vipin Tiwari, Rajab Ali Mondal, Harshini K, S Reka, Chetana Amancharla, Faiz ur Rahman, Harikrishnan P A, Indraneel Saha, Bhavya Tiwary, Navin Shankar Patel, Pradeep T S, Balaji A J, Priyapravas, Mohammed Rafee Tarafdar

License: CC BY 4.0

Abstract: Enterprises grapple with the significant challenge of managing proprietary unstructured data, hindering efficient information retrieval. This has led to the emergence of AI-driven information retrieval solutions, designed to adeptly extract relevant insights to address employee inquiries. These solutions often leverage pre-trained embedding models and generative models as foundational components. While pre-trained embeddings may exhibit proximity or disparity based on their original training objectives, they might not fully align with the unique characteristics of enterprise-specific data, leading to suboptimal alignment with the retrieval goals of enterprise environments. In this paper, we propose a methodology to fine-tune pre-trained embedding models specifically for enterprise environments. By adapting the embeddings to better suit the retrieval tasks prevalent in enterprises, we aim to enhance the performance of information retrieval solutions. We discuss the process of fine-tuning, its effect on retrieval accuracy, and the potential benefits for enterprise information management. Our findings demonstrate the efficacy of fine-tuned embedding models in improving the precision and relevance of search results in enterprise settings.

Submitted to arXiv on 18 May. 2024

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2406.00010v1

In the realm of enterprise data management, the challenge of handling proprietary unstructured data has become a significant obstacle to efficient information retrieval. To address this issue, have emerged, aiming to extract pertinent insights to meet employee inquiries effectively. These solutions often rely on and generative models as foundational components. However, while pre-trained embeddings may demonstrate proximity or disparity based on their original training objectives, they may not align perfectly with the unique characteristics of . This can lead to suboptimal alignment with the retrieval goals of enterprise environments. In response to this challenge, a is proposed in this paper to fine-tune pre-trained embedding models specifically for enterprise settings. The dataset used for this study includes a diverse range of internal Infosys Ltd. data such as technical course contents, internal knowledge base articles, standard operating procedures for technical tasks, a repository of internal technical queries with resolutions, sales data, and employee blogs. Text data was extracted from various formats including PDFs, MS Word documents, Excel sheets, PowerPoint presentations, and web pages. The extraction process utilized Langchain's document loaders and specialized tools like PDFMiner and BeautifulSoup for efficient data extraction from different file formats. The collected data then underwent preprocessing steps including masking personally identifiable information (PII), cleaning undesirable elements like XML tags and non-ASCII characters using Python libraries like lxml and clean-text. Furthermore,the chunking process segmented the cleaned data into contextually relevant units suitable for synthetic question generation. This involved denormalizing structured data into plain text key-value pairs and separating contextually independent paragraphs from unstructured text streams. Overall,the findings suggest that fine-tuning embedding models for enterprise environments can enhance information retrieval performance by improving precision and relevance in search results. By adapting embeddings to better suit retrieval tasks prevalent in enterprises, this methodology holds promise for enhancing .
Created on 24 Sep. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.