Performance Evaluation of Query Plan Recommendation with Apache Hadoop and Apache Spark

AI-generated keywords: Query plan recommendation Apache Hadoop Apache Spark MapReduce Distributed computing

AI-generated Key Points

  • Study focuses on evaluating query plan recommendation performance using Apache Hadoop and Apache Spark
  • Approach involves executing new queries based on previously created query execution plans (QEPs) and clustering the query space for optimization
  • Researchers leveraged MapReduce distributed computing model to address time-consuming nature of traditional clustering algorithms for large datasets
  • Methodology included software development, validation, resource management, data curation, original draft preparation, review and editing tasks by different authors
  • Algorithms such as Mapper and Reducer classes used for calculating total terms in each query and determining weights for features through term frequency calculations
  • Similarity measurement using MapReduce discussed to assess similarities between queries efficiently
  • Parallel query clustering significantly enhances scalability in query optimization processes
  • Apache Spark outperformed Apache Hadoop in performance metrics with an average speedup of 2x
  • Research highlights benefits of leveraging distributed computing frameworks like Apache Spark and Apache Hadoop for effective query execution plan optimization
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Elham Azhir, Mehdi Hosseinzadeh, Faheem Khan, Amir Mosavi

11pages, 4 figures
License: CC BY 4.0

Abstract: Access plan recommendation is a query optimization approach that executes new queries using prior created query execution plans (QEPs). The query optimizer divides the query space into clusters in the mentioned method. However, traditional clustering algorithms take a significant amount of execution time for clustering such large datasets. The MapReduce distributed computing model provides efficient solutions for storing and processing vast quantities of data. Apache Spark and Apache Hadoop frameworks are used in the present investigation to cluster different sizes of query datasets in the MapReduce-based access plan recommendation method. The performance evaluation is performed based on execution time. The results of the experiments demonstrated the effectiveness of parallel query clustering in achieving high scalability. Furthermore, Apache Spark achieved better performance than Apache Hadoop, reaching an average speedup of 2x.

Submitted to arXiv on 17 Sep. 2022

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2210.07143v1

The study focuses on evaluating the performance of query plan recommendation using Apache Hadoop and Apache Spark. The approach involves executing new queries based on previously created query execution plans (QEPs) and clustering the query space for optimization. To address the time-consuming nature of traditional clustering algorithms when dealing with large datasets, the researchers leveraged the MapReduce distributed computing model. By utilizing Apache Spark and Apache Hadoop frameworks, they aimed to cluster various sizes of query datasets within the MapReduce-based access plan recommendation method. The methodology included software development, validation, resource management, data curation, original draft preparation, review and editing tasks handled by different authors. Despite receiving no external funding for their research and having no conflicts of interest related to their work, all authors have approved the final version of the manuscript for publication. The study delves into algorithms such as Mapper and Reducer classes for calculating total terms in each query and determining weights for each feature through term frequency calculations. Additionally, a similarity measurement using MapReduce is discussed in detail to assess similarities between queries efficiently. Overall, parallel query clustering was found to significantly enhance scalability in query optimization processes. Furthermore, Apache Spark outperformed Apache Hadoop in terms of performance metrics with an average speedup of 2x. This research highlights the potential benefits of leveraging distributed computing frameworks like Apache Spark and Apache Hadoop for effective query execution plan optimization.
Created on 26 Oct. 2024

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.