Large scale link based latent Dirichlet allocation for web document classification

AI-generated keywords: Latent Dirichlet Allocation Web Document Classification Influence Model Gibbs Samplers Boosted LDA

AI-generated Key Points

The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.

  • Authors explore the applicability of latent Dirichlet allocation (LDA) for classifying large collections of web documents
  • Introduce a novel influence model that considers the linkage between documents, allowing topics to propagate along links
  • Develop LDA-specific boosting of Gibbs samplers, resulting in significant speedup in experiments
  • Boosted LDA model can be used for classification as dimensionality reduction and yields link weights for processing the web graph
  • Deploy LDA link weights in stacked graphical learning as an example
  • Achieve a 4% improvement over plain LDA with BayesNet and an 18% improvement over tf-idf with SVM in terms of AUC of classification
  • Gibbs sampling strategies result in about 5-10 times speedup without significant decreases in accuracy measured by likelihood and AUC of classification
  • Overall, the paper presents a comprehensive exploration of using LDA for classifying large web document collections, with improved performance and efficiency compared to traditional approaches such as tf-idf with SVM.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: István Bíró, Jácint Szabó

16 pages

Abstract: In this paper we demonstrate the applicability of latent Dirichlet allocation (LDA) for classifying large Web document collections. One of our main results is a novel influence model that gives a fully generative model of the document content taking linkage into account. In our setup, topics propagate along links in such a way that linked documents directly influence the words in the linking document. As another main contribution we develop LDA specific boosting of Gibbs samplers resulting in a significant speedup in our experiments. The inferred LDA model can be applied for classification as dimensionality reduction similarly to latent semantic indexing. In addition, the model yields link weights that can be applied in algorithms to process the Web graph; as an example we deploy LDA link weights in stacked graphical learning. By using Weka's BayesNet classifier, in terms of the AUC of classification, we achieve 4% improvement over plain LDA with BayesNet and 18% over tf.idf with SVM. Our Gibbs sampling strategies yield about 5-10 times speedup with less than 1% decrease in accuracy in terms of likelihood and AUC of classification.

Submitted to arXiv on 25 Jun. 2010

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 1006.4953v1

This paper's license doesn't allow us to build upon its content and the summarizing process is here made with the paper's metadata rather than the article.

In the paper titled "Large scale link based latent Dirichlet allocation for web document classification," authors István Bíró and Jácint Szabó explore the applicability of latent Dirichlet allocation (LDA) for classifying large collections of web documents. The authors introduce a novel influence model that takes into account the linkage between documents, providing a fully generative model of the document content. In this setup, topics propagate along links, allowing linked documents to directly influence the words in the linking document. The authors also develop LDA-specific boosting of Gibbs samplers, resulting in a significant speedup in their experiments. This boosted LDA model can be applied for classification as dimensionality reduction, similar to latent semantic indexing. Additionally, the model yields link weights that can be used in algorithms to process the web graph. As an example, the authors deploy LDA link weights in stacked graphical learning. To evaluate their approach, the authors use Weka's BayesNet classifier and compare it with plain LDA using BayesNet and tf-idf with SVM. They achieve a 4% improvement over plain LDA with BayesNet and an 18% improvement over tf-idf with SVM in terms of the AUC of classification. Furthermore, their Gibbs sampling strategies result in about 5-10 times speedup without significant decreases in accuracy measured by likelihood and AUC of classification. Overall, this paper presents a comprehensive exploration of using LDA for classifying large web document collections. The proposed influence model and boosted Gibbs samplers contribute to improved performance and efficiency compared to traditional approaches such as tf-idf with SVM.
Created on 08 Sep. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.