Large scale link based latent Dirichlet allocation for web document classification
AI-generated Key Points
⚠The license of the paper does not allow us to build upon its content and the key points are generated using the paper metadata rather than the full article.
- Authors explore the applicability of latent Dirichlet allocation (LDA) for classifying large collections of web documents
- Introduce a novel influence model that considers the linkage between documents, allowing topics to propagate along links
- Develop LDA-specific boosting of Gibbs samplers, resulting in significant speedup in experiments
- Boosted LDA model can be used for classification as dimensionality reduction and yields link weights for processing the web graph
- Deploy LDA link weights in stacked graphical learning as an example
- Achieve a 4% improvement over plain LDA with BayesNet and an 18% improvement over tf-idf with SVM in terms of AUC of classification
- Gibbs sampling strategies result in about 5-10 times speedup without significant decreases in accuracy measured by likelihood and AUC of classification
- Overall, the paper presents a comprehensive exploration of using LDA for classifying large web document collections, with improved performance and efficiency compared to traditional approaches such as tf-idf with SVM.
Authors: István Bíró, Jácint Szabó
Abstract: In this paper we demonstrate the applicability of latent Dirichlet allocation (LDA) for classifying large Web document collections. One of our main results is a novel influence model that gives a fully generative model of the document content taking linkage into account. In our setup, topics propagate along links in such a way that linked documents directly influence the words in the linking document. As another main contribution we develop LDA specific boosting of Gibbs samplers resulting in a significant speedup in our experiments. The inferred LDA model can be applied for classification as dimensionality reduction similarly to latent semantic indexing. In addition, the model yields link weights that can be applied in algorithms to process the Web graph; as an example we deploy LDA link weights in stacked graphical learning. By using Weka's BayesNet classifier, in terms of the AUC of classification, we achieve 4% improvement over plain LDA with BayesNet and 18% over tf.idf with SVM. Our Gibbs sampling strategies yield about 5-10 times speedup with less than 1% decrease in accuracy in terms of likelihood and AUC of classification.
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
⚠The license of the paper does not allow us to build upon its content and the AI assistant only knows about the paper metadata rather than the full article.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
⚠The license of this specific paper does not allow us to build upon its content and the summarizing tools will be run using the paper metadata rather than the full article. However, it still does a good job, and you can also try our tools on papers with more open licenses.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.