Generating Cyber Threat Intelligence to Discover Potential Security Threats Using Classification and Topic Modeling
AI-generated Key Points
- Authors explore the use of Cyber Threat Intelligence (CTI) from hacker forums to detect security threats
- Two datasets constructed: binary dataset and multi-class dataset
- Supervised learning techniques used, including classification algorithms and deep neural network-based classifiers
- Unsupervised techniques employed, specifically topic modeling algorithms like Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF)
- LDA and NMF algorithms used for topic modeling with frequency weights for LDA and TF-IDF weights for NMF
- Data processed using regular expressions and lemmatization with spaCy library
- Open-source Python libraries utilized, including scikit-learn, Keras, gensim, and Word2Vec
- Feature engineering techniques discussed, including bag-of-words (BOW), TF-IDF-based weights, Word2Vec, and Doc2Vec
- Classification tasks performed to separate cybersecurity-relevant posts from non-security posts
- Comparison of different classifiers' performances on datasets, including deep neural network-based classifiers
- Focus on utilizing CTI from hacker forums for threat detection through supervised learning (classification) and unsupervised learning (topic modeling)
- Experimental results provide insights into effectiveness of different techniques in analyzing and predicting cyber threats.
Authors: Md Imran Hossen (Sharon), Ashraful Islam (Sharon), Farzana Anowar (Sharon), Eshtiak Ahmed (Sharon), Mohammad Masudur Rahman (Sharon), Xiali (Sharon), Hei
Abstract: Due to the variety of cyber-attacks or threats, the cybersecurity community enhances the traditional security control mechanisms to an advanced level so that automated tools can encounter potential security threats. Very recently, Cyber Threat Intelligence (CTI) has been presented as one of the proactive and robust mechanisms because of its automated cybersecurity threat prediction. Generally, CTI collects and analyses data from various sources e.g., online security forums, social media where cyber enthusiasts, analysts, even cybercriminals discuss cyber or computer security-related topics and discovers potential threats based on the analysis. As the manual analysis of every such discussion (posts on online platforms) is time-consuming, inefficient, and susceptible to errors, CTI as an automated tool can perform uniquely to detect cyber threats. In this paper, we identify and explore relevant CTI from hacker forums utilizing different supervised (classification) and unsupervised learning (topic modeling) techniques. To this end, we collect data from a real hacker forum and constructed two datasets: a binary dataset and a multi-class dataset. We then apply several classifiers along with deep neural network-based classifiers and use them on the datasets to compare their performances. We also employ the classifiers on a labeled leaked dataset as our ground truth. We further explore the datasets using unsupervised techniques. For this purpose, we leverage two topic modeling algorithms namely Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF).
Ask questions about this paper to our AI assistant
You can also chat with multiple papers at once here.
Assess the quality of the AI-generated content by voting
Score: 0
Why do we need votes?
Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.
The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.
Similar papers summarized with our AI tools
Navigate through even more similar papers through a
tree representationLook for similar papers (in beta version)
By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.
Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.