Generating Cyber Threat Intelligence to Discover Potential Security Threats Using Classification and Topic Modeling

AI-generated keywords: Cyber Threat Intelligence (CTI) Supervised Learning Unsupervised Learning Feature Engineering Classification

AI-generated Key Points

  • Authors explore the use of Cyber Threat Intelligence (CTI) from hacker forums to detect security threats
  • Two datasets constructed: binary dataset and multi-class dataset
  • Supervised learning techniques used, including classification algorithms and deep neural network-based classifiers
  • Unsupervised techniques employed, specifically topic modeling algorithms like Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF)
  • LDA and NMF algorithms used for topic modeling with frequency weights for LDA and TF-IDF weights for NMF
  • Data processed using regular expressions and lemmatization with spaCy library
  • Open-source Python libraries utilized, including scikit-learn, Keras, gensim, and Word2Vec
  • Feature engineering techniques discussed, including bag-of-words (BOW), TF-IDF-based weights, Word2Vec, and Doc2Vec
  • Classification tasks performed to separate cybersecurity-relevant posts from non-security posts
  • Comparison of different classifiers' performances on datasets, including deep neural network-based classifiers
  • Focus on utilizing CTI from hacker forums for threat detection through supervised learning (classification) and unsupervised learning (topic modeling)
  • Experimental results provide insights into effectiveness of different techniques in analyzing and predicting cyber threats.
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Md Imran Hossen (Sharon), Ashraful Islam (Sharon), Farzana Anowar (Sharon), Eshtiak Ahmed (Sharon), Mohammad Masudur Rahman (Sharon), Xiali (Sharon), Hei

License: CC BY 4.0

Abstract: Due to the variety of cyber-attacks or threats, the cybersecurity community enhances the traditional security control mechanisms to an advanced level so that automated tools can encounter potential security threats. Very recently, Cyber Threat Intelligence (CTI) has been presented as one of the proactive and robust mechanisms because of its automated cybersecurity threat prediction. Generally, CTI collects and analyses data from various sources e.g., online security forums, social media where cyber enthusiasts, analysts, even cybercriminals discuss cyber or computer security-related topics and discovers potential threats based on the analysis. As the manual analysis of every such discussion (posts on online platforms) is time-consuming, inefficient, and susceptible to errors, CTI as an automated tool can perform uniquely to detect cyber threats. In this paper, we identify and explore relevant CTI from hacker forums utilizing different supervised (classification) and unsupervised learning (topic modeling) techniques. To this end, we collect data from a real hacker forum and constructed two datasets: a binary dataset and a multi-class dataset. We then apply several classifiers along with deep neural network-based classifiers and use them on the datasets to compare their performances. We also employ the classifiers on a labeled leaked dataset as our ground truth. We further explore the datasets using unsupervised techniques. For this purpose, we leverage two topic modeling algorithms namely Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF).

Submitted to arXiv on 16 Aug. 2021

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2108.06862v3

In this paper, the authors explore the use of Cyber Threat Intelligence (CTI) from hacker forums to detect potential security threats. They collect data from a real hacker forum and construct two datasets: a binary dataset and a multi-class dataset. To analyze the data, they employ supervised learning techniques such as classification algorithms and deep neural network-based classifiers. They also utilize unsupervised techniques, specifically topic modeling algorithms like Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF), to further explore the datasets. Topic modeling is an unsupervised learning approach that helps identify and determine topics in a set of documents. It can be a powerful tool for finding latent topics in large unlabeled datasets. The authors use LDA and NMF algorithms for topic modeling in this study. They apply frequency weights as features to LDA, while TF-IDF weights are used for the NMF algorithm. The experimental setup involves processing the data using regular expressions and lemmatization with the spaCy library. Open-source Python libraries such as scikit-learn, Keras, gensim, and Word2Vec are utilized for various tasks including classification development, training models, and topic modeling. The authors also discuss feature engineering techniques used in their study. They employ standard techniques like bag-of-words (BOW) and term frequency-inverse document frequency (TF-IDF)-based weights. Additionally, they experiment with more advanced feature engineering techniques based on deep learning models like Word2Vec and Doc2Vec. In terms of supervised methods, the authors perform classification tasks to accurately separate cybersecurity-relevant posts from non-security posts. They compare the performances of different classifiers on their datasets, including deep neural network-based classifiers. Overall, this paper focuses on utilizing CTI from hacker forums to detect potential security threats through both supervised learning (classification) and unsupervised learning (topic modeling) approaches. The experimental results and methodologies provide insights into the effectiveness of different techniques in analyzing and predicting cyber threats.
Created on 27 Oct. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

The previous summary was created more than a year ago and can be re-run (if necessary) by clicking on the Run button below.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.