Research checkpoint #02: first implementations
In the last post, I reported my studies in feature engineering techniques and defined some steps to keep preparing myself to be able to evaluate the datasets considered for this project. However, following suggestions from professor Daniel, we decided to already start taking a closer look at the CIC-IoT-2023, which is a recent dataset containing traffic records in an IoT network including a large amount of attacks, aiming to understand if we can overcome the issues we observed on it with what I have been studying. So, throughout this post, I will relate what I have done in the last weeks and which were the motivations for doing so.
Materials studied
During this time studying how to manipulate data, feature engineering has been my main topic of concern and especially feature selection is a subject that I needed to delve deeper, as we will probably use it regardless of the dataset we choose to work with. In the last post, I cited the article “An Introduction to Variable and Feature Selection”, by Guyon and Elisseeff, as a good reference in the theme and in the last weeks I took some time to read it, trying to take notes of the key concepts.
Variable and feature selection offers potentially great benefits:
- Improve the understanding and visualization of the data and the comprehension of the process that generated it.
- Improve model training and prediction performances.
- Reduce the storage and computational requirements to deal with the dataset.
To achieve this, some techniques to variable subset selection are explored:
Wrapper method: uses the learning machine as a black box and evaluates the score of different features subsets by using a defined criterion. An exhaustive search through the subsets can be almost impossible due to its computational costs, but there exists search strategies that can be used as heuristics.
Embedded methods: different methods that performs variable selection during the process of training, depending more on the chosen predictor model.
Filters: a fast option that can consider many criteria, like the variable ranking:
- Variable ranking: aiming to select a subset of the features to work with, variable ranking mechanisms showed to be a great starting point, as they are simple yet effective. It’s done in the preprocessing phase, independently of the choice of the predictor. Some criteria can be used to do it: correlation methods (particularly using the Pearson correlation coefficient), the individual predictive power of features or even the joint probability density of the variables. It’s also worth to note that it’s hard to classify variables as redundant of useless without considering the other ones, the article gives good examples on how variable dependencies cannot be ignored.
The article also considers feature construction techniques:
Clustering: replace a group of “similar” features by a cluster centroid.
Singular Value Decomposistion (SVD): its goal is to form a set of features that are linear combinations of the original variables, widely used aiming to reduce the dimensionality of the model.
Others supervised and unsupervised feature selection discussed across the article.
It’s important to design appropriate data representations, and beyond the objective methods cited above this moment is also a great opportunity to incorporate domain knowledge into the data.
Finally, validation methods are also considered, to help in the model selection and in the evaluation of its final performance. The article mentions statistical comparisons and cross-validation methods.
There are a lot of techniques that helps on the goals proposed in the beginning, this article also refers to a lot of other materials that may be helpful in future studies. A great recommendation that can be taken from all of this is to try the simplest things first.
Implementing a logistic regression model
During the studies of João Gabriel Josephik, also mentored by professor Daniel, he observed that the CIC-IoT-2023 dataset could have a problem for us on the way the data was collected: the attacks were captured in a different moment than the benign traffic, so the data is unintentionally classified by period and not by its purpose, as we wanted. One way to avoid this problem is to remove the feature that indicates the time at which the traffic was collected, but how could this affect the dataset and the information it can provide? So, by studying feature selection, my main goal was to be able to address these questions.
To evaluate the modifications in the dataset structure in an objective way, I would need to use a machine learning model to appraise how this move would affect the predictions. During the studies of the article “Applying hoeffding tree algorithms for effective stream learning in IoT DDoS detection”, by J. G. A. de Araújo Josephik, Y. Siqueira, K. G. Machado, R. Terada, A. L. dos Santos, M. Nogueira, and D. M. Batista, the authors used the logistic regression and perceptron models as baselines for the evaluation of the Hoeffding Tree, and the logistic regression results surprised me, considering its simplicity and that it was being used on a continuous data stream context, for which it isn’t designed to. In this way, this model seemed to be a good choice for making evaluations on my modifications to the dataset.
Using the scikit-learn library, that provides useful machine learning tools, I wrote a simple Jupyter Notebook that implements a logistic regression model, without much concern to its performance or to its hyperparameters fine-tuning, just with the goal to use it as a black box. It would be good to think a little more about the validation method, probably cross-validation would be a good option to consider depending on the proportion of the dataset chosen for these tests.
However, the data in the CIC-IoT-2023 dataset is stored as PCAP files, in the form they were collected from the IoT network used. I still need to figure out how to work with this format, probably converting it to CSV files (or verify if anyone already has done it before). So, just to check if my implementation was working, I used the Pima Indians Diabetes Database, published into Kaggle by the UC Irvine, a widely used dataset with a binary target for classification. It seems to work!
Now, the next step is to be able to use this model to evaluate the CIC-IoT-2023 dataset with my proposed modifications and see if they perform well or not.
Implementing a Hoeffding Tree model
In the last section I decided to use the logistic regression model as a black box to evaluate my modifications to the dataset, with the greatest motivation being the article that used this model as baseline to compare it with the Hoeffding Tree model. But, if the article was appraising this last one model for the same context I’m working on, why not to use this one? So, I decided also to implement a Hoeffding Tree model to use it for the same purposes as the logistic regression one.
The Hoeffding Tree is a machine learning model aimed for continuous learning, dealing with data stream, exactly what we expect to deal when analyzing the traffic in computer networks. I implemented it using the River python library, which provides a set of tools and models to deal with online machine learning. You can find my implementation here, also using the Diabetes Dataset for an initial validation (note that this dataset wasn’t designed to be used in a stream context).
Next steps
For the next weeks, my expectations are to:
Convert the PCAP files to CSV, by creating preprocessing codes to generate them or using another existing alternative.
Clearly define the strategies I will use to manipulate the CIC-IoT-2023 dataset, considering my studies in feature engineering and especially in feature selection.
Improve and use the logistic regression and Hoeffding Tree models to evaluate the dataset applying modifications to it.
Also would be good to exercise my learnings in Kaggle competitions, as I expected to do in the next steps from the last post, maybe on this one!
This post was made as a record of the progress of the research project “DDoS Detection in the Internet of Things using Machine Learning Methods with Retraining”, supervised by professor Daniel Batista, BCC - IME - USP. Project supported by the São Paulo Research Foundation (FAPESP), process nº 2024/10240-3. The opinions, hypothesis and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of FAPESP.