Post

Research checkpoint #15: performing the training and inference apart

Research checkpoint #15: performing the training and inference apart

In the last post, we discussed dimensionality reduction techniques, intending to reduce the amount of memory required to run an Hoeffding Tree model over the CIC-IoT-2023 dataset using a Raspberry Pi, which has very limited resources. In that scenario, this little machine was performing both the training and inference over the data. And what if we do just the inference on it? Also, with the intentions to overcome the performance limitations of this device, we could have a centralized computer capable of training a model regularly and then distribute it on the network, in a way that the Raspberry Pi (or other Internet of Things device) would just have the job to look at its network traffic and perform the inference. This idea is the one we explore in this post, understanding the viability of this approach. But before, some remarks about the last post.

Discussions regarding post #14

The content of the last post raised some interesting points of discussion that are worth noting, which can be studied with more attention in the future, especially through related literature. They are:

The impact of unbalanced data

One major problem that we faced in the last experiments was the data imbalance. For example, we got an accuracy of 98.26% with the Mutual Information (MI) technique, however only 2.3% of the data had the “Benign” label, being 97.7% of the entries labelled as attacks. This means that we could just say that all of our traffic was produced by attackers, and we would still achieve a high accuracy. This is an issue that we already faced in post #06, in our first experiments running a ML model over the CIC-IoT-2023 dataset, and that is reasonable: part of the attacks that we are handling with are based on volume (like DoS and DDoS), being totaly expected to have big amounts of data labelled in their category. The dataset authors even try to minimize this effect by grouping these kinds of attacks in groups of 100 packets for each CSV entry, different from the regular value of 10, but our clip of the data still had a significant discrepancy.

Some points for attention are:

  • How does our chosen model deal with unbalanced data? Considering that we are working with the Hoeffding Tree, a decision tree model, how well does it handle this issue? This can be used as a factor to determine if our choice of model was good for our scenario. It would probably also require a closer look at how the features from the dataset interact with each other, taking into account how decision trees work. A good idea may be to use decision forests.
  • This is an old problem, what techniques were developed to tackle it? There exist algorithms to perform data augmentation, for example, but would they also be suitable for our context here?

Note that this is a problem only because of the nature of our study: in the goal of using Machine Learning (ML) models as Intrusion Detection Systems (IDS), we are not only worried about its global accuracy, but much more about its efficiency over each of the classes, both the attack and benign data. We want it to have the lowest possible number of false positives and negatives, and, in practice, we saw that metrics like precision and recall were clearly under what we would like for our context.

What should be our concerns when producing data from network traffic to feed an IDS?

To feed an IDS, it is logical to use network data, as the dataset we are working on proposes. But, thinking in an online and adaptive system, how to collect the information from the real world? Two aspects need to be especially considered:

  • At what frequency should we capture network data? In this scenario of unbalanced data discussed in the last point, one question that arises is whether we need to collect all the packets that are traveling on the network. For example, if we’re handling a DDoS attack, we’ll have a giant volume of packets that are probably similar, is it necessary to store all of them and feed the IDS? Wouldn’t this help to bias the model? One idea is to extract packets from the network periodically, instead of registering all the traffic, but the impacts of this decision need to be measured.

  • Which features? This is a common question in any machine learning study. We saw with MI that not all features have a big impact individually over the label (however, they may have an impact together), are all of them necessary? Considering that memory is a limited resource for us, a deeper study of feature engineering here would be very interesting.

Note that both of these points are related to the discussion about quality versus quantity of data. Training an ML model, I think that both of these points are worth our attention: for a good inference, our data needs to be good and well-treated, but we also need a good volume of it to teach the model well. If we should give more attention to one of these points, it is a discussion that can be very extensive and will depend on our goals.

The CSV size in the filesystem and as a Pandas dataframe

Finally, we had another interesting observation in the last post: the original dataset files used sum up to 586MB, with the script that runs the model reaching around 1.94GB of memory peak in the reference computer; however, the CSV file containing the features in the PCA formatation have 609MB, with the script reaching 680MB of memory peak. Very curious that the PCA file, which is theoretically a reduction of the original dataset, is more expensive in disk space than the original data, but not in memory.

The difference in disk is perhaps due to the representation of the data: PCA uses only floats with large precision, that can use a considerable number of characters to represent just a single number, while in the original dataset we have a great volume of zeros (it is sparse) and numbers with a lower precision. So, even having a lower number of features, the PCA can still use more disk space. The memory, on the other hand, would require a deeper knowledge of how Pandas stores the DataFrames under the hood to understand the difference. Still interesting!

Experiment

In post #13, we observed that improvements on hardware have significant impacts on the capabilities of training a machine learning model, more specifically, the Hoeffding Tree model, the one we are working with. One possibility raised is to perform the training phase on this more powerful machine and then perform the inference in the edge devices, such as a Raspberry Pi, more limited in performance terms. As discussed, this can be done through serialization, using libraries such as Joblib. In this experiment, we evaluate this possibility.

The procedure adopted is:

  • In the powerful computer, we train the Hoeffding Tree model;
  • After the training is complete, we serialize the model using the Joblib library and save it as a .pkl file;
  • We share this file through SCP with the edge device;
  • The edge device loads the model and performs the inference.

These steps are orchestrated by a Bash script available at the GitHub project’s repository, together with the inference and training Python scripts. The dataset used in both of these phases is also available. We decided to use two scenarios of the CIC-IoT-2023’s data: the benign traffic and the Slowloris DDoS attack, as it is aligned with our topic of study and have an adequate amount of data for our executions. One of the targets of the Slowloris attack in this dataset is a Raspberry Pi, with MAC address DC:A6:32:C9:E6:F4. In the inference CSVs, we used only the traffic concerning this device to simulate its connection to the network (final file with 2.2MB), while in the training CSVs we used the rest of the data without it (around 270MB).

During the training and the inference, we measured the total time consumed (using the time command) and the memory peak (using Python’s resource module) over five executions of the procedure. The computers utilized were the usual, connected through a CAT 5e Ethernet cable in a local network without internet access:

  • Raspberry Pi Model 3 B
    • Quad Core 1.2GHz Broadcom BCM2837 64bit CPU
    • 1GB RAM
    • Debian GNU/Linux 12 (bookworm) OS
    • 100 Mbits/sec network interface card (Fast Ethernet)
  • Acer Aspire 3 A315-41-R4RB
    • AMD Ryzen 5 2500U 2.0GHz 64bit CPU
    • 12GB DDR4 2667 MHz RAM
    • Fedora Linux 41 (Silverblue) OS
    • 1000 Mbits/sec network interface card (Gigabit Ethernet)

Results

Here are the final collected metrics:

MetricTraining computerInference computerTotal
Average runtime17m24.77s2m0.80s19m35.81s
Memory peak2.25GB0.72GB-

The runtime standard deviation observed across the five executions wasn’t expressive, with the value observed in the total time of execution being around 1.7% of the average.

We saw in post #06 that the Raspberry Pi has difficulties dealing with high volumes of data, mainly due to its 1GB RAM limitation. However, here we have a great outcome: the most memory-demanding part is resolved by the training computer, achieving a peak that the Raspberry Pi wouldn’t be able to handle. This is exactly what we were expected to observe: the most computationally heavy part being managed by our central computer, while the edge device needs to deal with a smaller load, and still achieve success.

This success, however, is accompanied by some remarks:

  • The Raspberry Pi was able to do its assignment, but it required some adjustments to the model size. We’re using the Hoeffding Tree model, provided by the river library, and by default, its maximum size is 100MB. However, using this value, our constrained device wasn’t able to load its serialized parameters, with the process being killed. To achieve these results, I had to disable poor attributes (remove_poor_attrs parameter) and reduce the size limit to 65MB, a significant reduction that can have impacts that we aren’t considering here, but that should be taken into account in further studies. Note that our inference dataset has only 2.2MB (around 3.3MB when loaded in memory as a Pandas DataFrame), so the major part of this 0.72GB observed is only due to the model being loaded!

  • Again, we had problems with data balacing, in a way that it is hard to evaluate the effectiveness of the model. In the inference phase, we achieved 85.25% of accuracy, but only 6% of this data was benign, resulting in an F1-score of 44.12% in this label, versus 91.50% in the attack label. Considering also that the precision was very low for the benign label (28.86%), this means that we are probably just assigning the “attack” label to every entry.

  • We didn’t do any measurements on the network impact of sharing the model in this way. Considering the runtime averages observed, we used around 19m25.57s in the training + inference phases, with an additional 10s in the total time that includes the time of transferring the model from the notebook to the Raspberry Pi. But, how does this impact the network?

Concluding, with this experiment we saw that this format of sharing the model is viable! But there is more to investigate.

Next steps

A lot of questions can be raised considering what we have studied so far! As we are completing one year in this project, the next post will be dedicated to organizing and connecting our results, and also bringing light to the next interesting points of study.

This post was made as a record of the progress of the research project “DDoS Detection in the Internet of Things using Machine Learning Methods with Retraining”, supervised by professor Daniel Batista, BCC - IME - USP. Project supported by the São Paulo Research Foundation (FAPESP), process nº 2024/10240-3. The opinions, hypothesis and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of FAPESP.

This post is licensed under CC BY 4.0 by the author.