Post

Research checkpoint #00: a new project

Research checkpoint #00: a new project

Over the next two semesters, I’ll be working on a new project: a scientific initiation (commonly referred to as “IC”). Being advised by the professor Daniel Batista, from the Software Systems Group of the Computer Science Department at IME-USP, this will be a great opportunity to learn a lot of new things, delve deeper in topics that I liked across undergraduate classes and have contact with how scientific research works. In this post, I briefly describe the project I’ll be working on and the progress I’ve made so far in these first few weeks!

The project

Internet of Things (IoT) represents a new reality, attracting the attention of not only legitimate users, but also attackers. The evolution of botnets used for Distributed Denial of Service (DDoS) attacks combined with the low performance and storage capacity in IoT devices justifies the need for a centralized Intrusion Detection System (IDS). One possible scenario is the one in which the centralized IDS is constantly retrained on all network traffic and the inference is performed in the IoT devices themselves, only considering their traffic. How efficient is this approach? And which retraining, distributed inference and feature engineering techniques provide the best results to the known performant models? These are some of the questions we aim to understand through experiments.

The expected general schedule is as follows:

  • First trimester: Read related works, decide which dataset we’ll work during the project and begin creating the centralized model with distribute inference using neural networks.
  • Second trimester: Keep working on the model and write scientific reports.
  • Third trimester: Finish the model and run experiments on it to generate results and evaluate them. Also start to write didactic material describing all the experimentation procedures.
  • Fourth trimester: Study the model explainability and finish the didactic material, besides the last scientific report.

Making the foundations

Until this moment, I had never studied with attention anything related to the AI field. Since this will be one of the main pillars of this research (along with other topics like computer networks, which I’ve already taken a course in), over the last few weeks I’ve invested some time learning and trying to consolidate my foundations in this topic to be able to effectively handle the questions raised in this project. All of this was kind of a crash course and I’m sure that during the semester these foundations will need to be expanded, but that’s also part of the research, learning new things as they need to be learned.

Materials studied

These were the theoretic materials that I read and studied during this period:

Kaggle courses

A great discovery was Kaggle, a platform supported by Google that connects a community of researchers and students in AI, ML and data science. They also offer great courses in a lot of important topics of these fields. I have completed three of them (“Intro to Machine Learning”, “Pandas” and “Intermediate Machine Learning”) and I’m also finishing another one (“Feature Engineering”). They are great because they also provide Jupyter Notebooks with exercises to put in practice what was taught, it’s been a good way to see a clear application of the topics I read in the articles mentioned earlier.

The three Kaggle certificates Kaggle courses certificates

Classes in the university

This semester I’m also cursing the discipline “PMR3508 - Machine Learning and Pattern Recognition”, with the professor Fábio G. Cozman. It aims to be an introductory course in the machine learning field, also providing a statistical base and notions of data science.

Until now the classes were great, defining what is machine learning, differentiating supervised and unsupervised learning, covering techniques of data preparation (handling missing data, feature engineering, normalization, unbalanced datasets etc.), defining the Bayes classifier (and its theoretical optimality) and also introducing our first ML model (kNN). It has been a great way to delve deeper in the theory of the concepts I was already studying and also an incredible opportunity to see them applied in the real world, as the professor’s experience adds a lot to the classes, sharing problems he has faced during his career.

An experiment

One of the steps we’ll take in this project is to conduct experiments in a Raspberry Pi local network at IME-USP, simulating an IoT environment. One of the motivations to later use neural networks is that they present a good performance in these devices, as you can see in this tutorial. I found this PyTorch tutorial to be very interesting and, as I have a Raspberry Pi 3 Model B, I asked myself what would be its performance as the article aims the 4th version of the device, which has a considerable better hardware. I reproduced its experiment and also had great results! I’ll share them with more details in a next post.

Next steps

In the next two weeks, I expect to:

  • Continue to study and practice feature engineering techniques and also improve my skills in data visualization and manipulation.

  • Use the knowledge learned to take a closer look at datasets alternatives for the project. We’re considering to use the CIC-IoT-2023 dataset that is kind recent and has a extensive IoT attack data, but previous studies found some problems on it and we need to understand if they will make this dataset unusable.

In the long term, I expect to start studying distributed inference and retraining techniques, already thinking about the model we will develop in the future.

This post was made as a record of the progress of the research project “DDoS Detection in the Internet of Things using Machine Learning Methods with Retraining”, supervised by professor Daniel Batista, BCC - IME - USP. Project supported by the São Paulo Research Foundation (FAPESP), process nº 2024/10240-3. The opinions, hypothesis and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of FAPESP.

This post is licensed under CC BY 4.0 by the author.