Research checkpoint #01: diving into feature engineering

Posted Aug 26, 2024 Updated Jan 2, 2025

By Otávio de Oliveira Silva

4 min read

It’s been a week since the last post in which I introduce this new project that I’ll be working on across the next months, and this is the first checkpoint post! This week I kept studying feature engineering techniques, with emphasis in some theoretical material, following the next steps planned in the previous post.

Kaggle

Kaggle really is proving to be a great tool to learn and exercise topics related to machine learning and data analysis, this week I dedicated some time on it with the activities I list below.

Feature Engineering course

Another course completed: the Feature Engineering one. I think this was one of the bests of the ones I completed so far in the platform, covering from introductory topics to others more advanced. The first lesson defined clearly that “The goal of feature engineering is simply to make your data better suited to the problem at hand”. All the work on the data need to have this in mind, to make it easier for the model to infer patterns.

Feature Engineering course certificate

The lessons, together with exercises on real datasets, covered the following topics:

Mutual information: a way to quantify how two random variables depends on each other, very useful to determine which features have the most influence in the target.
Creating features: as well as removing features that aren’t useful to the model, creating features is another classical technique and very useful.
Clustering with K-means: this technique uses of unsupervised learning algorithms to find clusters in the data that may not be noticed by the supervised model.
Principal Component Analysis (PCA): looking at the axes where there is greater variation in the data, this technique can reveal how the points establish relationships with each other. This can also be used to reduce the dimensionality of the data.
Target Encoding: a more complex technique, this uses the target to help encoding categorical features.

As this course can not cover deeply all the thematic, they also suggest some great references that can be useful in the context of the project:

The Art of Feature Engineering, a book by Pablo Duboue.
An Empirical Analysis of Feature Engineering for Predictive Modeling, an article by Jeff Heaton.
Feature Engineering for Machine Learning, a book by Alice Zheng and Amanda Casari. The tutorial on clustering was inspired by this excellent book.
Feature Engineering and Selection, a book by Max Kuhn and Kjell Johnson.

Competitions involving data from computer networks

In addition to the courses, Kaggle also offers competitions where users can test their knowledge in real datasets and compare their results, being ranked as their model inference gets closer to the true target value. Searching for competitions related to computer networks, some of them looked to be interesting for future analysis (both of them accept late submissions):

More search for datasets

A closer analysis still has to be made in the CIC-IoT-2023 to determine if we can use it, in special some tools can be used to it: reading the “Feature Selection” section of the second chapter from the book Feature Engineering for Machine Learning, by Zheng and Casari, recommended by the Kaggle’s course on Feature Engineering, some techniques are described to determine which features can be pruned away from the data (and this is a concern in this dataset): filtering, wrapper methods and embedded methods. They are deeply explored in the paper “An Introduction to Variable and Feature Selection”, by Guyon and Elisseeff.

Also, other place to look after in future researches for datasets is the IEEE DataPort, that contains a large dataset storage that can be useful, like these two (a close look into them has to be made to conclude if they are truly useful for our purposes or not):

Next steps

For the next weeks:

Take a look into the Data Cleaning and Data Visualization courses on Kaggle, preparing myself to be able to review the datasets we are considering for work.
Also on Kaggle, consider to exercise my learnings on feature engineering in competitions, specially the ones related to computer networks.
Invest some time also in more of the theoretical material cited across this post, like the paper from Guyon and Elisseeff.
And finally, a question to be considered after all this is study is: automated feature engineering tools achieve reasonable results? Maybe they can give good hints in our data preparation.

This post was made as a record of the progress of the research project “DDoS Detection in the Internet of Things using Machine Learning Methods with Retraining”, supervised by professor Daniel Batista, BCC - IME - USP. Project supported by the São Paulo Research Foundation (FAPESP), process nº 2024/10240-3. The opinions, hypothesis and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of FAPESP.

Undergraduate Research Project

This post is licensed under CC BY 4.0 by the author.