Research checkpoint #04: learning to manipulate network packets in Python
During my studies on the CIC-IoT-2023 dataset, which provides PCAP files containing IoT network traffic data under benign and malign scenarios, as reported in the last post, I discovered new Python libraries that could be really useful for our work, like dpkt and PySpark. In this post, I report my studies and tests in Python libraries that are capable to deal with PCAP files, permitting the user to manipulate them and extract information. Also, the scripts in the supplementary material of the dataset I cited before were helpful to give some hints, however I think they will be more useful just as a reference than a basis to be adapted depending on the approach we choose for the data, I will comment more on this later.
Python libraries to manipulate PCAP files
The Packet Capture (PCAP) file format is an standard to save captured network traffic data. However, Python doesn’t provide by default methods to deal with these files, being necessary the use of external libraries to parse them to Python data structures so they can be manipulated. As I commented before, during my studies on the CIC-IoT-2023 dataset scripts I found dpkt, which aims this exactly objective, but there are some other libraries that can do this same function: PyShark (suggested by professor Daniel) and Scapy. They have some differences between them and what each one can offer, but all of them can be used to parse PCAP files. In the next subsections, I describe more about each one of them and also report the results of a small performance test I conducted just to check how they handled with 10MB of data. The notebook containing the tests can be found here.
Until this moment, my main goal in studying these libraries was to be able to parse the PCAP files to CSV ones, that can be easily interpreted by Pandas and, consequently, be used by the majority of the machine learning models available on public libraries. However, after observing the tests below I think this decision needs to be discussed further: we really need to convert all of the data to CSV files? During the training phase of the ML model we need to find a way to parse the PCAP files to Pandas dataframes, and I agree that CSV files would be a great solution, as we have an enormous amount of data and we would also need to flush benign and malign traffic. However, when dealing with real network traffic after this first phase, the pipeline of data preparation will change drastically, as we will receive packets from the network sequentially, analyze them to make predictions and use for them for stream learning. I think this process can be done in a cleaner way than creating CSV files for each packet we receive from the network. This is something to keep in mind when creating the scripts that will manipulate the data before feeding it to the model.
dpkt
dpkt is a Python library that permits the creation and visualization of network packets, also being able to parse PCAP files. The biggest advantage of this module is its speed: in my tests, it was able to load 10MB of network traffic data in an average time of only 0.12 seconds, which is amazing! To achieve this speed, dpkt only provides a simple parsing of the raw package to Python structures, not doing any additional work or analysis on them.
One thing to notice is that dpkt can only handle PCAP files, not PCAPNG ones, the standard used by Wireshark. The prefix of the binary files in these two formats are different, with PCAPNG having a different first block that is used as its magic number. This is one thing to observe when dealing with the CIC-IoT-2023 dataset: I don’t know if this is a pattern across all of the type of attacks, but in my reduced downloaded dataset the attacks with only one file were on the PCAPNG format (even though using the ‘.pcap’ extension) and the ones with more than one file were really PCAP files. I think this is due to the tool used to brake the original traffic collected in Wireshark into smaller files of at most 2GB, tcpdump.
PyShark
PyShark is a Python wrapper to tshark, a CLI network protocol analyzer, a terminal version of Wireshark. In this way, this module not only provides the raw data of the packets in both PCAP and PCAPNG file formats, but also interprets them using the Wireshark dissectors, simplifying the process of information extraction. It’s also capable to sniff the network, one thing that dpkt can’t do and would require the use of another library. This is an amazing tool to understand the packets and manipulating them, however this comes with a cost: speed. In the tests I conducted, in its best performance this library took an average of 141.52 seconds to parse the exactly same 10MB of network traffic data parsed by dpkt in less than one second. Also, it doesn’t goes well with Jupyter Notebooks because of some asynchronous functions.
One advantage of using it is the stream index information, which can be useful as I discussed in the last post. With the stream index of the TCP/UDP packets, its easy to identify the packets that are in the same flow, making it easier to flush the benign and malign traffic data. Notice that this information is not in the packet, it’s an attribute calculated by tshark based on the IP/port pair of both ends of the communication and, in the case of the TCP packets, probably some analysis to understand when the conversation starts and ends using the TCP flags (like FIN/RST).
Scapy
Scapy is an interactive packet manipulation library based on Python that is capable of creating, parsing, modifying and sniffing packets from the network, working with both PCAP and PCAPNG file formats. It can be used also as a Python library, being the middle ground in the tradeoff between speed and functionality, as we seen in dpkt and PyShark. In my tests, it took an average of 12.28 seconds to handle with the 10MB of network traffic data. It also has the biggest documentation out of these three options.
Next steps
In my understanding, some possible next steps after these two weeks and these two months of project are:
Design how will be the pipeline of data preparation, based on the libraries above and the studies of feature engineering I’ve done so far;
Try to implement the basis for it, the baseline models will be a great addition to verify if the process is working properly.
This post was made as a record of the progress of the research project “DDoS Detection in the Internet of Things using Machine Learning Methods with Retraining”, supervised by professor Daniel Batista, BCC - IME - USP. Project supported by the São Paulo Research Foundation (FAPESP), process nº 2024/10240-3. The opinions, hypothesis and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of FAPESP.