Abstract:
Nowadays, almost every aspect of human life is impacted by the Internet. Incidents of cyberattacks
and intrusions are therefore becoming regular news. Among many attack types, denial-of-service
(DoS) attacks remain the most devastating and severe due to their potential impact. As we move
down the tier, attacks at the application layer are particularly challenging to identify since they are
stealthy by nature. HTTP flooding is an application layer attack that is extremely dangerous and
damaging since it is simple to bring a targeted site or server down by flooding it with a large number
of HTTP requests because the attacker uses seemingly-legitimate HTTP GET or POST requests to
attack a web server or application.
Machine learning and artificial intelligence research have exploded in recent years, offering new
opportunities for intrusion detection solutions. However, data availability continues to greatly affect
the success of such systems, as there is a scarcity of high-quality IDS datasets. This study introduces
a solution that contributes to the detection of HTTP flood attacks using five machine learning
approaches. The dataset is an important part of building machine learning-based IDS models. The
process starts with generating a dataset. To generate normal http traffic, Selenium, a web browser
automation tool, was used; to generate http flood attack traffic, tools such as slowhttptest and hoic
were used. Meanwhile, Wireshark software is being used to capture network data and save it as a
pcap file. Consequently, utilize CICflowmeter to convert the Pcap file to CSV file format. 84
features are extracted. Following the use of both manual and automatic feature selection, 30 features
are selected and fed into the machine learning input for further experimentation.
This study analyzes a machine learning-based HTTP flood attack detection system. Five supervised
machine learning classifiers are evaluated: Random Forest (RF), Adaboost, Naive Bayes (NB),
multi-layer perceptron (MLP), and long short-term memory (RNN-LSTM). Using seven
classification performance evaluation metrics, namely accuracy, precision, recall, F-measure, false
positive rate, false negative rate, and training time (sec). Upon completion of the experiment, the
Random Forest algorithm produced superior results by applying the four classification metrics of
accuracy, recall, f-measure, and false negative rate, with values of 98.30, 97.03, 97.98, and 2.96,
respectively, using a test size of 20%. On the other hand, the Naive Bayes algorithm is
comparatively the worst performer for the detection of HTTP flood attacks in this study. Moreover,
even though the rank of estimators varies a little bit based on different metrics, using accuracy as a
measure results were obtained when we ordered from best to worst: Random Forest, MLP, RNN LSTM, Adaboost, and finally Naive Bayes, with corresponding values of 98.30, 97.75, 97.70,
95.48, and 93.54, respectively.