Bachelor Thesis

My Bachelor assignment was the perfect way of deciding at which chair I wanted to follow the Electrical Engineering Master. I was interested in both the Design and Analysis of Communication Systems (DACS) chair and the Telecommunication Engineering (TE) chair. Ultimately I decided to do my Bachelor assignment at the DACS chair. So I met with all the AIO’s working at DACS to see what kind of assignments they could offer me. All of the assignments were on state of the art topics. Cloud-networks for mobile providers, dimensioning link-capcity, etc. I chose to work with Rick Hofstede, his assignment was about HTTP(S) intrusion detection. His previous research was in the field of SSH intrusion detection, and he wanted to see if the same could be done with HTTP(S).

What does ‘HTTPS Intrusion Detection’ mean? Even though the term HTTP(S) is clear for most of you I am going to give an analogy anyways, as it will help explain other things too. Suppose that instead of visiting Web sites with your browser you want to physically hold the Web pages in your hand. You send a post-card to Scintilla requesting their home-page. The Web server reads this request prints the page puts it in a package and sends it to your door through PostNL. You open the package and view the Web page. In this scenario PostNL can be compared to HTTP. For HTTPS it can be imagined that the package is given a lock of which only the sender and receiver have the key. With ‘intrusion’ most people think of a burglar breaking into their home and stealing all their valuable items. For the Web this is quite similar. But instead of a home there is the back end, or a control-panel, of a Web site. And instead of a door with a lock there is an authentication mechanism. In my research I have looked at three authentication mechanisms. HTTP Basic Authentication (BA), Form-based Authentication (FA) and XMLRPC. There are several ways an attacker can try to gain unauthorized to such a back end, the one we have researched is the brute-force attacks. A brute-force attack is simply trying every login combination of username and password you can think of. These brute-force attacks are usually based on a list of commonly used login credentials called a dictionary, hence these attacks are also known as dictionary attacks.

Figure 1: Attack phases

Dictionary attacks typically feature three phases. The first phase is the ‘scan phase’, here an attacker scans the network for the targeted services. The second phase is the ‘brute-force phase’, this is the phase where all the login credentials are tried. This can end in two ways. Either no valid credentials are found and the attack is ceased. Or the last phase is seen, the ‘compromise’ phase, here the attacker has gained entry to the back end and is, for example, able to upload illegal content. Brute-force attacks are usually detected by analysing access logs, if they are detected at all. This host-based approach is hardly scalable in larger networks, since access to the logs is required. Besides the host-based approach, a network-based approach can be taken. This approach can be divided into two categories, namely packet-based and flow-based. To explain these two categories we take another look at our PostNL analogy of earlier. Packet-based intrusion detection systems can be seen as systems that open each and every packet that passes by to analyse their contents for malicious traffic. As you’ve likely realized if the packages are encrypted, if a lock is added, the IDS is no longer able to open the package and analyse its contents. The flow-based approach does not face this problem as it looks at packet headers, not their payload. A flow can be seen as the label that is attached to each packet. It lists the sender and destination. The weight of the package. If the package is part of a sequence it lists how long the sequence is, etc. This analogy is not very accurate, but it gives you an idea of what a flow is. Analysing the traffic generated by dictionary attack tools allowed us to develop signatures. These signatures, as shown in Figure X, can be used to detect dictionary attack from flow data. As can be seen there are two ranges defined, the packets per flow (PPF) and bytes per flow (BPF).

During my Bachelor assignment much effort was put in developing a flow-based prototype IDS. This prototype uses the signatures we have developed to detect dictionary attacks from given flow data. It detects attacks in three stages. First, a preselection stage, here the data is filtered to generate a list of source and destination IP address tuples with at least one flow matching at least one flow. Second, the detection stage, this is where the detection algorithm comes in. Every flow between the preselected IP address tuples are checked against the signatures. As the signatures defines different ranges there are also different modes of operation. Either only the PPF, or BPF, is used or both the PPF and BPF are used for the signature matching. If a tuple shows a consecutive number of flows higher than a given threshold it is marked as being an attack. And thirdly, the signature matching stage. This stage is necessary as there can be multiple signatures used in the detection stage. The basis of the signature matching algorithm finds its roots in the field of digital communication, namely the signal space concepts, where bits are mapped to a signal space to determine if a one or a zero was sent and received. Instead of using bits in a constellation diagram, we user the number of PPF and BPF on the axes of an imaginative constellation diagram, and the Pythagorean theorem for finding the signature that is closest to the analyzed traffic.

Figure 2: Detection accuracy under different flow record thresholds

We are number one! That was rather difficult seeing that we were the only one around. But in all seriousness, accuracies of around 100% are achievable with the prototype. However we must acknowledge that there are false positives, normal traffic that is marked as being an attack, these false positives are mainly caused by (legitimate) automated traffic, such as RSS parsers, Web calendar fetchers and SPAM being posted on blogs. This gives a false positive of around 10%.

The results of this assignment were documented in a conference paper. In there we have presented the first steps in the field of flow-based HTTP(S) intrusion detection. We have shown that the developed prototype in combination with the signatures is able of achieving accuracies of around 100%. However there are false positives, these are mainly caused by legitimate automatic traffic. We realize these types of traffic can be of great importance to Web site owners, as they often rely on search engine rankings for their income, for example. Further investigation of this traffic will therefore be part of our future work. In talks with Antagonist, we have learned that a system as presented in the paper may prove very useful. For example, it could be integrated with an automated system that blocks attackers based on detection results of our IDS. Requests from blocked IP addresses could be forwarded to a static landing page, from which one can choose to be unblocked. Since such behaviour is not understood by attack tools, humans can easily be unblocked while automated attacks are mitigated.

As this is only the first step in intrusion detection against HTTPS, there remains a lot of work to be done. If you are interested in continuing were I left off, contact Rick Hofstede from the DACS chair.