Bas de Heer - this article relates to the state of the current incident response to the disclosure of a widespread, major vulnerability.
As an industry, this is something we are not yet prepared for. This article begins by looking at the situation in our company and then discusses the research we have conducted into finding the best method for dealing with this problem.
Working in the financial sector, we at the Security Testing Team test the security of internal and external applications. Having hundreds of external and numerous internal websites, the recent announcement of Heartbleed caused quite a stir. Checking everything was actually a lot of fun. Until a few days ago.
The day following the disclosure, we gathered all known external applications (we won’t include the internal applications in this article), resolved their IP addresses, checked whether port 443 was open and started testing the resulting list of about 200 IP addresses. Other ports were tested later.
This was done because the Heartbleed vulnerability can expose critical information about a system, enabling an attacker to access encrypted information, source code, primary keys of the server and much more. Heartbleed is a vulnerability in OpenSSL 1.0.1 up to 1.0.1f. This is used on very many servers. Port 443 was scanned first to get quick results before conducting a slower test with more accuracy.
We initially used the Python script ssltest.py locally to check the list, scripted to automate testing of the whole list. We also tried the remote scanners (Qualys, filippo) and various other tests, but these were either found to be unreliable due to heavy use or difficult to automate. We found a small number of vulnerable servers and these were upgraded, thus mitigating the vulnerability.
Skip a week
A week later, we found a false negative from our testing. This meant that we would have to retest the entire list. By now, more tools had become available and we mustered all tools we could find:
- McAfee online Heartbleed tester
- Qualys SSL Labs SSL server test
- SSLYZE 0.9
- Cardiac Arrest
There are obviously more tools, but we considered ten different tools would be enough to obtain an idea of the vulnerability of our servers to Heartbleed. The idea was to run some tools on the full list of IP addresses and make a matrix with a simple 0 (negative) and 1 (positive) score. If there was a 1-on-1 tool, we would test all the other tools manually.
What we found was messy. Nexpose performed two scans, a short one and a long one, but failed to agree with itself on the finding in the first test. The first scan reported that the system was not vulnerable, while the second scan reported that the system was vulnerable. We ran NMAP multiple times with interesting results, so we gave it its own table.
Nessus did not scan the full list of IP addresses, so we only scanned 3 which it had not yet scanned, but did have positives in other scanners. It scanned 68 machines as a maximum.
- Two tests within halve an hour, first on a list of 15 then on the full list of IP’s
- Two tests a week apart, this machine has been updated in between tests
- ? Due to hostname resolution problems host C wasn’t testable in filippo and Qualys
The NMAP, Qualys, and Cardiac Arrest seem to agree with each other. Apart from these, the other tools seem to have no relation in output. Running NMAP on one OSX 10.9 gave different results than on the other OSX 10.9 machine. More on that later. Site C had another problem with SSL but it was included because it did give positive and negative findings in other tools. On 2 occasions, the expensive tools agreed with each other; the free tools agreed with each other too, but with opposite results to the expensive ones.
Why is detecting a specific vulnerability so difficult? Why can’t we do better? Which tool is correct?
Adrian Hayter performed another test, in which a Proof of Concept Server was tested with its own script, Cardiac Arrest. This agreed with our results. He studied the detection methods and TLS configuration to find the cause, but it seemed to become even more complicated and uncertain on different servers.
As we mentioned, NMAP deserves its own table. The J and M are the 2 different Mac OSX 10.9 laptops used, machine J being a MacBook Pro 2011 and machine M being a MacBook Pro Retina 2014. The numbers represent the runs. M1 and M2 were automatic scans of the list and the other scans were manually started against each of the 6 IPs.
The J machine was upgraded from 6.40 to 6.45 between the two runs in an attempt to see whether differing versions might produce different results. The M machine ran 6.40 in all runs. These results show that it does matter which test machine is used or whether the test is run automatically or manually.
We failed. At the Security Testing Team, we put too much trust in a single tool that appeared to work well. This is also a failure on the part of the security industry. We still do not know with certainty which machines are vulnerable or not. One thing it does show clearly is that the difference between tools should be taken into account when determining an automated test process. Trusting one tool is obviously not a good idea, but we should not have to resort to using 10 different tools.
What did we learn?
We managed to reduce the list of 200 to just a few machines that might be vulnerable. Having thrown multiple testing applications at all the servers, we are fairly confident that the machines which all the testing applications agreed were safe, are actually safe. The other servers must undergo different tests to find out whether they are really vulnerable and act accordingly.
This is obviously not the end of the story. Which servers had been vulnerable, but were fixed prior to these tests? What extra care is needed after fixing the versions?