A new standard? CISPA researchers test utility of web archives for live analyses of web security

Studies on website security have an important place in the research field of information security. To this day, the research standard is live analysis. This means that website security parameters are measured at the moment when the researchers access a website. The problem is that this always represents only a snapshot: What is "live" one moment may be out of date a moment later. "The web is so random that it is extremely complex to reproduce experiments," says CISPA researcher Florian Hantke. That's why it's almost impossible to repeat experiments under the same conditions in live analyses.

For Hantke, this poses a fundamental problem: "Experiments should always be reproducible, because otherwise an experiment loses relevance. Otherwise, anyone could simply claim that the Internet is safe." According to Hantke, one alternative that could theoretically guarantee the criterion of reproducibility could be the use of web archives. At regular intervals, web archives store copies of existing websites, so-called "snapshots," on external servers. There, they can be retrieved with date and time code. Unlike live websites, the stored copies are not subjected to any changes anymore. The best-known web archive is the Internet Archive. In research, live analyses have so far been used mainly for historical analyses, not for live analyses. Hantke explains this by saying that "many people think that archives do not contain all the important data.”

Internet Archive superior to other web archives

Hantke and his colleagues wanted to know how well web archives were suited for live analyses of website security mechanisms. To do this, they had to find out which of the existing web archives stored the most accurate copies. They examined a set of public web archives in terms of the volume and quality of the deposited data for the 5,000 most important websites in the period from January 2016 to July 2022. In a comparison of these web archives, the Internet Archive (IA) showed the best results. The quality of the archive is so good that, under certain circumstances, Hantke and his co-authors even recommend working with IA as the sole source.

The researchers verified the data quality of IA in a case study of two mechanisms that are standard on many websites: security headers and Java Script inclusions. They were also able to show that IA stores copies of websites with such regularity that even more detailed analyses are possible, the quality of which is equal to that of live analyses. In addition, IA allows for the analysis of multiple snapshots of a website at the same time, which Hantke calls "neighborhooding." This allows any short-term outliers in the data, such as a website's server problems, to be smoothed out. The process used by researchers of using publicly available web archives makes studies easier to reproduce. In the long term, this can increase the quality of research and make it easier to check security mechanisms of websites.

The challenges of using web archives

Nevertheless, there are also some things to consider when using web archives for live analytics. "One major disadvantage is the slow speed," explains Hantke. For example, processing large amounts of data is much faster in a classic live analysis because access to data stored in web archives is very slow. However, this could be solved by establishing collaborations with the archives favored by the researchers in order to get better access to the data. "The different vantage points also need to be considered," Hantke continues. These are the access points from which websites are accessed around the world. These access locations determine what exactly a website looks like that is stored in the archive. "For security issues, the differences tend to be negligible, but for analyses of the implementation of the GDPR, for example, the access location is important," he explains. This is because specific features relevant to the General Data Protection Regulation (GDPR) are often only displayed on European websites. So a copy stored in the U.S. would not be of help here. This is why, for each new research question, it has to be ascertained whether working with web archives is an option.

Productive PhD research

Florian Hantke is a PhD student and has been working at CISPA for a year now. He lives in Erlangen with his wife, so he works from home a lot. Asked whether he needs any special research equipment at home, he explains that a secure VPN connection to the CISPA server in Saarbrücken is completely sufficient. "I can simply send an instruction to the server and run the analyses there," Hantke says. He can then retrieve the results at a later point. The paper on web archives is already his second publication. "I'm quite happy with my output," he admits with a laugh. For the summer, he is already planning another paper. But before that, he hopes there will be more interest in his findings on using web archives for security analyses. In any case, Internet Archive's management has already signaled interest. Together with his co-authors from Ca' Foscari University in Venice, he is also planning a publicly accessible project for web security analyses, which other researchers will also be able to use.

full paper