Send email Copy Email Address

2024-11-29
Felix Koltermann

Study of web crawlers reveals shortcomings

For the first time, CISPA researcher Aleksei Stafeev presents a study that systematizes the knowledge about tools for the automated analysis of websites, so-called web crawlers, in the field of web security measurement. He examined hundreds of papers published at the most important international conferences over the last 12 years. The results showed that many papers describe the crawlers inadequately and that randomized algorithms perform best when the crawlers navigate the websites. The full results are published in the paper “SoK: State of the Krawlers - Evaluating the Effectiveness of Crawling Algorithms for Web Security Measurements”, which Stafeev presented at the USENIX Security Symposium 2024 in August. The paper was written as part of the TESTABLE project by CISPA-Faculty Dr. Giancarlo Pellegrino.

Studies to measure web security, for example in relation to the implementation of data protection measures or the security of websites, are very popular in the field of security research. Crawlers are the tool of choice for their implementation. “Crawlers aim to automate data collection on a website,” explains CISPA researcher Aleksei Stafeev. They are based on an algorithm that controls how the crawler automatically scans a website, visits various pages and collects data from them. “But web crawling is not as simple as it sounds,” Stafeev continues. "In theory, these tools simply visit websites. But in reality, the internet is very complex: there are a lot of different buttons on every website and each of them may or may not lead to a different page. You have an exponential growth of different pages and you have to figure out which ones you actually need to visit to get the data relevant to your research question.” Despite the great importance of web crawlers, their performance has so far only been studied to a very limited extent. Stafeev is now closing this gap with his study. 

The CISPA researcher took a two-step approach. “First, we conducted an overview of the current work on web measurements that use crawlers,” explains Stafeev. The result was a data corpus of 407 papers published between 2010 and 2022. “We tried to extract information about which crawlers are used and how to get a general picture of what is used in web measurements,” says the CISPA researcher. For the second part, Stafeev examined papers from the last three years that propose new crawlers. “We evaluated the crawlers in terms of what data they collect for the purpose of web security measurement,” Stafeev continues. To examine the crawlers in terms of code coverage, source coverage and JavaScript collection, Stafeev developed an experimental setup called Arachnarium. 

Insufficient descriptions and the randomization paradox

One of the key findings of the first part of the study was that most papers had inadequate descriptions of the web crawlers. “It was really difficult to extract and understand the information about what technology they use to crawl and what techniques they use. And there were usually not enough details about the code and algorithms used. Often it was just 'we use crawling' and that was it. One of the key learnings was that we can do better as a community by providing more information about the crawlers we use and how they are configured.” This is particularly important in order to be able to guarantee the reproducibility of studies, which is a key criterion of scientific quality.

The second part of the study also produced an astonishing result. “According to our data, web crawlers that use randomized algorithms seem to perform best,” explains Stafeev. “This is actually quite surprising, as it means that no matter what navigation strategies we've developed, we still haven't found a better solution than just clicking on things at random.” The CISPA researcher tested crawlers with various metrics. He found that there was no single winner among the crawlers for all three metrics. “So we can't give a one-size-fits-all recommendation that says: 'Everyone should use this crawler',” the CISPA researcher continues. It therefore depends crucially on the context and the exact objective as to which crawler is suitable.

Takeaways and further handling of the research data

In order to implement the study, Stafeev created a huge data set. “We believe that we can learn a lot more from it,” he says. “And it would be really nice if others could gain more insights from the data we have collected.” For this reason, Stafeev has made the complete data set freely accessible online. In future, he wants to devote himself to his real passion again: developing new crawlers. Stafeev had not originally planned to carry out such a large study. He originally only wanted to improve his own crawler and look at how others had dealt with the problem. “Systematizing knowledge, as this study is based on, is quite an undertaking” he says. “But I learned a lot from this project about how to carry out such experiments and work with such large data sets. I will capitalize on this knowledge in my future work,” concludes the CISPA researcher.