2022-10-25

HTML Violations and Where to Find Them: A Longitudinal Analysis of Specification Violations in HTML

Summary

With the increased interest in the web in the 90s, everyone wanted to have their own website. However, given the lack of knowledge, such pages contained numerous HTML specification violations. This was when browser vendors came up with a new feature – error tolerance. This feature, part of browsers ever since, makes the HTML parsers tolerate and instead fix violations temporarily. On the downside, it risks security issues like Mutation XSS and Dangling Markup. In this paper, we asked ourselves, do we still need to rely on this error tolerance, or can we abandon this security issue? To answer this question, we study the evolution of HTML violations over the past eight years. To this end, we identify security-relevant violations and leverage Common Crawl to check archived pages for these. Using this framework, we automatically analyze over 23K popular domains over time. This analysis reveals that while the number of violations has decreased over the years, more than 68% of all domains still contain at least one HTML violation today. While this number is obviously too high for browser vendors to tighten the parsing process immediately, we show that automatic approaches could quickly correct up to 46% of today’s violations. Based on our findings, we propose a roadmap for how we could tighten this process to improve the quality of HTML markup in the long run.