InputLab: Data protection-compliant test data for error-free software
Hello Dominic. Can you briefly explain what InputLab does and what problem you want to solve?
Hi, sure, of course. We produce fully synthetic data that can be used to systematically test software - in compliance with data protection regulations. Until now, many manufacturers have faced the problem that they can only test software with production data. This is data that is generated during live operation of the software, for example address data. As this data contains personal information, manufacturers are not allowed to use it in its actual form, as this is prohibited by GDPR. At the same time, however, they need data that is reasonably similar to this productive data in order to be able to test their system in a meaningful way. As a result, many companies resort to taking the productive data and altering it. Unfortunately, this often means that not all possible errors can be tested because, for example, special characters do not appear or appear too rarely. What also often happens is that errors that occur during input in live operation cannot be reproduced in the test with the alienated data and so the problem cannot be solved. This is why we have developed a demonstrator with which we can generate fully synthetic test data that exactly matches the software to be tested. Anyone who tests their system with our data can no longer be caught by surprise.
How do you produce this data? With the help of artificial intelligence?
No. Because this approach also has the problem that an AI system that is meant to produce truly meaningful data must first be fed with real data. Of course, companies that handle sensitive data, such as banks, insurance companies or public authorities, cannot do this under any circumstances. Our approach is unique and patent pending: We work with the description of data formats that companies use and usually have documented anyway. These can be schema-based data formats such as XML, JSON or database tables, for example. We convert these descriptions into a grammar, i.e. we describe them formally, and can then generate really meaningful input data based on these rules using the ISLa specification language we have developed. We also had to invest a lot of research and development work so that we can also handle complex data formats from practice.
Aren't “fuzzers” tools that do exactly that? Produce random data to test software?
Yes, that's true. However, the problem with fuzzers is that they often only generate data that makes no sense in terms of content. These inputs cannot be used to test programs in depth because they are sorted out as data garbage in the “first round”.
Do you have to manually convert the customers' descriptions of the data formats into a grammar?
No, everything is already fully automated with our demonstrator. We are currently working with three pilot partners. These partners receive test data from us and use it to test their systems. Two of them use the schema-based data exchange format XML, another works with an Open API interface. They give us feedback during the process and we work with them to further refine the data specifically for their purpose. We still have capacity for two to three more pilot partners. Later in the year, we will be happy to have paying customers.
How do you know that your approach really works?
On the one hand, from this collaboration with our pilot partners. Secondly, we have just fed several popular programs and code libraries with our automatically generated test data and found errors in all of the programs we tested. For example, one graphics library tried to reserve 133 terabytes of memory as input for a small vector graphics file. This can be a potential security risk if such a library is running on a web server, for example. This one library has many users and has even been relatively well tested. A program for processing electronic invoices and an e-book reader also exhibited faulty behavior in our tests. In practice, this can lead to real problems.
Here's a little anecdote: I spoke to a potential pilot partner who works with publishers. He told me about a publishing house that creates and sends electronic invoices. One such invoice went to a library that had bought books from this publisher. The title of the book by an international author contained special characters. The invoice was therefore not read correctly by the library's system and was therefore not processed. The library only noticed this when the reminder arrived. For smaller companies that have to wait a long time for their money as a result, this can be a real threat to their existence. With our test data, providers of invoicing software can find such errors and offer their customers software that works. That's a real competitive advantage.
Who exactly is your solution of interest to?
Actually for all companies that produce software, especially if the failure of the software produced would result in major financial losses. This could also be the case for software used in the automotive industry, for example. However, our work could also be of particular interest to the aforementioned banks or insurance companies that work with sensitive data and need to ensure the quality of their software. It is important that the companies have a sufficient test structure and can also process the data supplied by us. Unfortunately, this is often not the case for very small companies.
How did the idea for InputLab come about?
I did postdoctoral research at CISPA in Andreas Zeller's group between 2021 and August 2024. I developed the ISLa system, which is now one of our technical foundations. Andreas had the idea that we could found a company based on technologies like ISLa that sells XML test data. It was only during the process that we realized how big the market demand actually is.
What's next for you?
We are currently still in the development phase, in which we are working out how our idea can be technically implemented and, in the long term, brought to market maturity. We will continue to work as researchers at CISPA until then. The Federal Ministry of Education and Research is already supporting us as part of the StartUpSecure program with funding for personnel, equipment and other costs. The CISPA Incubator is also supporting us with advice, pitch training and the like. The GmbH will be founded in spring, but we won't leave CISPA and stand on our own two feet until fall 2025.
What motivates you to keep driving this innovation forward?
To be honest, first and foremost my team motivates me. It's so great to see that we can achieve so much more together than we can individually. We are now six full-time employees and three research assistants - unfortunately only men so far. This team is really great and the work is really fun. I'm constantly learning new things, which is very rewarding. Of course there are highs and lows, but at the moment the highs clearly outweigh the lows.
That is how it's supposed to be. I'm looking forward to hearing from you. Thank you for the interview, Dominic