YARA is the industry standard to search for patterns in malware data sets. Malware analysts heavily rely on YARA rules to identify specific threats, e.g., by scanning unknown malware samples for patterns that are characteristic for a certain malware strain. While YARA is tremendously useful to inspect individual files, its run time grows linearly with the number of input files, resulting in prohibitive performance penalties in large malware corpora. We present YarIx, a methodology to efficiently reveal files matching arbitrary YARA rules. In order to scale to large malware corpora, YarIx uses an inverted n-gram index that maps fixed-length byte sequences to lists of files in which they appear. To efficiently query such corpora, YarIx optimizes YARA searches by transforming YARA rules into index lookups to obtain a set of candidate files that potentially match the rule. Given the storage demands that arise when indexing binary files, YarIx compresses the disk footprint with variable byte delta encoding, abstracts from file offsets, and leverages a novel grouping-based compression methodology. This completeness-preserving approximation will then be scanned using YARA to get the actual set of matching files. Using 32M malware samples and 1404 YARA rules, we show that YarIx scales in both disk footprint and search performance. The index requires just 74% of the space required for storing the malware samples. Querying YarIx with a YARA rule in our test setup is five orders of magnitude faster than using standard sequential YARA scans.
USENIX Security Symposium