Language-Independent Fuzz Testing with Probabilistic, Generative Models
Fuzzing is a popular technique to create inputs for testing software that processes structured data. It has been successfully applied in various domains, ranging from compilers and interpreters over program analyses to rendering engines, image manipulation tools, and word processors. Existing fuzzing techniques create data in a particular target language in two ways. First, grammar-based approaches build on top of a formal grammar that describes the input format and create data that complies with this grammar. Second, whitebox approaches analyze the program under test to generate input that triggers particular paths. An important limitation of grammar-based approaches is that a grammar alone is insufficient to create input data that is realistic and effective for testing. Reasons include that simply expanding recursive grammar rules can lead to infinitely large data and that treating all possible grammar expansions as equally likely often creates unrealistic data. Whitebox approaches assume that fuzz testing targets one particular program under test and that this program is available at input generation time. These conditions are not always given, e.g., when using fuzzing to create inputs for differential testing across multiple supposedly equivalent programs or when fuzz testing remote web applications. Moreover, whitebox techniques often suffer from scalability issues.
We present Genesis, a generic approach for generating structured data without an a priori known model. The key idea is to exploit a given corpus of example data to automatically infer a probabilistic, generative model that creates new data with properties similar to the corpus. The model is generative in the sense that it allows for generating new example data that is compatible with the model. It is probabilistic in the sense that the inferred properties are guarded by probabilities, allowing Genesis to generate different data by taking different random decisions. As a language-independent representation of data, Genesis uses trees with labeled nodes and edges. Such trees can represent a wide range of structured data, e.g, the abstract syntax trees of programs in arbitrary programming languages. The only two requirements to add support for a new data format to Genesis are 1) a way to transform data into trees and vice versa, and 2) a set of example data to learn from. To support a wide range of different properties, Genesis is designed as a framework with an extensible set of techniques to infer a generative model.
As two examples of target languages that Genesis is useful for, we apply it to JavaScript programs and HTML documents. Our evaluation shows that, even though we do not provide a model of the target language, the approach generates input data that mostly complies with the target language. Specifically, given a corpus of less than 100 HTML documents, the approach creates HTML documents that have only 2.06 validation errors per generated kilobyte of HTML. Given a corpus of 10,000 JavaScript programs, 93.9% of the created programs are syntactically valid and 20.7% of them execute without any runtime errors. Both values are higher than for programs generated by existing fuzz testing approaches that are specifically designed for a particular target language. Moreover, using the Genesis-generated JavaScript programs for differential testing of web browsers, we find various inconsistencies, including browser bugs, unimplemented language features, and browser-specific behavior that developers should be aware of.