Language-Independent Fuzz Testing with Probabilistic, Generative Models
Fuzzing is a popular technique to create inputs for testing software that processes structured data. It has been successfully applied in various domains, ranging from compilers and interpreters over program analyses to rendering engines, image manipulation tools, and word processors. Existing fuzzing techniques create data in a particular target language in two ways. First, grammar-based approaches build on top of a formal grammar that describes the input format and create data that complies with this grammar. Second, whitebox approaches analyze the program under test to generate input that triggers particular paths. An important limitation of grammar-based approaches is that a grammar alone is insufficient to create input data that is realistic and effective for testing. Reasons include that simply expanding recursive grammar rules can lead to infinitely large data and that treating all possible grammar expansions as equally likely often creates unrealistic data. Whitebox approaches assume that fuzz testing targets one particular program under test and that this program is available at input generation time. These conditions are not always given, e.g., when using fuzzing to create inputs for differential testing across multiple supposedly equivalent programs or when fuzz testing remote web applications. Moreover, whitebox techniques often suffer from scalability issues.
We present Genesis, a generic approach for generating structured data without an a priori known model. The key idea is to exploit a given corpus of example data to automatically infer a probabilistic, generative model that creates new data with properties similar to the corpus. The model is generative in the sense that it allows for generating new example data that is compatible with the model. It is probabilistic in the sense that the inferred properties are guarded by probabilities, allowing Genesis to generate different data by taking different random decisions. As a language-independent representation of data, Genesis uses trees with labeled nodes and edges. Such trees can represent a wide range of structured data, e.g, the abstract syntax trees of programs in arbitrary programming languages. The only two requirements to add support for a new data format to Genesis are 1) a way to transform data into trees and vice versa, and 2) a set of example data to learn from. To support a wide range of different properties, Genesis is designed as a framework with an extensible set of techniques to infer a generative model.