It is the task of a reference data set to demonstrate the compression performance of different data formats and applications as objective as possible. For that purpose the potential spectrum of use must be covered accurately being independent from a certain implementation.

Most data formats and applications respond in different measure to the original data. The result depend massively on local redundancy distributions, distances between redundant parts, recurring symbols, the set of symbols used, and many other characteristics. Any suitable file collection should consist of files with contents reflecting these circumstances.

The most common file collections are:

Calgary Corpus

Canterbury Corpus

