Ian H. Witten and Timothy C Bell arranged the so-called "Calgary Text Compression Corpus" and published it in 1989 for the first time. The large version consists of 18 files representing 9 different data types.

All text files base on English language. The is encoded according to the ASCII character set. Despite its name, the "Calgary Text Compression Corpus" also contains machine code, scientific, and graphic data (about 27%).

File Size Contents
bib 111.261 structured text (bibliography), structure well-suited to import data into a data base
book1 768.771 text, novel
book2 610.856 formatted text, scientific
geo 102.400 geophysical data
news 377.109 formatted text, script with news
obj1 21.504 program code (object file), executable machine code
obj2 246.814 program code (object file), executable machine code
paper1 53.161 formatted text, scientific
paper2 82.199 formatted text, scientific
paper3 46.526 formatted text, scientific
paper4 13.286 formatted text, scientific
paper5 11.954 formatted text, scientific
paper6 38.105 formatted text, scientific
pic 513.216 image data (black and white)
progc 39.611 source code
progl 71.646 source code
progp 49.379 source code
trans 93.695 transcript terminal data
  3.251.493 Sum
  3.265.024 TAR

Meanwhile the Calgary Corpus is handled as a quasi standard to compare lossless compression procedures and formats. The name is derived from the University of Calgary. One of the authors, Ian Witten, was employed there at that time.

