SNPsyn binary formats

Data and results are stored in a compact binary format for faster transfer, processing and easier handling of the various files used by SNPsyn.

All binary data is stored in little-endian format. This means that the least significant byte is written first (at the lowest address). SNPsyn handles this by automatically detecting the type of architecture (little or big-endian). The following convention is used througout this document to denote the various types of binary data:

When more than one consecutive value of same type are used to define a field, an "xNumber" is appended (e.g., [B] x3, indicates three consecutive bytes).

Header

All SNPsyn's binary files begin with a header. The field "type" in the header indicates the kind of data stored in the file.

Structure:

SNP definition record

This structure is used in many places (GWAS data, lists of top single SNPs, top pairs SNPs). It defines the name of the SNP (or other marker) and its valid genotype values. It is also used to map genotype index values to string representations. Genotypes should be indexed (1-based index) in the same order as listed in the [S] field(s).

Missing values must not be listed in the definition, since SNPsyn uses index 0 to denote a missing genotype, and represents it with a "?". For example, using the definition of rs999988 in the example below, a genotype index value of 2 maps to genotype AA, while genotype index value 0 will be always mapped to "?"

Structure:

SNP definition map

This structure is used in many places to map SNP indexes to their definition records.

Structure:

Genotype distribution vector

This structure records the distribution of genotype values of single SNP. Number of samples with missing value (index zero) is recorded in the first element, followed by the distribution of samples with genotype values. Genotypes are listed in same order as given in the SNP definition record.

Structure:

Genotype distribution matrix (contingency table)

This structure records the distribution of genotype values for pairs of SNPs. Number of samples with missing value (index zero) is recorded in the first element, followed by the distribution of samples with genotype values. Genotypes are listed in same order as given in the SNP definition record.

Structure:

SNPsyn's GWAS data format

This is a binary, compact file format used by SNPsyn to store and retrive large GWAS data. Thanks to the binary encoding large number of SNPs and samples can be stored fast. Names of files of this type should end with ".syn" (e.g., GWASdata.syn).

Same order of SNPs and samples must be used when writting different parts of the data file. SNPs are indexed starting with 0. These indexes are reported in the results file and derived results files.

Structure:

Single SNPs results

This structure stores the results of the analysis of single SNPs. It includes an optional part "[GDvec] x (number of records)" where the distribution of genotype (and also missing) values is stored.

Structure:

SNP pairs results

This structure stores the results of the analysis of SNP pairs. It includes an optional part "[GDmat] x (number of records)" where the distribution of combinations of genotype (including missing) values of the SNP pair is stored.

Structure:

Histogram - 1D

Structure:

Histogram - 2D

Structure: