Documentation
SNPsyn is an interactive software tool for effective exploration and discovery of synergistic pairs of SNPs. The Flash-based application consists of components for GWAS data preparation, visual exploration of results and details on SNP pairs and single SNPs, Gene Ontology term enrichment analysis, and display of gene interaction networks. The components are connected such that a selection change in one component is propagated to subsequent components and their visualizations. A typical workflow consists of steps 1-4 detailed below.
- Distributions of scores: selection of pairs of SNPs based on the distribution of various measures (Synergy, Information Gain, ratio of Synergy to Information Gain, False Discovery Rate). SNP pairs within the selected region on the graph are sent to the window on the right for finer selection of SNP pairs.
- Selection of SNP pairs: finer selection of a subset of SNPs for further analysis, detailed information of selected pairs of SNPs. SNP pairs selected from this list can be sent into the Gene Ontology and Interaction network components for further analysis
- Gene Ontology: results of term-enrichment analysis for genes annotated to SNPs in selected pairs. Analyses on the three Gene Ontology aspects can be performed. The user can selected the p-value and fold enrichment threshold to use for the analysis.
- Interaction network: displays a network formed by the selected SNP pairs. The user can interact with network (change the layout criteria, subset of pairs to display, the type of score used to determine the weights of the network.
- Menu bars: commands for data upload, selection of analysis's results to explore, creating a report, bookmarks, visualization of single SNP results, and settings.
Distributions of scores and initial selection
In the analysis of a typical GWAS data many billions of SNP pairs may be evaluated. This component supports a better overview of all results and allows an initial selection from a much smaller subset (10000 best pairs of SNPS) for further exploration. The selection is made by comparing the distribution on true data (blue dots) with the distribution obtained by random permutation analysis (red dots).
- Information gain (I) vs. Synergy plot, and marginal distributions of Synergy and Information gain scores for all possible pairs of SNPs. Information gain (I, x-axis) is a measure of the total information that a pair of SNPs carries on the disease. Synergy (Syn, y-axis, also called Interaction gain) measures the amount of non-additive (extra) information obtained by jointly observing the two SNPs. Positive Synergy is an indication of true interaction, while negative Synergy is an indication of correlation between the two SNPs forming a pair.
- Use various criteria to select the most promising SNP pairs from the graph. If you aim for highly synergistic pairs then the Synergy to Information gain ratio should be at least 0.5. To reduce the number of false-positive pairs try to additionally filter by False Discovery Rate (FDR) and by setting minimum acceptable value for Information gain and Synergy in a way that most of the SNP pairs falling in non-significant (red) regions are not selected (as shown).
- SNP pairs in the selected region are automatically sent to the Selection component on the right.
- Change the intensity of the random background distribution of scores (reference) and the scale of the Information Gain vs. Synergy plot.
Selection of SNP pairs for further analysis
After the initial selection of a subset of all the scored SNP pairs in the distribution component, a more finer selection can be performed in this component. This selection is then sent to subsequent components for Gene Ontology enrichment analysis and Interaction network exploration.
- Tabular display of detailed information on each pair: user bookmark (star), the level of Synergy (Syn), Information gain (I), p-value, False Discovery Rate, first and second SNP in pair (SNP1, SNP2, names of genes annotated to SNPs are displayed when available) and their genomic position (pos1, pos2), Information gains of individual SNPs (I1 and I2), number of samples on which these statistics were calculated, and SNP pair interaction model index number (as defined by Li and Reich, Hum Hered 2000).
- Type a search string. Pairs that match this string in SNP id, gene name or genomic position will be displayed.
- Select pairs of interest for further analysis. Double-click on row to get details on a pair of SNPs.
- Get details on the SNP pair in the selected row.
- Send list of selected SNP pairs for further analysis in Gene Ontology enrichment and Interaction network components.
SNP pairs
With this component the user can explore all the details of a SNP pair.
- Information gain (I), Synergy (percentage of I), and interaction model index of the SNP pair.
- Gene names associated with the two SNPs (with links to NCBI's Entrez Gene pages), SNP ids, links to NCBI's dbSNP pages and HapMap's SNP information pages, Information gain of individual SNPs forming the pair.
- Contingency table showing the number of cases and controls for each combination of genotype values of the two SNPs, and the ratio of cases agains all samples within a cell. Number of samples with missing genotype (if any) are also displayed in the last column and last row.
- Select how to render the table.
Single SNPs
A prerequisite for the computation of Synergy is the computation of Information gain (I) scores of individual SNPs. This component shows the distribution of Information Gain scores of single SNPs and allows the user to identify "main-effect" SNPs, i.e., those SNPs that by themselves carry most information on the disease under study.
- Tabular display of detailed information on each single SNP: user bookmark (star), p-value, False Discovery Rate (FDR), SNP id (annotated gene name when available), genomic position (chr), Information Gain (I), number of samples from which I was calculated.
- Distribution of Information gain scores on true (blue) and randomly permuted data (red): Information gain (x-axis) and frequency of SNPs with specific Information gain (y-axis). Move the slider to select a subset of single SNPs to be displayed in the table.
- Change the intensity of the reference (random) distribution and scale of the y-axis.
- Double-click to get details on SNP.
In this window the user can explore all the details on a single SNP.
- Gene name associated with the SNP (with link to NCBI's Entrez Gene page), SNP id, link to NCBI's dbSNP page and HapMap's SNP information page, Information gain of SNP.
- Contingency table showing the number of cases and controls for each genotype value of the SNP, and the ratio of cases agains all samples for each genotype. Number of samples with missing genotype (if any) is also displayed in the last column.<
- Select how to render the table.
Gene ontology (GO)
Genes associated with SNPs selected in Selection are used as a cluster (query) set in this gene ontology term enrichment analysis. All SNPs associated to all annotated genes in GO are used as the reference set. Enriched terms are reported in a gene ontology tree. The user can select which aspect to display. Network nodes of SNPs associated with selected GO terms get highlighted (in green) in the Interaction network component.
- Select the Gene Ontology aspect: biological process, molecular function, cellular component.
- Report only GO terms with p-value lower than the selected threshold.
- Scale of the bar for fold-enrichment display.
- Number of cluster (query) SNPs associated with the term.
- Term enrichment p-value computed using the hypergeometric distribution.
- Term fold enrichment, computed as a ratio between the query and the reference SNP set frequencies. SNP set frequency is defined as the ratio between the number of term-annotated SNPs and the total number of SNPs in the set.
- Number of SNPs from the cluster set annotated with the term vs. number of SNPs from the genome (reference set) with the annotation.
- Click on rows to select the term and related SNPs. This selection is propagated to the Interaction network component, and selected SNPs get highlighted in green.
Interaction network
Nodes in the network are SNPs, edges connect SNPs forming the selected pairs.
- Select the layout criteria. SNPs on same chromosome ("group by chromosome"), on same gene ("group by gene"), or with no preference on genomic location ("basic").
- Toggle the display of gene names and SNP id of network nodes. If both are checked, gene name will be shown if defined, otherwise the SNP id is shown.
- Change the numeric display on network edges. Choose from: empty (none), pair's Synergy (Syn), pair's Information gain (I), False Discovery Rate (FDR).
- Move the slider to further reduce the network to a subset of best scored pairs of SNPs. Choose the number of best pairs to display and the criterion (Synergy, Information gain, FDR) used to determine the best pairs.
- Zoom in and out from the network. Click and drag on any empty space between the nodes and edges to move the network and bring interesting SNPs into view.
- Move the mouse over an node to get additional information on the SNP, to see neighboring SNPs on same chromosome or gene (will get highlighted in red, depending on the selected layout), interact with the layout by moving nodes (click hold and move node). SNPs selected in the Gene ontology component get highlighted in green here.
Bookmarks
This window stores the lists of selected (starred) SNP pairs and single SNPs for quick reference and access on details. SNP pairs can be added to these lists by marking a star in their respective rows in the Selection component. Similarly, single SNPs can be marked in the Single SNPs window.
- List of starred SNP pairs. Double click on row to get details on pair.
- List of starred single SNPs. Double click on row to get details on single SNPs.
- Select rows in the two lists and click "Remove star" to remove the associated pairs and single SNPs from the lists.
Report
Click on the "Report" button (top right on main screen) to generate a report in HTML format. Current state of all the main window components is recorded, including selection of SNP pairs, starred SNP pairs and single SNPs, selection in the score distribution graph, gene ontology, interaction network, etc. The content is linked to NCBI's dbSNP and Entrez Gene information pages. See an example report here.
Data preparation and upload
Submitting data for interaction analysis involves several steps: selection of case and control groups of samples, selection of SNPs to analyze, and selection of interaction analysis parameters. All steps are detailed below.
The tab-delimited data must follow the individual-major order, i.e., genotype data of an individual (sample) are listed in row, first row lists SNP ids, columns are SNPs, one special column (names "class" in first row) gives phenotype values of samples. More than two phenotype values (case-control) are allowed and can be used to form various groups of samples for analysis.
For a few example, please see the tab-delimited files provided in the Examples section. For example, have a look at the tab-delimited file gse8054.tab.gz from the first example.
In more hypothesis driven research and due to high computational and processing time demands associated with exhaustive search, the user can decide not to explore all possible combinations of pairs and instead focus on a subset of SNPs.
When done with group and SNP set selection click on "3. Prepare." The data will be formatted into SNPsyn's binary format and a confirmation window will appear (shown below).
- combine main effect SNPs: first, individual SNPs are scored for Information Gain, all possible pairs among 22000 best ranked SNPs are checked for synergy
- approximate screening of all pairs: first, a fast-to-calculate approximate upper bound on Synergy is used to score all pairs, then 242M best-ranked pairs are checked for exact synergy
Heuristic search runtime is restricted to five hours (242M pairs). This restriction can be lifted in a local installation of SNPsyn.
Analysis status and selection
Click on the "Open" button (menu on top-left in main window) to get a list analyses submitted to the server and to start explore the results.
Menu bars

Manuals and data formats
Implementation details
Synergy among SNPs is estimated by an information-theoretic approach called interaction analysis [Jakulin and Bratko 2003, Anastassiou 2007]. Discovering synergy is an inherently combinatorial problem and it requires an exhaustive exploration of all combinations of SNPs. Current genome-wide case-control association studies (GWAS) include over one million SNPs, which in practice limits SNPsyn's exhaustive enumeration and discovery of SNP synergy to investigation of pairs of SNPs.
The application is optimized for speed and designed for an interactive visual exploration of results. The computationally intensive part is implemented in C++ and can run in parallel on a dedicated cluster or grid. The graphical user interface was developed in Adobe Flex and it can run in standard web browsers that support Flash, or as a stand-alone application using Adobe's AIR run-time environment.
The web-server is needed to host the Flash-based application, receive data files, serve results files, and to invoke the command-line SNPsyn program that performs the actual data analysis. The analysis program may run on a dedicated cluster or grid.
SNPsyn command-line
Running the command 'SNPsyn man' will display the complete man page.
Scripts that invoke SNPsyn's command-line commands can be found here.
Binary data formats
SNPsyn uses a number of compact binary data structures for efficient storage of GWAS data and results of analysis. Find a detailed description of the data formats here.
Installation
Because of limited computational resources, the freely available web version of SNPsyn is limited to the analysis of at most 22000 SNPs or 242M heuristically scored pairs (depending on the chosen heuristic). If you want to handle larger datasets you must use SNPsyn's AIR version. This version provides you with instructions on how to run the analysis on your own computational cluster or grid.
Another possibility is to install your own local server version of SNPsyn.
SNPsyn AIR installation
Download the SNPsyn AIR package and follow these steps:
- Click on the icon to start the installation.
- Click the button "Install." Because we are using a self-signed certificate,
please ignore the warning.
- Here you may change the destination folder and other options, then click "Continue."
- Download SNPsyn command-line tool and add executables folder to your system path. SNPsyn AIR creates a data processing script (process.sh/process.bat) which calls the command-line tool.
-
After the installation is finished and you run SNPsyn AIR, you need
to download additional data for mapping human/mouse SNP markers to
genes/chromosomes:
- Enter the SNPsyn web server address (http://snpsyn.biolab.si or the address of your SNPsyn web server if you installed one). The AIR application connects with the server to download prepared SNP mappings and perform GO enrichment analysis.
- Select the desired SNP mapping (human/mouse) depending on the data you use.
- Select the storage folder for the SNP mapping data.
- Download data for SNP mapping from SNPsyn web server into selected folder (this step needs to be taken only once).
- Check if GO enrichment analysis is working on SNPsyn web server.
Using SNPsyn AIR application
The SNPsyn AIR version of the interface is intended for local use. All the required SNP mappings are downloaded to the local computer. Only GO enrichment analysis is performed on the SNPsyn server, requiring minimal data transfer.
A typical workflow involves 3 steps:
- Import data (from local file) and select SNPs with the help of GO enrichment analysis
(same way as in web interface). Then store the prepared
.syn file with the dialog shown below.
- Select location of new .syn file (output of preparation step and input to analysis script).
- Select your operating system in order for the process script to be generated correctly.
- Select frequency measure (relative or laplace).
- Select heuristic (off, main effects, screen).
- Select number of chunks.
- Select number of random permutations.
- Select ID of GPU to use for computation (first GPU has index 1). Uncheck if no GPU available.
When processing is finished, the .syn file is saved in the selected folder and a "process" script is created. - Start data analysis by running the "process" script. Make sure "SNPsyn" command-line utility is in the system's path.
- After the process script finishes, return to the interface and choose "Open" (simply choose the folder with the
results).
Server setup
A detailed description of how to setup SNPsyn web server from scratch is described here and a PDF version here.