Documentation

SNPsyn is an interactive software tool for effective exploration and discovery of synergistic pairs of SNPs. The Flash-based application consists of components for GWAS data preparation, visual exploration of results and details on SNP pairs and single SNPs, Gene Ontology term enrichment analysis, and display of gene interaction networks. The components are connected such that a selection change in one component is propagated to subsequent components and their visualizations. A typical workflow consists of steps 1-4 detailed below.

Distributions of scores and initial selection

In the analysis of a typical GWAS data many billions of SNP pairs may be evaluated. This component supports a better overview of all results and allows an initial selection from a much smaller subset (10000 best pairs of SNPS) for further exploration. The selection is made by comparing the distribution on true data (blue dots) with the distribution obtained by random permutation analysis (red dots).

Selection of SNP pairs for further analysis

After the initial selection of a subset of all the scored SNP pairs in the distribution component, a more finer selection can be performed in this component. This selection is then sent to subsequent components for Gene Ontology enrichment analysis and Interaction network exploration.

SNP pairs

With this component the user can explore all the details of a SNP pair.

Single SNPs

A prerequisite for the computation of Synergy is the computation of Information gain (I) scores of individual SNPs. This component shows the distribution of Information Gain scores of single SNPs and allows the user to identify "main-effect" SNPs, i.e., those SNPs that by themselves carry most information on the disease under study.

In this window the user can explore all the details on a single SNP.

Gene ontology (GO)

Genes associated with SNPs selected in Selection are used as a cluster (query) set in this gene ontology term enrichment analysis. All SNPs associated to all annotated genes in GO are used as the reference set. Enriched terms are reported in a gene ontology tree. The user can select which aspect to display. Network nodes of SNPs associated with selected GO terms get highlighted (in green) in the Interaction network component.

Interaction network

Nodes in the network are SNPs, edges connect SNPs forming the selected pairs.

Bookmarks

This window stores the lists of selected (starred) SNP pairs and single SNPs for quick reference and access on details. SNP pairs can be added to these lists by marking a star in their respective rows in the Selection component. Similarly, single SNPs can be marked in the Single SNPs window.

Report

Click on the "Report" button (top right on main screen) to generate a report in HTML format. Current state of all the main window components is recorded, including selection of SNP pairs, starred SNP pairs and single SNPs, selection in the score distribution graph, gene ontology, interaction network, etc. The content is linked to NCBI's dbSNP and Entrez Gene information pages. See an example report here.

Data preparation and upload

Submitting data for interaction analysis involves several steps: selection of case and control groups of samples, selection of SNPs to analyze, and selection of interaction analysis parameters. All steps are detailed below.

  • Select GWAS data file on your local computer for upload to server. Supported formats are tab-delimited (.tab or .txt).
  • Click to start the upload and data preparation steps.
  • The tab-delimited data must follow the individual-major order, i.e., genotype data of an individual (sample) are listed in row, first row lists SNP ids, columns are SNPs, one special column (names "class" in first row) gives phenotype values of samples. More than two phenotype values (case-control) are allowed and can be used to form various groups of samples for analysis.

    For a few example, please see the tab-delimited files provided in the Examples section. For example, have a look at the tab-delimited file gse8054.tab.gz from the first example.

    When the data is uploaded (see progression window above), the user may start to group samples.

  • All columns that may potentially hold information on phenotype are listed here. These are attributes (columns), whose names do not follow the SNP naming convention (rsNNNNN, where N is a digit 0-9).
  • For a selected phenotype attribute, the list of attribute values and number of samples having the particular value are listed. The user can assign samples with specific phenotype value to the "cases" or "controls" group. Not all samples need to be assigned, which is useful for comparison and exploration of results obtained on selected subsets of samples.
  • A subset of data is shown in a preview so that the user can check if the data was read correctly.
  • In more hypothesis driven research and due to high computational and processing time demands associated with exhaustive search, the user can decide not to explore all possible combinations of pairs and instead focus on a subset of SNPs.

  • List of available SNPs which can be considered in the analysis. Select a row to see distribution of genotype values in samples (histogram on right).
  • List of SNPs that were selected for analysis. Select a row to see distribution of genotype values in samples (histogram on left). Due to limited computational resources on our site, heuristic search will be used when you select more than 22000 SNPs. To analyze more SNPs exhaustively using your own computational resources, please use the AIR and command-line versions of SNPsyn.
  • Click "Add >", "Del <", "All >", "All <" to move selected SNPs from one list into the other.
  • Use Gene ontology annotation to select SNPs annotated to genes from various GO terms.
  • Select GO terms and use "Include" or "Exclude" to modify the list of SNPs selected for analysis.
  • When done with group and SNP set selection click on "3. Prepare." The data will be formatted into SNPsyn's binary format and a confirmation window will appear (shown below).

  • Name your analysis. The default value of this field is in this format: name of phenotype attribute used, phenotype values assigned to the cases group vs. values assigned to the controls group. We suggest you carefully name your analyses, because later you will be able to search for them easily.
  • Select probability estimate to be used in interaction analysis: relative frequency and Laplace estimate. The latter is a more robust estimation of probabilities and is more suitable for datasets with small number of samples.
  • Select heuristics (off, main effects, screen).
  • Submit GWAS data in SNPsyn's binary format to server for processing and interaction analysis.
  • Download GWAS data in SNPsyn's binary format to your local computer. You may run the analysis on your computer (cluster or grid) using SNPsyn's command-line programs.
  • SNPsyn implements two two-stage heuristics:

    Heuristic search runtime is restricted to five hours (242M pairs). This restriction can be lifted in a local installation of SNPsyn.

    Analysis status and selection

    Click on the "Open" button (menu on top-left in main window) to get a list analyses submitted to the server and to start explore the results.

  • Type text that at least partially matches the description of the analysis given at time of submission.
  • Click "Open" to load display the results of a selected analysis.
  • Click "Delete" to remove the data files and results of selected analysis
  • Menu bars

  • Submit your data for analysis.
  • Open and start exploring the results of a submitted analysis.
  • When exploring an analysis, generate a report based on current selection and content of the main components.
  • Get lists of bookmarked SNP pairs and single SNPs.
  • Select colors used to render the visualizations (graph, gene ontology, interaction network).
  • Explore the results of single SNP analysis.
  • Manuals and data formats

    Implementation details

    Synergy among SNPs is estimated by an information-theoretic approach called interaction analysis [Jakulin and Bratko 2003, Anastassiou 2007]. Discovering synergy is an inherently combinatorial problem and it requires an exhaustive exploration of all combinations of SNPs. Current genome-wide case-control association studies (GWAS) include over one million SNPs, which in practice limits SNPsyn's exhaustive enumeration and discovery of SNP synergy to investigation of pairs of SNPs.

    The application is optimized for speed and designed for an interactive visual exploration of results. The computationally intensive part is implemented in C++ and can run in parallel on a dedicated cluster or grid. The graphical user interface was developed in Adobe Flex and it can run in standard web browsers that support Flash, or as a stand-alone application using Adobe's AIR run-time environment.

    The web-server is needed to host the Flash-based application, receive data files, serve results files, and to invoke the command-line SNPsyn program that performs the actual data analysis. The analysis program may run on a dedicated cluster or grid.

    SNPsyn command-line

    Running the command 'SNPsyn man' will display the complete man page.

    Scripts that invoke SNPsyn's command-line commands can be found here.

    Binary data formats

    SNPsyn uses a number of compact binary data structures for efficient storage of GWAS data and results of analysis. Find a detailed description of the data formats here.

    Installation

    Because of limited computational resources, the freely available web version of SNPsyn is limited to the analysis of at most 22000 SNPs or 242M heuristically scored pairs (depending on the chosen heuristic). If you want to handle larger datasets you must use SNPsyn's AIR version. This version provides you with instructions on how to run the analysis on your own computational cluster or grid.

    Another possibility is to install your own local server version of SNPsyn.

    SNPsyn AIR installation

    Download the SNPsyn AIR package and follow these steps:

    1. Click on the icon to start the installation.



    2. Click the button "Install." Because we are using a self-signed certificate, please ignore the warning.



    3. Here you may change the destination folder and other options, then click "Continue."



    4. Download SNPsyn command-line tool and add executables folder to your system path. SNPsyn AIR creates a data processing script (process.sh/process.bat) which calls the command-line tool.

    5. After the installation is finished and you run SNPsyn AIR, you need to download additional data for mapping human/mouse SNP markers to genes/chromosomes:

    6. Enter the SNPsyn web server address (http://snpsyn.biolab.si or the address of your SNPsyn web server if you installed one). The AIR application connects with the server to download prepared SNP mappings and perform GO enrichment analysis.
    7. Select the desired SNP mapping (human/mouse) depending on the data you use.
    8. Select the storage folder for the SNP mapping data.
    9. Download data for SNP mapping from SNPsyn web server into selected folder (this step needs to be taken only once).
    10. Check if GO enrichment analysis is working on SNPsyn web server.

    Using SNPsyn AIR application

    The SNPsyn AIR version of the interface is intended for local use. All the required SNP mappings are downloaded to the local computer. Only GO enrichment analysis is performed on the SNPsyn server, requiring minimal data transfer.

    A typical workflow involves 3 steps:

    1. Import data (from local file) and select SNPs with the help of GO enrichment analysis (same way as in web interface). Then store the prepared .syn file with the dialog shown below.

      • Select location of new .syn file (output of preparation step and input to analysis script).
      • Select your operating system in order for the process script to be generated correctly.
      • Select frequency measure (relative or laplace).
      • Select heuristic (off, main effects, screen).
      • Select number of chunks.
      • Select number of random permutations.
      • Select ID of GPU to use for computation (first GPU has index 1). Uncheck if no GPU available.

      When processing is finished, the .syn file is saved in the selected folder and a "process" script is created.

    2. Start data analysis by running the "process" script. Make sure "SNPsyn" command-line utility is in the system's path.
    3. After the process script finishes, return to the interface and choose "Open" (simply choose the folder with the results).

    4. Server setup

      A detailed description of how to setup SNPsyn web server from scratch is described here and a PDF version here.


    © 2011-2014 Bioinformatics Laboratory and Laboratory for Adaptive Systems and Parallel Processing
    University of Ljubljana, Faculty of Computer and Information Science