public data mining resources

By Tidyomics Team | December 25, 2018

With the advancing of sequencing technologies, more and more public data are available for you to mine. One does not have to produce his own data, rather, mining public data sets can help to generate hypothesis and even publish decent papers if done properly.

In this blog post, I am going to list some of the public data resources one can take advantage of.

Gene Expression Omnibus

Gene Expression Omnibus (GEO) is a NCBI supported public functional genomics data repository. Array- and sequence-based data are deposited by researchers. Many journals require the authors have a GEO link to their data published along with the paper. Sequencing files are deposited in SRA format and NCBI has SRA toolkit to specifically interact with those files.

You can use ascp within sratoolkit’s prefetch for way faster downloads:

prefetch -t ascp -a "${ASCP_PATH}/connect/bin/ascp|{ASCP_PATH}/connect/etc/asperaweb_id_dsa.openssh" --max-size 1000GB ${SRA_ACCESSION_ID}

or you can use the parallelized fastq-dump to get the fastqs. see here

$time fasterq-dump SRR000001 -t /dev/shm -e 8

another option is https://github.com/rvalieris/parallel-fastq-dump

European Nucleotide Archive

The European Nucleotide Archive (ENA) provides a comprehensive record of the world’s nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation.

The good part of ENA is that fastq files are available for downloading. One has to convert the SRA files to fastq files from GEO. For big files, this can take long. In this light, I always go to ENA ftp to find the fastq files for the same study. To understand the structure of the ftp, see a gist from Mike Love:https://gist.github.com/mikelove/f539631f9e187a8931d34779436a1c01

Archive generated fastq files are organised by run accession number under vol1/fastq directory in ftp.sra.ebi.ac.uk:

ftp://ftp.sra.ebi.ac.uk/vol1/fastq/[/]/

is the first 6 letters and numbers of the run accession ( e.g. ERR000 for ERR000916 ),

does not exist if the run accession has six digits.

For example, fastq files for run ERR000916 are in directory: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR000/ERR000916/.

If the run accession has seven digits then the is 00 + the last digit of the run accession.

For example, fastq files for run SRR1016916 are in directory: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR101/006/SRR1016916/.

If the run accession has eight digits then the is 0 + the last two digits of the run accession.

If the run accession has nine digits then the is the last three digits of the run accession.

Even better, without downloading the fastqs, one can stream the ENA fastq files with stream_ena say for RNAseq quantification with salmon:

#/bin/bash
# from http://www.nxn.se/valent/streaming-rna-seq-data-from-ena
fastq="$1"

prefix=ftp://ftp.sra.ebi.ac.uk/vol1/fastq

accession=$(echo $fastq | tr '.' '_' | cut -d'_' -f 1)

dir1=${accession:0:6}

a_len=${#accession}
if (( $a_len == 9 )); then
    dir2="";
elif (( $a_len == 10 )); then
    dir2=00${accession:9:1};
elif (( $a_len == 11)); then
    dir2=0${accession:9:2};
else
    dir2=${accession:9:3};
fi

url=$prefix/$dir1/$dir2/$accession/$fastq.gz

curl --keepalive-time 4 -s $url | zcat

read more at http://www.ebi.ac.uk/ena/browse/read-download#downloading_files_ena_browser

How to use it:

./stream_ena SRR3185782.fastq | head
@SRR3185782.1 HWI-D00361:180:HJG3GADXX:2:1101:1460:2181/1
AGTGTGTTCATCAGTGTGGATTTGCCAATGCCGGTCTCCCCCACACAGAG
+
BBBFFBFFFB<FFFFFBFF<FFFFFFFFFFFFFIIIIFFFFFFFFIFFFF
@SRR3185782.2 HWI-D00361:180:HJG3GADXX:2:1101:1613:2218/1
GCCAATTTTCTTAATGTAAGTGCTGACTTCCTTAACAATTTCCTCATATC
+
BBBFFFFFFFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@SRR3185782.3 HWI-D00361:180:HJG3GADXX:2:1101:2089:2243/1
CGGGTTCTTGGACTTCAGCCAGTTGAGCAGGGCATCCTTGTTGAAGGCGG


salmon quant -l IU \
-i Homo_sapiens.GRCh38.78.cdna_ERCC_repbase.fa \
-r <(./stream_ena SRR3185782.fastq) -o SRR3185782

salmon quant -l IU \
-i Homo_sapiens.GRCh38.78.cdna_ERCC_repbase.fa \
-1 <(./stream_ena SRR1274127_1.fastq) \
-2 <(./stream_ena SRR1274127_2.fastq) -o SRR1274127

./stream_ena SRR1274127_1.fastq | fastqc -o SRR1274127_1_fastqc -f fastq stdin

RNAseq/microarray specific databases

ChIPseq specific databases

  • ENCODE
  • Cistrome: The best place for wet lab scientist to check the binding sites. Developed by Shierly Liu lab in Harvard.
  • ChIP-Atlas is an integrative and comprehensive database for visualizing and making use of public ChIP-seq data. ChIP-Atlas covers almost all public ChIP-seq data submitted to the SRA (Sequence Read Archives) in NCBI, DDBJ, or ENA, and is based on over 78,000 experiments.
  • remap an integrative analysis of transcriptional regulators ChIP-seq experiments from both Public and Encode datasets. The ReMap atlas consists of 80 million peaks from 485 transcription factors (TFs), transcription coactivators (TCAs) and chromatin-remodeling factors (CRFs).
  • A map of direct TF-DNA interactions in the human genome UniBind is a comprehensive map of direct interactions between transcription factor (TFs) and DNA. High confidence TF binding site predictions were obtained from uniform processing of thousands of ChIP-seq data sets using the ChIP-eat software.

Other field specific

  • Genotype-Tissue Expression (GTEx)
  • TCGA The Cancer Genome Atlas.
  • CCLE Broad Institute Cancer Cell Line Encyclopedia.
  • TARGET Therapeutically Applicable Research To Generate Effective Treatments.
  • The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators, with the goal of aggregating and harmonizing both exome and genome sequencing data from a wide variety of large-scale sequencing projects, and making summary data available for the wider scientific community.
  • 1000 genomes project

There are many other databases that I may miss here. As you can see, the amount of data available is immense. It is a good time to be research parasites :)

comments powered by Disqus