资源描述
二代测序数据分析简介,童春发
2013.12.23,主要内容,重测序的原理及流程
数据结构与质量评估
SRA数据库及数据获取
Bowtie2、BWA和SAMtools软件使用,重测序的原理及流程,,数据结构与质量评估,Fastq格式
FastQC,FASTQ format,http://en.wikipedia.org,A FASTQ file containing a single sequence might look like this,@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65,Illumina sequence identifiers,@HWUSI-EAS100R:6:73:941:1973#0/1,Versions of the Illumina pipeline since 1.4 appear to use #NNNNNN instead of #0 for the multiplex ID, where NNNNNN is the sequence of the multiplex tag.,With Casava 1.8 the format of the '@' line has changed,@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG,Quality,A quality value Q is an integer mapping of p (i.e., the probability that the corresponding base call is incorrect).
Phred quality score:
The Solexa pipeline (i.e., the software delivered with the Illumina Genome Analyzer) earlier used,Quality,Encoding,Sanger format can encode a Phred quality score from 0 to 93 using ASCII 33 to 126
Illumina's newest version (1.8) of their pipeline CASAVA will directly produce fastq in Sanger format
Solexa/Illumina 1.0 format can encode a Solexa/Illumina quality score from -5 to 62 using ASCII 59 to 126
Starting with Illumina 1.3 and before Illumina 1.8, the format encoded a Phred quality score from 0 to 62 using ASCII 64 to 126
Starting in Illumina 1.5 and before Illumina 1.8, the Phred scores 0 to 2 have a slightly different meaning,American Standard Code for Information Interchange (ASCII),,FastQC,http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Double click “run_fastqc.bat” to run FastQC
The analysis results for 11 modules
Green tick for normal
Orange triangle for slightly abnormal
Red cross for very unusual,,,Basic Statistics,Filename NHS066-47_L4_1.fq.gz
File type Conventional base calls
Encoding Sanger / Illumina 1.9
Total Sequences 3992798
Filtered Sequences 0
Sequence length 100
%GC 37,Per Base Sequence Quality,•The central red line is the median value
•The yellow box represents the inter-quartile range (25-75%)
•The upper and lower whiskers represent the 10% and 90% points
•The blue line represents the mean quality,Per Sequence Quality Scores,A warning is raised if the most frequently observed mean quality is below 27 - this equates to a 0.2% error rate.
An error is raised if the most frequently observed mean quality is below 20 - this equates to a 1% error rate.,Per Base Sequence Content,This module issues a warning if the difference between A and T, or G and C is greater than 10% in any position.
This module will fail if the difference between A and T, or G and C is greater than 20% in any position.,Per Base GC Content,This module issues a warning it the GC content of any base strays more than 5% from the mean GC content.
This module will fail if the GC content of any base strays more than 10% from the mean GC content.,Per Sequence GC Content,A warning is raised if the sum of the deviations from the normal distribution represents more than 15% of the reads
This module will indicate a failure if the sum of the deviations from the normal distribution represents more than 30% of the reads,Per Base N Content,This module raises a warning if any position shows an N content of >5%
This module will raise an error if any position shows an N content of >20%,Sequence Length Distribution,This module will raise a warning if all sequences are not the same length
This module will raise an error if any of the sequences have zero length,Duplicate Sequences,This module will issue a warning if non-unique sequences make up more than 20% of the total
This module will issue a error if non-unique sequences make up more than 50% of the total,Overrepresented Sequences,AATTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTCGAAGATCTCG 65311 1.636 TruSeq Adapter, Index 10 (97% over 36bp)
ATTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTCGAAGATCTCGT 6464 0.162 TruSeq Adapter, Index 10 (97% over 36bp)
AATAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTCGAAGATCTCGT 4633 0.116 TruSeq Adapter, Index 10 (97% over 36bp)
AATTAGTCGGAAGAGCACACGTCTGAACTCCAGTCACTCGAAGATCTCGT 4463 0.112 TruSeq Adapter, Index 10 (97% over 34bp)
AATTATGGATAATTAAAGTATTCCCCCCTTTTTTTTATGATATTTTTGAC 3994 0.100 No Hit
Warning: >0.1%
Failure: >1%,Overrepresented Kmers,This module will issue a warning if any k-mer is enriched more than 3 fold overall, or more than 5 fold at any individual position
This module will issue a error if any k-mer is enriched more than 10 fold at any individual base position,Saving a Report,NHS066-47_L4_1.fq_fastqc.zip,SRA数据库及数据获取,,,SRA数据库及数据获取,,SRA数据库及数据获取,,SRA数据库及数据获取,,查看和下载SRR576183,,Fastq-dum将SRA文件转化成FASTQ格式,fastq-dump --split-files -DQ “+” ./SRR576183.sra
fastq-dump --split-files -DQ “+” --gzip ./SRR576183.sra,直接下载FASTQ格式数据,ftp://ftp.era.ebi.ac.uk/vol1/fastq/SRR576/SRR576183,将Reads比对到参考序列,BWA
Bowtie2
Soap
Samtools,BWA,http://bio-bwa.sourceforge.net/
https://github.com/lh3/bwa
wget http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.5a.tar.bz2
tar -xjvf bwa-0.7.5a.tar.bz2
cd bwa-0.7.5a
make
Dowload test.tar.gz from ftp://202.119.214.193,BWA,../bwa-0.7.5a/bwa index ref.fa
../bwa-0.7.5a/bwa mem ref.fa test_PE1.fa > aln-se.sam
../bwa-0.7.5a/bwa mem ref.fa test_PE1.fa test_PE2.fa > aln-se.sam,Bowtie2,http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
下载 bowtie2-2.1.0-linux-x86_64.zip
unzip bowtie2-2.1.0-linux-x86_64.zip
mv bowtie2-2.1.0 bowtie2
cd bowtie2/example
mkdir work
cd work,Bowtie2,Index a reference genome
../../bowtie2-build ../reference/lambda_virus.fa lambda_virus
Aligning single-end reads
../../bowtie2 -x lambda_virus -U ../reads/reads_1.fq
-S eg1.sam
Aligning paired-end reads
../../bowtie2 -x lambda_virus -1 ../reads/reads_1.fq
-2 ../reads/reads_2.fq -S eg2.sam
-U: unpaired reads -S: sam format,SAM output,Name of read that aligned
Sum of all applicable flags. Flags relevant to Bowtie are:
1: The read is one of a pair
2: The alignment is one end of a proper paired-end alignment
4: The read has no reported alignments
8: The read is one of a pair and has no reported alignments
16: The alignment is to the reverse reference strand,SAM output,32: The other mate in the paired-end alignment is aligned to the reverse reference strand
64: The read is mate 1 in a pair
128: The read is mate 2 in a pair
Name of reference sequence where alignment occurs
1-based offset into the forward reference strand where leftmost character of the alignment occurs,SAM output,Mapping quality
CIGAR string representation of alignment
Name of reference sequence where mate's alignment occurs. Set to = if the mate's reference sequence is the same as this alignment's, or * if there is no mate.
1-based offset into the forward reference strand where leftmost character of the mate's alignment occurs. Offset is 0 if there is no mate,SAM output,Inferred fragment size. Size is negative if the mate's alignment occurs upstream of this alignment. Size is 0 if there is no mate.
Read sequence (reverse-complemented if aligned to the reverse strand)
ASCII-encoded read qualities (reverse-complemented if the read aligned to the reverse strand). The encoded quality values are on the Phred quality scale and the encoding is ASCII-offset by 33 (ASCII char !), similarly to a FASTQ file.
Optional fields. Fields are tab-separated. bowtie2 outputs zero or more of these optional fields for each alignment, depending on the type of the alignment:,SAM output,Optional fields:
AS:i: Alignment score. Only present if SAM record is for an aligned read
XS:i: Alignment score for second-best alignment. Only present if the SAM record is for an aligned read and more than one alignment was found for the read
YS:i: Alignment score for opposite mate in the paired-end alignment. Only present if the SAM record is for a read that aligned as part of a paired-end alignment.,SAM output,Optional fields:
XN:i: The number of ambiguous bases in the reference covering this alignment. Only present if SAM record is for an aligned read
XM:i: The number of mismatches in the alignment. Only present if SAM record is for an aligned read
XO:i: The number of gap opens, for both read and reference gaps, in the aligment. Only present if SAM record is for an aligned read,SAM output,Optional fields:
XG:i: The number of gap extensions, for both read and reference gaps, in the aligment. Only present if SAM record is for an aligned read
NM:i: The edit distance; that is, the minimal number of one-necleotide edits (substitutions, insertions and deletions) needed to transform the read string into the reference string. Only present if SAM record is for an aligned read,SAM output,Optional fields:
YP:i: Equals 1 if the read is part of a pair that has at least N concordant alignments, where N is the argument specified to –M plus one. Equals 0 if the read is part of pair that has fewer than N alignments. E.g. if –M 2 is specified and 3 distinct, concordant paired-end alignments are found, YP:i:1 will be printed. If fewer than 3 are found, YP:i:0 is printed. Only present if SAM record is for a read that aligned as part of a paired-end alignment.,SAM output,Optional fields:
YM:i: Equals 1 if the read aligned with at least N unpaired alignments, where N is the argument specified to –M plus one. Equals 0 if the read aligned with fewer than N unpaired alignments. E.g. if –M 2 is specified and 3 distinct, valid, unpaired alignments are found, YM:i:1 is printed. If fewer than 3 are found, YM:i:0 is printed. Only present if SAM record is for a read that Bowtie 2 attempted to align in an unpaired fashion.,SAM output,Optional fields:
YF:Z: String indicating reason why the read was filtered out. Only appears for reads that were filtered out.
MD:Z: A string representation of the mismatched reference bases in the alignment. Only present if SAM record is for an aligned read.,SAMtools,http://samtools.sourceforge.net/
Install SAMtools:
Dowload samtools-0.1.19.tar.bz2
tar –xjvf samtools-0.1.19.tar.bz2
Or:
git clone git://github.com/samtools/samtools.git
cd samtools-0.1.19
make,SAMtools: Primer Tutorial,http://biobits.org/samtools_primer.html
Sample Data Files
Aligning Reads Using Bowtie2
Converting SAM to BAM
Sorting and Indexing
Identifying Genomic Variants
Understanding the VCF Format
Visualizing Reads,SAMtools: Primer Tutorial,Sample Data Files
unzip samtools_primer-master.zip
Aligning Reads Using Bowtie2
cd samtools_primer-master
~/bowtie2/bowtie2 -x indexes/e_coli -U simulated_reads/sim_reads.fq -S sim_reads_aligned.sam,SAMtools: Primer Tutorial,Converting SAM to BAM
~/samtools-0.1.19/samtools view -b -S -o sim_reads_aligned.bam sim_reads_aligned.sam
Sorting and Indexing
~/samtools-0.1.19/samtools sort sim_reads_aligned.bam sim_reads_aligned.sorted
~/samtools-0.1.19/samtools index sim_reads_aligned.sorted.bam,SAMtools: Primer Tutorial,Identifying Genomic Variants
~/samtools-0.1.19/samtools mpileup -g -f genomes/NC_008253.fna sim_reads_aligned.sorted.bam > sim_variants.bcf
~/samtools-0.1.19/bcftools/bcftools view -c -v sim_variants.bcf > sim_variants.vcf,SAMtools: Primer Tutorial,Understanding the VCF Format
http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41,SAMtools: Primer Tutorial,Visualizing Reads
~/samtools-0.1.19/samtools tview sim_reads_aligned.sorted.bam genomes/NC_008253.fna,
展开阅读全文
相关搜索