二代测序数据分析简介.ppt

举报
资源描述
二代测序数据分析简介,童春发 2013.12.23,主要内容,重测序的原理及流程 数据结构与质量评估 SRA数据库及数据获取 Bowtie2、BWA和SAMtools软件使用,重测序的原理及流程,,数据结构与质量评估,Fastq格式 FastQC,FASTQ format,http://en.wikipedia.org,A FASTQ file containing a single sequence might look like this,@SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65,Illumina sequence identifiers,@HWUSI-EAS100R:6:73:941:1973#0/1,Versions of the Illumina pipeline since 1.4 appear to use #NNNNNN instead of #0 for the multiplex ID, where NNNNNN is the sequence of the multiplex tag.,With Casava 1.8 the format of the '@' line has changed,@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG,Quality,A quality value Q is an integer mapping of p (i.e., the probability that the corresponding base call is incorrect). Phred quality score: The Solexa pipeline (i.e., the software delivered with the Illumina Genome Analyzer) earlier used,Quality,Encoding,Sanger format can encode a Phred quality score from 0 to 93 using ASCII 33 to 126 Illumina's newest version (1.8) of their pipeline CASAVA will directly produce fastq in Sanger format Solexa/Illumina 1.0 format can encode a Solexa/Illumina quality score from -5 to 62 using ASCII 59 to 126 Starting with Illumina 1.3 and before Illumina 1.8, the format encoded a Phred quality score from 0 to 62 using ASCII 64 to 126 Starting in Illumina 1.5 and before Illumina 1.8, the Phred scores 0 to 2 have a slightly different meaning,American Standard Code for Information Interchange (ASCII),,FastQC,http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Double click “run_fastqc.bat” to run FastQC The analysis results for 11 modules Green tick for normal Orange triangle for slightly abnormal Red cross for very unusual,,,Basic Statistics,Filename NHS066-47_L4_1.fq.gz File type Conventional base calls Encoding Sanger / Illumina 1.9 Total Sequences 3992798 Filtered Sequences 0 Sequence length 100 %GC 37,Per Base Sequence Quality,•The central red line is the median value •The yellow box represents the inter-quartile range (25-75%) •The upper and lower whiskers represent the 10% and 90% points •The blue line represents the mean quality,Per Sequence Quality Scores,A warning is raised if the most frequently observed mean quality is below 27 - this equates to a 0.2% error rate. An error is raised if the most frequently observed mean quality is below 20 - this equates to a 1% error rate.,Per Base Sequence Content,This module issues a warning if the difference between A and T, or G and C is greater than 10% in any position. This module will fail if the difference between A and T, or G and C is greater than 20% in any position.,Per Base GC Content,This module issues a warning it the GC content of any base strays more than 5% from the mean GC content. This module will fail if the GC content of any base strays more than 10% from the mean GC content.,Per Sequence GC Content,A warning is raised if the sum of the deviations from the normal distribution represents more than 15% of the reads This module will indicate a failure if the sum of the deviations from the normal distribution represents more than 30% of the reads,Per Base N Content,This module raises a warning if any position shows an N content of >5% This module will raise an error if any position shows an N content of >20%,Sequence Length Distribution,This module will raise a warning if all sequences are not the same length This module will raise an error if any of the sequences have zero length,Duplicate Sequences,This module will issue a warning if non-unique sequences make up more than 20% of the total This module will issue a error if non-unique sequences make up more than 50% of the total,Overrepresented Sequences,AATTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTCGAAGATCTCG 65311 1.636 TruSeq Adapter, Index 10 (97% over 36bp) ATTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTCGAAGATCTCGT 6464 0.162 TruSeq Adapter, Index 10 (97% over 36bp) AATAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTCGAAGATCTCGT 4633 0.116 TruSeq Adapter, Index 10 (97% over 36bp) AATTAGTCGGAAGAGCACACGTCTGAACTCCAGTCACTCGAAGATCTCGT 4463 0.112 TruSeq Adapter, Index 10 (97% over 34bp) AATTATGGATAATTAAAGTATTCCCCCCTTTTTTTTATGATATTTTTGAC 3994 0.100 No Hit Warning: >0.1% Failure: >1%,Overrepresented Kmers,This module will issue a warning if any k-mer is enriched more than 3 fold overall, or more than 5 fold at any individual position This module will issue a error if any k-mer is enriched more than 10 fold at any individual base position,Saving a Report,NHS066-47_L4_1.fq_fastqc.zip,SRA数据库及数据获取,,,SRA数据库及数据获取,,SRA数据库及数据获取,,SRA数据库及数据获取,,查看和下载SRR576183,,Fastq-dum将SRA文件转化成FASTQ格式,fastq-dump --split-files -DQ “+” ./SRR576183.sra fastq-dump --split-files -DQ “+” --gzip ./SRR576183.sra,直接下载FASTQ格式数据,ftp://ftp.era.ebi.ac.uk/vol1/fastq/SRR576/SRR576183,将Reads比对到参考序列,BWA Bowtie2 Soap Samtools,BWA,http://bio-bwa.sourceforge.net/ https://github.com/lh3/bwa wget http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.5a.tar.bz2 tar -xjvf bwa-0.7.5a.tar.bz2 cd bwa-0.7.5a make Dowload test.tar.gz from ftp://202.119.214.193,BWA,../bwa-0.7.5a/bwa index ref.fa ../bwa-0.7.5a/bwa mem ref.fa test_PE1.fa > aln-se.sam ../bwa-0.7.5a/bwa mem ref.fa test_PE1.fa test_PE2.fa > aln-se.sam,Bowtie2,http://bowtie-bio.sourceforge.net/bowtie2/index.shtml 下载 bowtie2-2.1.0-linux-x86_64.zip unzip bowtie2-2.1.0-linux-x86_64.zip mv bowtie2-2.1.0 bowtie2 cd bowtie2/example mkdir work cd work,Bowtie2,Index a reference genome ../../bowtie2-build ../reference/lambda_virus.fa lambda_virus Aligning single-end reads ../../bowtie2 -x lambda_virus -U ../reads/reads_1.fq -S eg1.sam Aligning paired-end reads ../../bowtie2 -x lambda_virus -1 ../reads/reads_1.fq -2 ../reads/reads_2.fq -S eg2.sam -U: unpaired reads -S: sam format,SAM output,Name of read that aligned Sum of all applicable flags. Flags relevant to Bowtie are: 1: The read is one of a pair 2: The alignment is one end of a proper paired-end alignment 4: The read has no reported alignments 8: The read is one of a pair and has no reported alignments 16: The alignment is to the reverse reference strand,SAM output,32: The other mate in the paired-end alignment is aligned to the reverse reference strand 64: The read is mate 1 in a pair 128: The read is mate 2 in a pair Name of reference sequence where alignment occurs 1-based offset into the forward reference strand where leftmost character of the alignment occurs,SAM output,Mapping quality CIGAR string representation of alignment Name of reference sequence where mate's alignment occurs. Set to = if the mate's reference sequence is the same as this alignment's, or * if there is no mate. 1-based offset into the forward reference strand where leftmost character of the mate's alignment occurs. Offset is 0 if there is no mate,SAM output,Inferred fragment size. Size is negative if the mate's alignment occurs upstream of this alignment. Size is 0 if there is no mate. Read sequence (reverse-complemented if aligned to the reverse strand) ASCII-encoded read qualities (reverse-complemented if the read aligned to the reverse strand). The encoded quality values are on the Phred quality scale and the encoding is ASCII-offset by 33 (ASCII char !), similarly to a FASTQ file. Optional fields. Fields are tab-separated. bowtie2 outputs zero or more of these optional fields for each alignment, depending on the type of the alignment:,SAM output,Optional fields: AS:i: Alignment score. Only present if SAM record is for an aligned read XS:i: Alignment score for second-best alignment. Only present if the SAM record is for an aligned read and more than one alignment was found for the read YS:i: Alignment score for opposite mate in the paired-end alignment. Only present if the SAM record is for a read that aligned as part of a paired-end alignment.,SAM output,Optional fields: XN:i: The number of ambiguous bases in the reference covering this alignment. Only present if SAM record is for an aligned read XM:i: The number of mismatches in the alignment. Only present if SAM record is for an aligned read XO:i: The number of gap opens, for both read and reference gaps, in the aligment. Only present if SAM record is for an aligned read,SAM output,Optional fields: XG:i: The number of gap extensions, for both read and reference gaps, in the aligment. Only present if SAM record is for an aligned read NM:i: The edit distance; that is, the minimal number of one-necleotide edits (substitutions, insertions and deletions) needed to transform the read string into the reference string. Only present if SAM record is for an aligned read,SAM output,Optional fields: YP:i: Equals 1 if the read is part of a pair that has at least N concordant alignments, where N is the argument specified to –M plus one. Equals 0 if the read is part of pair that has fewer than N alignments. E.g. if –M 2 is specified and 3 distinct, concordant paired-end alignments are found, YP:i:1 will be printed. If fewer than 3 are found, YP:i:0 is printed. Only present if SAM record is for a read that aligned as part of a paired-end alignment.,SAM output,Optional fields: YM:i: Equals 1 if the read aligned with at least N unpaired alignments, where N is the argument specified to –M plus one. Equals 0 if the read aligned with fewer than N unpaired alignments. E.g. if –M 2 is specified and 3 distinct, valid, unpaired alignments are found, YM:i:1 is printed. If fewer than 3 are found, YM:i:0 is printed. Only present if SAM record is for a read that Bowtie 2 attempted to align in an unpaired fashion.,SAM output,Optional fields: YF:Z: String indicating reason why the read was filtered out. Only appears for reads that were filtered out. MD:Z: A string representation of the mismatched reference bases in the alignment. Only present if SAM record is for an aligned read.,SAMtools,http://samtools.sourceforge.net/ Install SAMtools: Dowload samtools-0.1.19.tar.bz2 tar –xjvf samtools-0.1.19.tar.bz2 Or: git clone git://github.com/samtools/samtools.git cd samtools-0.1.19 make,SAMtools: Primer Tutorial,http://biobits.org/samtools_primer.html Sample Data Files Aligning Reads Using Bowtie2 Converting SAM to BAM Sorting and Indexing Identifying Genomic Variants Understanding the VCF Format Visualizing Reads,SAMtools: Primer Tutorial,Sample Data Files unzip samtools_primer-master.zip Aligning Reads Using Bowtie2 cd samtools_primer-master ~/bowtie2/bowtie2 -x indexes/e_coli -U simulated_reads/sim_reads.fq -S sim_reads_aligned.sam,SAMtools: Primer Tutorial,Converting SAM to BAM ~/samtools-0.1.19/samtools view -b -S -o sim_reads_aligned.bam sim_reads_aligned.sam Sorting and Indexing ~/samtools-0.1.19/samtools sort sim_reads_aligned.bam sim_reads_aligned.sorted ~/samtools-0.1.19/samtools index sim_reads_aligned.sorted.bam,SAMtools: Primer Tutorial,Identifying Genomic Variants ~/samtools-0.1.19/samtools mpileup -g -f genomes/NC_008253.fna sim_reads_aligned.sorted.bam > sim_variants.bcf ~/samtools-0.1.19/bcftools/bcftools view -c -v sim_variants.bcf > sim_variants.vcf,SAMtools: Primer Tutorial,Understanding the VCF Format http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41,SAMtools: Primer Tutorial,Visualizing Reads ~/samtools-0.1.19/samtools tview sim_reads_aligned.sorted.bam genomes/NC_008253.fna,
展开阅读全文
相关搜索
温馨提示:
文客久久所有资源均是用户自行上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作他用。

当前位置:首页 > 教育教学资料库 > 课件讲义


Copyright © 2018-2021 Wenke99.com All rights reserved

工信部备案号浙ICP备20026746号-2  

公安局备案号:浙公网安备33038302330469号

本站为C2C交文档易平台,即用户上传的文档直接卖给下载用户,本站只是网络服务中间平台,所有原创文档下载所得归上传人所有,若您发现上传作品侵犯了您的权利,请立刻联系网站客服并提供证据,平台将在3个工作日内予以改正。