RNAseq分析流程

Posted by Adobe on August 15, 2016 | - view

RNAseq建库流程

建库测序流程

RNAseq 分析流程

RNAseq分析流程


参考序列的下载(数据库的准备)

  1. fasta下载:UCSC-Downloads
    output: hg19.fasta
# 人类参考基因组下载
for i in $(seq 1 22) X Y M
do wget wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr${i}.fa.gz
done
gunzip chr*.fa.gz
for i in $(seq 1 22) X Y M
do cat chr${i}.fa>>hg19.fasta
done
  1. GTF下载:UCSC-Tools-Table Browser
    output: hg19

Table Browser
ucsc table browser

References:
Alleine|reference genome
基因组各种版本对应关系 | 生信菜鸟团

下载fastq数据

NCBI-SRA
可通过命令行软件SRA Toolkits获取数据;或者直接根据浏览器界面下载。
EBI


数据质控、过滤

  1. 碱基含量分布 ( Per base sequence content )
  2. 碱基质量分布 ( Per base sequence quality ) Q20、Q30百分比 85%以上
  3. GC含量

数据质量评估标准

  1. 去除adapter
  2. 去除N碱基过多的reads (>10% or 5%)
  3. 去除低质量 (Q20<80%)
  4. 去除duplication ( 转录本拼接 )
  5. 保留duplication ( 基因表达定量 )

软件安装

[QC] software: FastQC(fastqc)+Cutadapt(cutadapt)+FASTX-Toolkit(fastq_quality_filter)+bbmap(repair.sh)

# Install FastQC
download fastqc_v0.11.2.zip
unzip fastqc_v0.11.2.zip
cd FastQC/
chmod u+x fastqc #将fastqc修改为可运行的主权限
ll
fastqc -h

# Install setuptools
wget --no-check-certificate https://bootstrap.pypa.io/ez_setup.py  
sudo python ez_setup.py --insecure  

# Install pip  
wget https://pypi.python.org/packages/e7/a8/7556133689add8d1a54c0b14aeff0acb03c64707ce100ecd53934da1aa13/pip-8.1.2.tar.gz --no-check-certificate  
tar -xvf pip-8.1.2.tar.gz  
cd pip-8.1.2/  
sudo python setup.py install

# Install cutadapt
sudo pip install cutadapt  

# Install FASTX-Toolkit

## Prerequisites
sudo yum install pkgconfig.x86_64 gcc.x86_64 gcc-c++.x86_64 wget.x86_64  

## Install libgtextutils
wget http://cancan.cshl.edu/labmembers/gordon/files/libgtextutils-0.6.tar.bz2  
tar -xjf libgtextutils-0.6.tar.bz2  
cd libgtextutils-0.6  
./configure  
sudo make  
sudo make install  
cd ..  
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:$PKG_CONFIG_PATH  

## Install fastx-toolkit
wget http://cancan.cshl.edu/labmembers/gordon/files/fastx_toolkit-0.0.12.tar.bz2   
tar -xjf fastx_toolkit-0.0.12.tar.bz2   
cd fastx_toolkit-0.0.12  
./configure  
make  
sudo make install  

## Sanity check:
fastx_uncollapser -h

# Install bbmap
wget http://sourceforge.mirrorservice.org/b/bb/bbmap/BBMap_36.30.tar.gz  
tar -xvf BBMap_36.30.tar.gz
cd bbmap

References:
Linux 下安装 python 软件包(pip)
cutadapt 安装
Fastx-toolkit installation on CentOS

软件运行

mkdir qcOutdir
fastqc ERR500.R1.fq -t 2 -o qcOutdir
fastqc ERR500.R2.fq -t 2 -o qcOutdir

cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTC ERR500.R1.fq -m 30 --output=ERR500.cutAd1.fq
cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTC ERR500.R2.fq -m 30 --output=ERR500.cutAd2.fq

perl ./removeN ERR500.cutAd1.fq ERR500.unknowNul1.fq
perl ./removeN ERR500.cutAd2.fq ERR500.unknowNul2.fq

fastq_quality_filter -q 20 -p 70 -i ERR500.unknowNul1.fq -o ERR500.qual1.fq -Q 33
fastq_quality_filter -q 20 -p 70 -i ERR500.unknowNul2.fq -o ERR500.qual2.fq -Q 33

~/taobao/bbmap/repair.sh in=ERR500.qual1.fq in2=ERR500.qual2.fq out=ERR500.clean1.fq out2=ERR500.clean2.fq

fastqc ERR500.clean1.fq -t 2 -o qcOutdir
fastqc ERR500.clean2.fq -t 2 -o qcOutdir

过滤统计图对比

wc -l xx.fq

得到数值后再除以4 过滤统计图


[mapping] software: TopHat2(tophat2)+SAMtools(samtools)+RSeQC(read_distribution.py|geneBody_coverage.py|junction_saturation.py)

tophat2 -o mappingOut500 -p 10 --read-mismatches 2 -r 50 --library-type fr-unstranded /data/lb/software/rnaSoft/database/hg19 ERR500.clean1.fq ERR500.clean2.fq  
samtools view -h accepted_hits.bam |awk '$1~/^@/||$5==50{print $0}' |samtools view -bhS - >ERR500.unique.bam  
samtools index ERR500.unique.bam
read_distribution.py -i ERR500.unique.bam -r /data/lb/software/bed/hg19.bed
geneBody_coverage.py -i ERR500.unique.bam -r /data/lb/software/bed/hg19.bed -o ERR500
junction_saturation.py -i ERR500.unique.bam -r /data/lb/software/bed/hg19.bed -o ERR500

Bowtie

bowtie2-build -f reference/lambda_virus.fa lambda_virus #建立索引
bowtie2 -x lambda_virus -1 reads/reads_1.fq -2 reads/reads_2.fq -S bowtie.sam
le bowtie.sam

samtools

samtools view -bS test/all.sam -o all.bam
samtools sort all.bam all.sort
samtools index all.sort.bam
samtools faidx ref.fna
samtools tview all.sort.bam
samtools tview all.sort.bam ref.fna
samtools mpileup -ugf ref.fna all.sort.bam -Q 13 -q 0 -F 0.002 |../bcftools/bcftools view -bvcg -t 0.001 -i -1 -p 0.5 -X 0.01 -b - > all.bcf
../bcftools/bcftools view all.bcf |perl ../bcftools/vcfutils.pl varFilter -Q 10 -d 2 -D 100 -a 2 -w 5 -W 5 -1 0.0001 > all.vcf

tophat2

tophat2 -o A -p 4 -G /ifs1/Database/Human/hg38.gtf --read-mismatches 2 -r 50 --library-type fr-unstranded /ifs1/Database/Human/hg38 /ifs1/Bio/Data/A.1.fq.gz /ifs1/Bio/Data/A.2.fq.gz  

tophat2 -o B -p 4 -G /ifs1/Database/Human/hg38.gtf --read-mismatches 2 -r 50 --library-type fr-unstranded /ifs1/Database/Human/hg38 /ifs1/Bio/Data/B.1.fq.gz /ifs1/Bio/Data/B.2.fq.gz  

samtools view accepted_hits.bam |le
samtools view accepted_hits.bam |grep "NH:i:1$" |le
le deletions.bed
cat align_summary.txt

grep "ERR" bowtie.sam |wc
grep "ERR" tophat.sam |wc

比对完数据评估

测序饱和度
测序随机性


[expression] software: HTSeq-count(htseq-count)+Cufflinks(cufflinks,cuffdiff)

#edgeR DEseq DEGSeq
htseq-count --stranded=no --format=bam --order=pos --idattr=gene_id --mode=intersection-nonempty -q /data/lb/software/rnaSoft/test/mappingOut500/ERR500.unique.bam /data/lb/software/rnaSoft/database/hg19.gtf >ERR500.htseq_count.xls
#DEGseq voom+limma
cufflinks -o cufflinks500 --library-type fr-unstranded -G /data/lb/software/rnaSoft/database/hg19.gtf /data/lb/software/rnaSoft/test/mappingOut500/ERR500.unique.bam
cuffdiff -b /data/lb/software/vcf/ucsc.hg19.fasta -L S499,S500 -o diff --library-type fr-unstranded -u /data/lb/software/rnaSoft/database/hg19.gtf /data/lb/software/rnaSoft/test/mappingOut499/ERR499.unique.bam /data/lb/software/rnaSoft/test/mappingOut500/ERR500.unique.bam

表达量计算

RPKM一直只适合原核生物,因为真核生物可变剪切Reads无法比对回去的原因
RPKM
FPKM
FPKM

rpkmforgene.py

python rpkmforgenes.py -i ~/RNAseq/2.tophat/A/accepted_hits.bam -o A.rpkm.txt -a ~/Data/ref/hg38.bed -fulltranscript -rmnameoverlap -bamu -mRNAnorm -readcount -sortpos -bothends
le A.rpkm.txt

基因表达差异

  1. Fold Change
  2. FDR校正

tophat+cufflinks

cufflinks

cufflinks分析流程 cufflinks常规分析 cufflinks新转录本分析

cufflinks -p 4 -o A/ --library-type fr-unstranded -G /ifs1/Database/Human/hg38.gtf ../2.tophat/A/accepted_hits.bam -u
cufflinks -p 4 -o A/ --library-type fr-unstranded -g /ifs1/Database/Human/hg38.gtf ../2.tophat/A/accepted_hits.bam -u
le transcripts.gtf
grep "CUFF" ../../4.cufflinks-g/A/transcripts.gtf #CUFF为新转录本

cuffmerge

cuffmerge -g /ifs1/Database/Human/hg38.gtf -s /ifs1/Database/Human/hg38.fa -p 4 assemblies.txt
le merged.gtf
wc merged.gtf

cuffcompare

cuffcompare -s /ifs1/Database/Human/hg38.fa -r /ifs1/Database/Human/hg38.gtf -R -o compare ../5.cuffmerge/merged_asm/merged.gtf

cuffdiff

cuffdiff -o diff -b /ifs1/Database/Human/hg38.fa -p 4 -L A,B -u /ifs1/Database/Human/hg38.gtf ../2.tophat/A/accepted_hits.bam ../2.tophat/B/accepted_hits.bam
cd diff/
ll gene_exp.diff

cummeRbund

RNAseq数据可视化R包 $ cat cummeRbund.R

#source("http://bioconductor.org/biocLite.R")
#biocLite('cummeRbund')
library (cummeRbund)
setwd ("/ifs1/Bio/data/RNAseq/9.cummeRbund/")
cuff <- readCufflinks()
cuff
dispersionPlot(genes(cuff))
csDensity(genes(cuff))
csDensity(genes(cuff),replicates=T)
csBoxplot(genes(cuff))
csDensity(genes(cuff),replicates=T)
csScatter(genes(cuff),"hESC","Fibroblasts",smooth=T)
csVolcano(genes(cuff),"hESC","Fibroblasts")
csVolcanoMatrix(genes(cuff))

[lncRNA] software: Cufflinks(cufflinks|cuffcompare)+TopHat2(gtf_to_fasta)

cufflinks -o novel500 --library-type fr-unstranded -g /data/lb/software/rnaSoft/database/hg19.gtf /data/lb/software/rnaSoft/test/mappingOut500/ERR500.unique.bam
cuffcompare -s /data/lb/software/vcf/ucsc.hg19.fasta -r /data/lb/software/rnaSoft/database/hg19.gtf -R -o compareGtf /data/lb/software/rnaSoft/test/novel500/transcripts.gtf
grep "class_code \"u\"" compareGtf.combined.gtf >novel.gtf
gtf_to_fasta novel.gtf /data/lb/software/vcf/ucsc.hg19.fasta novel.fa

CPC Coding Potential Calculation

新转录本分析

$ cat novel.sh
cufflinks -p 4 -o A/ --library-type fr-unstranded -g /ifs1/Database/Human/hg38.gtf ../2.tophat/A/accepted_hits.bam
cufflinks -p 4 -o B/ --library-type fr-unstranded -g /ifs1/Database/Human/hg38.gtf ../2.tophat/B/accepted_hits.bam
cuffmerge -g /ifs1/Database/Human/hg38.gtf -s /ifs1/Database/Human/hg38.fa -p 4 assemblies.txt
cuffcompare -s /ifs1/Database/Human/hg38.fa -r /ifs1/Database/Human/hg38.gtf -R -o compare ../5.cuffmerge/merged_asm/merged.gtf
grep "class_code \"u\"" compare.combined.gtf >novel.gtf
gtf_to_fasta novel.gtf  /ifs1/Database/Human/hg38.fa novel.fa

LncRNA分析

采取去除核糖体的方法才能保留所有LncRNA.

  1. 利用新转录本分析
  2. 去除长度小于200bp序列
  3. 编码潜能预测(TPC、PhyloCSF、iSeeRNA、Rfam、CPAT)