RNAseq建库流程
RNAseq 分析流程
参考序列的下载(数据库的准备)
- fasta下载:UCSC-Downloads
output: hg19.fasta
# 人类参考基因组下载
for i in $(seq 1 22) X Y M
do wget wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr${i}.fa.gz
done
gunzip chr*.fa.gz
for i in $(seq 1 22) X Y M
do cat chr${i}.fa>>hg19.fasta
done
- GTF下载:UCSC-Tools-Table Browser
output: hg19
References:
Alleine|reference genome
基因组各种版本对应关系 | 生信菜鸟团
下载fastq数据
NCBI-SRA
可通过命令行软件SRA Toolkits获取数据;或者直接根据浏览器界面下载。
EBI
数据质控、过滤
- 碱基含量分布 ( Per base sequence content )
- 碱基质量分布 ( Per base sequence quality ) Q20、Q30百分比 85%以上
- GC含量
- 去除adapter
- 去除N碱基过多的reads (>10% or 5%)
- 去除低质量 (Q20<80%)
- 去除duplication ( 转录本拼接 )
- 保留duplication ( 基因表达定量 )
软件安装
[QC] software: FastQC(fastqc)+Cutadapt(cutadapt)+FASTX-Toolkit(fastq_quality_filter)+bbmap(repair.sh)
# Install FastQC
download fastqc_v0.11.2.zip
unzip fastqc_v0.11.2.zip
cd FastQC/
chmod u+x fastqc #将fastqc修改为可运行的主权限
ll
fastqc -h
# Install setuptools
wget --no-check-certificate https://bootstrap.pypa.io/ez_setup.py
sudo python ez_setup.py --insecure
# Install pip
wget https://pypi.python.org/packages/e7/a8/7556133689add8d1a54c0b14aeff0acb03c64707ce100ecd53934da1aa13/pip-8.1.2.tar.gz --no-check-certificate
tar -xvf pip-8.1.2.tar.gz
cd pip-8.1.2/
sudo python setup.py install
# Install cutadapt
sudo pip install cutadapt
# Install FASTX-Toolkit
## Prerequisites
sudo yum install pkgconfig.x86_64 gcc.x86_64 gcc-c++.x86_64 wget.x86_64
## Install libgtextutils
wget http://cancan.cshl.edu/labmembers/gordon/files/libgtextutils-0.6.tar.bz2
tar -xjf libgtextutils-0.6.tar.bz2
cd libgtextutils-0.6
./configure
sudo make
sudo make install
cd ..
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:$PKG_CONFIG_PATH
## Install fastx-toolkit
wget http://cancan.cshl.edu/labmembers/gordon/files/fastx_toolkit-0.0.12.tar.bz2
tar -xjf fastx_toolkit-0.0.12.tar.bz2
cd fastx_toolkit-0.0.12
./configure
make
sudo make install
## Sanity check:
fastx_uncollapser -h
# Install bbmap
wget http://sourceforge.mirrorservice.org/b/bb/bbmap/BBMap_36.30.tar.gz
tar -xvf BBMap_36.30.tar.gz
cd bbmap
References:
Linux 下安装 python 软件包(pip)
cutadapt 安装
Fastx-toolkit installation on CentOS
软件运行
mkdir qcOutdir
fastqc ERR500.R1.fq -t 2 -o qcOutdir
fastqc ERR500.R2.fq -t 2 -o qcOutdir
cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTC ERR500.R1.fq -m 30 --output=ERR500.cutAd1.fq
cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTC ERR500.R2.fq -m 30 --output=ERR500.cutAd2.fq
perl ./removeN ERR500.cutAd1.fq ERR500.unknowNul1.fq
perl ./removeN ERR500.cutAd2.fq ERR500.unknowNul2.fq
fastq_quality_filter -q 20 -p 70 -i ERR500.unknowNul1.fq -o ERR500.qual1.fq -Q 33
fastq_quality_filter -q 20 -p 70 -i ERR500.unknowNul2.fq -o ERR500.qual2.fq -Q 33
~/taobao/bbmap/repair.sh in=ERR500.qual1.fq in2=ERR500.qual2.fq out=ERR500.clean1.fq out2=ERR500.clean2.fq
fastqc ERR500.clean1.fq -t 2 -o qcOutdir
fastqc ERR500.clean2.fq -t 2 -o qcOutdir
过滤统计图对比
wc -l xx.fq
得到数值后再除以4
[mapping] software: TopHat2(tophat2)+SAMtools(samtools)+RSeQC(read_distribution.py|geneBody_coverage.py|junction_saturation.py)
tophat2 -o mappingOut500 -p 10 --read-mismatches 2 -r 50 --library-type fr-unstranded /data/lb/software/rnaSoft/database/hg19 ERR500.clean1.fq ERR500.clean2.fq
samtools view -h accepted_hits.bam |awk '$1~/^@/||$5==50{print $0}' |samtools view -bhS - >ERR500.unique.bam
samtools index ERR500.unique.bam
read_distribution.py -i ERR500.unique.bam -r /data/lb/software/bed/hg19.bed
geneBody_coverage.py -i ERR500.unique.bam -r /data/lb/software/bed/hg19.bed -o ERR500
junction_saturation.py -i ERR500.unique.bam -r /data/lb/software/bed/hg19.bed -o ERR500
Bowtie
bowtie2-build -f reference/lambda_virus.fa lambda_virus #建立索引
bowtie2 -x lambda_virus -1 reads/reads_1.fq -2 reads/reads_2.fq -S bowtie.sam
le bowtie.sam
samtools
samtools view -bS test/all.sam -o all.bam
samtools sort all.bam all.sort
samtools index all.sort.bam
samtools faidx ref.fna
samtools tview all.sort.bam
samtools tview all.sort.bam ref.fna
samtools mpileup -ugf ref.fna all.sort.bam -Q 13 -q 0 -F 0.002 |../bcftools/bcftools view -bvcg -t 0.001 -i -1 -p 0.5 -X 0.01 -b - > all.bcf
../bcftools/bcftools view all.bcf |perl ../bcftools/vcfutils.pl varFilter -Q 10 -d 2 -D 100 -a 2 -w 5 -W 5 -1 0.0001 > all.vcf
tophat2
tophat2 -o A -p 4 -G /ifs1/Database/Human/hg38.gtf --read-mismatches 2 -r 50 --library-type fr-unstranded /ifs1/Database/Human/hg38 /ifs1/Bio/Data/A.1.fq.gz /ifs1/Bio/Data/A.2.fq.gz
tophat2 -o B -p 4 -G /ifs1/Database/Human/hg38.gtf --read-mismatches 2 -r 50 --library-type fr-unstranded /ifs1/Database/Human/hg38 /ifs1/Bio/Data/B.1.fq.gz /ifs1/Bio/Data/B.2.fq.gz
samtools view accepted_hits.bam |le
samtools view accepted_hits.bam |grep "NH:i:1$" |le
le deletions.bed
cat align_summary.txt
grep "ERR" bowtie.sam |wc
grep "ERR" tophat.sam |wc
比对完数据评估
测序饱和度
测序随机性
[expression] software: HTSeq-count(htseq-count)+Cufflinks(cufflinks,cuffdiff)
#edgeR DEseq DEGSeq
htseq-count --stranded=no --format=bam --order=pos --idattr=gene_id --mode=intersection-nonempty -q /data/lb/software/rnaSoft/test/mappingOut500/ERR500.unique.bam /data/lb/software/rnaSoft/database/hg19.gtf >ERR500.htseq_count.xls
#DEGseq voom+limma
cufflinks -o cufflinks500 --library-type fr-unstranded -G /data/lb/software/rnaSoft/database/hg19.gtf /data/lb/software/rnaSoft/test/mappingOut500/ERR500.unique.bam
cuffdiff -b /data/lb/software/vcf/ucsc.hg19.fasta -L S499,S500 -o diff --library-type fr-unstranded -u /data/lb/software/rnaSoft/database/hg19.gtf /data/lb/software/rnaSoft/test/mappingOut499/ERR499.unique.bam /data/lb/software/rnaSoft/test/mappingOut500/ERR500.unique.bam
表达量计算
RPKM一直只适合原核生物,因为真核生物可变剪切Reads无法比对回去的原因
FPKM
rpkmforgene.py
python rpkmforgenes.py -i ~/RNAseq/2.tophat/A/accepted_hits.bam -o A.rpkm.txt -a ~/Data/ref/hg38.bed -fulltranscript -rmnameoverlap -bamu -mRNAnorm -readcount -sortpos -bothends
le A.rpkm.txt
基因表达差异
- Fold Change
- FDR校正
cufflinks
cufflinks -p 4 -o A/ --library-type fr-unstranded -G /ifs1/Database/Human/hg38.gtf ../2.tophat/A/accepted_hits.bam -u
cufflinks -p 4 -o A/ --library-type fr-unstranded -g /ifs1/Database/Human/hg38.gtf ../2.tophat/A/accepted_hits.bam -u
le transcripts.gtf
grep "CUFF" ../../4.cufflinks-g/A/transcripts.gtf #CUFF为新转录本
cuffmerge
cuffmerge -g /ifs1/Database/Human/hg38.gtf -s /ifs1/Database/Human/hg38.fa -p 4 assemblies.txt
le merged.gtf
wc merged.gtf
cuffcompare
cuffcompare -s /ifs1/Database/Human/hg38.fa -r /ifs1/Database/Human/hg38.gtf -R -o compare ../5.cuffmerge/merged_asm/merged.gtf
cuffdiff
cuffdiff -o diff -b /ifs1/Database/Human/hg38.fa -p 4 -L A,B -u /ifs1/Database/Human/hg38.gtf ../2.tophat/A/accepted_hits.bam ../2.tophat/B/accepted_hits.bam
cd diff/
ll gene_exp.diff
cummeRbund
RNAseq数据可视化R包 $ cat cummeRbund.R
#source("http://bioconductor.org/biocLite.R")
#biocLite('cummeRbund')
library (cummeRbund)
setwd ("/ifs1/Bio/data/RNAseq/9.cummeRbund/")
cuff <- readCufflinks()
cuff
dispersionPlot(genes(cuff))
csDensity(genes(cuff))
csDensity(genes(cuff),replicates=T)
csBoxplot(genes(cuff))
csDensity(genes(cuff),replicates=T)
csScatter(genes(cuff),"hESC","Fibroblasts",smooth=T)
csVolcano(genes(cuff),"hESC","Fibroblasts")
csVolcanoMatrix(genes(cuff))
[lncRNA] software: Cufflinks(cufflinks|cuffcompare)+TopHat2(gtf_to_fasta)
cufflinks -o novel500 --library-type fr-unstranded -g /data/lb/software/rnaSoft/database/hg19.gtf /data/lb/software/rnaSoft/test/mappingOut500/ERR500.unique.bam
cuffcompare -s /data/lb/software/vcf/ucsc.hg19.fasta -r /data/lb/software/rnaSoft/database/hg19.gtf -R -o compareGtf /data/lb/software/rnaSoft/test/novel500/transcripts.gtf
grep "class_code \"u\"" compareGtf.combined.gtf >novel.gtf
gtf_to_fasta novel.gtf /data/lb/software/vcf/ucsc.hg19.fasta novel.fa
CPC Coding Potential Calculation
新转录本分析
$ cat novel.sh
cufflinks -p 4 -o A/ --library-type fr-unstranded -g /ifs1/Database/Human/hg38.gtf ../2.tophat/A/accepted_hits.bam
cufflinks -p 4 -o B/ --library-type fr-unstranded -g /ifs1/Database/Human/hg38.gtf ../2.tophat/B/accepted_hits.bam
cuffmerge -g /ifs1/Database/Human/hg38.gtf -s /ifs1/Database/Human/hg38.fa -p 4 assemblies.txt
cuffcompare -s /ifs1/Database/Human/hg38.fa -r /ifs1/Database/Human/hg38.gtf -R -o compare ../5.cuffmerge/merged_asm/merged.gtf
grep "class_code \"u\"" compare.combined.gtf >novel.gtf
gtf_to_fasta novel.gtf /ifs1/Database/Human/hg38.fa novel.fa
LncRNA分析
采取去除核糖体的方法才能保留所有LncRNA.
- 利用新转录本分析
- 去除长度小于200bp序列
- 编码潜能预测(TPC、PhyloCSF、iSeeRNA、Rfam、CPAT)