您所在的位置：一氧化碳中毒 >> 病种百科>> >> TCGA文献复现系列快检查你的GDC

TCGA文献复现系列快检查你的GDC

文章来源：一氧化碳中毒发布时间:2021-6-24 18:47:56 点击数：次

复现文献：doi:10./ijbs.

gdc数据整理1.表达矩阵整理step1.提取数据

#step0.加载R包library(tibble)library(dplyr)library(stringr)#step1.读取metadata数据----rm(list=ls())metaDat-jsonlite::fromJSON("../01_exprCount/metadata.cart.-09-21.json")#step2.获取样本名与文件名的对应关系----##file_id是文件夹的名字##file_name是文件夹压缩包的名字,即表达数据压缩包##associated_entities里面包含对应的样本名filesPath-paste("../01_exprCount",metaDat$file_id,metaDat$file_name,sep="/")#创建表达文件路径,../当前目录上一级filesPath[1]#查看是否正确#"../01_exprCount/80fda--47a4-b-c9a5/7f3b65a3-ad59-49f1-92fa-f86e.htseq.counts.gz"tmp-data.table::fread(filesPath[1]);head(tmp)#测试是否成功读取数据rm(tmp)#节省空间，耳目一新class(metaDat$associated_entities)#数据结构是listsampleName-metaDat$associated_entities%%unlist()%%#破坏list结构.[grep("^TCGA.*",.)]%%#获取样本名as.character();length(sampleName)#将样本名变为字符串,确定字符串长度metaDat$associated_entities[[1]];sampleName[1]#查看顺序是否改变#step3.测试，准备批量处理----test1-data.table::fread(filesPath[1]);head(test1)test2-data.table::fread(filesPath[2]);head(test2)#identical(test1$V1,test2$V1)#可以鉴定，也可以不鉴定，因为后面是mergetest-merge(test1,test2,by="V1");head(test)rm(test,test1,test2)#节省空间，耳目一新#step4.批量读取整理数据----n=0for(iinfilesPath){n=n+1text=data.table::fread(i)colnames(text)[2]=sampleName[n]if(n2){exprDat=text}else{exprDat=merge(exprDat,text,by="V1")}}exprDat[1:4,1:4]#查看是否成功提取colnames(exprDat)[1]-"ensemblID"#整理列名str_remove_all("ENSG.13","\\..*")#不确定时，先测试一下exprDat$ensemblID-str_remove_all(exprDat$ensemblID,"\\..*");exprDat$ensemblID[1:3]exprDat-column_to_rownames(exprDat,var="ensemblID")save(exprDat,file="exprDat_read.Rdata")#对于处理时间较长的数据都建议先保存，避免下面出bug

一定要验证！！！

小坑：误以为以manifest为基准进行Merge，得到的数据与manifest数据排序一样，不会改变

filenames-paste(manifest$id,manifest$filename,sep="/")

错误结果：

step2.滤过低表达基因image-

#step5.去除低表达量基因----count0-apply(exprDat,1,function(x)sum(x==0))#统计count为0的情况hist(count0,breaks=,col="#efc",main="count为0分布情况")#可视化dim(exprDat)#lowLevel-floor(ncol(exprDat)*0.1)#确定低表达的阈值，ceiling也可以exprDat-exprDat[apply(exprDat,1,function(x)sum(x==0)lowLevel),]dim(exprDat)#

Q1:为什么要滤过低表达基因

A1：

从生物学的角度来看，在任何情况下都没有以生物学意义表达的基因不会引起人们的

转载请注明：http://www.lwblm.com/bzbk/12038.html

上一篇文章： SQL常用语句

下一篇文章：没有了

自助服务

热点文章

最近更新