pRRophetic 通过基因表达水平预测临床化疗反应的R包

编程入门 行业动态 更新时间:2024-10-15 12:34:11

一、pRRophetic 是干什么用的?

pRRophetic是明尼苏达大学Paul Geeleher, Nancy Cox, R. Stephanie Huang做的包,主要算法是基于自己课题组2014年的发的Genome Biology,Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines (https://link.springer/article/10.1186/gb-2014-15-3-r47)。这个R包的算法的输入主要是较大的Project的cell line expression profiles 与对应的IC50信息,然后通过ridge regression建立一个模型,再用这个模型去预测临床样本的Chemotherapeutic Response。2021年,这个课题组又出了
oncoPredict (oncoPredict: an R package for predicting in vivo or cancer patient drug response and biomarkers from cell line screening data), 算是对pRRophetic的一次升级。
(注:R 4.2.1 版本会报错,建议使用4.0.2以下版本)

二、使用指南

1. 包的安装

github地址:https://github/paulgeeleher/pRRophetic
但是大家好像下载tar本地安装

OSF | pRRophetic R package Wiki
https://osf.io/5xvsg/wiki/home/


需要 “car”, “ridge”, “preprocessCore”, “genefilter”, “sva” 四个前置包。

# 如果不缺环境 直接安装就行,不过我喜欢Rstudio,点点鼠标就完事了。
install.packages("pRRophetic_0.5.tar.gz", repos = NULL, dependencies = TRUE)

值得一提的是,这个包里除了源代码,包含很多现成的数据集,这个就很人性化。

# 数据集在这里
find.package("pRRophetic") %>% paste0("/data") %>% dir()
[1] "bortezomibData.RData"                      
[2] "ccleData.RData"                            
[3] "Cell_line_RMA_proc_basalExp.txt"           
[4] "Cell_Lines_Details-1.csv"                  
[5] "cgp2016ExprRma.RData"                      
[6] "datalist"                                  
[7] "drugAndPhenoCgp.RData"                     
[8] "PANCANCER_IC_Tue Aug  9 15_28_57 2016.csv" 
[9] "PANCANCER_IC_Tue_Aug_9_15_28_57_2016.RData"

2. 使用范例

核心的函数 pRRopheticPredict

pRRopheticPredict {pRRophetic}	R Documentation
Given a gene expression matrix, predict drug senstivity for a drug in CGP
Description
Given a gene expression matrix, predict drug senstivity for a drug in CGP.

Usage
pRRopheticPredict(testMatrix, drug, tissueType = "all", batchCorrect = "eb",
  powerTransformPhenotype = TRUE, removeLowVaryingGenes = 0.2,
  minNumSamples = 10, selection = -1, printOutput = TRUE,
  removeLowVaringGenesFrom = "homogenizeData", dataset = "cgp2014")
Arguments
testMatrix	
a gene expression matrix with gene names as row ids and sample names as column ids.

drug	
the name of the drug for which you would like to predict sensitivity, one of A.443654, A.770041, ABT.263, ABT.888, AG.014699, AICAR, AKT.inhibitor.VIII, AMG.706, AP.24534, AS601245, ATRA, AUY922, Axitinib, AZ628, AZD.0530, AZD.2281, AZD6244, AZD6482, AZD7762, AZD8055, BAY.61.3606, Bexarotene, BI.2536, BIBW2992, Bicalutamide, BI.D1870, BIRB.0796, Bleomycin, BMS.509744, BMS.536924, BMS.708163, BMS.754807, Bortezomib, Bosutinib, Bryostatin.1, BX.795, Camptothecin, CCT007093, CCT018159, CEP.701, CGP.082996, CGP.60474, CHIR.99021, CI.1040, Cisplatin, CMK, Cyclopamine, Cytarabine, Dasatinib, DMOG, Docetaxel, Doxorubicin, EHT.1864, Elesclomol, Embelin, Epothilone.B, Erlotinib, Etoposide, FH535, FTI.277, GDC.0449, GDC0941, Gefitinib, Gemcitabine, GNF.2, GSK269962A, GSK.650394, GW.441756, GW843682X, Imatinib, IPA.3, JNJ.26854165, JNK.9L, JNK.Inhibitor.VIII, JW.7.52.1, KIN001.135, KU.55933, Lapatinib, Lenalidomide, LFM.A13, Metformin, Methotrexate, MG.132, Midostaurin, Mitomycin.C, MK.2206, MS.275, Nilotinib, NSC.87877, NU.7441, Nutlin.3a, NVP.BEZ235, NVP.TAE684, Obatoclax.Mesylate, OSI.906, PAC.1, Paclitaxel, Parthenolide, Pazopanib, PD.0325901, PD.0332991, PD.173074, PF.02341066, PF.4708671, PF.562271, PHA.665752, PLX4720, Pyrimethamine, QS11, Rapamycin, RDEA119, RO.3306, Roscovitine, Salubrinal, SB.216763, SB590885, Shikonin, SL.0101.1, Sorafenib, S.Trityl.L.cysteine, Sunitinib, Temsirolimus, Thapsigargin, Tipifarnib, TW.37, Vinblastine, Vinorelbine, Vorinostat, VX.680, VX.702, WH.4.023, WO2009093972, WZ.1.84, X17.AAG, X681640, XMD8.85, Z.LLNle.CHO, ZM.447439.

tissueType	
specify if you would like to traing the models on only a subset of the CGP cell lines (based on the tissue type from which the cell lines originated). This be one any of "all" (for everything, default option), "allSolidTumors" (everything except for blood), "blood", "breast", "CNS", "GI tract" ,"lung", "skin", "upper aerodigestive"

batchCorrect	
How should training and test data matrices be homogenized. Choices are "eb" (default) for ComBat, "qn" for quantiles normalization or "none" for no homogenization.

powerTransformPhenotype	
Should the phenotype be power transformed before we fit the regression model? Default to TRUE, set to FALSE if the phenotype is already known to be highly normal.

removeLowVaryingGenes	
What proportion of low varying genes should be removed? 20 percent be default

minNumSamples	
How many training and test samples are requried. Print an error if below this threshold

selection	
How should duplicate gene ids be handled. Default is -1 which asks the user. 1 to summarize by their or 2 to disguard all duplicates.

printOutput	
Set to FALSE to supress output

Value
a gene expression matrix that does not contain duplicate gene ids
# 加载数据集 硼替佐米(Bortezomib)
data(bortezomibData)

这个数据集来自与GSE9782。exprDataBortezomib数据集里面包含264的样本,pRRphetic里预处理后有22283个基因变量。另一个人重要的输入,就是drug response了。

# studyResponse的level 有三个 Levels: PGx_Responder = IE PGx_Responder = NR PGx_Responder = R
table(studyResponse)

PGx_Responder = IE PGx_Responder = NR  PGx_Responder = R 
25                126                113

首先QQplot看一下IC50是否符合正态分布。

pRRopheticQQplot("Bortezomib")


然后执行了一个 5-fold cross-validation,验证一下是否可以预测临床药物敏感性。可能朋友们会有疑问,为啥没有train dataset,只有test dataset,因为train dataset其实是默认好的,是CGP cell lines,可以用tissueType执行肿瘤类型(This be one any of “all” (for everything, default option), “allSolidTumors” (everything except for blood), “blood”, “breast”, “CNS”, “GI tract” ,“lung”, “skin”, “upper aerodigestive”)。

cvOut <- pRRopheticCV("Bortezomib", cvFold=5, testExprData=exprDataBortezomib)
summary(cvOut)

看看结果

Summary of cross-validation results:

Pearsons correlation: 0.43 , P =  3.57572505890629e-14 
R-squared value: 0.19
Estimated 95% confidence intervals: -4.41, 3.56
Mean prediction error: 1.61
#Plot the cross validation predicted phenotype against the measured IC50s.
plot(cvOut)


pRRopheticPredict
Given a gene expression matrix, predict drug senstivity for a drug in CGP

Based on the qqplot it is likely acceptable to use these data for prediction of
bortezomib sensitivity. Predict bortezomib sensitivity using all cell lines, then
only cell lines from hematological cancers and then only cell lines from derived from solid tumors. (selection = ?,How should duplicate gene ids be handled. Default is -1 which asks the user. 1 to summarize by their or 2 to disguard all duplicates.)

predictedPtype <- pRRopheticPredict(exprDataBortezomib, "Bortezomib", 
                                    selection=1)
predictedPtype_blood <- pRRopheticPredict(exprDataBortezomib, "Bortezomib", 
                                          "blood", selection=1)
predictedPtype_solid <- pRRopheticPredict(exprDataBortezomib, "Bortezomib",
                                          "allSolidTumors", selection=1)
t.test(predictedPtype[((studyResponse == "PGx_Responder = NR") & bortIndex)], 
       predictedPtype[((studyResponse == "PGx_Responder = R") & bortIndex)], 
       alternative="greater")

	Welch Two Sample t-test

data:  predictedPtype[((studyResponse == "PGx_Responder = NR") & bortIndex)] and predictedPtype[((studyResponse == "PGx_Responder = R") & bortIndex)]
t = 3.0109, df = 163.75, p-value = 0.001509
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 0.1826465       Inf
sample estimates:
mean of x mean of y 
-4.926576 -5.331926 

t.test(predictedPtype_blood[((studyResponse == "PGx_Responder = NR") & bortIndex)], 
       predictedPtype_blood[((studyResponse == "PGx_Responder = R") & bortIndex)],
       alternative="greater")

	Welch Two Sample t-test

data:  predictedPtype_blood[((studyResponse == "PGx_Responder = NR") & bortIndex)] and predictedPtype_blood[((studyResponse == "PGx_Responder = R") & bortIndex)]
t = 4.1204, df = 165.24, p-value = 2.984e-05
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 0.3975589       Inf
sample estimates:
mean of x mean of y 
-4.372173 -5.036370 


t.test(predictedPtype_solid[((studyResponse == "PGx_Responder = NR") & bortIndex)], 
       predictedPtype_solid[((studyResponse == "PGx_Responder = R") & bortIndex)],
       alternative="greater")

Welch Two Sample t-test

data:  predictedPtype_solid[((studyResponse == "PGx_Responder = NR") & bortIndex)] and predictedPtype_solid[((studyResponse == "PGx_Responder = R") & bortIndex)]
t = 1.1167, df = 165.87, p-value = 0.1329
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 -0.07325769         Inf
sample estimates:
mean of x mean of y 
-5.381191 -5.533412 


allTissuesPpv <- getPPV(predResponders=predictedPtype[((studyResponse == "PGx_Responder = R") & bortIndex)], predNonResponders=predictedPtype[((studyResponse == "PGx_Responder = NR") & bortIndex)], drug="Bortezomib")

PPV:  0.57 
NPV:  0.59 
Cutpoint:  -5.04 

bloodPpv <- getPPV(predResponders=predictedPtype_blood[((studyResponse == "PGx_Responder = R") & bortIndex)], predNonResponders=predictedPtype_blood[((studyResponse == "PGx_Responder = NR") & bortIndex)], drug="Bortezomib", tissue="blood")

PPV:  0.63 
NPV:  0.69 
Cutpoint:  -4.54 
df <- stack(list(NR=predictedPtype_blood[((studyResponse == "PGx_Responder = NR")
                                          & bortIndex)], R=predictedPtype_blood[((studyResponse == "PGx_Responder = R") & 
                                                                                   bortIndex)]))
ggplot(data=df, aes(y=values, x=ind)) + geom_boxplot(alpha=.3, fill=c("#CC0033", "#006633")) + 
  theme_bw() + ylab("Predicted Bortezomib Sensitivity") + xlab("Clinical Response")

然后是CCLE数据库,

data(ccleData) 

exprMatCcle是表达矩阵,里面有1037个cell line,18988个基因信息。
sensDataCcle里面有504的cell line的各种信息,包括“”CLE.Cell.Line.Name" ,"Primary.Cell.Line.Name”,“Compound”,还有最最重要的“IC50…uM.”

cvOut_pd <- pRRopheticCV("PD.0325901", cvFold=5, testExprData=exprMatCcle)
summary(cvOut_pd)
plot(cvOut_pd)

predictedPtype_ccle <- pRRopheticPredict(exprMatCcle, "PD.0325901", selection=1)

 11897  gene identifiers overlap between the supplied expression matrices... 
 
Found2batches
Adjusting for0covariate(s) or covariate level(s)
Standardizing Data across genes
Fitting L/S model and finding priors
Finding parametric adjustments
Adjusting the Data


 2380 low variabilty genes filtered.
Fitting Ridge Regression model... Done

Calculating predicted phenotype...Done
cellLinesWithCcleIc50s <- names(predictedPtype_ccle)[names(predictedPtype_ccle) %in%
                                                       sensDataCcle$CCLE.Cell.Line.Name]
predCcleOrd <- predictedPtype_ccle[names(predictedPtype_ccle)]
ccleActArea_pd <- -sensDataCcle$"ActArea"[sensDataCcle$Compound == "PD-0325901"]
names(ccleActArea_pd) <- sensDataCcle$"CCLE.Cell.Line.Name"[sensDataCcle$Compound ==
                                                              "PD-0325901"]
ccleActAreaord <- ccleActArea_pd[cellLinesWithCcleIc50s]
cor.test(predictedPtype_ccle[cellLinesWithCcleIc50s], ccleActAreaord, 
         method="spearman")

Spearman's rank correlation rho

data:  predictedPtype_ccle[cellLinesWithCcleIc50s] and ccleActAreaord
S = 7728298, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.5807112 
df2 <- data.frame(predCcle=predictedPtype_ccle[cellLinesWithCcleIc50s], 
                  actAreaCcle=ccleActAreaord)
ggplot(data=df2, aes(y=predCcle, x=actAreaCcle)) + geom_point(alpha=0.5) + 
  geom_smooth(method=lm) + theme_bw() + xlab("Measured Activity Area") +
  ylab("Predicted Drug Sensitivity")

predictedPtype_ccle_erlotinib <- pRRopheticLogisticPredict(exprMatCcle, "Erlotinib", selection=1)

Get the ActArea for the CCLE cell lines for which we have just predicted
IC50.

cellLinesWithCcleIc50s <- 
  names(predictedPtype_ccle_erlotinib)[names(predictedPtype_ccle_erlotinib) %in%
                                         sensDataCcle$CCLE.Cell.Line.Name]
predCcleOrd <- predictedPtype_ccle_erlotinib[names(predictedPtype_ccle_erlotinib)]
ccleActArea_pd <- sensDataCcle$"ActArea"[sensDataCcle$Compound == "Erlotinib"]
names(ccleActArea_pd) <- sensDataCcle$"CCLE.Cell.Line.Name"[sensDataCcle$Compound ==
                                                              "Erlotinib"]
ccleActAreaord <- ccleActArea_pd[cellLinesWithCcleIc50s]

There are a very large number of cell lines resistant to Erlotinib (within
the drug screening window), so a correlation is not an appropriate measure of concordance. So lets do a t-test between some of the most sensitive and resistant cell lines to assess whether signal is being captured by the predictions.

resistant <- names(sort(ccleActAreaord))[1:55] #55 highly resistant cell lines.
sensitive <- names(sort(ccleActAreaord, decreasing=TRUE))[1:15] #15 sensitive
t.test(predictedPtype_ccle_erlotinib[resistant], 
       predictedPtype_ccle_erlotinib[sensitive])

	Welch Two Sample t-test

data:  predictedPtype_ccle_erlotinib[resistant] and predictedPtype_ccle_erlotinib[sensitive]
t = 2.431, df = 32.127, p-value = 0.02081
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.2421538 2.7431351
sample estimates:
mean of x mean of y 
-5.184405 -6.677049

Despite the fact that IC50 values are not correlated for this drug between
these studies, the most sensitive/resistant samples are separated highly significantly with this logistic models.

boxplot(list(Resistant=predictedPtype_ccle_erlotinib[resistant], 
             Sensitive=predictedPtype_ccle_erlotinib[sensitive]), pch=20, 
        vertical=TRUE, method="jitter", ylab="Log-odds of sensitivity")


Include, is an example, prediction from the bortezomib clinical data where we try to predict CR, PR, MR, NC, PD from CR, PR, MR, NC, PD. This serves as both an example of prediction directly from clinical data and of using a dataset other than the CGP from which to predict.
First, prepare the training data and test expression data.

trainExpr <- exprDataBortezomib[, (detailedResponse %in% c(1,2,3,4,5)) & 
                                  studyIndex %in% c("studyCode = 25", "studyCode = 40")]
trainPtype <- detailedResponse[(detailedResponse %in% c(1,2,3,4,5)) & 
                                 studyIndex %in% c("studyCode = 25", "studyCode = 40")]
testExpr <- exprDataBortezomib[, (detailedResponse %in% c(1,2,3,4,5)) & 
                                 studyIndex %in% c("studyCode = 39")]

Calculate the predicted phenotype.

ptypeOut <- calcPhenotype(trainExpr, trainPtype, testExpr, selection=1)
testPtype <- detailedResponse[(detailedResponse %in% c(1,2,3,4,5)) & 
                                studyIndex %in% c("studyCode = 39")]
cor.test(testPtype, ptypeOut, alternative="greater")
t.test(ptypeOut[testPtype %in% c(3,4,5)], ptypeOut[testPtype %in% c(1,2)], 
       alternative="greater")

更多推荐

pRRophetic 通过基因表达水平预测临床化疗反应的R包

本文发布于:2023-06-11 01:37:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1368779.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:基因   水平   pRRophetic

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!