CNH+ and associated tumor purity, tumor ploidy for samples from TCGA study are found

analyze_TCGA_study takes data from a TCGA study and for each sample it finds CNH+ and associated tumor purity, tumor ploidy. Option to find both CNH+ and CNH is provided.

Usage

analyze_TCGA_study(study_name, da, grid, k, both = FALSE)

Arguments

study_name: TCGA study name
da: data frame containing Sample, Chromosome, Start, End, Num_Probes, Segment_Mean
grid: grid of purities, ploidies over which to search for CNH+ (matrix)
k: how many candidate solution to return
both: whether to find CNH+ as well as CNH (default is F, i.e., CNH+ only)

Value

saves csv results file named after the study, with CNH+, purity, ploidy for each sample

Details

Example shows how to 1) download RCN profiles for samples from a TCGA study; 2) read in a Supplementary file from van Dijk et al. which contains sample names of the TCGA tumor samples that were analyzed by the authors, as well as other data (CNH, survival data); 3) select from the downloaded TCGA RCN data only those samples that were analyzed by van Dijk et al.; 4) make grid of purities, ploidies over which the CNH+ solution will be searched for; 5) analyze the TCGA study - for each samples find CNH+ and associated purity, ploidy pair 6) compare survival curves of subjects with CNH+ below/above median and make KM plot

References

van Dijk E, van den Bosch T, Lenos KJ, El Makrini K, Nijman LE, van Essen HF, Lansu N, Boekhout M, Hageman JH, Fitzgerald RC, others (2021). “Chromosomal copy number heterogeneity predicts survival rates across cancers.” Nature communications, 12(1), 1--12.

Examples

if (FALSE) {
# Ex:
library('TCGAbiolinks')
library('stringi')
library('openxlsx')
library('dplyr')
#
# TCGA study
study_name = 'READ'
#
query = TCGAbiolinks::GDCquery(legacy = T,
                               project = paste0('TCGA-', study_name),
                               data.category = "Copy number variation",
                               file.type = 'hg19.seg',
                               platform = "Affymetrix SNP Array 6.0",
                               sample.type = 'Primary Tumor')
# download segmented_scna_hg19 data
TCGAbiolinks::GDCdownload(query = query,
                          method = "api", files.per.chunk = 10) # client')
data = TCGAbiolinks::GDCprepare(query = query)
da = as.data.frame(data)
#
# read in the Supplement to van Dijk et al. with results for tumor samples from TCGA studies
vD = openxlsx::read.xlsx("https://static-content.springer.com/esm/art%3A10.1038%2Fs41467-021-23384-6/MediaObjects/41467_2021_23384_MOESM4_ESM.xlsx")
#
# filter samples from the study
vDx = vD %>% filter(Type == study_name)
#
# match vD and TCGA
im = match(vDx$Samplename, unique(da$Sample))
#
# data frame with RCN for the study samples which were considered by van Dijk et al.
da_vD = da %>% filter(Sample %in% unique(da$Sample)[im])
#
# grid
grid = make_grid(purity = seq(0.2, 1, 0.01), ploidy = seq(1.5, 5, 0.01))
#
# analyze TCGA study
oo = analyze_TCGA_study(study_name, da_vD, grid, k=2)

res = read.csv(paste0(study_name, '_results.csv'))
# match samples from results file and the survival data in vDx
im_vDx_res = match(vDx$Samplename, res$sample)
#
# survival analysis (below/above median CNH+)
gg_cnhplus = plot_survival(study_name,
                           vDx$OS, vDx$OS_event,
                           res$cnh_plus[im_vDx_res],
                           type = 'unfiltered CNH+', ylim = c(0, 1))
gg_cnhplus
}