br To identify the subset of secretome genes with
To identify the subset of secretome genes with substantial paired normal versus primary tumor NPS-2143 changes across many cancer types, a rank-based metric was used. The rationale of implementing a relative metric rather than directly using the fold-change and significance values from the DE analyses was that their ranges, especially those of the p values, vary widely across cancer types due to differences in the number of samples for each. We therefore ranked the genes within each cancer type by p value, and by absolute log2(fold-change) value, then averaged the two ranks to yield a combined ‘‘PF-rank.’’ To identify the genes that exhibited the greatest and most significant changes across all included cancer types (regardless of fold-change direction), the PF-ranks for each cancer were averaged to yield a pan-cancer PF-rank.
Directional PF-ranks were also generated, where the direction of expression fold-change was incorporated instead of using the absolute log2FC values. In addition, the associated p values were converted to directional p values (pdir, Equation 2, Method Details), such that the lowest ranks were assigned to genes exhibiting a significant decrease in expression across many cancers, and the high-est to those with a significant increase in expression.
Definition of tissue-specific genes
Gene tissue specificity data was retrieved from the HPA, which has compiled a list of genes for each tissue that are classified as tissue enriched, group enriched, or tissue enhanced, based on their expression in that tissue compared to others (Uhle´n et al., 2015). Given the relatively small number of tissue-enriched genes for many tissues, especially when removing all non-SP genes,
we defined tissue-specific gene sets for each individual tissue as the combination of all its tissue-enriched, group-enriched, and tissue-enhanced genes (Table S4).
Estimation of UPR activation
Activation of the UPR was estimated using a GSA. In this analysis, the full gene sets were used; i.e., they were not filtered to remove non-secretome genes. The ‘‘Hallmark,’’ ‘‘Canonical pathways,’’ and ‘‘GO gene sets’’ libraries from MSigDB were queried for any set containing the phrase ‘‘endoplasmic reticulum stress’’ or ‘‘unfolded protein response,’’ and sets with the term ‘‘negative regulation’’ were excluded. This yielded 11 gene sets related to UPR and/or ER stress, which are shown in Figures S4 and S5.
Glycosylation and disulfide bond redox
Expression changes related to glycosylation and disulfide bond oxidation/reduction processes were evaluated by conducting a GSA, using the ‘‘GO bioprocess: glycosylation’’ and ‘‘GO molecular function: protein disulfide oxidoreductase activity’’ gene sets from MSigDB, respectively. To add resolution to the analysis of glycosylation activity, two subsets of the glycosylation gene set, ‘‘protein N-linked glycosylation’’ and ‘‘protein O-linked glycosylation,’’ were also evaluated for coordinated changes in gene expression. These gene sets were used in their complete form, and were not filtered (e.g., by removing non-secretome genes).
Secretory burden (SB) score
The SB score was calculated for each gene based on its associated protein length (number of amino acids), number of disulfide sites, and number of glycosylation sites, as described in Equation 1. Data for each of these terms was retrieved from UniProt, where the number of glycosylation sites was the sum of all N-, C-, O-, and S-linked glycosylation sites.
QUANTIFICATION AND STATISTICAL ANALYSIS
Differential expression (DE) analysis
All differential expression analyses reported in the study were conducted using the edgeR package in R (Robinson et al., 2010), with the raw gene count (integer) values as input. For the DE analysis comparing primary tumor expression to that of paired normal tissues, the patient ID number was included as an additional field in the design matrix. When comparing primary tumor gene counts from TCGA to those of healthy tissues from GTEx, only the sample type was considered (tumor or normal). Counts were normalized using the EdgeR calcNormFactors function, which scales library sizes using the trimmed mean of M-values (TMM) between each pair of samples (Robinson and Oshlack, 2010). For each DE analysis, low-count genes were removed beforehand; i.e., only genes with at least 10 counts in at least half of the samples were retained. Furthermore, DE analyses were only performed if there were at least 3 samples in each of the 2 conditions to be compared.