We used an unsupervised approach to explore genotypically defined clusters in cHL, by adapting Latent Dirichlet Allocation (LDA), a natural language processing (NLP) technique robust to missing lexical observations. Specifically, we integrated SCNAs with non-silent somatic mutation calls (including SNVs and indels) as weighted features to define dominant genetic subtypes using LDA. When relying on maximal cophenetic correlations, two dominant cHL clusters were identified.
Cluster H1 tumors comprised ~68% of cases and were dominated by somatic mutations in genes canonically involved in NFκB, JAK/STAT, and PI3K signaling pathways. Conversely, Cluster H2 tumors, which comprised ~32% of cases, were primarily characterized by a variety of SCNA events as well as mutations in TP53 and epigenetic modifiers such as KMT2D. Of interest, none of the newly discovered genes from WES appeared to be differentially mutated between clusters.
Comparing the two clusters, we observed H1 tumors to have a significantly higher somatic SNV mutational burden (P<0.001), and H2 tumors to have a significantly larger fraction of their genome affected by SCNAs (P<0.0001). When considering clinical associations, cHL patients with H2 tumors demonstrated a bimodal age distribution with an early peak in the 20s and a second peak at >60 years. In contrast, H1 tumors predominantly occurred in younger patients (P=0.02), with less pronounced bimodality. Patients with an H2 genotype had a modest but significant male predominance (P=0.007), were enriched for EBV positive tumors (P<0.0001) and mixed cellularity subtype (P=0.01). Patients with the H2 subtype also had higher ctDNA levels (P<0.001), and inferior clinical outcomes (P<0.01). Importantly, the negative prognostic implication of H2 tumors persisted when adjusting for high ctDNA levels (Hazard ratio 2.0 [95% Confidence interval 1.1-3.6], P<0.05).
SNVs and copy number alterations were first converted into an integer-valued matrix, with the following definitions: non-synonymous mutation (<=1): 2, non-synonymous mutation (>=2): 3; allele fraction (AF)-corrected CNV state: (2.75<=CN state>=4.5): 1, (4.5<=CN state<6.25): 2, (CN state>=6.25): 3; (1.2<CN state<=1.6): 1, (0.8<CN state<=1.2): 2, (CN state<0.8): 3. Features were excluded if observed in less than 2.5% of plasma cases.
The matrix data is integer valued, and linear models such as non-negative matrix factorization are not suitable. We therefore used a Latent Dirichlet Allocation (LDA). LDA is a popular generative model used in natural language processing. Several formulations have been described, but here we use the notation from Blei and colleagues1. In this model, each variant (e.g., SOCS1 mutation) is encoded as a “word” (N words: wn), the entire patient genotype is encoded as a “document” (w), and the cohort of patients is encoded as a “corpus” (D). The total number of words (i.e., sum of all features in the genotype) is Poisson-distributed, and the LDA is used to extract the “topics”, i.e., clusters. To fully characterize the LDA model, three matrices Θ, Φ, and Φ' are inferred, with the following definitions:θ(i,j) = Pr(cluster = j│genotype = ith genotype), A given genotype is then assigned as c*=arg max[θ1,genotype,θ2,genotype,…,θk,genotype]. The fitting was done by Gibbs sampler, with 250 ‘burn-in’ iterations, 2,000 iterations and optimized α (prior of clusters over genomes). Here the implementation from the R package textmineR was used.
ϕl,j = P(genetic variant = gl│cluster = j), and
ϕ'j,l = Pr(cluster = j│genetic variant = gl).
To find the optimal number of clusters, a custom metric based on cophenetic correlation coefficients was implemented, where both sample and feature robustness were assessed. Varying the number of clusters from k=2 to k=8, the cophenetic coefficients achieved their maximum for both sample and feature curves at k=2.
1 Blei, D. M., Ng, A. Y. & Jordan, M. I. Latent dirichlet allocation. Journal of Machine Learning Research 3, 993-1022 (2003).
Please send questions, issues, and/or licensing requests to: chlymph@gmail.com