Genetic Patterns in Oncogenesis and Identifying Gaps in Treatment

Catarina Bettencourt

Dataset 1: Chromosome data

The first dataset is a table from the Piovesan et al's BMC Research Notes Journal [3]. There was no place to download the file, so the .csv file was made manually. This dataset has 24 rows, one for each of the human chromosomes, and the following 6 columns:


Table 1. Human Chromosome Lengths and Weights

chromosome length_bp length_cm length_error_cm weight_pg weight_fg gc_percent
0 1 248956422 8.14 0.08 0.25 254.57 41.72
1 2 242193529 7.92 0.08 0.25 247.65 40.23
2 3 198295559 6.48 0.06 0.20 202.76 39.67


Dataset 2: Cancerous mutations and associated responses to therapeutics

This dataset was downloaded from the site, Cancer Genome Interpreter Annotates The Biological And Clinical Relevance Of Tumor Alterations [4]. This dataset contains 1170 rows and 20 columns. Within this, the relevant columns are the gene, drug status, drug family, drug, association, and primary tumor type full name. If a drug targets multiple mutations, then it would have multiple rows, one for each gene that is mutated. No cleaning was necessary for this dataset.

An example row with the relevant attributes would provide the following information:


Table 2. Clinical Relevance of Tumor Mutations

Biomarker Gene Alteration type Alteration Targeting Drug status Drug family Drug Association Evidence level Assay type Source Curator Curation date Primary Tumor type Metastatic Tumor Type TCGI included Comments Drug full name Primary Tumor type full name
2 ERBB2 amplification ERBB2 CNA ERBB2:amp NaN Approved ERBB2 inhibitor;ERBB2 inhibitor;Chemotherapy Pertuzumab;Trastuzumab;Docetaxel Responsive FDA guidelines NaN PMID:23801166 SDemajo;RShadrina 13.05.2022 BRCA NaN NaN Drug combination Pertuzumab (ERBB2 inhibitor)+ Trastuzumab (ERBB2 inhibitor) + Docetaxel (Chemotherapy) Breast adenocarcinoma
3 ERBB2 overexpression ERBB2 EXPR ERBB2:over NaN Approved ERBB2 inhibitor;ERBB2 inhibitor;Chemotherapy Pertuzumab;Trastuzumab;Docetaxel Responsive FDA guidelines NaN PMID:23801166 SDemajo;RShadrina 30.06.2022 BRCA NaN NaN Drug combination Pertuzumab (ERBB2 inhibitor) + Trastuzumab (ERBB2 inhibitor) + Docetaxel (Chemotherapy) Breast adenocarcinoma
4 ERBB2 amplification ERBB2 CNA ERBB2:amp NaN Approved ERBB2 inhibitor;Hormone therapy Lapatinib;Letrozole Responsive FDA guidelines NaN FDA;PMID:19786658 SDemajo;RShadrina 13.05.2022 BRCA NaN NaN Drug combination Lapatinib (ERBB2 inhibitor) + Letrozole (Hormone therapy) Breast adenocarcinoma


Dataset 3: Validated cancer-associated mutations

The next dataset is a catalog of validated mutations, which are associated with predispositions of cancer or have been proven to drive tumor growth. This dataset was also downloaded from the Cancer Genome Interpreter [5]. There are 5601 rows, each for a different mutation, and there are 9 columns. The relevant columns in this dataset are the following:


Table 3. Original HGVS Cancer-associated mutations

gene gdna protein transcript info context cancer_acronym source reference
0 ABCB4 chr7:g.87053221C>T . ENST00000265723 CSQN=SpliceDonorSNV;C2=SpliceDonorOfExon17_At_chr7:87053221;aliases=ENSP00000265723;source=Ensembl germline HC-PR ClinVar Clinvar:NM_018849.2(ABCB4):c.2211+1G>A AND Progressive familial intrahepatic cholestasis 3
1 ABL1 chr9:g.133738306G>A p.E236K ENST00000318560 CSQN=Missense;codon_pos=133738306-133738307-133738308;ref_codon_seq=GAG;aliases=ENSP00000323315;source=Ensembl somatic CML DoCM PMID:21562040
2 ABL1 chr9:g.133738309A>G p.M237V ENST00000318560 CSQN=Missense;reference_codon=ATG;candidate_codons=GTA,GTC,GTG,GTT;candidate_mnv_variants=chr9:g.133738309_133738311delATGinsGTA,chr9:g.133738309_133738311delATGinsGTC,chr9:g.133738309_133738311delATGinsGTT;aliases=ENSP00000323315;source=Ensembl somatic CML Biomarker PMID:21562040

The first section “chr7” indicates what chromosome number the mutation is located on. The “g” denotes that the following values are written in a linear genomic reference sequence. The numbers that follow are the location of the mutation in terms of base pairs. Lastly, the “C>T” indicates that there has been a point mutation from cytosine to thymine. Unfortunately, this is the simplest type of HGVS notation in this dataset, as there are multiple types of mutations that require different types of formats. For example, a deletion-insertion mutation (chr2:g.29445449_29445451delinsGCT) needs to provide two locations that indicate the start and end of the deletion, as well as the base pair sequence that is being inserted into the genome at that location. Another example is a mutation that could have two different results, or one that occurs in multiple locations. This value (chr2:g.29443695G>T__chr2:g.29443695G>C__chr2:g.29443697A>G) is an example that includes both, where at the same location, the guanine can be changed to either thymine or cytosine. This mutation also has a change that occurs two base pairs down. Regardless, different changes are separated by two underscores, “__”.

Since each of these different notations require extensive cleaning, an HGVS reader was created to convert the HGVS notation into columns that were cleaned to ensure compatibility with the Python language. Below is the final cleaned dataset that was used for all further analysis. If a gene had multiple associated mutations, multiple rows were created for each mutation.

Table 4. Processed HGVS Cancer-associated mutations

gene gdna protein transcript info context cancer_acronym source reference full_mutation chromosome location mutation simple_mutation location_start location_end
0 ABCB4 chr7:g.87053221C>T . ENST00000265723 CSQN=SpliceDonorSNV;C2=SpliceDonorOfExon17_At_chr7:87053221;aliases=ENSP00000265723;source=Ensembl germline Hepatic carcinoma predisposition ClinVar Clinvar:NM_018849.2(ABCB4):c.2211+1G>A AND Progressive familial intrahepatic cholestasis 3 chr7:g.87053221C>T 7 g.87053221C>T C>T point_mutation 87053221 NaN
1 ABL1 chr9:g.133750347A>G p.Y393C ENST00000318560 CSQN=Missense;reference_codon=TAC;candidate_codons=TGT,TGC;candidate_mnv_variants=chr9:g.133750347_133750348delACinsGT;aliases=ENSP00000323315;source=Ensembl somatic Chronic myeloid leukemia Biomarker PMID:21562040 chr9:g.133750347A>G 9 g.133750347A>G A>G point_mutation 133750347 NaN
3 ABL1 chr9:g.133750330G>T p.L387F ENST00000318560 CSQN=Missense;reference_codon=TTG;candidate_codons=TTT,TTC;candidate_snv_variants=chr9:g.133750330G>C;aliases=ENSP00000323315;source=Ensembl somatic Chronic myeloid leukemia Biomarker PMID:21562040 chr9:g.133750330G>T 9 g.133750330G>T G>T point_mutation 133750330 NaN


Dataset 4: Cancer acronyms

The final dataset comes from the additional .csv in the zip file from dataset 3. It contains two columns, one for the cancer acronym and one for the full cancer name. It should be noted that these full names match the “Primary Tumor type full name” column values in dataset 2. There are also 113 rows in this dataset, each corresponding to a different type of cancer.

Table 5. Cancer Acronyms Dictionary

Acronym Cancer
0 AA Adrenal adenoma
1 AC Adrenal cortex
2 ACA Adrenal cortex adenoma
3 AG Anaplastic oligodendroglioma
4 ALL Acute lymphoblastic leukemia

Contact Me

Preferred email: catarina.bettencourt.a@gmail.com | School email: bettencourt.c@northeastern.edu
LinkedIn | Blue Sky