Dataset 1: Chromosome data

The first dataset is a table from the Piovesan et al's BMC Research Notes Journal [3]. There was no place to download the file, so the .csv file was made manually. This dataset has 24 rows, one for each of the human chromosomes, and the following 6 columns:

Chromosome number: includes chromosomes 1-22, as well as the X and Y chromosomes
Length (bp): the length of the chromosome in base-pairs
Length (cm): the length of the chromosome in centimeters
Weight (pg): the weight of the chromosome in picograms
Weight (fg): The weight of the chromosome in femtograms
GC%: the percent of the chromosome that is composed of guanine and cytosine

Table 1. Human Chromosome Lengths and Weights

	chromosome	length_bp	length_cm	length_error_cm	weight_pg	weight_fg	gc_percent
0	1	248956422	8.14	0.08	0.25	254.57	41.72
1	2	242193529	7.92	0.08	0.25	247.65	40.23
2	3	198295559	6.48	0.06	0.20	202.76	39.67

Dataset 2: Cancerous mutations and associated responses to therapeutics

This dataset was downloaded from the site, Cancer Genome Interpreter Annotates The Biological And Clinical Relevance Of Tumor Alterations [4]. This dataset contains 1170 rows and 20 columns. Within this, the relevant columns are the gene, drug status, drug family, drug, association, and primary tumor type full name. If a drug targets multiple mutations, then it would have multiple rows, one for each gene that is mutated. No cleaning was necessary for this dataset.

An example row with the relevant attributes would provide the following information:

Drug: the name of the cancer drug
Drug family: the type of drug
Clinical status: whether the drug has been approved for clinical use
Primary tumor type full name: the type of cancer that this drug treats
Gene: the gene that this drug is associated with
Association: whether the drug works or does not work against that gene

Table 2. Clinical Relevance of Tumor Mutations

	Biomarker	Gene	Alteration type	Alteration	Targeting	Drug status	Drug family	Drug	Association	Evidence level	Assay type	Source	Curator	Curation date	Primary Tumor type	Metastatic Tumor Type	TCGI included	Comments	Drug full name	Primary Tumor type full name
2	ERBB2 amplification	ERBB2	CNA	ERBB2:amp	NaN	Approved	ERBB2 inhibitor;ERBB2 inhibitor;Chemotherapy	Pertuzumab;Trastuzumab;Docetaxel	Responsive	FDA guidelines	NaN	PMID:23801166	SDemajo;RShadrina	13.05.2022	BRCA	NaN	NaN	Drug combination	Pertuzumab (ERBB2 inhibitor)+ Trastuzumab (ERBB2 inhibitor) + Docetaxel (Chemotherapy)	Breast adenocarcinoma
3	ERBB2 overexpression	ERBB2	EXPR	ERBB2:over	NaN	Approved	ERBB2 inhibitor;ERBB2 inhibitor;Chemotherapy	Pertuzumab;Trastuzumab;Docetaxel	Responsive	FDA guidelines	NaN	PMID:23801166	SDemajo;RShadrina	30.06.2022	BRCA	NaN	NaN	Drug combination	Pertuzumab (ERBB2 inhibitor) + Trastuzumab (ERBB2 inhibitor) + Docetaxel (Chemotherapy)	Breast adenocarcinoma
4	ERBB2 amplification	ERBB2	CNA	ERBB2:amp	NaN	Approved	ERBB2 inhibitor;Hormone therapy	Lapatinib;Letrozole	Responsive	FDA guidelines	NaN	FDA;PMID:19786658	SDemajo;RShadrina	13.05.2022	BRCA	NaN	NaN	Drug combination	Lapatinib (ERBB2 inhibitor) + Letrozole (Hormone therapy)	Breast adenocarcinoma

Dataset 3: Validated cancer-associated mutations

The next dataset is a catalog of validated mutations, which are associated with predispositions of cancer or have been proven to drive tumor growth. This dataset was also downloaded from the Cancer Genome Interpreter [5]. There are 5601 rows, each for a different mutation, and there are 9 columns. The relevant columns in this dataset are the following:

Gene: the name of the gene
gDNA: its raw value uses Human Genome Variation Society (HGVS) nomenclature to denote a mutation, which is a combination of strings and integers (i.e. chr7:g.87053221C>T) [6]
Context: states whether a mutation is somatic, or on the germline, which is relevant since germline mutations are hereditary
Cancer acronym: the type of associated cancer, written as an acronym

Table 3. Original HGVS Cancer-associated mutations

	gene	gdna	protein	transcript	info	context	cancer_acronym	source	reference
0	ABCB4	chr7:g.87053221C>T	.	ENST00000265723	CSQN=SpliceDonorSNV;C2=SpliceDonorOfExon17_At_chr7:87053221;aliases=ENSP00000265723;source=Ensembl	germline	HC-PR	ClinVar	Clinvar:NM_018849.2(ABCB4):c.2211+1G>A AND Progressive familial intrahepatic cholestasis 3
1	ABL1	chr9:g.133738306G>A	p.E236K	ENST00000318560	CSQN=Missense;codon_pos=133738306-133738307-133738308;ref_codon_seq=GAG;aliases=ENSP00000323315;source=Ensembl	somatic	CML	DoCM	PMID:21562040
2	ABL1	chr9:g.133738309A>G	p.M237V	ENST00000318560	CSQN=Missense;reference_codon=ATG;candidate_codons=GTA,GTC,GTG,GTT;candidate_mnv_variants=chr9:g.133738309_133738311delATGinsGTA,chr9:g.133738309_133738311delATGinsGTC,chr9:g.133738309_133738311delATGinsGTT;aliases=ENSP00000323315;source=Ensembl	somatic	CML	Biomarker	PMID:21562040

The first section “chr7” indicates what chromosome number the mutation is located on. The “g” denotes that the following values are written in a linear genomic reference sequence. The numbers that follow are the location of the mutation in terms of base pairs. Lastly, the “C>T” indicates that there has been a point mutation from cytosine to thymine. Unfortunately, this is the simplest type of HGVS notation in this dataset, as there are multiple types of mutations that require different types of formats. For example, a deletion-insertion mutation (chr2:g.29445449_29445451delinsGCT) needs to provide two locations that indicate the start and end of the deletion, as well as the base pair sequence that is being inserted into the genome at that location. Another example is a mutation that could have two different results, or one that occurs in multiple locations. This value (chr2:g.29443695G>T__chr2:g.29443695G>C__chr2:g.29443697A>G) is an example that includes both, where at the same location, the guanine can be changed to either thymine or cytosine. This mutation also has a change that occurs two base pairs down. Regardless, different changes are separated by two underscores, “__”.

Since each of these different notations require extensive cleaning, an HGVS reader was created to convert the HGVS notation into columns that were cleaned to ensure compatibility with the Python language. Below is the final cleaned dataset that was used for all further analysis. If a gene had multiple associated mutations, multiple rows were created for each mutation.

Table 4. Processed HGVS Cancer-associated mutations

	gene	gdna	protein	transcript	info	context	cancer_acronym	source	reference	full_mutation	chromosome	location	mutation	simple_mutation	location_start	location_end
0	ABCB4	chr7:g.87053221C>T	.	ENST00000265723	CSQN=SpliceDonorSNV;C2=SpliceDonorOfExon17_At_chr7:87053221;aliases=ENSP00000265723;source=Ensembl	germline	Hepatic carcinoma predisposition	ClinVar	Clinvar:NM_018849.2(ABCB4):c.2211+1G>A AND Progressive familial intrahepatic cholestasis 3	chr7:g.87053221C>T	7	g.87053221C>T	C>T	point_mutation	87053221	NaN
1	ABL1	chr9:g.133750347A>G	p.Y393C	ENST00000318560	CSQN=Missense;reference_codon=TAC;candidate_codons=TGT,TGC;candidate_mnv_variants=chr9:g.133750347_133750348delACinsGT;aliases=ENSP00000323315;source=Ensembl	somatic	Chronic myeloid leukemia	Biomarker	PMID:21562040	chr9:g.133750347A>G	9	g.133750347A>G	A>G	point_mutation	133750347	NaN
3	ABL1	chr9:g.133750330G>T	p.L387F	ENST00000318560	CSQN=Missense;reference_codon=TTG;candidate_codons=TTT,TTC;candidate_snv_variants=chr9:g.133750330G>C;aliases=ENSP00000323315;source=Ensembl	somatic	Chronic myeloid leukemia	Biomarker	PMID:21562040	chr9:g.133750330G>T	9	g.133750330G>T	G>T	point_mutation	133750330	NaN

Dataset 4: Cancer acronyms

The final dataset comes from the additional .csv in the zip file from dataset 3. It contains two columns, one for the cancer acronym and one for the full cancer name. It should be noted that these full names match the “Primary Tumor type full name” column values in dataset 2. There are also 113 rows in this dataset, each corresponding to a different type of cancer.

Table 5. Cancer Acronyms Dictionary

	Acronym	Cancer
0	AA	Adrenal adenoma
1	AC	Adrenal cortex
2	ACA	Adrenal cortex adenoma
3	AG	Anaplastic oligodendroglioma
4	ALL	Acute lymphoblastic leukemia

Contact Me

Preferred email: catarina.bettencourt.a@gmail.com | School email: bettencourt.c@northeastern.edu
LinkedIn | Blue Sky

Genetic Patterns in Oncogenesis and Identifying Gaps in Treatment

Catarina Bettencourt

Dataset 1: Chromosome data

Dataset 2: Cancerous mutations and associated responses to therapeutics

Dataset 3: Validated cancer-associated mutations

Dataset 4: Cancer acronyms

Contact Me