Dataset 1: Chromosome data
The first dataset is a table from the Piovesan et al's BMC Research Notes Journal [3]. There was no place to download the file, so the .csv file was made manually. This dataset has 24 rows, one for each of the human chromosomes, and the following 6 columns:
- Chromosome number: includes chromosomes 1-22, as well as the X and Y chromosomes
- Length (bp): the length of the chromosome in base-pairs
- Length (cm): the length of the chromosome in centimeters
- Weight (pg): the weight of the chromosome in picograms
- Weight (fg): The weight of the chromosome in femtograms
- GC%: the percent of the chromosome that is composed of guanine and cytosine
Table 1. Human Chromosome Lengths and Weights
chromosome | length_bp | length_cm | length_error_cm | weight_pg | weight_fg | gc_percent | |
---|---|---|---|---|---|---|---|
0 | 1 | 248956422 | 8.14 | 0.08 | 0.25 | 254.57 | 41.72 |
1 | 2 | 242193529 | 7.92 | 0.08 | 0.25 | 247.65 | 40.23 |
2 | 3 | 198295559 | 6.48 | 0.06 | 0.20 | 202.76 | 39.67 |
Dataset 2: Cancerous mutations and associated responses to therapeutics
This dataset was downloaded from the site, Cancer Genome Interpreter Annotates The Biological And Clinical Relevance Of Tumor Alterations [4]. This dataset contains 1170 rows and 20 columns. Within this, the relevant columns are the gene, drug status, drug family, drug, association, and primary tumor type full name. If a drug targets multiple mutations, then it would have multiple rows, one for each gene that is mutated. No cleaning was necessary for this dataset.
An example row with the relevant attributes would provide the following information:
- Drug: the name of the cancer drug
- Drug family: the type of drug
- Clinical status: whether the drug has been approved for clinical use
- Primary tumor type full name: the type of cancer that this drug treats
- Gene: the gene that this drug is associated with
- Association: whether the drug works or does not work against that gene
Table 2. Clinical Relevance of Tumor Mutations
Biomarker | Gene | Alteration type | Alteration | Targeting | Drug status | Drug family | Drug | Association | Evidence level | Assay type | Source | Curator | Curation date | Primary Tumor type | Metastatic Tumor Type | TCGI included | Comments | Drug full name | Primary Tumor type full name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | ERBB2 amplification | ERBB2 | CNA | ERBB2:amp | NaN | Approved | ERBB2 inhibitor;ERBB2 inhibitor;Chemotherapy | Pertuzumab;Trastuzumab;Docetaxel | Responsive | FDA guidelines | NaN | PMID:23801166 | SDemajo;RShadrina | 13.05.2022 | BRCA | NaN | NaN | Drug combination | Pertuzumab (ERBB2 inhibitor)+ Trastuzumab (ERBB2 inhibitor) + Docetaxel (Chemotherapy) | Breast adenocarcinoma |
3 | ERBB2 overexpression | ERBB2 | EXPR | ERBB2:over | NaN | Approved | ERBB2 inhibitor;ERBB2 inhibitor;Chemotherapy | Pertuzumab;Trastuzumab;Docetaxel | Responsive | FDA guidelines | NaN | PMID:23801166 | SDemajo;RShadrina | 30.06.2022 | BRCA | NaN | NaN | Drug combination | Pertuzumab (ERBB2 inhibitor) + Trastuzumab (ERBB2 inhibitor) + Docetaxel (Chemotherapy) | Breast adenocarcinoma |
4 | ERBB2 amplification | ERBB2 | CNA | ERBB2:amp | NaN | Approved | ERBB2 inhibitor;Hormone therapy | Lapatinib;Letrozole | Responsive | FDA guidelines | NaN | FDA;PMID:19786658 | SDemajo;RShadrina | 13.05.2022 | BRCA | NaN | NaN | Drug combination | Lapatinib (ERBB2 inhibitor) + Letrozole (Hormone therapy) | Breast adenocarcinoma |
Dataset 3: Validated cancer-associated mutations
The next dataset is a catalog of validated mutations, which are associated with predispositions of cancer or have been proven to drive tumor growth. This dataset was also downloaded from the Cancer Genome Interpreter [5]. There are 5601 rows, each for a different mutation, and there are 9 columns. The relevant columns in this dataset are the following:
- Gene: the name of the gene
- gDNA: its raw value uses Human Genome Variation Society (HGVS) nomenclature to denote a mutation, which is a combination of strings and integers (i.e. chr7:g.87053221C>T) [6]
- Context: states whether a mutation is somatic, or on the germline, which is relevant since germline mutations are hereditary
- Cancer acronym: the type of associated cancer, written as an acronym
Table 3. Original HGVS Cancer-associated mutations
gene | gdna | protein | transcript | info | context | cancer_acronym | source | reference | |
---|---|---|---|---|---|---|---|---|---|
0 | ABCB4 | chr7:g.87053221C>T | . | ENST00000265723 | CSQN=SpliceDonorSNV;C2=SpliceDonorOfExon17_At_chr7:87053221;aliases=ENSP00000265723;source=Ensembl | germline | HC-PR | ClinVar | Clinvar:NM_018849.2(ABCB4):c.2211+1G>A AND Progressive familial intrahepatic cholestasis 3 |
1 | ABL1 | chr9:g.133738306G>A | p.E236K | ENST00000318560 | CSQN=Missense;codon_pos=133738306-133738307-133738308;ref_codon_seq=GAG;aliases=ENSP00000323315;source=Ensembl | somatic | CML | DoCM | PMID:21562040 |
2 | ABL1 | chr9:g.133738309A>G | p.M237V | ENST00000318560 | CSQN=Missense;reference_codon=ATG;candidate_codons=GTA,GTC,GTG,GTT;candidate_mnv_variants=chr9:g.133738309_133738311delATGinsGTA,chr9:g.133738309_133738311delATGinsGTC,chr9:g.133738309_133738311delATGinsGTT;aliases=ENSP00000323315;source=Ensembl | somatic | CML | Biomarker | PMID:21562040 |
The first section “chr7” indicates what chromosome number the mutation is located on. The “g” denotes that the following values are written in a linear genomic reference sequence. The numbers that follow are the location of the mutation in terms of base pairs. Lastly, the “C>T” indicates that there has been a point mutation from cytosine to thymine. Unfortunately, this is the simplest type of HGVS notation in this dataset, as there are multiple types of mutations that require different types of formats. For example, a deletion-insertion mutation (chr2:g.29445449_29445451delinsGCT) needs to provide two locations that indicate the start and end of the deletion, as well as the base pair sequence that is being inserted into the genome at that location. Another example is a mutation that could have two different results, or one that occurs in multiple locations. This value (chr2:g.29443695G>T__chr2:g.29443695G>C__chr2:g.29443697A>G) is an example that includes both, where at the same location, the guanine can be changed to either thymine or cytosine. This mutation also has a change that occurs two base pairs down. Regardless, different changes are separated by two underscores, “__”.
Since each of these different notations require extensive cleaning, an HGVS reader was created to convert the HGVS notation into columns that were cleaned to ensure compatibility with the Python language. Below is the final cleaned dataset that was used for all further analysis. If a gene had multiple associated mutations, multiple rows were created for each mutation.
Table 4. Processed HGVS Cancer-associated mutations
gene | gdna | protein | transcript | info | context | cancer_acronym | source | reference | full_mutation | chromosome | location | mutation | simple_mutation | location_start | location_end | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ABCB4 | chr7:g.87053221C>T | . | ENST00000265723 | CSQN=SpliceDonorSNV;C2=SpliceDonorOfExon17_At_chr7:87053221;aliases=ENSP00000265723;source=Ensembl | germline | Hepatic carcinoma predisposition | ClinVar | Clinvar:NM_018849.2(ABCB4):c.2211+1G>A AND Progressive familial intrahepatic cholestasis 3 | chr7:g.87053221C>T | 7 | g.87053221C>T | C>T | point_mutation | 87053221 | NaN |
1 | ABL1 | chr9:g.133750347A>G | p.Y393C | ENST00000318560 | CSQN=Missense;reference_codon=TAC;candidate_codons=TGT,TGC;candidate_mnv_variants=chr9:g.133750347_133750348delACinsGT;aliases=ENSP00000323315;source=Ensembl | somatic | Chronic myeloid leukemia | Biomarker | PMID:21562040 | chr9:g.133750347A>G | 9 | g.133750347A>G | A>G | point_mutation | 133750347 | NaN |
3 | ABL1 | chr9:g.133750330G>T | p.L387F | ENST00000318560 | CSQN=Missense;reference_codon=TTG;candidate_codons=TTT,TTC;candidate_snv_variants=chr9:g.133750330G>C;aliases=ENSP00000323315;source=Ensembl | somatic | Chronic myeloid leukemia | Biomarker | PMID:21562040 | chr9:g.133750330G>T | 9 | g.133750330G>T | G>T | point_mutation | 133750330 | NaN |
Dataset 4: Cancer acronyms
The final dataset comes from the additional .csv in the zip file from dataset 3. It contains two columns, one for the cancer acronym and one for the full cancer name. It should be noted that these full names match the “Primary Tumor type full name” column values in dataset 2. There are also 113 rows in this dataset, each corresponding to a different type of cancer.
Table 5. Cancer Acronyms Dictionary
Acronym | Cancer | |
---|---|---|
0 | AA | Adrenal adenoma |
1 | AC | Adrenal cortex |
2 | ACA | Adrenal cortex adenoma |
3 | AG | Anaplastic oligodendroglioma |
4 | ALL | Acute lymphoblastic leukemia |
Contact Me
Preferred email: catarina.bettencourt.a@gmail.com | School email: bettencourt.c@northeastern.edu
LinkedIn | Blue Sky