All data used in this project comes from the following library: Pasolli E, Schiffer L, Manghi P, Renson A, Obenchain V, Truong D, Beghini F, Malik F, Ramos M, Dowd J, Huttenhower C, Morgan M, Segata N, Waldron L (2017). “Accessible, curated metagenomic data through ExperimentHub.” Nat. Methods, 14(11), 1023–1024. ISSN 1548-7091, 1548-7105, doi:10.1038/nmeth.4468.
The R script and sample data and results used in this file can be found at https://github.com/bettencourt-c/-microbiome-pilot-analysis.
suppressPackageStartupMessages({
library(tidyverse)
library(phyloseq)
library(vegan)
library(ggplot2)
library(cowplot)
library(knitr)
library(kableExtra)
})
FOLDER_PATH <- "20260301_PD_stool_no-abx_sex-All_BMIrange-0-Inf_minreads-0"
config <- readRDS(file.path(FOLDER_PATH, "data", "config.rds"))
meta_combined <- readRDS(file.path(FOLDER_PATH, "data", "meta_combined.rds"))
ps_rel_a <- readRDS(file.path(FOLDER_PATH, "data", "relative_abundance_ps.rds"))
ps_path_a <- readRDS(file.path(FOLDER_PATH, "data", "pathway_abundance_ps.rds"))
ps_path_c <- readRDS(file.path(FOLDER_PATH, "data", "pathway_coverage_ps.rds"))
rel_div <- file.path(FOLDER_PATH, "plots", "relative_abundance_combined-diversity.png")
rel_top5 <- file.path(FOLDER_PATH, "plots", "relative_abundance_genus_top-5-Genus.png")
rel_permanova <- read.csv(file.path(FOLDER_PATH, "results", "relative_abundance_permanova.csv"))
rel_betadisp <- read.csv(file.path(FOLDER_PATH, "results", "relative_abundance_betadisper.csv"))
rel_wilcox <- read.csv(file.path(FOLDER_PATH, "results", "relative_abundance_genus_wilcox_genus.csv"))
rel_direction <- read.csv(file.path(FOLDER_PATH, "results", "relative_abundance_genus_direction_table.csv"))
path_a_permanova <- read.csv(file.path(FOLDER_PATH, "results", "pathway_abundance_permanova.csv"))
path_a_betadisp <- read.csv(file.path(FOLDER_PATH, "results", "pathway_abundance_betadisper.csv"))
path_a_wilcox <- read.csv(file.path(FOLDER_PATH, "results", "pathway_abundance_wilcox_feature.csv"))
path_a_direction <- read.csv(file.path(FOLDER_PATH, "results", "pathway_abundance_direction_table.csv"))
path_a_div <- file.path(FOLDER_PATH, "plots", "pathway_abundance_combined-diversity.png")
path_a_top5 <- file.path(FOLDER_PATH, "plots", "pathway_abundance_top-5-Feature.png")
path_c_permanova <- read.csv(file.path(FOLDER_PATH, "results", "pathway_coverage_permanova.csv"))
path_c_betadisp <- read.csv(file.path(FOLDER_PATH, "results", "pathway_coverage_betadisper.csv"))
path_c_wilcox <- read.csv(file.path(FOLDER_PATH, "results", "pathway_coverage_wilcox_feature.csv"))
path_c_direction <- read.csv(file.path(FOLDER_PATH, "results", "pathway_coverage_direction_table.csv"))
path_c_div <- file.path(FOLDER_PATH, "plots", "pathway_coverage_beta-diversity.png")
path_c_top5 <- file.path(FOLDER_PATH, "plots", "pathway_coverage_top-5-Feature.png")
The following parameters were selected. Using the available age range of the PD cohort, age-matched controls were selected. The size of the PD cohort was limited, and reduces the power of these results. I decided to continue with this limited cohort since the goal of this analysis was to test improve my R coding skills within the context of microbiome analysis. To balance the sample size between the limited PD cohort and extensive HC cohort, 23 HC samples were randomly selected. Participants taking antibiotics were excluded. Although not used in this analysis given the already limited sample, the R script is built to filter the cohort further based on gender, minimum sequencing read length, and BMI range. Additional parameters can be added to the script with minimal changes. Users can also select the type of sample data they wish to analyze. Currently, users cannot analyze gene families because my computer is not capable of handling the memory and disk requirements. I plan to continue troubleshooting ways to reduce computing power, as I am personally very interested in seeing how the gene families differ between cohort microbiomes.
as.data.frame(t(sapply(config, function(x) paste(x, collapse = ", ")))) %>%
pivot_longer(everything(), names_to = "Parameter", values_to = "Value") %>%
knitr::kable(caption = "Analysis Parameters", booktabs = TRUE) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
position = "left",
font_size = 14
) %>%
row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
column_spec(1, bold = TRUE)
| Parameter | Value |
|---|---|
| DIAGNOSIS | PD |
| SAMPLE_LOCATION | stool |
| data_to_pull | relative_abundance, pathway_abundance, pathway_coverage |
| acceptable_abx | no, NA |
| acceptable_gender | female, male, NA |
| MIN_READS | 0 |
| MIN_BMI | 0 |
| MAX_BMI | Inf |
Summary statistics for both cohorts are available below.
meta_combined %>%
group_by(disease) %>%
summarize(
N = n(),
Age = paste0(round(mean(age, na.rm = TRUE), 1), " ± ", round(sd(age, na.rm = TRUE), 1)),
BMI = paste0(round(mean(BMI, na.rm = TRUE), 1), " ± ", round(sd(BMI, na.rm = TRUE), 1)),
Male = paste0(sum(gender == "male", na.rm = TRUE), " (", round(mean(gender == "male", na.rm = TRUE) * 100), "%)"),
.groups = "drop"
) %>%
knitr::kable(caption = "Cohort Demographics", booktabs = TRUE) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
position = "left",
font_size = 14
) %>%
row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
column_spec(1, bold = TRUE)
| disease | N | Age | BMI | Male |
|---|---|---|---|---|
| healthy | 23 | 58.7 ± 7.9 | 25.9 ± 3.8 | 12 (52%) |
| PD | 23 | 66.7 ± 8.7 | 24.7 ± 2.6 | 23 (100%) |
Figure A displays that alpha diversity did not differ significantly between PD and HC groups, as determined by Wilcox rank test. This was selected over a t-test since microbiome data typically does not pass normality test. Figure B presents the between-group differences of the PD and HC cohorts.
knitr::include_graphics(rel_div)
Gut microbiome composition differed significantly between PD and healthy controls (PERMANOVA: R² = 0.055, p = 0.001), with no significant difference in group dispersion (betadisper: p = 0.541), confirming the result reflects true compositional differences rather than differences in within-group variability.
rel_permanova %>%
knitr::kable(caption = "Relative Abundance PERMANOVA Results", booktabs = TRUE) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
position = "left",
font_size = 14
) %>%
row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
column_spec(1, bold = TRUE)
| Term | Df | SumOfSqs | R2 | F | Pr..F. |
|---|---|---|---|---|---|
| Model | 1 | 0.6931896 | 0.0552811 | 2.5747 | 0.001 |
| Residual | 44 | 11.8461719 | 0.9447189 | NA | NA |
| Total | 45 | 12.5393615 | 1.0000000 | NA | NA |
rel_betadisp %>%
knitr::kable(caption = "Multivariate Homogeneity of Groups Dispersions in Relative Abundance", booktabs = TRUE) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
position = "left",
font_size = 14
) %>%
row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
column_spec(1, bold = TRUE)
| Term | Df | Sum.Sq | Mean.Sq | F | N.Perm | Pr..F. |
|---|---|---|---|---|---|---|
| Groups | 1 | 0.0014981 | 0.0014981 | 0.4066816 | 999 | 0.541 |
| Residuals | 44 | 0.1620856 | 0.0036838 | NA | NA | NA |
The top five genera from each cohort is evaluated. When a genus is present in the top five for both cohorts, it is only listed once.
knitr::include_graphics(rel_top5)
The remaining table uses the results from the Wilcoxon rank-sum test with FDR Correction, but adds directionality to the differences between cohorts.
rel_direction %>%
knitr::kable(caption = "Directional Change of Relative Abundance (Genus)", booktabs = TRUE) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
position = "left",
font_size = 14
) %>%
row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
column_spec(1, bold = TRUE)
| Feature | p_value | p_adjusted | healthy | PD | direction |
|---|---|---|---|---|---|
| Akkermansia | 0.0006219 | 0.0021766 | 1.220584 | 7.387229 | Increased in PD |
| Alistipes | 0.0005601 | 0.0021766 | 3.163461 | 8.700576 | Increased in PD |
| Bacteroides | 0.0866029 | 0.1515551 | 30.709332 | 19.214478 | Decreased in PD |
| Eubacterium | 0.8604876 | 0.8604876 | 5.778391 | 6.256920 | Increased in PD |
| Faecalibacterium | 0.6925143 | 0.8079334 | 6.043400 | 7.125576 | Increased in PD |
| Prevotella | 0.1615553 | 0.2261774 | 5.315137 | 6.479969 | Increased in PD |
| Roseburia | 0.0223249 | 0.0520913 | 4.993906 | 2.746188 | Decreased in PD |
Pathway abundance represents the predicted metabolic activity of the microbiome, quantified as reads per kilobase (RPK). Significant differences in pathway abundance between groups suggest altered functional capacity, independent of which taxa are present. As with relative abundance, PERMANOVA was used to assess between-group compositional differences, with betadisper confirming result validity.
Gut microbiome functional capacity differed significantly between PD and healthy controls (PERMANOVA: R² = 0.646, p = 0.001), with no significant difference in group dispersion (betadisper: p = 0.129), confirming the result reflects true compositional differences rather than differences in within-group variability.
knitr::include_graphics(path_a_div)
path_a_permanova %>%
knitr::kable(caption = "Pathway Abundance PERMANOVA Results", booktabs = TRUE) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
position = "left",
font_size = 14
) %>%
row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
column_spec(1, bold = TRUE)
| Term | Df | SumOfSqs | R2 | F | Pr..F. |
|---|---|---|---|---|---|
| Model | 1 | 3.697569 | 0.6456247 | 80.16214 | 0.001 |
| Residual | 44 | 2.029550 | 0.3543753 | NA | NA |
| Total | 45 | 5.727119 | 1.0000000 | NA | NA |
path_a_betadisp %>%
knitr::kable(caption = "Multivariate Homogeneity of Groups Dispersions in Pathway Abundance", booktabs = TRUE) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
position = "left",
font_size = 14
) %>%
row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
column_spec(1, bold = TRUE)
| Term | Df | Sum.Sq | Mean.Sq | F | N.Perm | Pr..F. |
|---|---|---|---|---|---|---|
| Groups | 1 | 0.0315933 | 0.0315933 | 2.435929 | 999 | 0.129 |
| Residuals | 44 | 0.5706674 | 0.0129697 | NA | NA | NA |
The top five pathways per cohort are shown below. Unknown and unintegrated pathways were not filtered prior to this analysis and may dominate the top features. Future iterations of this pipeline will include a filtering step to focus on annotated pathways.
knitr::include_graphics(path_a_top5)
path_a_direction %>%
knitr::kable(caption = "Directional Change of Pathway Abundance", booktabs = TRUE) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
position = "left",
font_size = 14
) %>%
row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
column_spec(1, bold = TRUE)
| Feature | p_value | p_adjusted | healthy | PD | direction |
|---|---|---|---|---|---|
| UNINTEGRATED | 0.0000000 | 0.0000000 | 0.6842501 | 0.1675525 | Decreased in PD |
| UNINTEGRATED|g__Bacteroides.s__Bacteroides_vulgatus | 0.0120390 | 0.0120390 | 0.0511932 | 0.0053143 | Decreased in PD |
| UNINTEGRATED|g__Faecalibacterium.s__Faecalibacterium_prausnitzii | 0.0006586 | 0.0008232 | 0.0335795 | 0.0115321 | Decreased in PD |
| UNINTEGRATED|unclassified | 0.0000044 | 0.0000073 | 0.1983234 | 0.0613956 | Decreased in PD |
| UNMAPPED | 0.0000000 | 0.0000000 | 0.2708799 | 0.8203939 | Increased in PD |
Pathway coverage represents the proportion of a pathway’s component reactions detected in a sample, ranging from 0 to 1. Unlike pathway abundance which reflects expression level, coverage indicates whether a pathway is structurally present at all. Alpha diversity is not applicable to coverage data given its bounded nature, so only beta diversity is assessed here.
Pathway coverage composition differed between groups (PERMANOVA: R² = 0.086, p = 0.001), with no significant difference in group dispersion (betadisper: p = 0.747), confirming the result reflects true compositional differences.
knitr::include_graphics(path_c_div)
path_c_permanova %>%
knitr::kable(caption = "Pathway Coverage PERMANOVA Results", booktabs = TRUE) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
position = "left",
font_size = 14
) %>%
row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
column_spec(1, bold = TRUE)
| Term | Df | SumOfSqs | R2 | F | Pr..F. |
|---|---|---|---|---|---|
| Model | 1 | 0.9698315 | 0.0861643 | 4.1487 | 0.001 |
| Residual | 44 | 10.2857734 | 0.9138357 | NA | NA |
| Total | 45 | 11.2556049 | 1.0000000 | NA | NA |
path_c_betadisp %>%
knitr::kable(caption = "Multivariate Homogeneity of Groups Dispersions in Pathway Coverage", booktabs = TRUE) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
position = "left",
font_size = 14
) %>%
row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
column_spec(1, bold = TRUE)
| Term | Df | Sum.Sq | Mean.Sq | F | N.Perm | Pr..F. |
|---|---|---|---|---|---|---|
| Groups | 1 | 0.0002899 | 0.0002899 | 0.0974597 | 999 | 0.747 |
| Residuals | 44 | 0.1308623 | 0.0029741 | NA | NA | NA |
The top five pathways per cohort are shown below. Unknown and unintegrated pathways were not filtered prior to this analysis and may dominate the top features. Future iterations of this pipeline will include a filtering step to focus on annotated pathways.
knitr::include_graphics(path_c_top5)
path_c_direction %>%
knitr::kable(caption = "Directional Change of Pathway Coverage", booktabs = TRUE) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
position = "left",
font_size = 14
) %>%
row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
column_spec(1, bold = TRUE)
| Feature | p_value | p_adjusted | healthy | PD | direction |
|---|---|---|---|---|---|
| DTDPRHAMSYN-PWY: dTDP-L-rhamnose biosynthesis I | 0.0204860 | 0.0307289 | 1 | 0.9999996 | Decreased in PD |
| PWY-6386: UDP-N-acetylmuramoyl-pentapeptide biosynthesis II (lysine-containing) | 0.0101373 | 0.0304118 | 1 | 0.9999799 | Decreased in PD |
| PWY-7219: adenosine ribonucleotides de novo biosynthesis | 0.1618858 | 0.1618858 | 1 | 1.0000000 | Decreased in PD |
| UNINTEGRATED | NA | NA | 1 | 1.0000000 | Decreased in PD |
| UNINTEGRATED|unclassified | NA | NA | 1 | 1.0000000 | Decreased in PD |
| UNMAPPED | NA | NA | 1 | 1.0000000 | Decreased in PD |
Although it is promising to see statistically significant results in this analysis, it should be noted that without a review from a researcher with experience in host-microbiome analytics, the results drawn from this pilot project should be taken with a grain of salt. My current lab does not work with any form of microbiology, so the basis for each statistical analysis is based off of my understanding of the current literature.
There are multiple limitations in this study, first being the sample size. Given the complexity and volatility of microbial community composition, there are many additional factors that must be accounted for, such as sequencing read length, diet, activity level, probiotic use, and severity of PD. There are also variations in sequencing techniques and sample collection that are not accounted for in this analysis. Pre-processing steps of the data source are also a source of limitation. The Shannon diversity estimates presented in this project should be interpreted with caution, as the curatedMetagenomicData relative abundance values have been pre-processed and normalized, removing singletons from the data. While the between-group comparison remains valid, absolute diversity values may be underestimated relative to studies using raw count data.
Future analysis would include a multiple linear regression including but not limited to factors such as age, gender, and UPDRS score. Unknown and unintegrated values would also be filtered out for secondary analysis of known pathways. Additionally, I would be interested in evaluating whether a composite bacterial health score can be determined following analysis of the entire healthy control cohort. This composite score would ideally include the bacterial genera, gene families, and pathway information.