Data Summary

All data used in this project comes from the following library: Pasolli E, Schiffer L, Manghi P, Renson A, Obenchain V, Truong D, Beghini F, Malik F, Ramos M, Dowd J, Huttenhower C, Morgan M, Segata N, Waldron L (2017). “Accessible, curated metagenomic data through ExperimentHub.” Nat. Methods, 14(11), 1023–1024. ISSN 1548-7091, 1548-7105, doi:10.1038/nmeth.4468.

The R script and sample data and results used in this file can be found at https://github.com/bettencourt-c/-microbiome-pilot-analysis.

suppressPackageStartupMessages({
  library(tidyverse)
  library(phyloseq)
  library(vegan)
  library(ggplot2)
  library(cowplot)
  library(knitr)
  library(kableExtra)
})

FOLDER_PATH <- "20260301_PD_stool_no-abx_sex-All_BMIrange-0-Inf_minreads-0"
config <- readRDS(file.path(FOLDER_PATH, "data", "config.rds"))
meta_combined <- readRDS(file.path(FOLDER_PATH, "data", "meta_combined.rds"))
ps_rel_a  <- readRDS(file.path(FOLDER_PATH, "data", "relative_abundance_ps.rds"))
ps_path_a <- readRDS(file.path(FOLDER_PATH, "data", "pathway_abundance_ps.rds"))
ps_path_c <- readRDS(file.path(FOLDER_PATH, "data", "pathway_coverage_ps.rds"))
rel_div  <- file.path(FOLDER_PATH, "plots", "relative_abundance_combined-diversity.png")
rel_top5       <- file.path(FOLDER_PATH, "plots", "relative_abundance_genus_top-5-Genus.png")
rel_permanova  <- read.csv(file.path(FOLDER_PATH, "results", "relative_abundance_permanova.csv"))
rel_betadisp   <- read.csv(file.path(FOLDER_PATH, "results", "relative_abundance_betadisper.csv"))
rel_wilcox     <- read.csv(file.path(FOLDER_PATH, "results", "relative_abundance_genus_wilcox_genus.csv"))
rel_direction  <- read.csv(file.path(FOLDER_PATH, "results", "relative_abundance_genus_direction_table.csv"))
path_a_permanova <- read.csv(file.path(FOLDER_PATH, "results", "pathway_abundance_permanova.csv"))
path_a_betadisp  <- read.csv(file.path(FOLDER_PATH, "results", "pathway_abundance_betadisper.csv"))
path_a_wilcox    <- read.csv(file.path(FOLDER_PATH, "results", "pathway_abundance_wilcox_feature.csv"))
path_a_direction <- read.csv(file.path(FOLDER_PATH, "results", "pathway_abundance_direction_table.csv"))
path_a_div       <- file.path(FOLDER_PATH, "plots", "pathway_abundance_combined-diversity.png")
path_a_top5      <- file.path(FOLDER_PATH, "plots", "pathway_abundance_top-5-Feature.png")
path_c_permanova <- read.csv(file.path(FOLDER_PATH, "results", "pathway_coverage_permanova.csv"))
path_c_betadisp  <- read.csv(file.path(FOLDER_PATH, "results", "pathway_coverage_betadisper.csv"))
path_c_wilcox    <- read.csv(file.path(FOLDER_PATH, "results", "pathway_coverage_wilcox_feature.csv"))
path_c_direction <- read.csv(file.path(FOLDER_PATH, "results", "pathway_coverage_direction_table.csv"))
path_c_div       <- file.path(FOLDER_PATH, "plots", "pathway_coverage_beta-diversity.png")
path_c_top5      <- file.path(FOLDER_PATH, "plots", "pathway_coverage_top-5-Feature.png")

The following parameters were selected. Using the available age range of the PD cohort, age-matched controls were selected. The size of the PD cohort was limited, and reduces the power of these results. I decided to continue with this limited cohort since the goal of this analysis was to test improve my R coding skills within the context of microbiome analysis. To balance the sample size between the limited PD cohort and extensive HC cohort, 23 HC samples were randomly selected. Participants taking antibiotics were excluded. Although not used in this analysis given the already limited sample, the R script is built to filter the cohort further based on gender, minimum sequencing read length, and BMI range. Additional parameters can be added to the script with minimal changes. Users can also select the type of sample data they wish to analyze. Currently, users cannot analyze gene families because my computer is not capable of handling the memory and disk requirements. I plan to continue troubleshooting ways to reduce computing power, as I am personally very interested in seeing how the gene families differ between cohort microbiomes.

as.data.frame(t(sapply(config, function(x) paste(x, collapse = ", ")))) %>%
  pivot_longer(everything(), names_to = "Parameter", values_to = "Value") %>%
  knitr::kable(caption = "Analysis Parameters", booktabs = TRUE) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = FALSE,
    position          = "left",
    font_size         = 14
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
  column_spec(1, bold = TRUE)
Analysis Parameters
Parameter Value
DIAGNOSIS PD
SAMPLE_LOCATION stool
data_to_pull relative_abundance, pathway_abundance, pathway_coverage
acceptable_abx no, NA
acceptable_gender female, male, NA
MIN_READS 0
MIN_BMI 0
MAX_BMI Inf

Summary statistics for both cohorts are available below.

meta_combined %>%
  group_by(disease) %>%
  summarize(
    N        = n(),
    Age      = paste0(round(mean(age, na.rm = TRUE), 1), " ± ", round(sd(age, na.rm = TRUE), 1)),
    BMI      = paste0(round(mean(BMI, na.rm = TRUE), 1), " ± ", round(sd(BMI, na.rm = TRUE), 1)),
    Male     = paste0(sum(gender == "male", na.rm = TRUE), " (", round(mean(gender == "male", na.rm = TRUE) * 100), "%)"),
    .groups  = "drop"
  ) %>%
  knitr::kable(caption = "Cohort Demographics", booktabs = TRUE) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = FALSE,
    position          = "left",
    font_size         = 14
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
  column_spec(1, bold = TRUE)
Cohort Demographics
disease N Age BMI Male
healthy 23 58.7 ± 7.9 25.9 ± 3.8 12 (52%)
PD 23 66.7 ± 8.7 24.7 ± 2.6 23 (100%)

Relative Abundance Overview

Figure A displays that alpha diversity did not differ significantly between PD and HC groups, as determined by Wilcox rank test. This was selected over a t-test since microbiome data typically does not pass normality test. Figure B presents the between-group differences of the PD and HC cohorts.

knitr::include_graphics(rel_div)

Gut microbiome composition differed significantly between PD and healthy controls (PERMANOVA: R² = 0.055, p = 0.001), with no significant difference in group dispersion (betadisper: p = 0.541), confirming the result reflects true compositional differences rather than differences in within-group variability.

rel_permanova %>%
  knitr::kable(caption = "Relative Abundance PERMANOVA Results", booktabs = TRUE) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = FALSE,
    position          = "left",
    font_size         = 14
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
  column_spec(1, bold = TRUE)
Relative Abundance PERMANOVA Results
Term Df SumOfSqs R2 F Pr..F.
Model 1 0.6931896 0.0552811 2.5747 0.001
Residual 44 11.8461719 0.9447189 NA NA
Total 45 12.5393615 1.0000000 NA NA
rel_betadisp %>%
  knitr::kable(caption = "Multivariate Homogeneity of Groups Dispersions in Relative Abundance", booktabs = TRUE) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = FALSE,
    position          = "left",
    font_size         = 14
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
  column_spec(1, bold = TRUE)
Multivariate Homogeneity of Groups Dispersions in Relative Abundance
Term Df Sum.Sq Mean.Sq F N.Perm Pr..F.
Groups 1 0.0014981 0.0014981 0.4066816 999 0.541
Residuals 44 0.1620856 0.0036838 NA NA NA

The top five genera from each cohort is evaluated. When a genus is present in the top five for both cohorts, it is only listed once.

knitr::include_graphics(rel_top5)

The remaining table uses the results from the Wilcoxon rank-sum test with FDR Correction, but adds directionality to the differences between cohorts.

rel_direction %>%
  knitr::kable(caption = "Directional Change of Relative Abundance (Genus)", booktabs = TRUE) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = FALSE,
    position          = "left",
    font_size         = 14
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
  column_spec(1, bold = TRUE)
Directional Change of Relative Abundance (Genus)
Feature p_value p_adjusted healthy PD direction
Akkermansia 0.0006219 0.0021766 1.220584 7.387229 Increased in PD
Alistipes 0.0005601 0.0021766 3.163461 8.700576 Increased in PD
Bacteroides 0.0866029 0.1515551 30.709332 19.214478 Decreased in PD
Eubacterium 0.8604876 0.8604876 5.778391 6.256920 Increased in PD
Faecalibacterium 0.6925143 0.8079334 6.043400 7.125576 Increased in PD
Prevotella 0.1615553 0.2261774 5.315137 6.479969 Increased in PD
Roseburia 0.0223249 0.0520913 4.993906 2.746188 Decreased in PD

Pathway Abundance Overview

Pathway abundance represents the predicted metabolic activity of the microbiome, quantified as reads per kilobase (RPK). Significant differences in pathway abundance between groups suggest altered functional capacity, independent of which taxa are present. As with relative abundance, PERMANOVA was used to assess between-group compositional differences, with betadisper confirming result validity.

Gut microbiome functional capacity differed significantly between PD and healthy controls (PERMANOVA: R² = 0.646, p = 0.001), with no significant difference in group dispersion (betadisper: p = 0.129), confirming the result reflects true compositional differences rather than differences in within-group variability.

knitr::include_graphics(path_a_div)

path_a_permanova %>%
  knitr::kable(caption = "Pathway Abundance PERMANOVA Results", booktabs = TRUE) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = FALSE,
    position          = "left",
    font_size         = 14
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
  column_spec(1, bold = TRUE)
Pathway Abundance PERMANOVA Results
Term Df SumOfSqs R2 F Pr..F.
Model 1 3.697569 0.6456247 80.16214 0.001
Residual 44 2.029550 0.3543753 NA NA
Total 45 5.727119 1.0000000 NA NA
path_a_betadisp %>%
  knitr::kable(caption = "Multivariate Homogeneity of Groups Dispersions in Pathway Abundance", booktabs = TRUE) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = FALSE,
    position          = "left",
    font_size         = 14
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
  column_spec(1, bold = TRUE)
Multivariate Homogeneity of Groups Dispersions in Pathway Abundance
Term Df Sum.Sq Mean.Sq F N.Perm Pr..F.
Groups 1 0.0315933 0.0315933 2.435929 999 0.129
Residuals 44 0.5706674 0.0129697 NA NA NA

The top five pathways per cohort are shown below. Unknown and unintegrated pathways were not filtered prior to this analysis and may dominate the top features. Future iterations of this pipeline will include a filtering step to focus on annotated pathways.

knitr::include_graphics(path_a_top5)

path_a_direction %>%
  knitr::kable(caption = "Directional Change of Pathway Abundance", booktabs = TRUE) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = FALSE,
    position          = "left",
    font_size         = 14
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
  column_spec(1, bold = TRUE)
Directional Change of Pathway Abundance
Feature p_value p_adjusted healthy PD direction
UNINTEGRATED 0.0000000 0.0000000 0.6842501 0.1675525 Decreased in PD
UNINTEGRATED&#124;g__Bacteroides.s__Bacteroides_vulgatus 0.0120390 0.0120390 0.0511932 0.0053143 Decreased in PD
UNINTEGRATED&#124;g__Faecalibacterium.s__Faecalibacterium_prausnitzii 0.0006586 0.0008232 0.0335795 0.0115321 Decreased in PD
UNINTEGRATED&#124;unclassified 0.0000044 0.0000073 0.1983234 0.0613956 Decreased in PD
UNMAPPED 0.0000000 0.0000000 0.2708799 0.8203939 Increased in PD

Pathway Coverage Overview

Pathway coverage represents the proportion of a pathway’s component reactions detected in a sample, ranging from 0 to 1. Unlike pathway abundance which reflects expression level, coverage indicates whether a pathway is structurally present at all. Alpha diversity is not applicable to coverage data given its bounded nature, so only beta diversity is assessed here.

Pathway coverage composition differed between groups (PERMANOVA: R² = 0.086, p = 0.001), with no significant difference in group dispersion (betadisper: p = 0.747), confirming the result reflects true compositional differences.

knitr::include_graphics(path_c_div)

path_c_permanova %>%
  knitr::kable(caption = "Pathway Coverage PERMANOVA Results", booktabs = TRUE) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = FALSE,
    position          = "left",
    font_size         = 14
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
  column_spec(1, bold = TRUE)
Pathway Coverage PERMANOVA Results
Term Df SumOfSqs R2 F Pr..F.
Model 1 0.9698315 0.0861643 4.1487 0.001
Residual 44 10.2857734 0.9138357 NA NA
Total 45 11.2556049 1.0000000 NA NA
path_c_betadisp %>%
  knitr::kable(caption = "Multivariate Homogeneity of Groups Dispersions in Pathway Coverage", booktabs = TRUE) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = FALSE,
    position          = "left",
    font_size         = 14
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
  column_spec(1, bold = TRUE)
Multivariate Homogeneity of Groups Dispersions in Pathway Coverage
Term Df Sum.Sq Mean.Sq F N.Perm Pr..F.
Groups 1 0.0002899 0.0002899 0.0974597 999 0.747
Residuals 44 0.1308623 0.0029741 NA NA NA

The top five pathways per cohort are shown below. Unknown and unintegrated pathways were not filtered prior to this analysis and may dominate the top features. Future iterations of this pipeline will include a filtering step to focus on annotated pathways.

knitr::include_graphics(path_c_top5)

path_c_direction %>%
  knitr::kable(caption = "Directional Change of Pathway Coverage", booktabs = TRUE) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = FALSE,
    position          = "left",
    font_size         = 14
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
  column_spec(1, bold = TRUE)
Directional Change of Pathway Coverage
Feature p_value p_adjusted healthy PD direction
DTDPRHAMSYN-PWY: dTDP-L-rhamnose biosynthesis I 0.0204860 0.0307289 1 0.9999996 Decreased in PD
PWY-6386: UDP-N-acetylmuramoyl-pentapeptide biosynthesis II (lysine-containing) 0.0101373 0.0304118 1 0.9999799 Decreased in PD
PWY-7219: adenosine ribonucleotides de novo biosynthesis 0.1618858 0.1618858 1 1.0000000 Decreased in PD
UNINTEGRATED NA NA 1 1.0000000 Decreased in PD
UNINTEGRATED&#124;unclassified NA NA 1 1.0000000 Decreased in PD
UNMAPPED NA NA 1 1.0000000 Decreased in PD

Final Thoughts

Although it is promising to see statistically significant results in this analysis, it should be noted that without a review from a researcher with experience in host-microbiome analytics, the results drawn from this pilot project should be taken with a grain of salt. My current lab does not work with any form of microbiology, so the basis for each statistical analysis is based off of my understanding of the current literature.

There are multiple limitations in this study, first being the sample size. Given the complexity and volatility of microbial community composition, there are many additional factors that must be accounted for, such as sequencing read length, diet, activity level, probiotic use, and severity of PD. There are also variations in sequencing techniques and sample collection that are not accounted for in this analysis. Pre-processing steps of the data source are also a source of limitation. The Shannon diversity estimates presented in this project should be interpreted with caution, as the curatedMetagenomicData relative abundance values have been pre-processed and normalized, removing singletons from the data. While the between-group comparison remains valid, absolute diversity values may be underestimated relative to studies using raw count data.

Future analysis would include a multiple linear regression including but not limited to factors such as age, gender, and UPDRS score. Unknown and unintegrated values would also be filtered out for secondary analysis of known pathways. Additionally, I would be interested in evaluating whether a composite bacterial health score can be determined following analysis of the entire healthy control cohort. This composite score would ideally include the bacterial genera, gene families, and pathway information.