Data Summary

All data used in this project comes from the following library: Pasolli E, Schiffer L, Manghi P, Renson A, Obenchain V, Truong D, Beghini F, Malik F, Ramos M, Dowd J, Huttenhower C, Morgan M, Segata N, Waldron L (2017). “Accessible, curated metagenomic data through ExperimentHub.” Nat. Methods, 14(11), 1023–1024. ISSN 1548-7091, 1548-7105, doi:10.1038/nmeth.4468.

The R script and sample data and results used in this file can be found at https://github.com/bettencourt-c/-microbiome-pilot-analysis.

suppressPackageStartupMessages({
  library(tidyverse)
  library(phyloseq)
  library(vegan)
  library(ggplot2)
  library(cowplot)
  library(knitr)
  library(kableExtra)
})

FOLDER_PATH <- "20260301_PD_stool_no-abx_sex-All_BMIrange-0-Inf_minreads-0"
config <- readRDS(file.path(FOLDER_PATH, "data", "config.rds"))
meta_combined <- readRDS(file.path(FOLDER_PATH, "data", "meta_combined.rds"))

ps_rel_a  <- readRDS(file.path(FOLDER_PATH, "data", "relative_abundance_ps.rds"))
ps_path_a <- readRDS(file.path(FOLDER_PATH, "data", "pathway_abundance_ps.rds"))
ps_path_c <- readRDS(file.path(FOLDER_PATH, "data", "pathway_coverage_ps.rds"))

rel_div  <- file.path(FOLDER_PATH, "plots", "relative_abundance_combined-diversity.png")
rel_top5       <- file.path(FOLDER_PATH, "plots", "relative_abundance_genus_top-5-Genus.png")
rel_permanova  <- read.csv(file.path(FOLDER_PATH, "results", "relative_abundance_permanova.csv"))
rel_betadisp   <- read.csv(file.path(FOLDER_PATH, "results", "relative_abundance_betadisper.csv"))
rel_wilcox     <- read.csv(file.path(FOLDER_PATH, "results", "relative_abundance_genus_wilcox_genus.csv"))
rel_direction  <- read.csv(file.path(FOLDER_PATH, "results", "relative_abundance_genus_direction_table.csv"))

path_a_permanova <- read.csv(file.path(FOLDER_PATH, "results", "pathway_abundance_permanova.csv"))
path_a_betadisp  <- read.csv(file.path(FOLDER_PATH, "results", "pathway_abundance_betadisper.csv"))
path_a_wilcox    <- read.csv(file.path(FOLDER_PATH, "results", "pathway_abundance_wilcox_feature.csv"))
path_a_direction <- read.csv(file.path(FOLDER_PATH, "results", "pathway_abundance_direction_table.csv"))
path_a_div       <- file.path(FOLDER_PATH, "plots", "pathway_abundance_combined-diversity.png")
path_a_top5      <- file.path(FOLDER_PATH, "plots", "pathway_abundance_top-5-Feature.png")

path_c_permanova <- read.csv(file.path(FOLDER_PATH, "results", "pathway_coverage_permanova.csv"))
path_c_betadisp  <- read.csv(file.path(FOLDER_PATH, "results", "pathway_coverage_betadisper.csv"))
path_c_wilcox    <- read.csv(file.path(FOLDER_PATH, "results", "pathway_coverage_wilcox_feature.csv"))
path_c_direction <- read.csv(file.path(FOLDER_PATH, "results", "pathway_coverage_direction_table.csv"))
path_c_div       <- file.path(FOLDER_PATH, "plots", "pathway_coverage_beta-diversity.png")
path_c_top5      <- file.path(FOLDER_PATH, "plots", "pathway_coverage_top-5-Feature.png")

The following parameters were selected. Using the available age range of the PD cohort, age-matched controls were selected. The size of the PD cohort was limited, and reduces the power of these results. I decided to continue with this limited cohort since the goal of this analysis was to test improve my R coding skills within the context of microbiome analysis. To balance the sample size between the limited PD cohort and extensive HC cohort, 23 HC samples were randomly selected. Participants taking antibiotics were excluded. Although not used in this analysis given the already limited sample, the R script is built to filter the cohort further based on gender, minimum sequencing read length, and BMI range. Additional parameters can be added to the script with minimal changes. Users can also select the type of sample data they wish to analyze. Currently, users cannot analyze gene families because my computer is not capable of handling the memory and disk requirements. I plan to continue troubleshooting ways to reduce computing power, as I am personally very interested in seeing how the gene families differ between cohort microbiomes.

as.data.frame(t(sapply(config, function(x) paste(x, collapse = ", ")))) %>%
  pivot_longer(everything(), names_to = "Parameter", values_to = "Value") %>%
  knitr::kable(caption = "Analysis Parameters", booktabs = TRUE) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = FALSE,
    position          = "left",
    font_size         = 14
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
  column_spec(1, bold = TRUE)

Analysis Parameters
Parameter	Value
DIAGNOSIS	PD
SAMPLE_LOCATION	stool
data_to_pull	relative_abundance, pathway_abundance, pathway_coverage
acceptable_abx	no, NA
acceptable_gender	female, male, NA
MIN_READS	0
MIN_BMI	0
MAX_BMI	Inf

Summary statistics for both cohorts are available below.

meta_combined %>%
  group_by(disease) %>%
  summarize(
    N        = n(),
    Age      = paste0(round(mean(age, na.rm = TRUE), 1), " ± ", round(sd(age, na.rm = TRUE), 1)),
    BMI      = paste0(round(mean(BMI, na.rm = TRUE), 1), " ± ", round(sd(BMI, na.rm = TRUE), 1)),
    Male     = paste0(sum(gender == "male", na.rm = TRUE), " (", round(mean(gender == "male", na.rm = TRUE) * 100), "%)"),
    .groups  = "drop"
  ) %>%
  knitr::kable(caption = "Cohort Demographics", booktabs = TRUE) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = FALSE,
    position          = "left",
    font_size         = 14
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
  column_spec(1, bold = TRUE)

Cohort Demographics
disease	N	Age	BMI	Male
healthy	23	58.7 ± 7.9	25.9 ± 3.8	12 (52%)
PD	23	66.7 ± 8.7	24.7 ± 2.6	23 (100%)

Relative Abundance Overview

Figure A displays that alpha diversity did not differ significantly between PD and HC groups, as determined by Wilcox rank test. This was selected over a t-test since microbiome data typically does not pass normality test. Figure B presents the between-group differences of the PD and HC cohorts.

knitr::include_graphics(rel_div)

Gut microbiome composition differed significantly between PD and healthy controls (PERMANOVA: R² = 0.055, p = 0.001), with no significant difference in group dispersion (betadisper: p = 0.541), confirming the result reflects true compositional differences rather than differences in within-group variability.

rel_permanova %>%
  knitr::kable(caption = "Relative Abundance PERMANOVA Results", booktabs = TRUE) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = FALSE,
    position          = "left",
    font_size         = 14
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
  column_spec(1, bold = TRUE)

Relative Abundance PERMANOVA Results
Term	Df	SumOfSqs	R2	F	Pr..F.
Model	1	0.6931896	0.0552811	2.5747	0.001
Residual	44	11.8461719	0.9447189	NA	NA
Total	45	12.5393615	1.0000000	NA	NA

rel_betadisp %>%
  knitr::kable(caption = "Multivariate Homogeneity of Groups Dispersions in Relative Abundance", booktabs = TRUE) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = FALSE,
    position          = "left",
    font_size         = 14
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
  column_spec(1, bold = TRUE)

Multivariate Homogeneity of Groups Dispersions in Relative Abundance
Term	Df	Sum.Sq	Mean.Sq	F	N.Perm	Pr..F.
Groups	1	0.0014981	0.0014981	0.4066816	999	0.541
Residuals	44	0.1620856	0.0036838	NA	NA	NA

The top five genera from each cohort is evaluated. When a genus is present in the top five for both cohorts, it is only listed once.

knitr::include_graphics(rel_top5)

The remaining table uses the results from the Wilcoxon rank-sum test with FDR Correction, but adds directionality to the differences between cohorts.

rel_direction %>%
  knitr::kable(caption = "Directional Change of Relative Abundance (Genus)", booktabs = TRUE) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = FALSE,
    position          = "left",
    font_size         = 14
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
  column_spec(1, bold = TRUE)

Directional Change of Relative Abundance (Genus)
Feature	p_value	p_adjusted	healthy	PD	direction
Akkermansia	0.0006219	0.0021766	1.220584	7.387229	Increased in PD
Alistipes	0.0005601	0.0021766	3.163461	8.700576	Increased in PD
Bacteroides	0.0866029	0.1515551	30.709332	19.214478	Decreased in PD
Eubacterium	0.8604876	0.8604876	5.778391	6.256920	Increased in PD
Faecalibacterium	0.6925143	0.8079334	6.043400	7.125576	Increased in PD
Prevotella	0.1615553	0.2261774	5.315137	6.479969	Increased in PD
Roseburia	0.0223249	0.0520913	4.993906	2.746188	Decreased in PD

Pathway Abundance Overview

Pathway abundance represents the predicted metabolic activity of the microbiome, quantified as reads per kilobase (RPK). Significant differences in pathway abundance between groups suggest altered functional capacity, independent of which taxa are present. As with relative abundance, PERMANOVA was used to assess between-group compositional differences, with betadisper confirming result validity.

Gut microbiome functional capacity differed significantly between PD and healthy controls (PERMANOVA: R² = 0.646, p = 0.001), with no significant difference in group dispersion (betadisper: p = 0.129), confirming the result reflects true compositional differences rather than differences in within-group variability.

knitr::include_graphics(path_a_div)

path_a_permanova %>%
  knitr::kable(caption = "Pathway Abundance PERMANOVA Results", booktabs = TRUE) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = FALSE,
    position          = "left",
    font_size         = 14
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
  column_spec(1, bold = TRUE)

Pathway Abundance PERMANOVA Results
Term	Df	SumOfSqs	R2	F	Pr..F.
Model	1	3.697569	0.6456247	80.16214	0.001
Residual	44	2.029550	0.3543753	NA	NA
Total	45	5.727119	1.0000000	NA	NA

path_a_betadisp %>%
  knitr::kable(caption = "Multivariate Homogeneity of Groups Dispersions in Pathway Abundance", booktabs = TRUE) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = FALSE,
    position          = "left",
    font_size         = 14
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
  column_spec(1, bold = TRUE)

Multivariate Homogeneity of Groups Dispersions in Pathway Abundance
Term	Df	Sum.Sq	Mean.Sq	F	N.Perm	Pr..F.
Groups	1	0.0315933	0.0315933	2.435929	999	0.129
Residuals	44	0.5706674	0.0129697	NA	NA	NA

The top five pathways per cohort are shown below. Unknown and unintegrated pathways were not filtered prior to this analysis and may dominate the top features. Future iterations of this pipeline will include a filtering step to focus on annotated pathways.

knitr::include_graphics(path_a_top5)

path_a_direction %>%
  knitr::kable(caption = "Directional Change of Pathway Abundance", booktabs = TRUE) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = FALSE,
    position          = "left",
    font_size         = 14
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
  column_spec(1, bold = TRUE)

Directional Change of Pathway Abundance
Feature	p_value	p_adjusted	healthy	PD	direction
UNINTEGRATED	0.0000000	0.0000000	0.6842501	0.1675525	Decreased in PD
UNINTEGRATED\|g__Bacteroides.s__Bacteroides_vulgatus	0.0120390	0.0120390	0.0511932	0.0053143	Decreased in PD
UNINTEGRATED\|g__Faecalibacterium.s__Faecalibacterium_prausnitzii	0.0006586	0.0008232	0.0335795	0.0115321	Decreased in PD
UNINTEGRATED\|unclassified	0.0000044	0.0000073	0.1983234	0.0613956	Decreased in PD
UNMAPPED	0.0000000	0.0000000	0.2708799	0.8203939	Increased in PD

Pathway Coverage Overview

Pathway coverage represents the proportion of a pathway’s component reactions detected in a sample, ranging from 0 to 1. Unlike pathway abundance which reflects expression level, coverage indicates whether a pathway is structurally present at all. Alpha diversity is not applicable to coverage data given its bounded nature, so only beta diversity is assessed here.

Pathway coverage composition differed between groups (PERMANOVA: R² = 0.086, p = 0.001), with no significant difference in group dispersion (betadisper: p = 0.747), confirming the result reflects true compositional differences.

knitr::include_graphics(path_c_div)

path_c_permanova %>%
  knitr::kable(caption = "Pathway Coverage PERMANOVA Results", booktabs = TRUE) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = FALSE,
    position          = "left",
    font_size         = 14
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
  column_spec(1, bold = TRUE)

Pathway Coverage PERMANOVA Results
Term	Df	SumOfSqs	R2	F	Pr..F.
Model	1	0.9698315	0.0861643	4.1487	0.001
Residual	44	10.2857734	0.9138357	NA	NA
Total	45	11.2556049	1.0000000	NA	NA

path_c_betadisp %>%
  knitr::kable(caption = "Multivariate Homogeneity of Groups Dispersions in Pathway Coverage", booktabs = TRUE) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = FALSE,
    position          = "left",
    font_size         = 14
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
  column_spec(1, bold = TRUE)

Multivariate Homogeneity of Groups Dispersions in Pathway Coverage
Term	Df	Sum.Sq	Mean.Sq	F	N.Perm	Pr..F.
Groups	1	0.0002899	0.0002899	0.0974597	999	0.747
Residuals	44	0.1308623	0.0029741	NA	NA	NA

The top five pathways per cohort are shown below. Unknown and unintegrated pathways were not filtered prior to this analysis and may dominate the top features. Future iterations of this pipeline will include a filtering step to focus on annotated pathways.

knitr::include_graphics(path_c_top5)

path_c_direction %>%
  knitr::kable(caption = "Directional Change of Pathway Coverage", booktabs = TRUE) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = FALSE,
    position          = "left",
    font_size         = 14
  ) %>%
  row_spec(0, bold = TRUE, color = "white", background = "#4E6478") %>%
  column_spec(1, bold = TRUE)

Directional Change of Pathway Coverage
Feature	p_value	p_adjusted	healthy	PD	direction
DTDPRHAMSYN-PWY: dTDP-L-rhamnose biosynthesis I	0.0204860	0.0307289	1	0.9999996	Decreased in PD
PWY-6386: UDP-N-acetylmuramoyl-pentapeptide biosynthesis II (lysine-containing)	0.0101373	0.0304118	1	0.9999799	Decreased in PD
PWY-7219: adenosine ribonucleotides de novo biosynthesis	0.1618858	0.1618858	1	1.0000000	Decreased in PD
UNINTEGRATED	NA	NA	1	1.0000000	Decreased in PD
UNINTEGRATED\|unclassified	NA	NA	1	1.0000000	Decreased in PD
UNMAPPED	NA	NA	1	1.0000000	Decreased in PD

Final Thoughts

Although it is promising to see statistically significant results in this analysis, it should be noted that without a review from a researcher with experience in host-microbiome analytics, the results drawn from this pilot project should be taken with a grain of salt. My current lab does not work with any form of microbiology, so the basis for each statistical analysis is based off of my understanding of the current literature.

There are multiple limitations in this study, first being the sample size. Given the complexity and volatility of microbial community composition, there are many additional factors that must be accounted for, such as sequencing read length, diet, activity level, probiotic use, and severity of PD. There are also variations in sequencing techniques and sample collection that are not accounted for in this analysis. Pre-processing steps of the data source are also a source of limitation. The Shannon diversity estimates presented in this project should be interpreted with caution, as the curatedMetagenomicData relative abundance values have been pre-processed and normalized, removing singletons from the data. While the between-group comparison remains valid, absolute diversity values may be underestimated relative to studies using raw count data.

Future analysis would include a multiple linear regression including but not limited to factors such as age, gender, and UPDRS score. Unknown and unintegrated values would also be filtered out for secondary analysis of known pathways. Additionally, I would be interested in evaluating whether a composite bacterial health score can be determined following analysis of the entire healthy control cohort. This composite score would ideally include the bacterial genera, gene families, and pathway information.

Gut Microbiome in Parkinson’s Disease (PD) vs Healthy Controls (HC)

Pilot analysis, not intended for research or clinical use

Catarina A Bettencourt

March 2026

Data Summary

Relative Abundance Overview

Pathway Abundance Overview

Pathway Coverage Overview

Final Thoughts