Data analysis

Data preprocessing

Flow cytometry. Event counts for every gate were exported and frequencies to relevant parent populations were calculated in R. Absolute cell counts were back-calculated using the counts/mL of blood for major lineages derived from whole blood count panel  (panel 6), where the equivalent of  25μl  of whole blood was analysed per sample. Median fluorescent intensities were calculated using FlowJo for relevant markers on specific populations. For panels 1-4 and 6-7 a minimum threshold of 30 events per gate was used  to investigate its subpopulations/measure MFI.

Serology. ELISA: data was normalised using a min/max normalisation in order to compare samples across batches. Values above 0.15 were considered positive.

LIPS: LIPS data represent the average of three replicative experiments. Results were given as fold changes (FC=LU sample/average LU of healthy control samples). A fold change greater than 4 was considered positive.

Statistical methods

Cell subset counts (per ml of blood) LIPS assay values and cytokine concentration were analysed after log10 transformation; all other parameters (cell subsets’ frequencies, serology parameters) were analysed without any additional data transformation.

First, we identified parameters different between seropositive and seronegative controls by t-test (pval <0.05). For these parameters, only seronegative controls were taken into account in the downstream testing.

We also identified a set of parameters with much higher intra-individual variation in sick than in controls, by comparing distributions of within-individuals variation (SD) in healthy and sick with Wilcoxon rank test (pval <0.01 across 2-sample estimates’ SD or 3-sample estimates’ SD). We interpreted these parameters as changing in the course of the disease. For these parameters we reasoned that samples from a sick individual are so variable that they should be treated as independent  measures and we did not use weights in any downstream models.

Influence of age and sex was tested by comparing nested linear mixed models on healthy control data:

parameter~1+(1|patient) + weights

parameter~1+age +(1|patient) OR parameter~1+sex +(1|patient) + weights

parameter~1+age+sex+ (1|patient) + weights

For the cases when not enough samples were available to estimate patient effects, linear model was used. For parameters with significant sex or age influence, estimates of predicted  age/sex influence were subtracted from the raw parameter values and residuals were used for downstream statistical testing.

Main tetsts:

Testing for difference between CP/controls/LRTI and severity groups, raw values (or residuals, where sex/age was significant) between CP and controls were compared by fitting linear mixed model:

parameter~control_CPstatus+(1|patient) + weights

parameter~ control_CPstatus_severity+(1|patient) + weights

where severity was defined as Low for WHO 1-2, Moderate for WHO 3-4 and Severe for WHO 5-8, Healthy for controls.  For the cases when not enough samples were available to estimate patient effects, linear model was used instead. Appropriate contrasts (moderate vs. healthy, severe vs. healthy, severe vs. moderate, CP vs controls, LRTI vs controls, LRTI vs. severity classes) were extracted and effect size estimated by dividing difference between estimated means of populations by standard deviation of controls, unless stated otherwise (for age and sex corrected values this was done on age and sex corrected parameters).

We include detailed results, with tests run across all combinations of aforementioned factors (all controls/seronegative only; with correction for age/sex where appropriate/without any correction), so the reader may easily judge the influence of our chosen analysis path on statistics.

Hundreds of hypotheses were tested in parallel (flow cytometry parameters, serology and cytokine levels). These hypotheses are heavily interdependent, technically and in some cases biologically e.g. the same subset of cells measured over different panels; complementary subsets of cells; related subsets of cells identified by different markers etc. Therefore have provided the raw p-values for all comparisons except where stated otherwise. A high, conservative estimate of the number of independent hypotheses tested is 315, should one wish to use a Bonferroni correction.


R code to replicate our analyses with the data:

For privacy reasons, we do not provide exact age of participants. Parts of the code using this information are included, but you won’t be able o run them without this information. We provide intermediate results files instead, so the whole pipeline might be reproduced.