GutBalance

Visualizations of the optimized LR model for each disease phenotype

Features (balances) of the LR models are represented by underscore-linked NCBI taxonomic IDs (e.g. ‘384638__1735_371601’). Two underscores separate the species of each balance with the numerator located on the left and the denominator on the right. The two species linked by one underscore indicate they are both located in the 3-part balance’ denominator. Complete taxonomic lineage of NCBI taxonomic IDs can be checked in the NCBI Taxonomy database.

Balance-based datasets and LR models

Heading abbreviations for each column of the table:
SbpMatrix: SBP matrix [csv]and plot of species partitions and the sample-wise distribution of all balances[svg].
BalanceMatrix: the balance matrix [feature] and the sample-class mapping file [label].
Files4ML: balance matrix in multiple formats [.arff, .libsvm, and .csv] compatible with various mainstream machine learning software (Weka, LIBSVM, or MatLab).

Datasets (MeSH ID)	Nr. samples	Nr. balances	Classification task	SbpMatrix	BalanceMatrix	Files4ML	Moldels
D000067877	242	79	Autism Spectrum Disorder(55), Healthy(187)	csv, svg	feature, label	zip(.arff, .libsvm, .csv)	📥
D000236	696	69	Adenoma(220), Healthy(476)	csv, svg	feature, label	zip(.arff, .libsvm, .csv)	📥
D000755	28	39	Anemia Sickle Cell(14), Healthy(14)	csv, svg	feature, label	zip(.arff, .libsvm, .csv)	📥
D001172	168	58	Arthritis Rheumatoid(96), Healthy(72)	csv, svg	feature, label	zip(.arff, .libsvm, .csv)	📥
D001289	202	65	Attention Deficit Disorder with Hyperactivity(15), Healthy(187)	csv, svg	feature, label	zip(.arff, .libsvm, .csv)	📥
D001327	272	81	Autoimmune Diseases(85), Healthy(187)	csv, svg	feature, label	zip(.arff, .libsvm, .csv)	📥
D001714	214	68	Bipolar Disorder(27), Healthy(187)	csv, svg	feature, label	zip(.arff, .libsvm, .csv)	📥
D002318	204	66	Cardiovascular Diseases(17), Healthy(187)	csv, svg	feature, label	zip(.arff, .libsvm, .csv)	📥
D002446	205	67	Celiac Disease(18), Healthy(187)	csv, svg	feature, label	zip(.arff, .libsvm, .csv)	📥
D003015	215	84	Clostridium Infections(28), Healthy(187)	csv, svg	feature, label	zip(.arff, .libsvm, .csv)	📥

Datasets (MeSH ID)	Nr. samples	Nr. balances	Classification task	SbpMatrix	BalanceMatrix	Files4ML	Moldels
D005891	10	26	Gingivitis(7), Healthy(3)	csv, svg	feature, label	zip(.arff, .libsvm, .csv)	📥
D055113	12	28	Chronic Periodontitis(9), Healthy(3)	csv, svg	feature, label	zip(.arff, .libsvm, .csv)	📥
D015658	17	31	HIV Infections(12), Healthy(5)	csv, svg	feature, label	zip(.arff, .libsvm, .csv)	📥
D015518	18	30	Rett Syndrome(8), Healthy(10)	csv, svg	feature, label	zip(.arff, .libsvm, .csv)	📥
D003922	23	51	Diabetes Mellitus Type 1(10), Healthy(13)	csv, svg	feature, label	zip(.arff, .libsvm, .csv)	📥
D015497	30	50	HIV-1(16), Healthy(14)	csv, svg	feature, label	zip(.arff, .libsvm, .csv)	📥
D025241	40	46	Spondylarthritis(27), Healthy(13)	csv, svg	feature, label	zip(.arff, .libsvm, .csv)	📥
D016640	41	39	Diabetes Gestational(23), Healthy(18)	csv, svg	feature, label	zip(.arff, .libsvm, .csv)	📥
D015209	42	50	Cholangitis Sclerosing(11), Healthy(31)	csv, svg	feature, label	zip(.arff, .libsvm, .csv)	📥
D011565	48	31	Psoriasis(32), Healthy(16)	csv, svg	feature, label	zip(.arff, .libsvm, .csv)	📥

Performance of the models

The performance were evaluated by multiple metrics caculated with PyCM python package.

Heading abbreviations for each column of the table:
ACC: Accuracy, the number of correct predictions from all predictions made.
AUC: The area under the curve is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming 'positive' ranks higher than 'negative').
F1: F1 score (also F-score or F-measure) is a measure of a test's accuracy. The F1 score is the harmonic average of the precision and recall, where F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.
GI: Gini index, A chance-standardized variant of the AUC is given by Gini coefficient, taking values between 0 (no difference between the score distributions of the two classes) and 1 (complete separation between the two distributions). Gini coefficient is widespread use metric in imbalanced data learning.
IBA: Index of balanced accuracy, The method combines an unbiased index of its overall accuracy and a measure about how dominant is the class with the highest individual accuracy rate
J: Jaccard index, defined as the size of the intersection divided by the size of the union of the sample sets, is a statistic used for comparing the similarity and diversity of sample sets.
MCC: Matthews correlation coefficient, introduced by biochemist Brian W. Matthews in 1975, is used as a measure of the quality of binary (two-class) classifications. The MCC is, in essence, a correlation coefficient between the observed and predicted binary classifications; it returns a value between −1 and +1. A coefficient of +1 represents a perfect prediction, 0 no better than random prediction and −1 indicates total disagreement between prediction and observation.
PPV: Positive predictive value, is the proportion of positives that correspond to the presence of the condition.
TNR: Specificity (also called the true negative rate) measures the proportion of negatives that are correctly identified.
TPR: Sensitivity (also called the true positive rate, the recall, or probability of detection in some fields) measures the proportion of positives that are correctly identified.

Balances-based disease prediction models

Visualizations of the optimized LR model for each disease phenotype

Balance-based datasets and LR models

Performance of the models