Visualizations of the optimized LR model for each disease phenotype
Features (balances) of the LR models are represented by underscore-linked NCBI taxonomic IDs (e.g. ‘384638__1735_371601’). Two underscores separate the species of each balance with the numerator located on the left and the denominator on the right. The two species linked by one underscore indicate they are both located in the 3-part balance’ denominator. Complete taxonomic lineage of NCBI taxonomic IDs can be checked in the NCBI Taxonomy database.
Balance-based datasets and LR models
Heading abbreviations for each column of the table:
SbpMatrix: SBP matrix [csv]and plot of species partitions and the sample-wise distribution of all balances[svg].
BalanceMatrix: the balance matrix [feature] and the sample-class mapping file [label].
Files4ML: balance matrix in multiple formats [.arff, .libsvm, and .csv] compatible with various mainstream machine learning software (Weka, LIBSVM, or MatLab).
Performance of the models
The performance were evaluated by multiple metrics caculated with PyCM python package.
Heading abbreviations for each column of the table:
ACC: Accuracy, the number of correct predictions from all predictions made.
AUC: The area under the curve is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming 'positive' ranks higher than 'negative').
F1: F1 score (also F-score or F-measure) is a measure of a test's accuracy. The F1 score is the harmonic average of the precision and recall, where F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.
GI: Gini index, A chance-standardized variant of the AUC is given by Gini coefficient, taking values between 0 (no difference between the score distributions of the two classes) and 1 (complete separation between the two distributions). Gini coefficient is widespread use metric in imbalanced data learning.
IBA: Index of balanced accuracy, The method combines an unbiased index of its overall accuracy and a measure about how dominant is the class with the highest individual accuracy rate
J: Jaccard index, defined as the size of the intersection divided by the size of the union of the sample sets, is a statistic used for comparing the similarity and diversity of sample sets.
MCC: Matthews correlation coefficient, introduced by biochemist Brian W. Matthews in 1975, is used as a measure of the quality of binary (two-class) classifications. The MCC is, in essence, a correlation coefficient between the observed and predicted binary classifications; it returns a value between −1 and +1. A coefficient of +1 represents a perfect prediction, 0 no better than random prediction and −1 indicates total disagreement between prediction and observation.
PPV: Positive predictive value, is the proportion of positives that correspond to the presence of the condition.
TNR: Specificity (also called the true negative rate) measures the proportion of negatives that are correctly identified.
TPR: Sensitivity (also called the true positive rate, the recall, or probability of detection in some fields) measures the proportion of positives that are correctly identified.