In line with my recent correlation analysis between Cramer’s V and FST, I wanted to compare the weights calculated by the Logistic Regression Ensembles I recently discussed with FST. Logistic Regression Ensembles are implemented in the dev branch of my population methods exporation toolkit Asaph.

I again used the Burkina Faso An. gambiae and An. coluzzii samples from the Anopheles gambiae 1000 genomes project for the comparison. I ran the Logistic Regression Ensembles with both the counts and categorical feature encodings and with and without bagging. Plots of the weights from the Logistic Regression Ensembles vs FST scores are below:

Fst vs LR Ensembles (counts) w/ Bagging

Fst vs LR Ensembles (counts) w/o Bagging

Fst vs LR Ensembles (categories) w/ Bagging

Fst vs LR Ensembles (categories) w/o Bagging

I used linear regression performed on the FST scores and LR ensemble weights to calculate values:

Feature Encoding Bagging?
Counts Yes 0.889
Counts No 0.812
Categories Yes 0.850
Categories No 0.643

Conclusion

When bagging is used, Logistic Regression Ensembles weights seem to correlate well with FST scores. The only real outlier is when Logistic Regression Ensembles are used with the categories feature encoding and no bagging.

It’s important to mention that the correlation analysis only tells us how well these methods agree with FST. They do not necessarily tell us which method is better or worse. Substantial work remains to validate the results of the Logistic Regression Ensembles.