In line with my recent correlation analysis between Cramer’s V and FST, I wanted to compare the weights calculated by the Logistic Regression Ensembles I recently discussed with FST. Logistic Regression Ensembles are implemented in the
dev branch of my population methods exporation toolkit Asaph.
I again used the Burkina Faso An. gambiae and An. coluzzii samples from the Anopheles gambiae 1000 genomes project for the comparison. I ran the Logistic Regression Ensembles with both the counts and categorical feature encodings and with and without bagging. Plots of the weights from the Logistic Regression Ensembles vs FST scores are below:
I used linear regression performed on the FST scores and LR ensemble weights to calculate values:
When bagging is used, Logistic Regression Ensembles weights seem to correlate well with FST scores. The only real outlier is when Logistic Regression Ensembles are used with the categories feature encoding and no bagging.
It’s important to mention that the correlation analysis only tells us how well these methods agree with FST. They do not necessarily tell us which method is better or worse. Substantial work remains to validate the results of the Logistic Regression Ensembles.