For my last comparison, I’ll look at the correlation between the variable importance measures (VIM) computed by Random Forests vs the scores calculated via FST. Previously, I analyzed the correlations between FST scores and associations computed using Cramer’s V and weights computed from Logistic Regression Ensembles.

In line with my recent correlation analysis between Cramer’s V and FST, I wanted to compare the weights calculated by the Logistic Regression Ensembles I recently discussed with FST.

I again used the Burkina Faso An. gambiae and An. coluzzii samples from the Anopheles gambiae 1000 genomes project for the comparison. I ran trained the Random Forests with both the counts and categorical feature encodings using the dev branch of my population genetics methods exporation toolkit, Asaph. Plots of the variable importance measures from the Random Forests vs FST scores are below:

Fst vs Random Forests (counts)

Fst vs Random Forests (categories)

With the counts feature-encoding scheme, linear regression between the Random Forests variable importance measures and FST scores had . With the categories feature-encoding scheme, linear regression gave .


Variable importance measures computed for each SNP using Random Forests correlate reasonably well with the FST scores, regardless of the encoding scheme used. (Random Forests are particularly robust to the choice of encoding scheme.) It would be interesting to analyze variants where Random Forests and FST give substantially different scores.