# Random Forests vs F_{ST} for Insect Population Genetics

For my last comparison, I’ll look at the correlation between the variable importance measures (VIM) computed by Random Forests vs the scores calculated via F_{ST}. Previously, I analyzed the correlations between F_{ST} scores and associations computed using Cramer’s V and weights computed from Logistic Regression Ensembles.

In line with my recent correlation analysis between Cramer’s V and F_{ST}, I wanted to compare the weights calculated by the Logistic Regression Ensembles I recently discussed with F_{ST}.

I again used the Burkina Faso *An. gambiae* and *An. coluzzii* samples from the Anopheles gambiae 1000 genomes project for the comparison. I ran trained the Random Forests with both the counts and categorical feature encodings using the `dev`

branch of my population genetics methods exporation toolkit, Asaph. Plots of the variable importance measures from the Random Forests vs F^{ST} scores are below:

With the counts feature-encoding scheme, linear regression between the Random Forests variable importance measures and F^{ST} scores had . With the categories feature-encoding scheme, linear regression gave .

## Conclusion

Variable importance measures computed for each SNP using Random Forests correlate reasonably well with the F_{ST} scores, regardless of the encoding scheme used. (Random Forests are particularly robust to the choice of encoding scheme.) It would be interesting to analyze variants where Random Forests and F_{ST} give substantially different scores.