# Variable Selection with Logistic Regression Ensembles

*(02/16/2017) Thanks to feedback on the bioinformatics reddit, it’s been brought to my attention that most GWAS studies employ Logistic Regression for single-SNP association tests using software such as SNPTEST. This is different from the approach of incorporating all of the SNPs into a single Logistic Regression model as described below. Marchini, et al. and Balding have written some excellent reviews of statistical practices in GWAS that discuss single-SNP association tests and other approaches. I’ve changed the title to reflect that the effects of variance in the LR weights on variable selection are still valid.*

Logistic regression models are commonly used to identify SNPs which are correlated with differences between phenotypes associated with population structures. Logistic Regression is particularly popular for genome-wide association studies (GWAS) of human diseases^{1,2,3,4,5,6,7,8,9,10}.

When applied to SNPs, samples are assigned to classes in accordance with their phenotypes and their variants are encoded as a feature matrix. A LR model is then trained. The magnitudes of the weights from the LR model are used to rank the variants, with the top-ranked variants selected for further exploration.

Genomes often have on the order of millions of variants. With such large data sizes, LR models often need to be trained with approximate, stochastic method such as Stochastic Gradient Descent (SGD). These methods introduce randomness into the weights and consequently the rankings. We decided to evaluate the consistency of the rankings.

## Comparison of Rankings from Two Logistic Regression Models

To demonstrate this effect, we trained a pair of LR models on variants from 149 *An. gambiae* and *An. coluzzii* mosquitoes in the *Anopheles* 1000 genomes dataset. We encoded each variants as two features, each storing the number of occurrences of one allele. We used the magnitudes (absolute values) of the weights from the models to rank the variants. We compared the membership of the top 0.01% (466) of the SNPs ranked by each model in each pair using the Jaccard similarity. We found that only 81% of the top 0.01% (466) of the ranked SNPs agreed between the two models. The following table contains the similarity for different thresholds:

Threshold (%) | Number of SNPs | Jaccard Similarity |
---|---|---|

0.01% | 466 | 80.7% |

0.1% | 4,662 | 83.8% |

1% | 46,620 | 79.2% |

10% | 466,204 | 76.6% |

This instability could have significant impacts on the reproducibility and correctness of these GWAS studies.

## Logistic Regression Ensembles

Leo Breiman realized that certain machine learning models (decision trees, linear regression, others) are unstable^{11} and proposed bagging^{12} as a solution. Breiman’s later used bagging in his Random Forests^{13} algorithm, where it become well-known. Breiman’s focus was on classifier accuracy, however, and not necessarily on calculating variable importance scores or using weights for ranking.

We employ an ensemble approach to Logistic Regression models to stabilize the feature weights and achieve consistent rankings. We trained pairs of ensembles of Logistic Regression models. We normalized the weight vector from each model and then computed the average magnitude of the weights for each feature. We then used the averaged magnitudes to rank the SNPs. We repeated our analysis of the Jaccard similarities of the top 0.01%, 0.1%, 1%, and 10% of the ranked SNPs for ensembles with different numbers of models.

With ensembles of 250 models, we were able to achieve an agreement of 99% of the top 0.01% of SNPs.

## Conclusion

Logistic Regression models trained with stochastic methods such as Stochastic Gradient Descent (SGD) do not necessarily produce the same weights from run to run. This does not generally affect classification accuracy, especially in cases with a large number of correlated variables. However, the variations in the weights do affect analyses such as ranking and variable selection. Researchers should be cautious when using Logistic Regression weights for ranking.

We demonstrated that an ensemble approach can be used to stabilize the weights and consequently the resulting variable rankings. Further validation work will be needed to determine if Logistic Regression ensembles are a suitable solution, but our results are promising.

*The analyses presented here used the development branch of the software package Asaph.*