RJ NowlingPersonal website and blog for RJ Nowling. Data science engineer with a Ph.D. in Computer Science & Engineering with experience in computational physics, bioinformatics, machine learning, and distributed systems.
http://rnowling.github.io/
Thu, 14 Jun 2018 04:04:35 +0000Thu, 14 Jun 2018 04:04:35 +0000Jekyll v3.7.3Arthropod Genomics Symposium 2018<p>Last weekend, I attended the 11th Annual <a href="http://arthropod.igb.illinois.edu/welcome">Arthropod Genomics Symposium (AGS)</a>; this was my third time attending AGS. AGS is a smaller conference (~150 attendees) but attracts a healthy mixture of new and established researchers in arthropod genomics. I’ve found everyone to be incredibly nice, open, and supportive – echoing a statement by <a href="http://rmwaterhouse.org/">Rob Waterhouse</a>, it feels like a gathering of old friends.</p>
<p>An excellent set of talks formed the core of the conference program. A few highlights that stuck with me:</p>
<ul>
<li>
<p>“Clonal genome evolution and rapid invasive spread of the marbled crayfish” (Julian Gutenkunst): Remarkedly, <a href="https://en.wikipedia.org/wiki/Marbled_crayfish">marbled crayfish</a>, an invasive species discovered in the 1990s, have three sets of 92 chromosomes and reproduce asexually. Meaning, offspring are effectively clones of their mothers. Population genetics analysis has identified as few as four structural variations between populations (compare this with the millions of SNPs observed between mosquito populations).</p>
</li>
<li>
<p>“Insights into the genome organization and non-coding genes of <em>Diaphorina citri</em>, the vector of citrus greening disease” (<a href="https://btiscience.org/explore-bti/directory/surya-saha/">Surya Saha</a>): As a native Floridian, I’ve heard about citrus greening and its impact on Florida citrus production; it was exciting to see two parts of my world intersect. In addition to presenting interesting insights into the biology of the <em>Diaphorina</em> vector, Surya presented on an <a href="https://arxiv.org/abs/1805.03602">large-scale effort</a> with his colleague <a href="https://btiscience.org/explore-bti/directory/prashant-hosmani/">Prashant Hosmani</a> to engage undergraduates in manual gene annotation. Their consortium spans multiple institutions (including a community college and researchers from the U.S. Department of Agriculture) across multiple states.</p>
</li>
<li>
<p><a href="https://www.hgsc.bcm.edu/people/richards-s">Stephen “Fringy” Richards</a>: Fringy gave an update on the <a href="http://i5k.github.io/">i5k</a> project, which aims to sequence 5,000 arthropod genomes. With more than 200 genomes sequenced, Fringy announced the pilot project to be completed. He gave an overview of what has been learned, both in terms of sequencing pipelines and from comparative analysis of the genomes sequenced thus far. Fringy finished his talking by presenting the audiences with a challenge: how do we do we fund the sequencing of more genomes?</p>
</li>
<li>
<p>“Reproductive Worker Honey Bees: A Glimpse of Ancestral Sociality” (<a href="https://berylmjones.weebly.com/">Beryl Jones</a>): Beryl, a graduate student in the <a href="https://www.life.illinois.edu/robinson/">Robinson</a> group at UIUC, described novel observations of cooperate social behavior in worker bees after the loss of the queen. To study this behavior, Beryl used an <a href="http://www.pnas.org/content/115/7/1433.short">automated behavior tracking</a> which combines barcoding of individual bees, video recordings, and deep learning to classify social interactions appearing in individual frames. The observational study was followed up with a RNA-Seq gene expression analysis.</p>
</li>
</ul>
<p>Beyond the talks, AGS provided plenty of opportunities for one-on-one and small-group discussions. Since I’ll be starting a tenure track-equivalent position in the Fall, I appreciated hearing about the experiences of new PIs and the advice given to them by more experienced PIs. My <a href="/publications/AGS_2018.pdf">poster</a> on unsupervised population genetics led to exciting discussions and feedback as well as potential leads on new data sets. Other researchers also gave me great feedback and ideas on how to engage undergraduates and potential problems to work on.</p>
<p>Overall, AGS was a wonderful experience. I can’t wait for AGS 2019, to be held at Kansas State University.</p>
Sun, 10 Jun 2018 12:13:19 +0000
http://rnowling.github.io/bioinformatics/2018/06/10/ags-2018.html
http://rnowling.github.io/bioinformatics/2018/06/10/ags-2018.htmlinsectsconferencesbioinformaticsExploring CNNs for Classifying Chemosensory Receptors<p>In my <a href="/bioinformatics/2018/05/20/chemosensory-lstm.html">previous post</a>, I explored using a <a href="https://en.wikipedia.org/wiki/Long_short-term_memory">Long Short-Term Memory</a> deep learning architecture to classify protein sequences as either olfactory or gustatory receptors. Check out that blog post for background on insect chemosensory receptors and the data set I’m using. With my LSTM model, I achieved an accuracy of 96.4% after 50 epochs of training. I hypothesized that <a href="https://en.wikipedia.org/wiki/Convolutional_neural_network">convolutional neural networks (CNNs)</a> might be also be a useful model, especially if multiple layers of CNNs were used to correspond to the different levels (primary, secondary, and tertiary) organization within protein structures.</p>
<p>Wang, et al.’s paper on <a href="https://www.nature.com/articles/srep18962">predicting protein secondary structure using deep convolutional neural fields</a> inspired me to give CNNs a try and proved to be a useful guide. I decided to start with a single layer 1D convolutional network since that’s the easiest to get started with and used the <a href="https://github.com/keras-team/keras/blob/ce4947cbaf380589a63def4cc6eb3e460c41254f/examples/imdb_cnn.py">Keras IMDB CNN example</a> as a reference. My final deep learning model consisted of 3 layers:</p>
<ul>
<li>1D convolutional layer with 16 filters and a kernel size of <script type="math/tex">11\times20</script> using the ReLU activation function</li>
<li>1D global max pooling layer</li>
<li>Single unit output layer with a sigmoid activation function</li>
</ul>
<p>After 50 epochs, the CNN model achieved an accuracy of 99.6% – a noticeable improvement over the 96.4% achieved by the LSTM model.</p>
<p>In their work, Wang, et al. used a window size of 11 residues, since, as they note, <script type="math/tex">\alpha</script>-helices have an average length of 11 residues. I experimented with using kernel sizes of 7, 9, and 13 as well, but found that a kernel size of 11 gave me the best performance.</p>
<p>So, how does this work? My current understanding is as follows: We treat each amino acid in a protein sequence as a categorical variable, so each protein sequence of <script type="math/tex">N</script> residues is encoded as a <script type="math/tex">N \times 20</script> matrix. For each amino acid (row), the 1D convolutional layer convolves a <script type="math/tex">11\times20</script> kernel with entries from a sliding window (5 positions before and after) to calculate a new value for each position. We have 16 kernels in our setup, so the output for a single sequence is <script type="math/tex">(N, 16)</script>. The global max pooling layer then finds the maximum value across all of thepositions for each kernel, producing a vector of length 16. This resulting vector is passed into the output layer.</p>
<p>The CNN model is not only more accurate but noticably faster on my Nvidia GTX 1050 Ti. Training and prediction is completed in 30 s vs 30 minutes for the LSTM.</p>
<p>The model does make me a bit suspicious, however. The model is not very complex and it’s probably disingenuous to even refer to this as “deep learning”. The use of a single 1D convolutional layer suggests to me that the model might be finding motifs found in one type of receptor, but not the other. I could validate this by pulling out the layer weights.</p>
<p>This promising results with CNNs does suggest that deep learning models might be useful for insect chemosensory receptors. To make the problem more challenging, I’d like to build a model to differentiate between ORs, GRs, and GPRCs. Insect ORs and GRs were initially thought to be GPCRs, but these GPCRs have a different signaling mechanism (release of a G Protein).</p>
Mon, 21 May 2018 12:13:19 +0000
http://rnowling.github.io/bioinformatics/2018/05/21/chemosensory-cnn.html
http://rnowling.github.io/bioinformatics/2018/05/21/chemosensory-cnn.htmldeep learningmachine learningstatisticsbioinformaticsbioinformaticsExploring LSTMs for Classifying Chemosensory Receptors<p>In my recent work on differentiation in <em>Anopheles gambiae</em> populations, chemosensory receptors have proven to be one of the most prominent gene families. These receptors are closely tied to differences in food, mating, and habitation preferences, so this makes sense. In fact, significant losses and gains in chemosensory receptor genes are seen across insect species. This makes these receptors a fascinating and useful object of study for better understanding molecular evolution of insects.</p>
<p>Once my current research project is completed, I’m considering spending some time focusing on chemosensory receptors next. Insect chemosensory receptor are highly divergent and tend to be difficult to identify in genomes. Researchers often have to resort to a laborious manual annotation processes for each genome – a major impediment. I believe that gene-specific gene prediction tools could improve the detection and annotation of these receptors.</p>
<p><a href="https://en.wikipedia.org/wiki/Long_short-term_memory">Long Short-Term Memory</a> deep learning techniques are an exciting alternative to established techniques such as <a href="https://en.wikipedia.org/wiki/Hidden_Markov_model">Hidden Markov Models (HMMs)</a>. I decided to begin exploring LSTMs with a simpler problem: classifying a protein sequence as either an olfactory (OR) or gustatory (GR) receptor.</p>
<p>I used a data set of 930 ORs and 844 GRs from multiple species of <em>Drosophila</em> and three mosquitoes (<em>Anopheles gambiae</em>, <em>Culex</em>, and <em>Aedes aegypti</em>), with 70% used for the training set and 30% used for the test set. I vectorized the sequences with each amino acid represented by a one-hot encoded vector. Based on the examples in <a href="https://keras.io/">Keras</a>, I used a relatively simple LSTM model with a single LSTM layer of 64 units, a dropout rate of 0.2, and a recurrent dropout rate of 0.2 followed by a dense layer with a sigmoidal activation function.</p>
<p>After 50 epochs, I achieved an accuracy of 96.4% – pretty good for a simple, out-of-the-box model.</p>
<p>I see two main advantages of the LSTM model compared with a profile HMM. First, the LSTM doesn’t require aligning the sequences. At a basic level, skipping the multiple-sequence alignment (MSA) saves a step. More importantly, a poor-quality MSA (which can easily happen with poorly-conserved gene families like insect chemosensory receptors) could lead to poorly-performing models. Skipping the alignment could lead to higher-quality models. The most impressive part (at least to me), however, is that Keras is a generic machine learning framework that is performing reasonably well on a problem that normally requires the use of domain-specific techniques and software.</p>
<p>The main disadvantage of the LSTM is simple – how do I improve it? I’m relatively new to deep learning, so I haven’t yet developed the proper intuition for how to design more sophisticated models. My experience with classical machine learning suggests that hyper-parameter tuning only gets you so far and only gives you a one-time improvement.</p>
<p>I’m guessing that building more sophisticated models, either by adding more layers to the model or even multiple models outputting to the same dense layer, might be more powerful. In image processing applications, it is generally thought that deep learning layers learn different levels of features. Why wouldn’t the same be true of protein-sequence classification problems? We generally talk about proteins as having levels (primary, secondary, and tertiary) of structures. Would using multiple layers of LSTMs or even <a href="https://en.wikipedia.org/wiki/Convolutional_neural_network">convolutional neural networks (CNNs)</a> to learn protein structure lead to improved the classification accuracy? Or, given the small-ish data set, are we better off with a hybrid model that combines a LSTM with hand-engineered features?</p>
<p>I see this problem leading to plenty of future work with which to keep myself busy.</p>
Sun, 20 May 2018 12:13:19 +0000
http://rnowling.github.io/bioinformatics/2018/05/20/chemosensory-lstm.html
http://rnowling.github.io/bioinformatics/2018/05/20/chemosensory-lstm.htmldeep learningmachine learningstatisticsbioinformaticsbioinformaticsTesting for Regions of Enriched Differentiation along the Chromosome using the Binomial Test<p>Within my work on insect vector population genetics, we often want to infer regions of the chromosomes that are undergoing differentiation. One way in which we do this is to look for windows with more than expected number of statistically-significant SNPs.</p>
<p>To set up the test, we first need to perform association tests on each individual SNP using something like the <a href="/machine/learning/2017/10/07/likelihood-ratio-test.html">likelihood-ratio test</a> or <a href="https://en.wikipedia.org/wiki/Fixation_index"><script type="math/tex">F_{ST}</script></a> to identify SNPs that are strongly correlated with the population structure or phenotype of interest. We then divide the chromosome into non-overlapping windows and count the number of SNPs in each window. Lastly, we perform a statistical test on each window, with a null hypothesis that the SNPs are uniformly distributed across the windows.</p>
<p><a href="http://science.sciencemag.org/content/330/6003/514">Neafsey, et al.</a> performed this analysis using the popular <script type="math/tex">\chi_2</script> test. I prefer using the one-tailed <a href="https://en.wikipedia.org/wiki/Binomial_test">binomial test</a>, however, as it’s more sensitive. Conveniently, the binomial test is available in <a href="https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.binom_test.html">Scipy</a>.</p>
<p>My script for performing this analysis is available below:</p>
<script src="https://gist.github.com/rnowling/bfd94f606144731233c897d977121146.js"></script>
Sun, 28 Jan 2018 12:13:19 +0000
http://rnowling.github.io/bioinformatics/2018/01/28/binomial-test.html
http://rnowling.github.io/bioinformatics/2018/01/28/binomial-test.htmlmathmachine learningstatisticsbioinformaticsbioinformaticsTesting Feature Significance with the Likelihood Ratio Test<p><a href="https://en.wikipedia.org/wiki/Logistic_regression">Logistic Regression</a> (LR) is a popular technique for binary classification within the machine learning and statistics communities. From the machine learning perspective, it has a number of desirable properties. Training and prediction are incredibly fast. When using stochastic gradient descent and its cousins, LR supports online learning, enabling models to change as the data changes and training on datasets larger than the available memory on the machine. And finally, LR naturally accomodates sparse data.</p>
<p>Because of its roots in the statistics community, Logistic Regression is amenable to analyses other machine learning techniques are not. The <a href="https://en.wikipedia.org/wiki/Likelihood-ratio_test">Likelihood-Ratio Test</a> can be used to determine if the addition of the features to a LR model result in a statistically-significant improvement in the fit of the model.<sup><a href="#hosmer">1</a></sup></p>
<p>I originally learned about the Likelihood-Ratio Test when reading about ways that variants are found in genome-wide association studies (GWAS). The statistician <a href="https://en.wikipedia.org/wiki/David_Balding">David J. Balding</a> has significantly impacted the field and its methods. His <a href="http://www.montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class4/Balding2006.pdf">tutorial on statistical methods for population association studies</a> is a great place to start for anyone interested in the subject.</p>
<p>As Prof. Balding points out, many GWA studies use the Likelihood-Ratio Test to perform single-SNP association tests. Basically, a LR model is built for each SNP and compared to a null model that only uses the class probabilities. SNPs with small p-values are then selected for further study.</p>
<h2 id="likelihood-ratio-test">Likelihood-Ratio Test</h2>
<p>The question we are trying to answer with the Likelihood-Ratio Test is:</p>
<blockquote>
<p>Does the model that includes the variable(s) in question tell us more about the outcome (or response) variable than a model that does not include the variable(s)?</p>
</blockquote>
<p>Using the Likelihood-Ratio Test, we compute a p-value indicating the significance of the additional features. Using that p-value, we can accept or reject the null hypothesis.</p>
<p>Let <script type="math/tex">\theta^0</script> and <script type="math/tex">x^0</script> and <script type="math/tex">\theta^1</script> and <script type="math/tex">x^1</script> be the weights and feature matrices used in the null and alternative models, respectively. Note that we need <script type="math/tex">\theta^0 \subset \theta^1</script> and <script type="math/tex">x^0 \subset x^1</script>, meaning that the models are “nested.” Let <script type="math/tex">y</script> be the vector of class labels, <script type="math/tex">N</script> denote the number of samples, and <script type="math/tex">df</script> be number of additional weights / features in <script type="math/tex">\theta^1</script>.</p>
<p>The Logistic Regression model is given by:</p>
<script type="math/tex; mode=display">\pi_\theta(x_i) = \frac{e^{\theta \cdot x_i}}{1+e^{\theta \cdot x_i}}</script>
<p>Note that the intercept is considered part of <script type="math/tex">\theta</script>. We append a columns of 1s to <script type="math/tex">x</script> to model the intercept. (In the implementation below, since you control the feature matrices and model, you can model it as you need.)</p>
<p>The likelihood for the Logistic Regression model is given by:</p>
<script type="math/tex; mode=display">L(\theta | x) = \prod_{i=1}^N \pi_\theta(x_i)^{y_i} (1 - \pi_\theta)^{1 - y^i} \\
\log L(\theta | x) = \sum_{i=1}^N y_i \log \pi_\theta(x_i) + (1 - y_i) \log (1 - \pi_\theta(x_i))</script>
<p>The Likelihood-Ratio Test is then given by:</p>
<script type="math/tex; mode=display">G = 2 (\log L(\theta^1 | x^1) - \log L(\theta^0 | x^0))</script>
<p>Finally, we compute the p-value for the null model using the <script type="math/tex">\chi^2(df)</script> CDF:</p>
<script type="math/tex; mode=display">p = P[\chi^2(df) > G]</script>
<h2 id="python-implementation-and-example">Python Implementation and Example</h2>
<p>Using <a href="http://scikit-learn.org/stable/">scikit-learn</a> and <a href="https://www.scipy.org/">scipy</a>, implementing the Likelihood-Ratio Test is pretty straightforward (as long as you remember to use the <strong>unnormalized</strong> log losses and negate them):</p>
<script src="https://gist.github.com/rnowling/ec9c9038e492d55ffae2ae257aa4acd9.js?file=likelihood_ratio_test.py"></script>
<p>The <code class="highlighter-rouge">likelihood_ratio_test</code> function takes four parameters:</p>
<ol>
<li>Feature matrix for the alternative model</li>
<li>Labels for the samples</li>
<li>A LR model to use for the test</li>
<li>(Optional) Feature matrix for the null model. If this is not given, then the class probabilities are calculated from the sample labels and used.</li>
</ol>
<p>and returns a p-value indicating the statistical significance of the new features.</p>
<p>To illustrate its use, I generated some fake data with 20 binary features. The binary features range in their probability of matching the class labels from 0.5 (uncorrelated) to 1.0 (completely correlated). Half of the features have inverted values (<code class="highlighter-rouge">1 - label</code>). I generated 100 fake data sets with 100 samples each. I then ran the Likelihood-Ratio Test for each feature individually and created a box plot of the p-values:</p>
<p><img src="/images/likelihood_ratio_test_p_values_boxplot.png" alt="" /></p>
<p>As expected, the statistical significance varies according to the probability that the feature matches the label. And we so no difference in whether the features matches the label or is inverted, also as expected.</p>
<p>(I <a href="https://gist.github.com/rnowling/ec9c9038e492d55ffae2ae257aa4acd9">posted my code</a> under the Apache License v2 so you can re-create my results and use the test in your own work.)</p>
<p><a name="hosmer"></a>Note: the derivation given here comes from <em>Applied Logistic Regression</em> (3<sup>rd</sup> Ed.) by Hosmer, Lemeshow, and Sturdivant.</p>
Sat, 07 Oct 2017 12:13:19 +0000
http://rnowling.github.io/machine/learning/2017/10/07/likelihood-ratio-test.html
http://rnowling.github.io/machine/learning/2017/10/07/likelihood-ratio-test.htmlmathmachine learningstatisticsmachinelearningTalk on Productionizing ML Models<p>Last night, I gave a talk titled “Real-World Lessons in Machine Learning Applied to Spam Classification” at the <a href="https://www.meetup.com/MKE-Big-Data/">MKE Big Data</a> meetup. In my talk, I used spam classification as a use case for communicating some lessons learnd from my experiences building production machine learning-powered services. In particular, I wanted to get the point across that modeling and algorithm choices are not independent from the requirements of the production system – we need to design our models and choose our algorithms while keeping in mind how those choices will impact the resulting production system.</p>
<p>You can grab my slides <a href="/static/rnowling_mke_big_data_2017.pdf">here</a>. My slides and source code I used to generate my plots are also available on the <a href="https://github.com/MKE-Big-Data/MKE-BD-Talks">MKE BD Talks</a> GitHub repo.</p>
<p>A few attendees had asked for some additional resources related to the topics. Martin Zinkevich of Google recently published an excellent guide based on their experiences titled <a href="http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf">Rules of Machine Learning: Best Practices for ML Engineering</a>, which I highly recommend. <a href="http://hunch.net/~vw/">Vowpal Wabbit</a> is a powerful toolkit for online machine learning that incorporates some of the latest algorithms and techniques.</p>
Wed, 03 May 2017 00:02:19 +0000
http://rnowling.github.io/machine/learning/2017/05/03/production-ml-systems.html
http://rnowling.github.io/machine/learning/2017/05/03/production-ml-systems.htmlengineeringmachinelearningRandom Forests vs F<sub>ST</sub> for Insect Population Genetics<p>For my last comparison, I’ll look at the correlation between the variable importance measures (VIM) computed by Random Forests vs the scores calculated via F<sub>ST</sub>. Previously, I analyzed the correlations between F<sub>ST</sub> scores and associations computed using <a href="http://rnowling.github.io/bioinformatics/2017/04/04/cramers-v.html">Cramer’s V</a> and weights computed from <a href="http://rnowling.github.io/bioinformatics/2017/04/05/lr-ensembles-vs-fst.html">Logistic Regression Ensembles</a>.</p>
<p>In line with my <a href="http://rnowling.github.io/bioinformatics/2017/04/04/cramers-v.html">recent correlation analysis</a> between Cramer’s V and F<sub>ST</sub>, I wanted to compare the weights calculated by the <a href="http://rnowling.github.io/machine/learning/2017/02/14/lr-gwas.html">Logistic Regression Ensembles</a> I recently discussed with F<sub>ST</sub>.</p>
<p>I again used the Burkina Faso <em>An. gambiae</em> and <em>An. coluzzii</em> samples from the <a href="https://www.malariagen.net/projects/ag1000g">Anopheles gambiae 1000 genomes project</a> for the comparison. I ran trained the Random Forests with both the counts and categorical feature encodings using the <code class="highlighter-rouge">dev</code> branch of my population genetics methods exporation toolkit, <a href="https://github.com/rnowling/asaph/">Asaph</a>. Plots of the variable importance measures from the Random Forests vs F<sup>ST</sup> scores are below:</p>
<p><img src="/images/random-forests-vs-fst/bfm_vs_bfs_fst_vs_rf_counts.png" alt="Fst vs Random Forests (counts)" /></p>
<p><img src="/images/random-forests-vs-fst/bfm_vs_bfs_fst_vs_rf_categories.png" alt="Fst vs Random Forests (categories)" /></p>
<p>With the counts feature-encoding scheme, linear regression between the Random Forests variable importance measures and F<sup>ST</sup> scores had <script type="math/tex">r^2=0.665</script>. With the categories feature-encoding scheme, linear regression gave <script type="math/tex">r^2=0.656</script>.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Variable importance measures computed for each SNP using Random Forests correlate reasonably well with the F<sub>ST</sub> scores, regardless of the encoding scheme used. (Random Forests are particularly robust to the choice of encoding scheme.) It would be interesting to analyze variants where Random Forests and F<sub>ST</sub> give substantially different scores.</p>
Wed, 05 Apr 2017 00:01:19 +0000
http://rnowling.github.io/bioinformatics/2017/04/05/random-forests-vs-fst.html
http://rnowling.github.io/bioinformatics/2017/04/05/random-forests-vs-fst.htmlstatisticsbioinformaticsLogistic Regression Ensembles vs F<sub>ST</sub> for Insect Population Genetics<p>In line with my <a href="http://rnowling.github.io/bioinformatics/2017/04/04/cramers-v.html">recent correlation analysis</a> between Cramer’s V and F<sub>ST</sub>, I wanted to compare the weights calculated by the <a href="http://rnowling.github.io/machine/learning/2017/02/14/lr-gwas.html">Logistic Regression Ensembles</a> I recently discussed with F<sub>ST</sub>. Logistic Regression Ensembles are implemented in the <code class="highlighter-rouge">dev</code> branch of my population methods exporation toolkit <a href="https://github.com/rnowling/asaph/">Asaph</a>.</p>
<p>I again used the Burkina Faso <em>An. gambiae</em> and <em>An. coluzzii</em> samples from the <a href="https://www.malariagen.net/projects/ag1000g">Anopheles gambiae 1000 genomes project</a> for the comparison. I ran the Logistic Regression Ensembles with both the counts and categorical feature encodings and with and without bagging. Plots of the weights from the Logistic Regression Ensembles vs F<sup>ST</sup> scores are below:</p>
<p><img src="/images/lr-vs-fst/bfm_vs_bfs_fst_vs_lr_counts_bagging.png" alt="Fst vs LR Ensembles (counts) w/ Bagging" /></p>
<p><img src="/images/lr-vs-fst/bfm_vs_bfs_fst_vs_lr_counts_no_bagging.png" alt="Fst vs LR Ensembles (counts) w/o Bagging" /></p>
<p><img src="/images/lr-vs-fst/bfm_vs_bfs_fst_vs_lr_categories_bagging.png" alt="Fst vs LR Ensembles (categories) w/ Bagging" /></p>
<p><img src="/images/lr-vs-fst/bfm_vs_bfs_fst_vs_lr_categories_no_bagging.png" alt="Fst vs LR Ensembles (categories) w/o Bagging" /></p>
<p>I used linear regression performed on the F<sup>ST</sup> scores and LR ensemble weights to calculate <script type="math/tex">r^2</script> values:</p>
<table>
<thead>
<tr>
<th style="text-align: center">Feature Encoding</th>
<th style="text-align: center">Bagging?</th>
<th style="text-align: center"><script type="math/tex">r^2</script></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">Counts</td>
<td style="text-align: center">Yes</td>
<td style="text-align: center">0.889</td>
</tr>
<tr>
<td style="text-align: center">Counts</td>
<td style="text-align: center">No</td>
<td style="text-align: center">0.812</td>
</tr>
<tr>
<td style="text-align: center">Categories</td>
<td style="text-align: center">Yes</td>
<td style="text-align: center">0.850</td>
</tr>
<tr>
<td style="text-align: center">Categories</td>
<td style="text-align: center">No</td>
<td style="text-align: center">0.643</td>
</tr>
</tbody>
</table>
<h2 id="conclusion">Conclusion</h2>
<p>When bagging is used, Logistic Regression Ensembles weights seem to correlate well with F<sup>ST</sup> scores. The only real outlier is when Logistic Regression Ensembles are used with the categories feature encoding and no bagging.</p>
<p>It’s important to mention that the correlation analysis only tells us how well these methods agree with F<sub>ST</sub>. They do not necessarily tell us which method is better or worse. Substantial work remains to validate the results of the Logistic Regression Ensembles.</p>
Wed, 05 Apr 2017 00:01:19 +0000
http://rnowling.github.io/bioinformatics/2017/04/05/lr-ensembles-vs-fst.html
http://rnowling.github.io/bioinformatics/2017/04/05/lr-ensembles-vs-fst.htmlstatisticsbioinformaticsCramer's V vs F<sub>ST</sub> for Insect Population Genetics<p><a href="https://en.wikipedia.org/wiki/Fixation_index">Fixation index</a>, or F<sub>ST</sub>, is a univariate statistic calculated as the ratio of variance within populations to the variance between populations. Within insect population genetics, F<sub>ST</sub> is used to score, and then rank, the correlation between variants and the population structure.</p>
<p>The focus of my Ph.D. dissertation was to investigate variable importance measures as calculated via Random Forests as an alterative to F<sub>ST</sub>. I’ve also begun looking at <a href="http://rnowling.github.io/machine/learning/2017/02/14/lr-gwas.html">Logistic Regression Ensembles</a>.</p>
<p>In addition to these two machine learning approaches, I wanted to investigate a statistical method, Cramer’s V. <a href="https://en.wikipedia.org/wiki/Cram%C3%A9r's_V">Cramer’s V</a> measures the assocation (correlation of unsigned variables) of nominal (categorical) variables. I went ahead and implemented Cramer’s V in the <code class="highlighter-rouge">dev</code> branch of my population methods exporation toolkit <a href="https://github.com/rnowling/asaph/">Asaph</a>.</p>
<p>I used the Burkina Faso <em>An. gambiae</em> and <em>An. coluzzii</em> samples from the <a href="https://www.malariagen.net/projects/ag1000g">Anopheles gambiae 1000 genomes project</a> to compare Cramer’s V and F<sub>ST</sub>. I calculated the F<sub>ST</sub> scores for each SNP using <a href="https://vcftools.github.io/">vcftools</a>. I calculated Cramer’s V using Asaph on data imported using both the counts and categories feature encoding schemes. I then plotted F<sub>ST</sub> vs Cramer’s V (counts) and F<sub>ST</sub> vs Cramer’s V (categories) to get a sense of the correlation between the two metrics.</p>
<p><img src="/images/cramers-v-vs-fst/bfm_vs_bfs_fst_vs_cramers_v_counts.png" alt="Fst vs Cramer's V (counts)" /></p>
<p><img src="/images/cramers-v-vs-fst/bfm_vs_bfs_fst_vs_cramers_v_categories.png" alt="Fst vs Cramer's V (categories)" /></p>
<p>The above figures give the scatter plots of F<sub>ST</sub> vs Cramer’s V with the counts and categories feature encodings, respectively. Cramer’s V calculated on the count-encoded features has a <script type="math/tex">r^2</script> value of 0.865 vs F<sub>ST</sub>, while Cramer’s V calculated on the count-encoded features has a <script type="math/tex">r^2</script> value of 0.818 vs F<sub>ST</sub>.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Along with Random Forests and Logistic Regression Ensembles, Cramer’s V is another alternative to F<sub>ST</sub> for finding variants that best describe the genetic basis of differences between two populations. Cramer’s V correlates well with F<sub>ST</sub>, but a simple correlation analysis doesn’t tell us which metric is more appropriate for a given situation. Substantial work remains to validate the four methods and compare them.</p>
Tue, 04 Apr 2017 00:01:19 +0000
http://rnowling.github.io/bioinformatics/2017/04/04/cramers-v.html
http://rnowling.github.io/bioinformatics/2017/04/04/cramers-v.htmlstatisticsbioinformaticsClassifying Graphs with Shortest Paths<p>Graphs can be an easy and intuitive way of representing interactions between agents or state transitions in sociological, biological, and dynamical systems. The <a href="http://cse.nd.edu">Computer Science & Engineering department</a> happens to have a number of researchers in the <a href="http://icensa.nd.edu">Interdisciplinary Center for Network Science and Applications (iCeNSA)</a> working on <a href="https://en.wikipedia.org/wiki/Complex_network">complex networks</a>. Between some of my own research modeling protein-folding dynamics as Markov State Models and having a desk in the iCeNSA office space, I was exposed to some of this research.</p>
<p>One of the most natural applications of network science is analyzing clickstream data. In particular, we can represent users’ browsing sessions as graphs. In the simplest case, we can use vertices to represent the pages that users have visited and directed edges to represent that a user has navigated from one page to another. A more sophisticated model might might use edge weights to record the number of times the user navigated from one page to another in a single session. In fact, if we normalize the outgoing edge weights of each vertex, we can derive a <a href="https://en.wikipedia.org/wiki/Markov_model">Markov model</a> of the dynamics of the browsing session.</p>
<p>My motivation and goal of modeling users’ browsing sessions as graphs is to be able to segment users by their browsing behaviors. For example, I may want to train a machine learning model to discriminate between users who are likely to make a purchase (convert) versus those who are just window shopping using graphs generated from their browsing sessions. I don’t know the content of the web sites, and web sites can be structured differently. Thus, I won’t be able to match vertices between separate graphs easily. As such, I want to engineer features based purely on topological features of the graphs that are invariant to permutations of vertices and the number of edges and vertices in graphs.</p>
<p>There are several ways to approach classifying graphs with machine learning models. One approach is simply to engineer a bunch of features from different statistics computed from the graphs. <a href="http://onlinelibrary.wiley.com/doi/10.1002/sam.11153/full">Li, et al.</a> describe a number of metrics including the <a href="https://en.wikipedia.org/wiki/Clustering_coefficient">average clustering coefficient</a> and the average path length (<a href="https://en.wikipedia.org/wiki/Closeness_centrality">closeness centrality</a>). However, be aware that some of their features (such as the numbers of edges and vertices) probably won’t be useful if you are comparing graphs of different sizes.</p>
<p>A second approach would be to use <a href="https://en.wikipedia.org/wiki/Graph_kernel">graph kernels</a>, functions for computing a similarity score between two graphs. A number of machine learning methods (called <a href="https://en.wikipedia.org/wiki/Kernel_method">kernel methods</a>) such as Support Vector Machines and Principal Component Analysis can be adapted to use inner products computed between pairs of data points using a kernel instead of feature vectors. Such kernel methods are advantageous since they enable these method to be extended to data types that are difficult to represent with traditional feature vectors. Often-cited graph kernels include <a href="http://ieeexplore.ieee.org/abstract/document/1565664/">Shortest-Paths</a>, <a href="http://www.jmlr.org/proceedings/papers/v5/shervashidze09a/shervashidze09a.pdf">Graphlet</a>, and [Random Walk]((https://en.wikipedia.org/wiki/Graph_kernel) kernels. At <a href="http://nips.cc">NIPS 2016</a>, I saw a very nice presentation by <a href="https://www.cs.uchicago.edu/directory/risi-kondor">Risi Kondor</a> on the <a href="http://papers.nips.cc/paper/6135-learning-bound-for-parameter-transfer-learning.pdf">Multiscale Laplacian Graph Kernel</a>, which both allows incorprating features computed on vertices and edges as well as adapting well to multi-scale problems like protein structures.</p>
<p>For this blog post, I’m going to focus on evaluating the ability of machine learning models to discriminate between undirected graphs generated by the <a href="https://en.wikipedia.org/wiki/Erd%C5%91s%E2%80%93R%C3%A9nyi_model">Erdős–Rényi</a> and <a href="https://en.wikipedia.org/wiki/Stochastic_block_model">planted partitioned</a> random graph models. I’m using the experimental framework from a paper on the <a href="https://arxiv.org/pdf/1510.06492">generalized Shortest-Path graph kernel</a>. Instead of using graph kernels, I’m first going to focus on features engineered from the distribution of lengths of the shortest-paths between all pairs of vertices. Here are examples of two such graphs:</p>
<p><img src="/images/classifying-graphs-with-shortest-paths/graphs.png" alt="graphs" /></p>
<h2 id="generating-the-graphs">Generating the Graphs</h2>
<p>I’m choosing to generate 100 graphs of each type. Each graph has 100 vertices. For the ER model, I’m using an edge probability of 0.2. Following the direction of the generalized shortest-path graph kernels paper, I set the parameters for the planted partitioned model to generate the same number of edges as the ER model, with a multiplier for <script type="math/tex">p_1</script> of 1.6.</p>
<h2 id="analysis-of-shortest-path-distributions">Analysis of Shortest-Path Distributions</h2>
<p>The distribution of the average of the all-pairs shortest-path lengths for each graph is plotted below:</p>
<p><img src="/images/classifying-graphs-with-shortest-paths/avg_sp_hist.png" alt="average all-pairs shortest-path lengths" /></p>
<p>Note that the two distributions overlap substantially. Thus, simply using the average all-pairs shortest-path length for each graph won’t be able to effectively discriminate between graphs from the two classes.</p>
<p>I then decided to try generating a normalized histogram of the all-pairs shortest-path lengths for each graph. When I compared the distributions of the Euclidean distances of graphs generated by the same model to pairs from different models, I observed a separation in the distributions:</p>
<p><img src="/images/classifying-graphs-with-shortest-paths/sp_distr_dist_hist.png" alt="shortest-path length distribution distances" /></p>
<p>The difference in the distribution of distances for graphs from different models versus those generated by the same model was promising. I saw a similar result with cosine similarity.</p>
<h2 id="generating-features">Generating Features</h2>
<p>As I mentioned earlier, instead of using a kernel method, I focused on generating features that could be used with standard machine learnign models. I focused on four modeling approaches:</p>
<ol>
<li>One feature vector for each graph. The features were the <script type="math/tex">L_2</script> normalized histogram of all-pairs shortest-path lengths. I used bin-sizes of 1 with enough bins to include the longest path found. (In my case, all of the graphs had the same maximum length of 3.) The ER graphs were labeled as 0, while the PP graphs were labeled as 1.</li>
<li>Represent each pair of graphs as a feature vector. I computed the difference in the normalized histograms of the all-pairs shortest-paths lengths. Since the differences aren’t symmetric, I computed the differences each way and added two feature vectors for each pair. Feature vectors containing graphs from the same model were labeled 0, while feature vectors for graphs from different models were labeled 0.</li>
<li>Same as #2, except that I took the absolute value of the differences and only added one feature vector per pair.</li>
<li>Like approach #3, I used pairs of graphs. I used a single feature – the Euclidean distance calculated between the normalized histograms from each pair.</li>
</ol>
<h2 id="experiments">Experiments</h2>
<p>I used <a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier">Logistic Regression with Stochastic Gradient Descent</a> and <a href="http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html">Random Forests</a> for classification. I used 1,000 epochs for LR, and 100 trees for RF. I perfomed 10-fold stratified cross-fold validation and used accuracy and area under the ROC curve as metrics. Accuracy utilizes binary predictions, while the ROC AUC utilizes the predicted probabilities.</p>
<table>
<thead>
<tr>
<th style="text-align: center">Features Type</th>
<th style="text-align: center">Classifier</th>
<th style="text-align: center">ROC AUC (std)</th>
<th style="text-align: center">Accuracy (%, std)</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">1</td>
<td style="text-align: center">LR</td>
<td style="text-align: center"><strong>1.0 (0.0)</strong></td>
<td style="text-align: center"><strong>98.0 (3.3)</strong></td>
</tr>
<tr>
<td style="text-align: center">1</td>
<td style="text-align: center">RF</td>
<td style="text-align: center"><strong>1.0 (0.0)</strong></td>
<td style="text-align: center"><strong>100.0 (0.0)</strong></td>
</tr>
<tr>
<td style="text-align: center">2</td>
<td style="text-align: center">LR</td>
<td style="text-align: center">0.490 (0.006)</td>
<td style="text-align: center">66.7 (0.0)</td>
</tr>
<tr>
<td style="text-align: center">2</td>
<td style="text-align: center">RF</td>
<td style="text-align: center"><strong>0.999 (0.0)</strong></td>
<td style="text-align: center"><strong>99.3 (0.0)</strong></td>
</tr>
<tr>
<td style="text-align: center">3</td>
<td style="text-align: center">LR</td>
<td style="text-align: center">0.910 (0.007)</td>
<td style="text-align: center">71.1 (0.4)</td>
</tr>
<tr>
<td style="text-align: center">3</td>
<td style="text-align: center">RF</td>
<td style="text-align: center"><strong>0.999 (0.0)</strong></td>
<td style="text-align: center"><strong>99.2 (0.0)</strong></td>
</tr>
<tr>
<td style="text-align: center">4</td>
<td style="text-align: center">LR</td>
<td style="text-align: center">0.914 (0.005)</td>
<td style="text-align: center">66.7 (0.0)</td>
</tr>
<tr>
<td style="text-align: center">4</td>
<td style="text-align: center">RF</td>
<td style="text-align: center">0.921 (0.002)</td>
<td style="text-align: center">84.4 (0.4)</td>
</tr>
</tbody>
</table>
<p>LR performed well with feature type 1 with an average ROC AUC of 1.0 and average accuracy of 98.5%, but performed poorly on all other feature types. RFs performed well with feature types 1-3, achieving a minimum average ROC AUC of 0.999 and average accuracy of 99.2%.</p>
<p>We see these results reflected in ROC curves generated for each classifier from one of its 10 folds:</p>
<p><img src="/images/classifying-graphs-with-shortest-paths/lr_roc.png" alt="LR ROC" /></p>
<p>The LR classifier has an abysmal ROC curve for feature type 2. The ROC curves for features types 3 and 4 are decent. The ROC curve for feature type 1 appears to be perfect. (Note that the ROC curves are based on sorting by predicted probabilities, while the accuracies used the binary labels.)</p>
<p><img src="/images/classifying-graphs-with-shortest-paths/rf_roc.png" alt="RF ROC" /></p>
<p>The RF model performes nearly perfectly on feature type 1, as indicated by the barely visible curve in the upper left of the plot. The ROC curve for feature types 2 and 3 are nearly perfect. The ROC curve for feature type 4 is acceptable, largely an indication of how robust RF classifiers are.</p>
<p>In terms of computational complexity, feature type 1 only requires creating a feature vector for each graph. Feature types 2-4 require computing a feature vector or distance for each pair of graphs.</p>
<p>Overall, it looks like feature type 1, where the feature vectors are the normalized histograms (bin size 1) of the all-pairs shortest-path lengths for each graph can be used to very accurately discriminate between graphs generated by the two models, regardless of the classifier used.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this blog post, I looked at training machine learning models to discriminate between graphs generated by two different random graph models. I gave some background on feature engineering and kernel methods for graphs. I looked at four ways of representing the problem and how the corresponding features would be generated. I evaluated the four types of features using Logistic Regression and Random Forest classifiers. I observed that both LR and RF classifiers performed well when normalized histograms of the all-pairs shortest-paths lengths were used as features.</p>
<p>Going forward, I’d like to evaluate additional types of features such as the average clustering coefficient and the distributions of graphlets found in the graphs. Using the feature-vector per pair approach (feature type 3), I could incorporate graph kernels as features. However, I would probably need to find a different way to represent the absolute differences in the histograms so that it’s feasible to use Logistic Regression. One may to do so might be to discretize the differences for each bin in the histogram into a set of bins.</p>
<p>Once I have a reasonable set of useful features, I’d like to explore the effectiveness of this approach on small graphs since most user browsing sessions have far fewer than 100 nodes (e.g., 10).</p>
<p><em>The scripts used in the analyses are available in my <a href="https://github.com/rnowling/graph-experiments">graph-experiments</a> repo on GitHub under the Apache Public License v2.</em></p>
Sat, 04 Mar 2017 00:01:19 +0000
http://rnowling.github.io/machine/learning/2017/03/04/classifying-graphs-with-shortest-paths.html
http://rnowling.github.io/machine/learning/2017/03/04/classifying-graphs-with-shortest-paths.htmlmathmachinelearning