RJ NowlingPersonal website and blog for RJ Nowling. Data science engineer with a Ph.D. in Computer Science & Engineering with experience in computational physics, bioinformatics, machine learning, and distributed systems.
http://rnowling.github.io/
Sun, 28 Jan 2018 23:06:25 +0000Sun, 28 Jan 2018 23:06:25 +0000Jekyll v3.6.2Testing for Regions of Enriched Differentiation along the Chromosome using the Binomial Test<p>Within my work on insect vector population genetics, we often want to infer regions of the chromosomes that are undergoing differentiation. One way in which we do this is to look for windows with more than expected number of statistically-significant SNPs.</p>
<p>To set up the test, we first need to perform association tests on each individual SNP using something like the <a href="/machine/learning/2017/10/07/likelihood-ratio-test.html">likelihood-ratio test</a> or <a href="https://en.wikipedia.org/wiki/Fixation_index"><script type="math/tex">F_{ST}</script></a> to identify SNPs that are strongly correlated with the population structure or phenotype of interest. We then divide the chromosome into non-overlapping windows and count the number of SNPs in each window. Lastly, we perform a statistical test on each window, with a null hypothesis that the SNPs are uniformly distributed across the windows.</p>
<p><a href="http://science.sciencemag.org/content/330/6003/514">Neafsey, et al.</a> performed this analysis using the popular <script type="math/tex">\chi_2</script> test. I prefer using the one-tailed <a href="https://en.wikipedia.org/wiki/Binomial_test">binomial test</a>, however, as it’s more sensitive. Conveniently, the binomial test is available in <a href="https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.binom_test.html">Scipy</a>.</p>
<p>My script for performing this analysis is available below:</p>
<script src="https://gist.github.com/rnowling/bfd94f606144731233c897d977121146.js"></script>
Sun, 28 Jan 2018 12:13:19 +0000
http://rnowling.github.io/bioinformatics/2018/01/28/binomial-test.html
http://rnowling.github.io/bioinformatics/2018/01/28/binomial-test.htmlmathmachine learningstatisticsbioinformaticsbioinformaticsTesting Feature Significance with the Likelihood Ratio Test<p><a href="https://en.wikipedia.org/wiki/Logistic_regression">Logistic Regression</a> (LR) is a popular technique for binary classification within the machine learning and statistics communities. From the machine learning perspective, it has a number of desirable properties. Training and prediction are incredibly fast. When using stochastic gradient descent and its cousins, LR supports online learning, enabling models to change as the data changes and training on datasets larger than the available memory on the machine. And finally, LR naturally accomodates sparse data.</p>
<p>Because of its roots in the statistics community, Logistic Regression is amenable to analyses other machine learning techniques are not. The <a href="https://en.wikipedia.org/wiki/Likelihood-ratio_test">Likelihood-Ratio Test</a> can be used to determine if the addition of the features to a LR model result in a statistically-significant improvement in the fit of the model.<sup><a href="#hosmer">1</a></sup></p>
<p>I originally learned about the Likelihood-Ratio Test when reading about ways that variants are found in genome-wide association studies (GWAS). The statistician <a href="https://en.wikipedia.org/wiki/David_Balding">David J. Balding</a> has significantly impacted the field and its methods. His <a href="http://www.montefiore.ulg.ac.be/~kvansteen/GBIO0009-1/ac20112012/Class4/Balding2006.pdf">tutorial on statistical methods for population association studies</a> is a great place to start for anyone interested in the subject.</p>
<p>As Prof. Balding points out, many GWA studies use the Likelihood-Ratio Test to perform single-SNP association tests. Basically, a LR model is built for each SNP and compared to a null model that only uses the class probabilities. SNPs with small p-values are then selected for further study.</p>
<h2 id="likelihood-ratio-test">Likelihood-Ratio Test</h2>
<p>The question we are trying to answer with the Likelihood-Ratio Test is:</p>
<blockquote>
<p>Does the model that includes the variable(s) in question tell us more about the outcome (or response) variable than a model that does not include the variable(s)?</p>
</blockquote>
<p>Using the Likelihood-Ratio Test, we compute a p-value indicating the significance of the additional features. Using that p-value, we can accept or reject the null hypothesis.</p>
<p>Let <script type="math/tex">\theta^0</script> and <script type="math/tex">x^0</script> and <script type="math/tex">\theta^1</script> and <script type="math/tex">x^1</script> be the weights and feature matrices used in the null and alternative models, respectively. Note that we need <script type="math/tex">\theta^0 \subset \theta^1</script> and <script type="math/tex">x^0 \subset x^1</script>, meaning that the models are “nested.” Let <script type="math/tex">y</script> be the vector of class labels, <script type="math/tex">N</script> denote the number of samples, and <script type="math/tex">df</script> be number of additional weights / features in <script type="math/tex">\theta^1</script>.</p>
<p>The Logistic Regression model is given by:</p>
<script type="math/tex; mode=display">\pi_\theta(x_i) = \frac{e^{\theta \cdot x_i}}{1+e^{\theta \cdot x_i}}</script>
<p>Note that the intercept is considered part of <script type="math/tex">\theta</script>. We append a columns of 1s to <script type="math/tex">x</script> to model the intercept. (In the implementation below, since you control the feature matrices and model, you can model it as you need.)</p>
<p>The likelihood for the Logistic Regression model is given by:</p>
<script type="math/tex; mode=display">L(\theta | x) = \prod_{i=1}^N \pi_\theta(x_i)^{y_i} (1 - \pi_\theta)^{1 - y^i} \\
\log L(\theta | x) = \sum_{i=1}^N y_i \log \pi_\theta(x_i) + (1 - y_i) \log (1 - \pi_\theta(x_i))</script>
<p>The Likelihood-Ratio Test is then given by:</p>
<script type="math/tex; mode=display">G = 2 (\log L(\theta^1 | x^1) - \log L(\theta^0 | x^0))</script>
<p>Finally, we compute the p-value for the null model using the <script type="math/tex">\chi^2(df)</script> CDF:</p>
<script type="math/tex; mode=display">p = P[\chi^2(df) > G]</script>
<h2 id="python-implementation-and-example">Python Implementation and Example</h2>
<p>Using <a href="http://scikit-learn.org/stable/">scikit-learn</a> and <a href="https://www.scipy.org/">scipy</a>, implementing the Likelihood-Ratio Test is pretty straightforward (as long as you remember to use the <strong>unnormalized</strong> log losses and negate them):</p>
<script src="https://gist.github.com/rnowling/ec9c9038e492d55ffae2ae257aa4acd9.js?file=likelihood_ratio_test.py"></script>
<p>The <code class="highlighter-rouge">likelihood_ratio_test</code> function takes four parameters:</p>
<ol>
<li>Feature matrix for the alternative model</li>
<li>Labels for the samples</li>
<li>A LR model to use for the test</li>
<li>(Optional) Feature matrix for the null model. If this is not given, then the class probabilities are calculated from the sample labels and used.</li>
</ol>
<p>and returns a p-value indicating the statistical significance of the new features.</p>
<p>To illustrate its use, I generated some fake data with 20 binary features. The binary features range in their probability of matching the class labels from 0.5 (uncorrelated) to 1.0 (completely correlated). Half of the features have inverted values (<code class="highlighter-rouge">1 - label</code>). I generated 100 fake data sets with 100 samples each. I then ran the Likelihood-Ratio Test for each feature individually and created a box plot of the p-values:</p>
<p><img src="/images/likelihood_ratio_test_p_values_boxplot.png" alt="" /></p>
<p>As expected, the statistical significance varies according to the probability that the feature matches the label. And we so no difference in whether the features matches the label or is inverted, also as expected.</p>
<p>(I <a href="https://gist.github.com/rnowling/ec9c9038e492d55ffae2ae257aa4acd9">posted my code</a> under the Apache License v2 so you can re-create my results and use the test in your own work.)</p>
<p><a name="hosmer"></a>Note: the derivation given here comes from <em>Applied Logistic Regression</em> (3<sup>rd</sup> Ed.) by Hosmer, Lemeshow, and Sturdivant.</p>
Sat, 07 Oct 2017 12:13:19 +0000
http://rnowling.github.io/machine/learning/2017/10/07/likelihood-ratio-test.html
http://rnowling.github.io/machine/learning/2017/10/07/likelihood-ratio-test.htmlmathmachine learningstatisticsmachinelearningTalk on Productionizing ML Models<p>Last night, I gave a talk titled “Real-World Lessons in Machine Learning Applied to Spam Classification” at the <a href="https://www.meetup.com/MKE-Big-Data/">MKE Big Data</a> meetup. In my talk, I used spam classification as a use case for communicating some lessons learnd from my experiences building production machine learning-powered services. In particular, I wanted to get the point across that modeling and algorithm choices are not independent from the requirements of the production system – we need to design our models and choose our algorithms while keeping in mind how those choices will impact the resulting production system.</p>
<p>You can grab my slides <a href="/static/rnowling_mke_big_data_2017.pdf">here</a>. My slides and source code I used to generate my plots are also available on the <a href="https://github.com/MKE-Big-Data/MKE-BD-Talks">MKE BD Talks</a> GitHub repo.</p>
<p>A few attendees had asked for some additional resources related to the topics. Martin Zinkevich of Google recently published an excellent guide based on their experiences titled <a href="http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf">Rules of Machine Learning: Best Practices for ML Engineering</a>, which I highly recommend. <a href="http://hunch.net/~vw/">Vowpal Wabbit</a> is a powerful toolkit for online machine learning that incorporates some of the latest algorithms and techniques.</p>
Wed, 03 May 2017 00:02:19 +0000
http://rnowling.github.io/machine/learning/2017/05/03/production-ml-systems.html
http://rnowling.github.io/machine/learning/2017/05/03/production-ml-systems.htmlengineeringmachinelearningRandom Forests vs F<sub>ST</sub> for Insect Population Genetics<p>For my last comparison, I’ll look at the correlation between the variable importance measures (VIM) computed by Random Forests vs the scores calculated via F<sub>ST</sub>. Previously, I analyzed the correlations between F<sub>ST</sub> scores and associations computed using <a href="http://rnowling.github.io/bioinformatics/2017/04/04/cramers-v.html">Cramer’s V</a> and weights computed from <a href="http://rnowling.github.io/bioinformatics/2017/04/05/lr-ensembles-vs-fst.html">Logistic Regression Ensembles</a>.</p>
<p>In line with my <a href="http://rnowling.github.io/bioinformatics/2017/04/04/cramers-v.html">recent correlation analysis</a> between Cramer’s V and F<sub>ST</sub>, I wanted to compare the weights calculated by the <a href="http://rnowling.github.io/machine/learning/2017/02/14/lr-gwas.html">Logistic Regression Ensembles</a> I recently discussed with F<sub>ST</sub>.</p>
<p>I again used the Burkina Faso <em>An. gambiae</em> and <em>An. coluzzii</em> samples from the <a href="https://www.malariagen.net/projects/ag1000g">Anopheles gambiae 1000 genomes project</a> for the comparison. I ran trained the Random Forests with both the counts and categorical feature encodings using the <code class="highlighter-rouge">dev</code> branch of my population genetics methods exporation toolkit, <a href="https://github.com/rnowling/asaph/">Asaph</a>. Plots of the variable importance measures from the Random Forests vs F<sup>ST</sup> scores are below:</p>
<p><img src="/images/random-forests-vs-fst/bfm_vs_bfs_fst_vs_rf_counts.png" alt="Fst vs Random Forests (counts)" /></p>
<p><img src="/images/random-forests-vs-fst/bfm_vs_bfs_fst_vs_rf_categories.png" alt="Fst vs Random Forests (categories)" /></p>
<p>With the counts feature-encoding scheme, linear regression between the Random Forests variable importance measures and F<sup>ST</sup> scores had <script type="math/tex">r^2=0.665</script>. With the categories feature-encoding scheme, linear regression gave <script type="math/tex">r^2=0.656</script>.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Variable importance measures computed for each SNP using Random Forests correlate reasonably well with the F<sub>ST</sub> scores, regardless of the encoding scheme used. (Random Forests are particularly robust to the choice of encoding scheme.) It would be interesting to analyze variants where Random Forests and F<sub>ST</sub> give substantially different scores.</p>
Wed, 05 Apr 2017 00:01:19 +0000
http://rnowling.github.io/bioinformatics/2017/04/05/random-forests-vs-fst.html
http://rnowling.github.io/bioinformatics/2017/04/05/random-forests-vs-fst.htmlstatisticsbioinformaticsLogistic Regression Ensembles vs F<sub>ST</sub> for Insect Population Genetics<p>In line with my <a href="http://rnowling.github.io/bioinformatics/2017/04/04/cramers-v.html">recent correlation analysis</a> between Cramer’s V and F<sub>ST</sub>, I wanted to compare the weights calculated by the <a href="http://rnowling.github.io/machine/learning/2017/02/14/lr-gwas.html">Logistic Regression Ensembles</a> I recently discussed with F<sub>ST</sub>. Logistic Regression Ensembles are implemented in the <code class="highlighter-rouge">dev</code> branch of my population methods exporation toolkit <a href="https://github.com/rnowling/asaph/">Asaph</a>.</p>
<p>I again used the Burkina Faso <em>An. gambiae</em> and <em>An. coluzzii</em> samples from the <a href="https://www.malariagen.net/projects/ag1000g">Anopheles gambiae 1000 genomes project</a> for the comparison. I ran the Logistic Regression Ensembles with both the counts and categorical feature encodings and with and without bagging. Plots of the weights from the Logistic Regression Ensembles vs F<sup>ST</sup> scores are below:</p>
<p><img src="/images/lr-vs-fst/bfm_vs_bfs_fst_vs_lr_counts_bagging.png" alt="Fst vs LR Ensembles (counts) w/ Bagging" /></p>
<p><img src="/images/lr-vs-fst/bfm_vs_bfs_fst_vs_lr_counts_no_bagging.png" alt="Fst vs LR Ensembles (counts) w/o Bagging" /></p>
<p><img src="/images/lr-vs-fst/bfm_vs_bfs_fst_vs_lr_categories_bagging.png" alt="Fst vs LR Ensembles (categories) w/ Bagging" /></p>
<p><img src="/images/lr-vs-fst/bfm_vs_bfs_fst_vs_lr_categories_no_bagging.png" alt="Fst vs LR Ensembles (categories) w/o Bagging" /></p>
<p>I used linear regression performed on the F<sup>ST</sup> scores and LR ensemble weights to calculate <script type="math/tex">r^2</script> values:</p>
<table>
<thead>
<tr>
<th style="text-align: center">Feature Encoding</th>
<th style="text-align: center">Bagging?</th>
<th style="text-align: center"><script type="math/tex">r^2</script></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">Counts</td>
<td style="text-align: center">Yes</td>
<td style="text-align: center">0.889</td>
</tr>
<tr>
<td style="text-align: center">Counts</td>
<td style="text-align: center">No</td>
<td style="text-align: center">0.812</td>
</tr>
<tr>
<td style="text-align: center">Categories</td>
<td style="text-align: center">Yes</td>
<td style="text-align: center">0.850</td>
</tr>
<tr>
<td style="text-align: center">Categories</td>
<td style="text-align: center">No</td>
<td style="text-align: center">0.643</td>
</tr>
</tbody>
</table>
<h2 id="conclusion">Conclusion</h2>
<p>When bagging is used, Logistic Regression Ensembles weights seem to correlate well with F<sup>ST</sup> scores. The only real outlier is when Logistic Regression Ensembles are used with the categories feature encoding and no bagging.</p>
<p>It’s important to mention that the correlation analysis only tells us how well these methods agree with F<sub>ST</sub>. They do not necessarily tell us which method is better or worse. Substantial work remains to validate the results of the Logistic Regression Ensembles.</p>
Wed, 05 Apr 2017 00:01:19 +0000
http://rnowling.github.io/bioinformatics/2017/04/05/lr-ensembles-vs-fst.html
http://rnowling.github.io/bioinformatics/2017/04/05/lr-ensembles-vs-fst.htmlstatisticsbioinformaticsCramer's V vs F<sub>ST</sub> for Insect Population Genetics<p><a href="https://en.wikipedia.org/wiki/Fixation_index">Fixation index</a>, or F<sub>ST</sub>, is a univariate statistic calculated as the ratio of variance within populations to the variance between populations. Within insect population genetics, F<sub>ST</sub> is used to score, and then rank, the correlation between variants and the population structure.</p>
<p>The focus of my Ph.D. dissertation was to investigate variable importance measures as calculated via Random Forests as an alterative to F<sub>ST</sub>. I’ve also begun looking at <a href="http://rnowling.github.io/machine/learning/2017/02/14/lr-gwas.html">Logistic Regression Ensembles</a>.</p>
<p>In addition to these two machine learning approaches, I wanted to investigate a statistical method, Cramer’s V. <a href="https://en.wikipedia.org/wiki/Cram%C3%A9r's_V">Cramer’s V</a> measures the assocation (correlation of unsigned variables) of nominal (categorical) variables. I went ahead and implemented Cramer’s V in the <code class="highlighter-rouge">dev</code> branch of my population methods exporation toolkit <a href="https://github.com/rnowling/asaph/">Asaph</a>.</p>
<p>I used the Burkina Faso <em>An. gambiae</em> and <em>An. coluzzii</em> samples from the <a href="https://www.malariagen.net/projects/ag1000g">Anopheles gambiae 1000 genomes project</a> to compare Cramer’s V and F<sub>ST</sub>. I calculated the F<sub>ST</sub> scores for each SNP using <a href="https://vcftools.github.io/">vcftools</a>. I calculated Cramer’s V using Asaph on data imported using both the counts and categories feature encoding schemes. I then plotted F<sub>ST</sub> vs Cramer’s V (counts) and F<sub>ST</sub> vs Cramer’s V (categories) to get a sense of the correlation between the two metrics.</p>
<p><img src="/images/cramers-v-vs-fst/bfm_vs_bfs_fst_vs_cramers_v_counts.png" alt="Fst vs Cramer's V (counts)" /></p>
<p><img src="/images/cramers-v-vs-fst/bfm_vs_bfs_fst_vs_cramers_v_categories.png" alt="Fst vs Cramer's V (categories)" /></p>
<p>The above figures give the scatter plots of F<sub>ST</sub> vs Cramer’s V with the counts and categories feature encodings, respectively. Cramer’s V calculated on the count-encoded features has a <script type="math/tex">r^2</script> value of 0.865 vs F<sub>ST</sub>, while Cramer’s V calculated on the count-encoded features has a <script type="math/tex">r^2</script> value of 0.818 vs F<sub>ST</sub>.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Along with Random Forests and Logistic Regression Ensembles, Cramer’s V is another alternative to F<sub>ST</sub> for finding variants that best describe the genetic basis of differences between two populations. Cramer’s V correlates well with F<sub>ST</sub>, but a simple correlation analysis doesn’t tell us which metric is more appropriate for a given situation. Substantial work remains to validate the four methods and compare them.</p>
Tue, 04 Apr 2017 00:01:19 +0000
http://rnowling.github.io/bioinformatics/2017/04/04/cramers-v.html
http://rnowling.github.io/bioinformatics/2017/04/04/cramers-v.htmlstatisticsbioinformaticsClassifying Graphs with Shortest Paths<p>Graphs can be an easy and intuitive way of representing interactions between agents or state transitions in sociological, biological, and dynamical systems. The <a href="http://cse.nd.edu">Computer Science & Engineering department</a> happens to have a number of researchers in the <a href="http://icensa.nd.edu">Interdisciplinary Center for Network Science and Applications (iCeNSA)</a> working on <a href="https://en.wikipedia.org/wiki/Complex_network">complex networks</a>. Between some of my own research modeling protein-folding dynamics as Markov State Models and having a desk in the iCeNSA office space, I was exposed to some of this research.</p>
<p>One of the most natural applications of network science is analyzing clickstream data. In particular, we can represent users’ browsing sessions as graphs. In the simplest case, we can use vertices to represent the pages that users have visited and directed edges to represent that a user has navigated from one page to another. A more sophisticated model might might use edge weights to record the number of times the user navigated from one page to another in a single session. In fact, if we normalize the outgoing edge weights of each vertex, we can derive a <a href="https://en.wikipedia.org/wiki/Markov_model">Markov model</a> of the dynamics of the browsing session.</p>
<p>My motivation and goal of modeling users’ browsing sessions as graphs is to be able to segment users by their browsing behaviors. For example, I may want to train a machine learning model to discriminate between users who are likely to make a purchase (convert) versus those who are just window shopping using graphs generated from their browsing sessions. I don’t know the content of the web sites, and web sites can be structured differently. Thus, I won’t be able to match vertices between separate graphs easily. As such, I want to engineer features based purely on topological features of the graphs that are invariant to permutations of vertices and the number of edges and vertices in graphs.</p>
<p>There are several ways to approach classifying graphs with machine learning models. One approach is simply to engineer a bunch of features from different statistics computed from the graphs. <a href="http://onlinelibrary.wiley.com/doi/10.1002/sam.11153/full">Li, et al.</a> describe a number of metrics including the <a href="https://en.wikipedia.org/wiki/Clustering_coefficient">average clustering coefficient</a> and the average path length (<a href="https://en.wikipedia.org/wiki/Closeness_centrality">closeness centrality</a>). However, be aware that some of their features (such as the numbers of edges and vertices) probably won’t be useful if you are comparing graphs of different sizes.</p>
<p>A second approach would be to use <a href="https://en.wikipedia.org/wiki/Graph_kernel">graph kernels</a>, functions for computing a similarity score between two graphs. A number of machine learning methods (called <a href="https://en.wikipedia.org/wiki/Kernel_method">kernel methods</a>) such as Support Vector Machines and Principal Component Analysis can be adapted to use inner products computed between pairs of data points using a kernel instead of feature vectors. Such kernel methods are advantageous since they enable these method to be extended to data types that are difficult to represent with traditional feature vectors. Often-cited graph kernels include <a href="http://ieeexplore.ieee.org/abstract/document/1565664/">Shortest-Paths</a>, <a href="http://www.jmlr.org/proceedings/papers/v5/shervashidze09a/shervashidze09a.pdf">Graphlet</a>, and [Random Walk]((https://en.wikipedia.org/wiki/Graph_kernel) kernels. At <a href="http://nips.cc">NIPS 2016</a>, I saw a very nice presentation by <a href="https://www.cs.uchicago.edu/directory/risi-kondor">Risi Kondor</a> on the <a href="http://papers.nips.cc/paper/6135-learning-bound-for-parameter-transfer-learning.pdf">Multiscale Laplacian Graph Kernel</a>, which both allows incorprating features computed on vertices and edges as well as adapting well to multi-scale problems like protein structures.</p>
<p>For this blog post, I’m going to focus on evaluating the ability of machine learning models to discriminate between undirected graphs generated by the <a href="https://en.wikipedia.org/wiki/Erd%C5%91s%E2%80%93R%C3%A9nyi_model">Erdős–Rényi</a> and <a href="https://en.wikipedia.org/wiki/Stochastic_block_model">planted partitioned</a> random graph models. I’m using the experimental framework from a paper on the <a href="https://arxiv.org/pdf/1510.06492">generalized Shortest-Path graph kernel</a>. Instead of using graph kernels, I’m first going to focus on features engineered from the distribution of lengths of the shortest-paths between all pairs of vertices. Here are examples of two such graphs:</p>
<p><img src="/images/classifying-graphs-with-shortest-paths/graphs.png" alt="graphs" /></p>
<h2 id="generating-the-graphs">Generating the Graphs</h2>
<p>I’m choosing to generate 100 graphs of each type. Each graph has 100 vertices. For the ER model, I’m using an edge probability of 0.2. Following the direction of the generalized shortest-path graph kernels paper, I set the parameters for the planted partitioned model to generate the same number of edges as the ER model, with a multiplier for <script type="math/tex">p_1</script> of 1.6.</p>
<h2 id="analysis-of-shortest-path-distributions">Analysis of Shortest-Path Distributions</h2>
<p>The distribution of the average of the all-pairs shortest-path lengths for each graph is plotted below:</p>
<p><img src="/images/classifying-graphs-with-shortest-paths/avg_sp_hist.png" alt="average all-pairs shortest-path lengths" /></p>
<p>Note that the two distributions overlap substantially. Thus, simply using the average all-pairs shortest-path length for each graph won’t be able to effectively discriminate between graphs from the two classes.</p>
<p>I then decided to try generating a normalized histogram of the all-pairs shortest-path lengths for each graph. When I compared the distributions of the Euclidean distances of graphs generated by the same model to pairs from different models, I observed a separation in the distributions:</p>
<p><img src="/images/classifying-graphs-with-shortest-paths/sp_distr_dist_hist.png" alt="shortest-path length distribution distances" /></p>
<p>The difference in the distribution of distances for graphs from different models versus those generated by the same model was promising. I saw a similar result with cosine similarity.</p>
<h2 id="generating-features">Generating Features</h2>
<p>As I mentioned earlier, instead of using a kernel method, I focused on generating features that could be used with standard machine learnign models. I focused on four modeling approaches:</p>
<ol>
<li>One feature vector for each graph. The features were the <script type="math/tex">L_2</script> normalized histogram of all-pairs shortest-path lengths. I used bin-sizes of 1 with enough bins to include the longest path found. (In my case, all of the graphs had the same maximum length of 3.) The ER graphs were labeled as 0, while the PP graphs were labeled as 1.</li>
<li>Represent each pair of graphs as a feature vector. I computed the difference in the normalized histograms of the all-pairs shortest-paths lengths. Since the differences aren’t symmetric, I computed the differences each way and added two feature vectors for each pair. Feature vectors containing graphs from the same model were labeled 0, while feature vectors for graphs from different models were labeled 0.</li>
<li>Same as #2, except that I took the absolute value of the differences and only added one feature vector per pair.</li>
<li>Like approach #3, I used pairs of graphs. I used a single feature – the Euclidean distance calculated between the normalized histograms from each pair.</li>
</ol>
<h2 id="experiments">Experiments</h2>
<p>I used <a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier">Logistic Regression with Stochastic Gradient Descent</a> and <a href="http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html">Random Forests</a> for classification. I used 1,000 epochs for LR, and 100 trees for RF. I perfomed 10-fold stratified cross-fold validation and used accuracy and area under the ROC curve as metrics. Accuracy utilizes binary predictions, while the ROC AUC utilizes the predicted probabilities.</p>
<table>
<thead>
<tr>
<th style="text-align: center">Features Type</th>
<th style="text-align: center">Classifier</th>
<th style="text-align: center">ROC AUC (std)</th>
<th style="text-align: center">Accuracy (%, std)</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">1</td>
<td style="text-align: center">LR</td>
<td style="text-align: center"><strong>1.0 (0.0)</strong></td>
<td style="text-align: center"><strong>98.0 (3.3)</strong></td>
</tr>
<tr>
<td style="text-align: center">1</td>
<td style="text-align: center">RF</td>
<td style="text-align: center"><strong>1.0 (0.0)</strong></td>
<td style="text-align: center"><strong>100.0 (0.0)</strong></td>
</tr>
<tr>
<td style="text-align: center">2</td>
<td style="text-align: center">LR</td>
<td style="text-align: center">0.490 (0.006)</td>
<td style="text-align: center">66.7 (0.0)</td>
</tr>
<tr>
<td style="text-align: center">2</td>
<td style="text-align: center">RF</td>
<td style="text-align: center"><strong>0.999 (0.0)</strong></td>
<td style="text-align: center"><strong>99.3 (0.0)</strong></td>
</tr>
<tr>
<td style="text-align: center">3</td>
<td style="text-align: center">LR</td>
<td style="text-align: center">0.910 (0.007)</td>
<td style="text-align: center">71.1 (0.4)</td>
</tr>
<tr>
<td style="text-align: center">3</td>
<td style="text-align: center">RF</td>
<td style="text-align: center"><strong>0.999 (0.0)</strong></td>
<td style="text-align: center"><strong>99.2 (0.0)</strong></td>
</tr>
<tr>
<td style="text-align: center">4</td>
<td style="text-align: center">LR</td>
<td style="text-align: center">0.914 (0.005)</td>
<td style="text-align: center">66.7 (0.0)</td>
</tr>
<tr>
<td style="text-align: center">4</td>
<td style="text-align: center">RF</td>
<td style="text-align: center">0.921 (0.002)</td>
<td style="text-align: center">84.4 (0.4)</td>
</tr>
</tbody>
</table>
<p>LR performed well with feature type 1 with an average ROC AUC of 1.0 and average accuracy of 98.5%, but performed poorly on all other feature types. RFs performed well with feature types 1-3, achieving a minimum average ROC AUC of 0.999 and average accuracy of 99.2%.</p>
<p>We see these results reflected in ROC curves generated for each classifier from one of its 10 folds:</p>
<p><img src="/images/classifying-graphs-with-shortest-paths/lr_roc.png" alt="LR ROC" /></p>
<p>The LR classifier has an abysmal ROC curve for feature type 2. The ROC curves for features types 3 and 4 are decent. The ROC curve for feature type 1 appears to be perfect. (Note that the ROC curves are based on sorting by predicted probabilities, while the accuracies used the binary labels.)</p>
<p><img src="/images/classifying-graphs-with-shortest-paths/rf_roc.png" alt="RF ROC" /></p>
<p>The RF model performes nearly perfectly on feature type 1, as indicated by the barely visible curve in the upper left of the plot. The ROC curve for feature types 2 and 3 are nearly perfect. The ROC curve for feature type 4 is acceptable, largely an indication of how robust RF classifiers are.</p>
<p>In terms of computational complexity, feature type 1 only requires creating a feature vector for each graph. Feature types 2-4 require computing a feature vector or distance for each pair of graphs.</p>
<p>Overall, it looks like feature type 1, where the feature vectors are the normalized histograms (bin size 1) of the all-pairs shortest-path lengths for each graph can be used to very accurately discriminate between graphs generated by the two models, regardless of the classifier used.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this blog post, I looked at training machine learning models to discriminate between graphs generated by two different random graph models. I gave some background on feature engineering and kernel methods for graphs. I looked at four ways of representing the problem and how the corresponding features would be generated. I evaluated the four types of features using Logistic Regression and Random Forest classifiers. I observed that both LR and RF classifiers performed well when normalized histograms of the all-pairs shortest-paths lengths were used as features.</p>
<p>Going forward, I’d like to evaluate additional types of features such as the average clustering coefficient and the distributions of graphlets found in the graphs. Using the feature-vector per pair approach (feature type 3), I could incorporate graph kernels as features. However, I would probably need to find a different way to represent the absolute differences in the histograms so that it’s feasible to use Logistic Regression. One may to do so might be to discretize the differences for each bin in the histogram into a set of bins.</p>
<p>Once I have a reasonable set of useful features, I’d like to explore the effectiveness of this approach on small graphs since most user browsing sessions have far fewer than 100 nodes (e.g., 10).</p>
<p><em>The scripts used in the analyses are available in my <a href="https://github.com/rnowling/graph-experiments">graph-experiments</a> repo on GitHub under the Apache Public License v2.</em></p>
Sat, 04 Mar 2017 00:01:19 +0000
http://rnowling.github.io/machine/learning/2017/03/04/classifying-graphs-with-shortest-paths.html
http://rnowling.github.io/machine/learning/2017/03/04/classifying-graphs-with-shortest-paths.htmlmathmachinelearningVariable Selection with Logistic Regression Ensembles<p><em>(02/16/2017) Thanks to feedback on the <a href="https://www.reddit.com/r/bioinformatics/comments/5u8f7i/be_cautious_when_using_logistic_regression_for/">bioinformatics reddit</a>, it’s been brought to my attention that most GWAS studies employ Logistic Regression for single-SNP association tests using software such as <a href="https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html">SNPTEST</a>. This is different from the approach of incorporating all of the SNPs into a single Logistic Regression model as described below. <a href="http://www.nature.com/nrg/journal/v11/n7/abs/nrg2796.html">Marchini, et al.</a> and <a href="http://www.nature.com/nrg/journal/v7/n10/full/nrg1916.html">Balding</a> have written some excellent reviews of statistical practices in GWAS that discuss single-SNP association tests and other approaches. I’ve changed the title to reflect that the effects of variance in the LR weights on variable selection are still valid.</em></p>
<p>Logistic regression models are commonly used to identify SNPs which are correlated with differences between phenotypes associated with population structures. Logistic Regression is particularly popular for genome-wide association studies (GWAS) of human diseases<sup><a href="#liu">1</a>,<a href="#stahl">2</a>,<a href="#hunter">3</a>,<a href="#shi">4</a>,<a href="#han">5</a>,<a href="#turnbull">6</a>,<a href="#chasman">7</a>,<a href="#kumar">8</a>,<a href="#cha">9</a>,<a href="#hu">10</a></sup>.</p>
<p>When applied to SNPs, samples are assigned to classes in accordance with their phenotypes and their variants are encoded as a feature matrix. A LR model is then trained. The magnitudes of the weights from the LR model are used to rank the variants, with the top-ranked variants selected for further exploration.</p>
<p>Genomes often have on the order of millions of variants. With such large data sizes, LR models often need to be trained with approximate, stochastic method such as Stochastic Gradient Descent (SGD). These methods introduce randomness into the weights and consequently the rankings. We decided to evaluate the consistency of the rankings.</p>
<h2 id="comparison-of-rankings-from-two-logistic-regression-models">Comparison of Rankings from Two Logistic Regression Models</h2>
<p>To demonstrate this effect, we trained a pair of LR models on variants from 149 <em>An. gambiae</em> and <em>An. coluzzii</em> mosquitoes in the <a href="https://www.malariagen.net/projects/ag1000g"><em>Anopheles</em> 1000 genomes</a> dataset. We encoded each variants as two features, each storing the number of occurrences of one allele. We used the magnitudes (absolute values) of the weights from the models to rank the variants. We compared the membership of the top 0.01% (466) of the SNPs ranked by each model in each pair using the Jaccard similarity. We found that only 81% of the top 0.01% (466) of the ranked SNPs agreed between the two models. The following table contains the similarity for different thresholds:</p>
<table>
<thead>
<tr>
<th style="text-align: center">Threshold (%)</th>
<th style="text-align: center">Number of SNPs</th>
<th style="text-align: center">Jaccard Similarity</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">0.01%</td>
<td style="text-align: center">466</td>
<td style="text-align: center">80.7%</td>
</tr>
<tr>
<td style="text-align: center">0.1%</td>
<td style="text-align: center">4,662</td>
<td style="text-align: center">83.8%</td>
</tr>
<tr>
<td style="text-align: center">1%</td>
<td style="text-align: center">46,620</td>
<td style="text-align: center">79.2%</td>
</tr>
<tr>
<td style="text-align: center">10%</td>
<td style="text-align: center">466,204</td>
<td style="text-align: center">76.6%</td>
</tr>
</tbody>
</table>
<p>This instability could have significant impacts on the reproducibility and correctness of these GWAS studies.</p>
<h2 id="logistic-regression-ensembles">Logistic Regression Ensembles</h2>
<p>Leo Breiman realized that certain machine learning models (decision trees, linear regression, others) are unstable<sup><a href="#breiman">11</a></sup> and proposed bagging<sup><a href="#bagging">12</a></sup> as a solution. Breiman’s later used bagging in his Random Forests<sup><a href="#random-forests">13</a></sup> algorithm, where it become well-known. Breiman’s focus was on classifier accuracy, however, and not necessarily on calculating variable importance scores or using weights for ranking.</p>
<p>We employ an ensemble approach to Logistic Regression models to stabilize the feature weights and achieve consistent rankings. We trained pairs of ensembles of Logistic Regression models. We normalized the weight vector from each model and then computed the average magnitude of the weights for each feature. We then used the averaged magnitudes to rank the SNPs. We repeated our analysis of the Jaccard similarities of the top 0.01%, 0.1%, 1%, and 10% of the ranked SNPs for ensembles with different numbers of models.</p>
<p><img src="/images/stable_rankings_lr_ensembles/snp_ranking_overlaps_sgd-l2.png" alt="Jaccard Similarities for Logistic Regression Ensembles" /></p>
<p>With ensembles of 250 models, we were able to achieve an agreement of 99% of the top 0.01% of SNPs.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Logistic Regression models trained with stochastic methods such as Stochastic Gradient Descent (SGD) do not necessarily produce the same weights from run to run. This does not generally affect classification accuracy, especially in cases with a large number of correlated variables. However, the variations in the weights do affect analyses such as ranking and variable selection. Researchers should be cautious when using Logistic Regression weights for ranking.</p>
<p>We demonstrated that an ensemble approach can be used to stabilize the weights and consequently the resulting variable rankings. Further validation work will be needed to determine if Logistic Regression ensembles are a suitable solution, but our results are promising.</p>
<p><em>The analyses presented here used the development branch of the software package <a href="https://github.com/rnowling/asaph">Asaph</a>.</em></p>
<h2 id="references">References</h2>
<p><a name="liu">1</a>: <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3150510/">Liu, et al.</a></p>
<p><a name="stahl">2</a>: <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4243840/">Stahl, et al.</a></p>
<p><a name="hunter">3</a>: <a href="http://www.nature.com/ng/journal/v39/n7/full/ng2075.html">Hunter, al.</a></p>
<p><a name="shi">4</a>: <a href="http://www.nature.com/ng/journal/v43/n12/abs/ng.978.html">Shi, et al.</a></p>
<p><a name="han">5</a>: <a href="http://www.nature.com/ng/journal/v41/n11/abs/ng.472.html">Han, et al.</a></p>
<p><a name="turnbull">6</a>: <a href="http://www.nature.com/ng/journal/v42/n6/abs/ng.586.html">Turnbull, et al.</a></p>
<p><a name="chasman">7</a>: <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3125402/">Chasman, et al.</a></p>
<p><a name="kumar">8</a>: <a href="http://www.nature.com/ng/journal/v43/n5/abs/ng.809.html">Kumar, et al.</a></p>
<p><a name="cha">9</a>: <a href="http://hmg.oxfordjournals.org/content/19/23/4735.short">Cha, et al.</a></p>
<p><a name="hu">10</a>: <a href="http://www.nature.com/ng/journal/v43/n8/abs/ng.875.html">Hu, et al.</a></p>
<p><a name="breiman">11</a>: <a href="http://projecteuclid.org/euclid.aos/1032181158">Leo Breiman (1994)</a></p>
<p><a name="bagging">12</a>: <a href="http://www.machine-learning.martinsewell.com/ensembles/bagging/Breiman1996.pdf">Leo Breiman (1996)</a></p>
<p><a name="random-forests">13</a>: <a href="http://link.springer.com/article/10.1023/A:1010933404324">Leo Breiman (2001)</a></p>
Tue, 14 Feb 2017 00:01:19 +0000
http://rnowling.github.io/machine/learning/2017/02/14/lr-gwas.html
http://rnowling.github.io/machine/learning/2017/02/14/lr-gwas.htmlbioinformatcsscientific computingmachinelearningTesting CLI Apps with Bats<p>I’ve been looking for a good way to test <a href="https://github.com/rnowling/asaph">Asaph</a>, the small machine-learning application I wrote for my Ph.D. thesis.</p>
<p>Most testing solutions I found didn’t quite fit what I wanted. The built-in Python <a href="https://docs.python.org/2/library/unittest.html"><code class="highlighter-rouge">unittest</code></a> framework is my usual go-to. It’s flexible, powerful, and easy-to-use. Asaph is heavily data-dependent with relatively complex internal data structures and its workflow involves lots of file I/O. Consequently, I found it cumbersome to write unit tests.</p>
<p>Most command-line testing solutions seem to be focused on testing the interfaces. Most examples I found focused on using <code class="highlighter-rouge">unittest</code> to test argument parsing with libraries such as <a href="https://docs.python.org/2.7/library/argparse.html"><code class="highlighter-rouge">argparse</code></a>. Other options include testing interactive CLI apps such as those that prompt the user or use something like <a href="https://en.wikipedia.org/wiki/Curses_%28programming_library%29"><code class="highlighter-rouge">curses</code></a>. Asaph isn’t really interactive, though.</p>
<p>Asaph’s commands form a workflow. The user first calls Asaph to convert data to its internal format. The user then uses Asaph to train a Logistic Regression model or Random Forests models with different numbers of trees. The user can then call Asaph to check convergence of the SNP rankings, deciding whether to train models with more trees or not. Lastly, the user can output the SNP rankings from one of the models. In each step, new files (initial data, models, plots, rankings) are added to the work directory or the contents of the work directory are queried.</p>
<p>What I really wanted was to test Asaph and its workflow holistically. I want to call Asaph and check that it executes successfully and produces the expected output on disk. Sometimes it may be enough to merely check that the output exists, while in other cases, I want to query the output to make sure what it contains is reasonable. By running Asaph’s workflow on test data, we can check the most common codepaths and ensure no syntax or type errors have been introduced.</p>
<p>In my search, I came across <a href="https://github.com/sstephenson/bats">Bash Automated Testing System</a> or bats. Bats allows you to write tests in Bash. Bash scripts map well onto my use case: Bash commands are external programs in the traditional Unix philosophy. You then define assertions through standard Bash comparisons. Additionally, bash supports simple setup and teardown functions and loading helper functions.</p>
<p>Bats is best demonstrated through the tests I created for the Asaph import script:</p>
<script src="https://gist.github.com/rnowling/74224fed33ac99137d373297d6694c34.js"></script>
<p>In the example, I use the following features:</p>
<ol>
<li>Defining an embedded helper function (<code class="highlighter-rouge">count_snps</code>)</li>
<li>Setup and teardown functions</li>
<li>Defining tests with the annotation <code class="highlighter-rouge">@test</code></li>
<li>Running commands and checking the return codes</li>
<li>Check for the existence of output files and directories</li>
<li>Checking the contents of output files</li>
</ol>
<p>My experience is not unique to Asaph. Many of the scientific applications I’ve come across in my research are built around complex datasets and workflows of deeply-connected steps. It can be easier to use holistic tests with for these applications. I haven’t quite come across anything like Bats before, but I think it can be an useful tool to computational scientists.</p>
Sat, 04 Feb 2017 00:01:19 +0000
http://rnowling.github.io/software/engineering/2017/02/04/testing-cli-apps-with-bats.html
http://rnowling.github.io/software/engineering/2017/02/04/testing-cli-apps-with-bats.htmltestingasaphscientific computingsoftwareengineeringRunning OpenMM in Docker on Debian<p><a href="http://openmm.org/">OpenMM</a> is an open-source and high-quality molecular dynamics library for GPUs and CPUs. I used OpenMM extensively in my first few years of graduate school. With a new project potentially on my plate, I decided to go back to OpenMM and refamiliarize myself with it. The first task was to get OpenMM running on my Debian system, preferably with OpenCL support for the built-in Intel GPU.</p>
<p>I decided to build OpenMM inside a Docker container, mostly so I can track dependencies more easily.</p>
<h2 id="installing-docker">Installing Docker</h2>
<p>Since I’m using Debian Squeeze (testing), some of the packages are in flux. In particular, the <code class="highlighter-rouge">docker.io</code> package was been <a href="https://tracker.debian.org/news/804615">pulled from testing</a>.</p>
<p>So, I had to install Docker from the upstream repository using the <a href="https://docs.docker.com/engine/installation/linux/debian/">official instructions</a>. Unfortunately, the command for adding the repository failed:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ sudo add-apt-repository \
"deb https://apt.dockerproject.org/repo/ \
debian-$(lsb_release -cs) \
main testing"
Traceback (most recent call last):
File "/usr/bin/add-apt-repository", line 95, in <module>
sp = SoftwareProperties(options=options)
File "/usr/lib/python3/dist-packages/softwareproperties/SoftwareProperties.py", line 109, in __init__
self.reload_sourceslist()
File "/usr/lib/python3/dist-packages/softwareproperties/SoftwareProperties.py", line 599, in reload_sourceslist
self.distro.get_sources(self.sourceslist)
File "/usr/lib/python3/dist-packages/aptsources/distro.py", line 89, in get_sources
(self.id, self.codename))
aptsources.distro.NoDistroTemplateException: Error: could not find a distribution template for Debian/stretch
</code></pre></div></div>
<p>Thankfully, a simple workaround was available – just manually append the repo line to <code class="highlighter-rouge">/etc/apt/sources.list</code>:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ sudo su -c "echo deb https://apt.dockerproject.org/repo/ debian-stretch main testing >> /etc/apt/sources.list"
</code></pre></div></div>
<p>From there, I was able to follow the rest of the instructions to set up Docker for non-root users.</p>
<h2 id="building-openmm">Building OpenMM</h2>
<p>Once I had Docker up and running, I created a <code class="highlighter-rouge">Dockerfile</code> to build an image containing the OpenMM dependencies and source code.</p>
<p>You can clone the repository and build the image like so:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ git clone https://github.com/rnowling/openmm-docker.git
$ cd openmm-docker
$ docker build -t rnowling/openmm-docker .
</code></pre></div></div>
<p>The build takes quite a while since OpenMM is a large project. Once built, you can run the container to get a bash shell:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ docker run -i -t --device /dev/dri:/dev/dri rnowling/openmm-docker
root@874b1411bf01:/openmm-7.0.1#
</code></pre></div></div>
<p>Note the <code class="highlighter-rouge">--device /dev/dri:/dev/dri</code> flag. This exposes the GPU to the container, allowing usage of the GPU with OpenCL. Note that the OpenCL mixed and double precision tests will fail. If you run one of them manually, you’ll see why:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@874b1411bf01:/openmm-7.0.1# ./TestOpenCLHarmonicBondForce mixed
exception: This device does not support double precision
</code></pre></div></div>
<p>I used the open-source <a href="https://freedesktop.org/wiki/Software/Beignet/">beignet</a> OpenCL driver for Intel GPUs. Unfortunately, biegnet only supports single-precision floating point operations at the moment. In the future, I may look into using the proprietary Intel OpenCL drivers. Generally speaking, many of the force calculations can be done in single precision, but double precision is <a href="https://github.com/pandegroup/openmm/issues/1616#issuecomment-252302777">needed in some of the integration calculations</a> to minimize errors.</p>
<p>Finally, we can test the image with the included Python wrapper:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>root@874b1411bf01:/openmm-7.0.1# python -m simtk.testInstallation
There are 3 Platforms available:
1 Reference - Successfully computed forces
2 CPU - Successfully computed forces
3 OpenCL - Successfully computed forces
Median difference in forces between platforms:
Reference vs. CPU: 1.99996e-06
Reference vs. OpenCL: 0.055165
CPU vs. OpenCL: 0.0551638
</code></pre></div></div>
<p>Viola! OpenMM running in a Docker container.</p>
Sat, 28 Jan 2017 00:01:19 +0000
http://rnowling.github.io/molecular/dynamics/2017/01/28/openmm-docker-debian.html
http://rnowling.github.io/molecular/dynamics/2017/01/28/openmm-docker-debian.htmlmoleculardynamics