RJ NowlingPersonal website and blog for RJ Nowling. Data science engineer with a Ph.D. in Computer Science & Engineering with experience in computational physics, bioinformatics, machine learning, and distributed systems.
http://rnowling.github.io/
Wed, 03 May 2017 16:22:54 +0000Wed, 03 May 2017 16:22:54 +0000Jekyll v3.4.3Talk on Productionizing ML Models<p>Last night, I gave a talk titled “Real-World Lessons in Machine Learning Applied to Spam Classification” at the <a href="https://www.meetup.com/MKE-Big-Data/">MKE Big Data</a> meetup. In my talk, I used spam classification as a use case for communicating some lessons learnd from my experiences building production machine learning-powered services. In particular, I wanted to get the point across that modeling and algorithm choices are not independent from the requirements of the production system – we need to design our models and choose our algorithms while keeping in mind how those choices will impact the resulting production system.</p>
<p>You can grab my slides <a href="/static/rnowling_mke_big_data_2017.pdf">here</a>. My slides and source code I used to generate my plots are also available on the <a href="https://github.com/MKE-Big-Data/MKE-BD-Talks">MKE BD Talks</a> GitHub repo.</p>
<p>A few attendees had asked for some additional resources related to the topics. Martin Zinkevich of Google recently published an excellent guide based on their experiences titled <a href="http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf">Rules of Machine Learning: Best Practices for ML Engineering</a>, which I highly recommend. <a href="http://hunch.net/~vw/">Vowpal Wabbit</a> is a powerful toolkit for online machine learning that incorporates some of the latest algorithms and techniques.</p>
Wed, 03 May 2017 00:02:19 +0000
http://rnowling.github.io/machine/learning/2017/05/03/production-ml-systems.html
http://rnowling.github.io/machine/learning/2017/05/03/production-ml-systems.htmlengineeringmachinelearningRandom Forests vs F<sub>ST</sub> for Insect Population Genetics<p>For my last comparison, I’ll look at the correlation between the variable importance measures (VIM) computed by Random Forests vs the scores calculated via F<sub>ST</sub>. Previously, I analyzed the correlations between F<sub>ST</sub> scores and associations computed using <a href="http://rnowling.github.io/bioinformatics/2017/04/04/cramers-v.html">Cramer’s V</a> and weights computed from <a href="http://rnowling.github.io/bioinformatics/2017/04/05/lr-ensembles-vs-fst.html">Logistic Regression Ensembles</a>.</p>
<p>In line with my <a href="http://rnowling.github.io/bioinformatics/2017/04/04/cramers-v.html">recent correlation analysis</a> between Cramer’s V and F<sub>ST</sub>, I wanted to compare the weights calculated by the <a href="http://rnowling.github.io/machine/learning/2017/02/14/lr-gwas.html">Logistic Regression Ensembles</a> I recently discussed with F<sub>ST</sub>.</p>
<p>I again used the Burkina Faso <em>An. gambiae</em> and <em>An. coluzzii</em> samples from the <a href="https://www.malariagen.net/projects/ag1000g">Anopheles gambiae 1000 genomes project</a> for the comparison. I ran trained the Random Forests with both the counts and categorical feature encodings using the <code class="highlighter-rouge">dev</code> branch of my population genetics methods exporation toolkit, <a href="https://github.com/rnowling/asaph/">Asaph</a>. Plots of the variable importance measures from the Random Forests vs F<sup>ST</sup> scores are below:</p>
<p><img src="/images/random-forests-vs-fst/bfm_vs_bfs_fst_vs_rf_counts.png" alt="Fst vs Random Forests (counts)" /></p>
<p><img src="/images/random-forests-vs-fst/bfm_vs_bfs_fst_vs_rf_categories.png" alt="Fst vs Random Forests (categories)" /></p>
<p>With the counts feature-encoding scheme, linear regression between the Random Forests variable importance measures and F<sup>ST</sup> scores had <script type="math/tex">r^2=0.665</script>. With the categories feature-encoding scheme, linear regression gave <script type="math/tex">r^2=0.656</script>.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Variable importance measures computed for each SNP using Random Forests correlate reasonably well with the F<sub>ST</sub> scores, regardless of the encoding scheme used. (Random Forests are particularly robust to the choice of encoding scheme.) It would be interesting to analyze variants where Random Forests and F<sub>ST</sub> give substantially different scores.</p>
Wed, 05 Apr 2017 00:01:19 +0000
http://rnowling.github.io/bioinformatics/2017/04/05/random-forests-vs-fst.html
http://rnowling.github.io/bioinformatics/2017/04/05/random-forests-vs-fst.htmlstatisticsbioinformaticsLogistic Regression Ensembles vs F<sub>ST</sub> for Insect Population Genetics<p>In line with my <a href="http://rnowling.github.io/bioinformatics/2017/04/04/cramers-v.html">recent correlation analysis</a> between Cramer’s V and F<sub>ST</sub>, I wanted to compare the weights calculated by the <a href="http://rnowling.github.io/machine/learning/2017/02/14/lr-gwas.html">Logistic Regression Ensembles</a> I recently discussed with F<sub>ST</sub>. Logistic Regression Ensembles are implemented in the <code class="highlighter-rouge">dev</code> branch of my population methods exporation toolkit <a href="https://github.com/rnowling/asaph/">Asaph</a>.</p>
<p>I again used the Burkina Faso <em>An. gambiae</em> and <em>An. coluzzii</em> samples from the <a href="https://www.malariagen.net/projects/ag1000g">Anopheles gambiae 1000 genomes project</a> for the comparison. I ran the Logistic Regression Ensembles with both the counts and categorical feature encodings and with and without bagging. Plots of the weights from the Logistic Regression Ensembles vs F<sup>ST</sup> scores are below:</p>
<p><img src="/images/lr-vs-fst/bfm_vs_bfs_fst_vs_lr_counts_bagging.png" alt="Fst vs LR Ensembles (counts) w/ Bagging" /></p>
<p><img src="/images/lr-vs-fst/bfm_vs_bfs_fst_vs_lr_counts_no_bagging.png" alt="Fst vs LR Ensembles (counts) w/o Bagging" /></p>
<p><img src="/images/lr-vs-fst/bfm_vs_bfs_fst_vs_lr_categories_bagging.png" alt="Fst vs LR Ensembles (categories) w/ Bagging" /></p>
<p><img src="/images/lr-vs-fst/bfm_vs_bfs_fst_vs_lr_categories_no_bagging.png" alt="Fst vs LR Ensembles (categories) w/o Bagging" /></p>
<p>I used linear regression performed on the F<sup>ST</sup> scores and LR ensemble weights to calculate <script type="math/tex">r^2</script> values:</p>
<table>
<thead>
<tr>
<th style="text-align: center">Feature Encoding</th>
<th style="text-align: center">Bagging?</th>
<th style="text-align: center"><script type="math/tex">r^2</script></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">Counts</td>
<td style="text-align: center">Yes</td>
<td style="text-align: center">0.889</td>
</tr>
<tr>
<td style="text-align: center">Counts</td>
<td style="text-align: center">No</td>
<td style="text-align: center">0.812</td>
</tr>
<tr>
<td style="text-align: center">Categories</td>
<td style="text-align: center">Yes</td>
<td style="text-align: center">0.850</td>
</tr>
<tr>
<td style="text-align: center">Categories</td>
<td style="text-align: center">No</td>
<td style="text-align: center">0.643</td>
</tr>
</tbody>
</table>
<h2 id="conclusion">Conclusion</h2>
<p>When bagging is used, Logistic Regression Ensembles weights seem to correlate well with F<sup>ST</sup> scores. The only real outlier is when Logistic Regression Ensembles are used with the categories feature encoding and no bagging.</p>
<p>It’s important to mention that the correlation analysis only tells us how well these methods agree with F<sub>ST</sub>. They do not necessarily tell us which method is better or worse. Substantial work remains to validate the results of the Logistic Regression Ensembles.</p>
Wed, 05 Apr 2017 00:01:19 +0000
http://rnowling.github.io/bioinformatics/2017/04/05/lr-ensembles-vs-fst.html
http://rnowling.github.io/bioinformatics/2017/04/05/lr-ensembles-vs-fst.htmlstatisticsbioinformaticsCramer's V vs F<sub>ST</sub> for Insect Population Genetics<p><a href="https://en.wikipedia.org/wiki/Fixation_index">Fixation index</a>, or F<sub>ST</sub>, is a univariate statistic calculated as the ratio of variance within populations to the variance between populations. Within insect population genetics, F<sub>ST</sub> is used to score, and then rank, the correlation between variants and the population structure.</p>
<p>The focus of my Ph.D. dissertation was to investigate variable importance measures as calculated via Random Forests as an alterative to F<sub>ST</sub>. I’ve also begun looking at <a href="http://rnowling.github.io/machine/learning/2017/02/14/lr-gwas.html">Logistic Regression Ensembles</a>.</p>
<p>In addition to these two machine learning approaches, I wanted to investigate a statistical method, Cramer’s V. <a href="https://en.wikipedia.org/wiki/Cram%C3%A9r's_V">Cramer’s V</a> measures the assocation (correlation of unsigned variables) of nominal (categorical) variables. I went ahead and implemented Cramer’s V in the <code class="highlighter-rouge">dev</code> branch of my population methods exporation toolkit <a href="https://github.com/rnowling/asaph/">Asaph</a>.</p>
<p>I used the Burkina Faso <em>An. gambiae</em> and <em>An. coluzzii</em> samples from the <a href="https://www.malariagen.net/projects/ag1000g">Anopheles gambiae 1000 genomes project</a> to compare Cramer’s V and F<sub>ST</sub>. I calculated the F<sub>ST</sub> scores for each SNP using <a href="https://vcftools.github.io/">vcftools</a>. I calculated Cramer’s V using Asaph on data imported using both the counts and categories feature encoding schemes. I then plotted F<sub>ST</sub> vs Cramer’s V (counts) and F<sub>ST</sub> vs Cramer’s V (categories) to get a sense of the correlation between the two metrics.</p>
<p><img src="/images/cramers-v-vs-fst/bfm_vs_bfs_fst_vs_cramers_v_counts.png" alt="Fst vs Cramer's V (counts)" /></p>
<p><img src="/images/cramers-v-vs-fst/bfm_vs_bfs_fst_vs_cramers_v_categories.png" alt="Fst vs Cramer's V (categories)" /></p>
<p>The above figures give the scatter plots of F<sub>ST</sub> vs Cramer’s V with the counts and categories feature encodings, respectively. Cramer’s V calculated on the count-encoded features has a <script type="math/tex">r^2</script> value of 0.865 vs F<sub>ST</sub>, while Cramer’s V calculated on the count-encoded features has a <script type="math/tex">r^2</script> value of 0.818 vs F<sub>ST</sub>.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Along with Random Forests and Logistic Regression Ensembles, Cramer’s V is another alternative to F<sub>ST</sub> for finding variants that best describe the genetic basis of differences between two populations. Cramer’s V correlates well with F<sub>ST</sub>, but a simple correlation analysis doesn’t tell us which metric is more appropriate for a given situation. Substantial work remains to validate the four methods and compare them.</p>
Tue, 04 Apr 2017 00:01:19 +0000
http://rnowling.github.io/bioinformatics/2017/04/04/cramers-v.html
http://rnowling.github.io/bioinformatics/2017/04/04/cramers-v.htmlstatisticsbioinformaticsClassifying Graphs with Shortest Paths<p>Graphs can be an easy and intuitive way of representing interactions between agents or state transitions in sociological, biological, and dynamical systems. The <a href="http://cse.nd.edu">Computer Science & Engineering department</a> happens to have a number of researchers in the <a href="http://icensa.nd.edu">Interdisciplinary Center for Network Science and Applications (iCeNSA)</a> working on <a href="https://en.wikipedia.org/wiki/Complex_network">complex networks</a>. Between some of my own research modeling protein-folding dynamics as Markov State Models and having a desk in the iCeNSA office space, I was exposed to some of this research.</p>
<p>One of the most natural applications of network science is analyzing clickstream data. In particular, we can represent users’ browsing sessions as graphs. In the simplest case, we can use vertices to represent the pages that users have visited and directed edges to represent that a user has navigated from one page to another. A more sophisticated model might might use edge weights to record the number of times the user navigated from one page to another in a single session. In fact, if we normalize the outgoing edge weights of each vertex, we can derive a <a href="https://en.wikipedia.org/wiki/Markov_model">Markov model</a> of the dynamics of the browsing session.</p>
<p>My motivation and goal of modeling users’ browsing sessions as graphs is to be able to segment users by their browsing behaviors. For example, I may want to train a machine learning model to discriminate between users who are likely to make a purchase (convert) versus those who are just window shopping using graphs generated from their browsing sessions. I don’t know the content of the web sites, and web sites can be structured differently. Thus, I won’t be able to match vertices between separate graphs easily. As such, I want to engineer features based purely on topological features of the graphs that are invariant to permutations of vertices and the number of edges and vertices in graphs.</p>
<p>There are several ways to approach classifying graphs with machine learning models. One approach is simply to engineer a bunch of features from different statistics computed from the graphs. <a href="http://onlinelibrary.wiley.com/doi/10.1002/sam.11153/full">Li, et al.</a> describe a number of metrics including the <a href="https://en.wikipedia.org/wiki/Clustering_coefficient">average clustering coefficient</a> and the average path length (<a href="https://en.wikipedia.org/wiki/Closeness_centrality">closeness centrality</a>). However, be aware that some of their features (such as the numbers of edges and vertices) probably won’t be useful if you are comparing graphs of different sizes.</p>
<p>A second approach would be to use <a href="https://en.wikipedia.org/wiki/Graph_kernel">graph kernels</a>, functions for computing a similarity score between two graphs. A number of machine learning methods (called <a href="https://en.wikipedia.org/wiki/Kernel_method">kernel methods</a>) such as Support Vector Machines and Principal Component Analysis can be adapted to use inner products computed between pairs of data points using a kernel instead of feature vectors. Such kernel methods are advantageous since they enable these method to be extended to data types that are difficult to represent with traditional feature vectors. Often-cited graph kernels include <a href="http://ieeexplore.ieee.org/abstract/document/1565664/">Shortest-Paths</a>, <a href="http://www.jmlr.org/proceedings/papers/v5/shervashidze09a/shervashidze09a.pdf">Graphlet</a>, and [Random Walk]((https://en.wikipedia.org/wiki/Graph_kernel) kernels. At <a href="http://nips.cc">NIPS 2016</a>, I saw a very nice presentation by <a href="https://www.cs.uchicago.edu/directory/risi-kondor">Risi Kondor</a> on the <a href="http://papers.nips.cc/paper/6135-learning-bound-for-parameter-transfer-learning.pdf">Multiscale Laplacian Graph Kernel</a>, which both allows incorprating features computed on vertices and edges as well as adapting well to multi-scale problems like protein structures.</p>
<p>For this blog post, I’m going to focus on evaluating the ability of machine learning models to discriminate between undirected graphs generated by the <a href="https://en.wikipedia.org/wiki/Erd%C5%91s%E2%80%93R%C3%A9nyi_model">Erdős–Rényi</a> and <a href="https://en.wikipedia.org/wiki/Stochastic_block_model">planted partitioned</a> random graph models. I’m using the experimental framework from a paper on the <a href="https://arxiv.org/pdf/1510.06492">generalized Shortest-Path graph kernel</a>. Instead of using graph kernels, I’m first going to focus on features engineered from the distribution of lengths of the shortest-paths between all pairs of vertices. Here are examples of two such graphs:</p>
<p><img src="/images/classifying-graphs-with-shortest-paths/graphs.png" alt="graphs" /></p>
<h2 id="generating-the-graphs">Generating the Graphs</h2>
<p>I’m choosing to generate 100 graphs of each type. Each graph has 100 vertices. For the ER model, I’m using an edge probability of 0.2. Following the direction of the generalized shortest-path graph kernels paper, I set the parameters for the planted partitioned model to generate the same number of edges as the ER model, with a multiplier for <script type="math/tex">p_1</script> of 1.6.</p>
<h2 id="analysis-of-shortest-path-distributions">Analysis of Shortest-Path Distributions</h2>
<p>The distribution of the average of the all-pairs shortest-path lengths for each graph is plotted below:</p>
<p><img src="/images/classifying-graphs-with-shortest-paths/avg_sp_hist.png" alt="average all-pairs shortest-path lengths" /></p>
<p>Note that the two distributions overlap substantially. Thus, simply using the average all-pairs shortest-path length for each graph won’t be able to effectively discriminate between graphs from the two classes.</p>
<p>I then decided to try generating a normalized histogram of the all-pairs shortest-path lengths for each graph. When I compared the distributions of the Euclidean distances of graphs generated by the same model to pairs from different models, I observed a separation in the distributions:</p>
<p><img src="/images/classifying-graphs-with-shortest-paths/sp_distr_dist_hist.png" alt="shortest-path length distribution distances" /></p>
<p>The difference in the distribution of distances for graphs from different models versus those generated by the same model was promising. I saw a similar result with cosine similarity.</p>
<h2 id="generating-features">Generating Features</h2>
<p>As I mentioned earlier, instead of using a kernel method, I focused on generating features that could be used with standard machine learnign models. I focused on four modeling approaches:</p>
<ol>
<li>One feature vector for each graph. The features were the <script type="math/tex">L_2</script> normalized histogram of all-pairs shortest-path lengths. I used bin-sizes of 1 with enough bins to include the longest path found. (In my case, all of the graphs had the same maximum length of 3.) The ER graphs were labeled as 0, while the PP graphs were labeled as 1.</li>
<li>Represent each pair of graphs as a feature vector. I computed the difference in the normalized histograms of the all-pairs shortest-paths lengths. Since the differences aren’t symmetric, I computed the differences each way and added two feature vectors for each pair. Feature vectors containing graphs from the same model were labeled 0, while feature vectors for graphs from different models were labeled 0.</li>
<li>Same as #2, except that I took the absolute value of the differences and only added one feature vector per pair.</li>
<li>Like approach #3, I used pairs of graphs. I used a single feature – the Euclidean distance calculated between the normalized histograms from each pair.</li>
</ol>
<h2 id="experiments">Experiments</h2>
<p>I used <a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier">Logistic Regression with Stochastic Gradient Descent</a> and <a href="http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html">Random Forests</a> for classification. I used 1,000 epochs for LR, and 100 trees for RF. I perfomed 10-fold stratified cross-fold validation and used accuracy and area under the ROC curve as metrics. Accuracy utilizes binary predictions, while the ROC AUC utilizes the predicted probabilities.</p>
<table>
<thead>
<tr>
<th style="text-align: center">Features Type</th>
<th style="text-align: center">Classifier</th>
<th style="text-align: center">ROC AUC (std)</th>
<th style="text-align: center">Accuracy (%, std)</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">1</td>
<td style="text-align: center">LR</td>
<td style="text-align: center"><strong>1.0 (0.0)</strong></td>
<td style="text-align: center"><strong>98.0 (3.3)</strong></td>
</tr>
<tr>
<td style="text-align: center">1</td>
<td style="text-align: center">RF</td>
<td style="text-align: center"><strong>1.0 (0.0)</strong></td>
<td style="text-align: center"><strong>100.0 (0.0)</strong></td>
</tr>
<tr>
<td style="text-align: center">2</td>
<td style="text-align: center">LR</td>
<td style="text-align: center">0.490 (0.006)</td>
<td style="text-align: center">66.7 (0.0)</td>
</tr>
<tr>
<td style="text-align: center">2</td>
<td style="text-align: center">RF</td>
<td style="text-align: center"><strong>0.999 (0.0)</strong></td>
<td style="text-align: center"><strong>99.3 (0.0)</strong></td>
</tr>
<tr>
<td style="text-align: center">3</td>
<td style="text-align: center">LR</td>
<td style="text-align: center">0.910 (0.007)</td>
<td style="text-align: center">71.1 (0.4)</td>
</tr>
<tr>
<td style="text-align: center">3</td>
<td style="text-align: center">RF</td>
<td style="text-align: center"><strong>0.999 (0.0)</strong></td>
<td style="text-align: center"><strong>99.2 (0.0)</strong></td>
</tr>
<tr>
<td style="text-align: center">4</td>
<td style="text-align: center">LR</td>
<td style="text-align: center">0.914 (0.005)</td>
<td style="text-align: center">66.7 (0.0)</td>
</tr>
<tr>
<td style="text-align: center">4</td>
<td style="text-align: center">RF</td>
<td style="text-align: center">0.921 (0.002)</td>
<td style="text-align: center">84.4 (0.4)</td>
</tr>
</tbody>
</table>
<p>LR performed well with feature type 1 with an average ROC AUC of 1.0 and average accuracy of 98.5%, but performed poorly on all other feature types. RFs performed well with feature types 1-3, achieving a minimum average ROC AUC of 0.999 and average accuracy of 99.2%.</p>
<p>We see these results reflected in ROC curves generated for each classifier from one of its 10 folds:</p>
<p><img src="/images/classifying-graphs-with-shortest-paths/lr_roc.png" alt="LR ROC" /></p>
<p>The LR classifier has an abysmal ROC curve for feature type 2. The ROC curves for features types 3 and 4 are decent. The ROC curve for feature type 1 appears to be perfect. (Note that the ROC curves are based on sorting by predicted probabilities, while the accuracies used the binary labels.)</p>
<p><img src="/images/classifying-graphs-with-shortest-paths/rf_roc.png" alt="RF ROC" /></p>
<p>The RF model performes nearly perfectly on feature type 1, as indicated by the barely visible curve in the upper left of the plot. The ROC curve for feature types 2 and 3 are nearly perfect. The ROC curve for feature type 4 is acceptable, largely an indication of how robust RF classifiers are.</p>
<p>In terms of computational complexity, feature type 1 only requires creating a feature vector for each graph. Feature types 2-4 require computing a feature vector or distance for each pair of graphs.</p>
<p>Overall, it looks like feature type 1, where the feature vectors are the normalized histograms (bin size 1) of the all-pairs shortest-path lengths for each graph can be used to very accurately discriminate between graphs generated by the two models, regardless of the classifier used.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this blog post, I looked at training machine learning models to discriminate between graphs generated by two different random graph models. I gave some background on feature engineering and kernel methods for graphs. I looked at four ways of representing the problem and how the corresponding features would be generated. I evaluated the four types of features using Logistic Regression and Random Forest classifiers. I observed that both LR and RF classifiers performed well when normalized histograms of the all-pairs shortest-paths lengths were used as features.</p>
<p>Going forward, I’d like to evaluate additional types of features such as the average clustering coefficient and the distributions of graphlets found in the graphs. Using the feature-vector per pair approach (feature type 3), I could incorporate graph kernels as features. However, I would probably need to find a different way to represent the absolute differences in the histograms so that it’s feasible to use Logistic Regression. One may to do so might be to discretize the differences for each bin in the histogram into a set of bins.</p>
<p>Once I have a reasonable set of useful features, I’d like to explore the effectiveness of this approach on small graphs since most user browsing sessions have far fewer than 100 nodes (e.g., 10).</p>
<p><em>The scripts used in the analyses are available in my <a href="https://github.com/rnowling/graph-experiments">graph-experiments</a> repo on GitHub under the Apache Public License v2.</em></p>
Sat, 04 Mar 2017 00:01:19 +0000
http://rnowling.github.io/machine/learning/2017/03/04/classifying-graphs-with-shortest-paths.html
http://rnowling.github.io/machine/learning/2017/03/04/classifying-graphs-with-shortest-paths.htmlmathmachinelearningVariable Selection with Logistic Regression Ensembles<p><em>(02/16/2017) Thanks to feedback on the <a href="https://www.reddit.com/r/bioinformatics/comments/5u8f7i/be_cautious_when_using_logistic_regression_for/">bioinformatics reddit</a>, it’s been brought to my attention that most GWAS studies employ Logistic Regression for single-SNP association tests using software such as <a href="https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html">SNPTEST</a>. This is different from the approach of incorporating all of the SNPs into a single Logistic Regression model as described below. <a href="http://www.nature.com/nrg/journal/v11/n7/abs/nrg2796.html">Marchini, et al.</a> and <a href="http://www.nature.com/nrg/journal/v7/n10/full/nrg1916.html">Balding</a> have written some excellent reviews of statistical practices in GWAS that discuss single-SNP association tests and other approaches. I’ve changed the title to reflect that the effects of variance in the LR weights on variable selection are still valid.</em></p>
<p>Logistic regression models are commonly used to identify SNPs which are correlated with differences between phenotypes associated with population structures. Logistic Regression is particularly popular for genome-wide association studies (GWAS) of human diseases<sup><a href="#liu">1</a>,<a href="#stahl">2</a>,<a href="#hunter">3</a>,<a href="#shi">4</a>,<a href="#han">5</a>,<a href="#turnbull">6</a>,<a href="#chasman">7</a>,<a href="#kumar">8</a>,<a href="#cha">9</a>,<a href="#hu">10</a></sup>.</p>
<p>When applied to SNPs, samples are assigned to classes in accordance with their phenotypes and their variants are encoded as a feature matrix. A LR model is then trained. The magnitudes of the weights from the LR model are used to rank the variants, with the top-ranked variants selected for further exploration.</p>
<p>Genomes often have on the order of millions of variants. With such large data sizes, LR models often need to be trained with approximate, stochastic method such as Stochastic Gradient Descent (SGD). These methods introduce randomness into the weights and consequently the rankings. We decided to evaluate the consistency of the rankings.</p>
<h2 id="comparison-of-rankings-from-two-logistic-regression-models">Comparison of Rankings from Two Logistic Regression Models</h2>
<p>To demonstrate this effect, we trained a pair of LR models on variants from 149 <em>An. gambiae</em> and <em>An. coluzzii</em> mosquitoes in the <a href="https://www.malariagen.net/projects/ag1000g"><em>Anopheles</em> 1000 genomes</a> dataset. We encoded each variants as two features, each storing the number of occurrences of one allele. We used the magnitudes (absolute values) of the weights from the models to rank the variants. We compared the membership of the top 0.01% (466) of the SNPs ranked by each model in each pair using the Jaccard similarity. We found that only 81% of the top 0.01% (466) of the ranked SNPs agreed between the two models. The following table contains the similarity for different thresholds:</p>
<table>
<thead>
<tr>
<th style="text-align: center">Threshold (%)</th>
<th style="text-align: center">Number of SNPs</th>
<th style="text-align: center">Jaccard Similarity</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">0.01%</td>
<td style="text-align: center">466</td>
<td style="text-align: center">80.7%</td>
</tr>
<tr>
<td style="text-align: center">0.1%</td>
<td style="text-align: center">4,662</td>
<td style="text-align: center">83.8%</td>
</tr>
<tr>
<td style="text-align: center">1%</td>
<td style="text-align: center">46,620</td>
<td style="text-align: center">79.2%</td>
</tr>
<tr>
<td style="text-align: center">10%</td>
<td style="text-align: center">466,204</td>
<td style="text-align: center">76.6%</td>
</tr>
</tbody>
</table>
<p>This instability could have significant impacts on the reproducibility and correctness of these GWAS studies.</p>
<h2 id="logistic-regression-ensembles">Logistic Regression Ensembles</h2>
<p>Leo Breiman realized that certain machine learning models (decision trees, linear regression, others) are unstable<sup><a href="#breiman">11</a></sup> and proposed bagging<sup><a href="#bagging">12</a></sup> as a solution. Breiman’s later used bagging in his Random Forests<sup><a href="#random-forests">13</a></sup> algorithm, where it become well-known. Breiman’s focus was on classifier accuracy, however, and not necessarily on calculating variable importance scores or using weights for ranking.</p>
<p>We employ an ensemble approach to Logistic Regression models to stabilize the feature weights and achieve consistent rankings. We trained pairs of ensembles of Logistic Regression models. We normalized the weight vector from each model and then computed the average magnitude of the weights for each feature. We then used the averaged magnitudes to rank the SNPs. We repeated our analysis of the Jaccard similarities of the top 0.01%, 0.1%, 1%, and 10% of the ranked SNPs for ensembles with different numbers of models.</p>
<p><img src="/images/stable_rankings_lr_ensembles/snp_ranking_overlaps_sgd-l2.png" alt="Jaccard Similarities for Logistic Regression Ensembles" /></p>
<p>With ensembles of 250 models, we were able to achieve an agreement of 99% of the top 0.01% of SNPs.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Logistic Regression models trained with stochastic methods such as Stochastic Gradient Descent (SGD) do not necessarily produce the same weights from run to run. This does not generally affect classification accuracy, especially in cases with a large number of correlated variables. However, the variations in the weights do affect analyses such as ranking and variable selection. Researchers should be cautious when using Logistic Regression weights for ranking.</p>
<p>We demonstrated that an ensemble approach can be used to stabilize the weights and consequently the resulting variable rankings. Further validation work will be needed to determine if Logistic Regression ensembles are a suitable solution, but our results are promising.</p>
<p><em>The analyses presented here used the development branch of the software package <a href="https://github.com/rnowling/asaph">Asaph</a>.</em></p>
<h2 id="references">References</h2>
<p><a name="liu">1</a>: <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3150510/">Liu, et al.</a></p>
<p><a name="stahl">2</a>: <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4243840/">Stahl, et al.</a></p>
<p><a name="hunter">3</a>: <a href="http://www.nature.com/ng/journal/v39/n7/full/ng2075.html">Hunter, al.</a></p>
<p><a name="shi">4</a>: <a href="http://www.nature.com/ng/journal/v43/n12/abs/ng.978.html">Shi, et al.</a></p>
<p><a name="han">5</a>: <a href="http://www.nature.com/ng/journal/v41/n11/abs/ng.472.html">Han, et al.</a></p>
<p><a name="turnbull">6</a>: <a href="http://www.nature.com/ng/journal/v42/n6/abs/ng.586.html">Turnbull, et al.</a></p>
<p><a name="chasman">7</a>: <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3125402/">Chasman, et al.</a></p>
<p><a name="kumar">8</a>: <a href="http://www.nature.com/ng/journal/v43/n5/abs/ng.809.html">Kumar, et al.</a></p>
<p><a name="cha">9</a>: <a href="http://hmg.oxfordjournals.org/content/19/23/4735.short">Cha, et al.</a></p>
<p><a name="hu">10</a>: <a href="http://www.nature.com/ng/journal/v43/n8/abs/ng.875.html">Hu, et al.</a></p>
<p><a name="breiman">11</a>: <a href="http://projecteuclid.org/euclid.aos/1032181158">Leo Breiman (1994)</a></p>
<p><a name="bagging">12</a>: <a href="http://www.machine-learning.martinsewell.com/ensembles/bagging/Breiman1996.pdf">Leo Breiman (1996)</a></p>
<p><a name="random-forests">13</a>: <a href="http://link.springer.com/article/10.1023/A:1010933404324">Leo Breiman (2001)</a></p>
Tue, 14 Feb 2017 00:01:19 +0000
http://rnowling.github.io/machine/learning/2017/02/14/lr-gwas.html
http://rnowling.github.io/machine/learning/2017/02/14/lr-gwas.htmlbioinformatcsscientific computingmachinelearningTesting CLI Apps with Bats<p>I’ve been looking for a good way to test <a href="https://github.com/rnowling/asaph">Asaph</a>, the small machine-learning application I wrote for my Ph.D. thesis.</p>
<p>Most testing solutions I found didn’t quite fit what I wanted. The built-in Python <a href="https://docs.python.org/2/library/unittest.html"><code class="highlighter-rouge">unittest</code></a> framework is my usual go-to. It’s flexible, powerful, and easy-to-use. Asaph is heavily data-dependent with relatively complex internal data structures and its workflow involves lots of file I/O. Consequently, I found it cumbersome to write unit tests.</p>
<p>Most command-line testing solutions seem to be focused on testing the interfaces. Most examples I found focused on using <code class="highlighter-rouge">unittest</code> to test argument parsing with libraries such as <a href="https://docs.python.org/2.7/library/argparse.html"><code class="highlighter-rouge">argparse</code></a>. Other options include testing interactive CLI apps such as those that prompt the user or use something like <a href="https://en.wikipedia.org/wiki/Curses_%28programming_library%29"><code class="highlighter-rouge">curses</code></a>. Asaph isn’t really interactive, though.</p>
<p>Asaph’s commands form a workflow. The user first calls Asaph to convert data to its internal format. The user then uses Asaph to train a Logistic Regression model or Random Forests models with different numbers of trees. The user can then call Asaph to check convergence of the SNP rankings, deciding whether to train models with more trees or not. Lastly, the user can output the SNP rankings from one of the models. In each step, new files (initial data, models, plots, rankings) are added to the work directory or the contents of the work directory are queried.</p>
<p>What I really wanted was to test Asaph and its workflow holistically. I want to call Asaph and check that it executes successfully and produces the expected output on disk. Sometimes it may be enough to merely check that the output exists, while in other cases, I want to query the output to make sure what it contains is reasonable. By running Asaph’s workflow on test data, we can check the most common codepaths and ensure no syntax or type errors have been introduced.</p>
<p>In my search, I came across <a href="https://github.com/sstephenson/bats">Bash Automated Testing System</a> or bats. Bats allows you to write tests in Bash. Bash scripts map well onto my use case: Bash commands are external programs in the traditional Unix philosophy. You then define assertions through standard Bash comparisons. Additionally, bash supports simple setup and teardown functions and loading helper functions.</p>
<p>Bats is best demonstrated through the tests I created for the Asaph import script:</p>
<script src="https://gist.github.com/rnowling/74224fed33ac99137d373297d6694c34.js"></script>
<p>In the example, I use the following features:</p>
<ol>
<li>Defining an embedded helper function (<code class="highlighter-rouge">count_snps</code>)</li>
<li>Setup and teardown functions</li>
<li>Defining tests with the annotation <code class="highlighter-rouge">@test</code></li>
<li>Running commands and checking the return codes</li>
<li>Check for the existence of output files and directories</li>
<li>Checking the contents of output files</li>
</ol>
<p>My experience is not unique to Asaph. Many of the scientific applications I’ve come across in my research are built around complex datasets and workflows of deeply-connected steps. It can be easier to use holistic tests with for these applications. I haven’t quite come across anything like Bats before, but I think it can be an useful tool to computational scientists.</p>
Sat, 04 Feb 2017 00:01:19 +0000
http://rnowling.github.io/software/engineering/2017/02/04/testing-cli-apps-with-bats.html
http://rnowling.github.io/software/engineering/2017/02/04/testing-cli-apps-with-bats.htmltestingasaphscientific computingsoftwareengineeringRunning OpenMM in Docker on Debian<p><a href="http://openmm.org/">OpenMM</a> is an open-source and high-quality molecular dynamics library for GPUs and CPUs. I used OpenMM extensively in my first few years of graduate school. With a new project potentially on my plate, I decided to go back to OpenMM and refamiliarize myself with it. The first task was to get OpenMM running on my Debian system, preferably with OpenCL support for the built-in Intel GPU.</p>
<p>I decided to build OpenMM inside a Docker container, mostly so I can track dependencies more easily.</p>
<h2 id="installing-docker">Installing Docker</h2>
<p>Since I’m using Debian Squeeze (testing), some of the packages are in flux. In particular, the <code class="highlighter-rouge">docker.io</code> package was been <a href="https://tracker.debian.org/news/804615">pulled from testing</a>.</p>
<p>So, I had to install Docker from the upstream repository using the <a href="https://docs.docker.com/engine/installation/linux/debian/">official instructions</a>. Unfortunately, the command for adding the repository failed:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>$ sudo add-apt-repository \
"deb https://apt.dockerproject.org/repo/ \
debian-$(lsb_release -cs) \
main testing"
Traceback (most recent call last):
File "/usr/bin/add-apt-repository", line 95, in <module>
sp = SoftwareProperties(options=options)
File "/usr/lib/python3/dist-packages/softwareproperties/SoftwareProperties.py", line 109, in __init__
self.reload_sourceslist()
File "/usr/lib/python3/dist-packages/softwareproperties/SoftwareProperties.py", line 599, in reload_sourceslist
self.distro.get_sources(self.sourceslist)
File "/usr/lib/python3/dist-packages/aptsources/distro.py", line 89, in get_sources
(self.id, self.codename))
aptsources.distro.NoDistroTemplateException: Error: could not find a distribution template for Debian/stretch
</code></pre>
</div>
<p>Thankfully, a simple workaround was available – just manually append the repo line to <code class="highlighter-rouge">/etc/apt/sources.list</code>:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>$ sudo su -c "echo deb https://apt.dockerproject.org/repo/ debian-stretch main testing >> /etc/apt/sources.list"
</code></pre>
</div>
<p>From there, I was able to follow the rest of the instructions to set up Docker for non-root users.</p>
<h2 id="building-openmm">Building OpenMM</h2>
<p>Once I had Docker up and running, I created a <code class="highlighter-rouge">Dockerfile</code> to build an image containing the OpenMM dependencies and source code.</p>
<p>You can clone the repository and build the image like so:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>$ git clone https://github.com/rnowling/openmm-docker.git
$ cd openmm-docker
$ docker build -t rnowling/openmm-docker .
</code></pre>
</div>
<p>The build takes quite a while since OpenMM is a large project. Once built, you can run the container to get a bash shell:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>$ docker run -i -t --device /dev/dri:/dev/dri rnowling/openmm-docker
root@874b1411bf01:/openmm-7.0.1#
</code></pre>
</div>
<p>Note the <code class="highlighter-rouge">--device /dev/dri:/dev/dri</code> flag. This exposes the GPU to the container, allowing usage of the GPU with OpenCL. Note that the OpenCL mixed and double precision tests will fail. If you run one of them manually, you’ll see why:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>root@874b1411bf01:/openmm-7.0.1# ./TestOpenCLHarmonicBondForce mixed
exception: This device does not support double precision
</code></pre>
</div>
<p>I used the open-source <a href="https://freedesktop.org/wiki/Software/Beignet/">beignet</a> OpenCL driver for Intel GPUs. Unfortunately, biegnet only supports single-precision floating point operations at the moment. In the future, I may look into using the proprietary Intel OpenCL drivers. Generally speaking, many of the force calculations can be done in single precision, but double precision is <a href="https://github.com/pandegroup/openmm/issues/1616#issuecomment-252302777">needed in some of the integration calculations</a> to minimize errors.</p>
<p>Finally, we can test the image with the included Python wrapper:</p>
<div class="highlighter-rouge"><pre class="highlight"><code>root@874b1411bf01:/openmm-7.0.1# python -m simtk.testInstallation
There are 3 Platforms available:
1 Reference - Successfully computed forces
2 CPU - Successfully computed forces
3 OpenCL - Successfully computed forces
Median difference in forces between platforms:
Reference vs. CPU: 1.99996e-06
Reference vs. OpenCL: 0.055165
CPU vs. OpenCL: 0.0551638
</code></pre>
</div>
<p>Viola! OpenMM running in a Docker container.</p>
Sat, 28 Jan 2017 00:01:19 +0000
http://rnowling.github.io/molecular/dynamics/2017/01/28/openmm-docker-debian.html
http://rnowling.github.io/molecular/dynamics/2017/01/28/openmm-docker-debian.htmlmoleculardynamicsSymplectic Integrators Bound Energy Error<p>In my previous blog posts, I analyzed the position and velocity error of the harmonic oscillator simulated with the Leapfrog integrator. I <a href="/math/2016/11/19/leapfrog-global-error.html">proved that the Leapfrog integrator is a second-order method</a>. Using a simulation, I validated that the error of the positions and velocities between the numerically-integrated and analytical models grows linearly with the trajectory length and quadratically with the timestep.</p>
<p>What about the error of the total energy between the numerically-integrated and analytical models? The total energy $E(t)$ is calculated from the positions <script type="math/tex">x(t)</script> and velocities <script type="math/tex">v(t)</script> by</p>
<script type="math/tex; mode=display">E(t) = \frac{1}{2}m v^2(t) + \frac{1}{2} m \omega x^2(t)</script>
<p>As the errors in the positions and velocities grow linearly with time, we can write the numerical positions and velocities as pertubations of the true positions and velocities:</p>
<script type="math/tex; mode=display">\tilde{x}(t) = x(t) + \mathcal{O}(t) \\
\tilde{v}(t) = v(t) + \mathcal{O}(t)</script>
<p>We can then substitute the perturbed positions and velocities into <script type="math/tex">E(t)</script> to get <script type="math/tex">\tilde{E}(t)</script> and solve:</p>
<script type="math/tex; mode=display">\tilde{E}(t) = \frac{1}{2}m \tilde{v}(t)^2 + \frac{1}{2} m \omega \tilde{x}(t)^2 \\
\tilde{E}(t) = \frac{1}{2}m (v(t) + \mathcal{O}(t))^2 + \frac{1}{2} m \omega (x(t) + \mathcal{O}(t))^2 \\
\tilde{E}(t) = \frac{1}{2}m (v(t)^2 + \mathcal{O}(v(t)t) + \mathcal{O}(t^2)) + \frac{1}{2} m \omega (x(t)^2 + \mathcal{O}(x(t)t) + \mathcal{O}(t^2)) \\
\tilde{E}(t) = \frac{1}{2}m v^2(t) + \frac{1}{2} m \omega x^2(t) + \mathcal{O}(v(t)t) + \mathcal{O}(t^2) + \mathcal{O}(x(t)t) + \mathcal{O}(t^2) \\
\tilde{E}(t) = \frac{1}{2}m v^2(t) + \frac{1}{2} m \omega x^2(t) + \mathcal{O}(t^2)</script>
<p>For the harmonic oscillator, the positions and velocities are bounded by constants. Thus, we end up with linear and quadratic error terms depending on <script type="math/tex">t</script>. Since the quadratic error term to be the largest term in the large <script type="math/tex">t</script> limit, we can expect the error in the energies be bounded by quadratic growth with respect to the length of the trajectories.</p>
<p>Let’s do a simulation to validate our result. I simulated the harmonic oscillator using the Leapfrog integrator with a timestep of 0.01 s to generate trajectories ranging from 1 s to 1000 s. I sampled the total energy at each time step and plotted the average and standard deviations of the energies (black) for each trajectory length. I also included the analytical energy (magenta) as a reference.</p>
<p><img src="/images/symplectic_bounded_error/vv_duration_energies.png" alt="Total Energy" /></p>
<p>But wait! The energy doesn’t grow over time! In fact, the error in the energy doesn’t seem to change in time once the trajectories are long enough. What’s going on?</p>
<p>Our analysis above provided an upper-bound on the energy error over time – the error in the energy could always be less. Specifically, we didn’t take into account the stricter property of symplectiness.</p>
<p>In my <a href="math/2016/12/14/leapfrog-symplectic-harmonic-oscillator.html">last blog post</a>, I proved that the Leapfrog integrator is symplectic. <a href="http://www.cds.caltech.edu/~marsden/bib/1988/04-GeMa1988/GeMa1988.pdf">Ge and Mardsen</a> proved that if a symplectic integrator exactly conserves the total energy (Hamiltonian) of a system, then it is computing the exact trajectory for that system. They go on to suggest that for symplectic integrators, the error in the energy is a good proxy for evaluating the error in the trajectory.</p>
<p>Specifically, symplectic integrators seem to bound the error in the energy so that it doesn’t grow over time. Physicists would say that the error is secular. This is a useful property when studying physical systems.</p>
Wed, 11 Jan 2017 12:13:19 +0000
http://rnowling.github.io/math/2017/01/11/symplectic-integrators-bound-energy-error.html
http://rnowling.github.io/math/2017/01/11/symplectic-integrators-bound-energy-error.htmlmathmathLeapfrog is Symplectic for the Harmonic Oscillator<p>Microcanonical molecular dynamics describes the motion of molecules using the <a href="https://en.wikipedia.org/wiki/Hamiltonian_mechanics">Hamiltonian mechanics</a> framework. Hamiltonian dynamics are <a href="https://en.wikipedia.org/wiki/Symplectomorphism">symplectic</a>, meaning that they preserve volume in phase space. The symplectic property relates to properties we learned in first-semeter college physics such as conservation of energy.</p>
<p>The <a href="/math/2016/11/07/harmonic-oscillator.html">harmonic oscillator</a> is a simple symplectic model, useful for study. A plot of the path through phase space of our analytical derivation of the harmonic oscillator, demonstrates the symplectic property:</p>
<p><img src="/images/harmonic_oscillator/analytical_phase.png" alt="Harmonic Oscillator Phase Diagram" /></p>
<p>Symplectic integrators are important for bounding errors in the trajectories and resulting statistics such as transition rates<sup><a href="#reviews">1</a></sup>. <a href="http://www.cds.caltech.edu/~marsden/bib/1988/04-GeMa1988/GeMa1988.pdf">Ge and Mardsen</a> showed that if an integrator is symplectic, then the integrator can only conserve energy exactly if it computes the exact trajectory except for a reparameterization in time. Since the error in the energy is bounded, the error of statistics calculated from trajectories are bounded.</p>
<p>The bound would seem to contradict our results from the analyses of the <a href="/math/2016/11/13/leapfrog-local-error.html">local</a> and <a href="/math/2016/11/19/leapfrog-global-error.html">global truncation</a> errors. These analyses indicate that the errors in energy and other statistics computed from the trajectories would grow without bound. It turns out that symplectic integrators <strong>exactly</strong> simulate <em>shadow Hamiltonians</em>, which are perturbations of the original Hamiltonians. Thus, we can use the energy and other statistics from the shadow Hamiltonian as approximations to the values for the true Hamiltonian.</p>
<p>Additionally, the relationship between the bounds on the errors in the energy and the trajectories implies that the error in energy can be used as a measure of error for the trajectories. For example, unbounded inncreases or decreases in the energy from a simulation are indicative of an incorrect implementation of a symplectic integrator.</p>
<p>Unfortunately, the mathematical definition of the symplectic property and its relation to properties like the conservation of energy are expressed using advanced areas of math such as <a href="https://en.wikipedia.org/wiki/Differential_geometry">differential geometry</a>. Fortunately, it is much easier to prove that an integrator is symplectic than it is to state the definition of symplectiness.</p>
<h2 id="hamiltonian-dynamics">Hamiltonian Dynamics</h2>
<p>We’re going to start with a detour. Symplectiness is described using the language of Hamiltonians and flows, so we’ll describe some basics and show how Hamiltonians relate to Newton’s equations of motion.</p>
<p>The Hamiltonian is a function that takes the positions <script type="math/tex">q</script> and momenta <script type="math/tex">p</script>. The form of the Hamiltonian commonly used in molecular dynamics is a linear combination of the kinetic and potential energies:</p>
<script type="math/tex; mode=display">H(q, p) = \frac{1}{2}p^T M^{-1} p + U(q)</script>
<p>We can describe the dynamics of the Hamiltonian system using a pair of first-order differential equations:</p>
<script type="math/tex; mode=display">\frac{dq}{dt} = \frac{dH}{dp} = M^{-1} p \\
\frac{dp}{dt} = -\frac{dH}{dq} = -\nabla U(q) \\</script>
<p>The system of two first-order ODEs can be rewritten as a single second-order ODE:</p>
<script type="math/tex; mode=display">\frac{d^2 q}{dt^2} = \frac{d}{dt} \frac{dq}{dt} \\
= \frac{d}{dt} M^{-1} p \\
= M^{-1} \frac{dp}{dt} \\
= M^{-1} (-\nabla U(q)) \\
= - M^{-1} \nabla U(q)</script>
<p>By re-arranging the mass term, we get the form of Newton’s equations of motions we expect:</p>
<script type="math/tex; mode=display">M \frac{d^2 q}{dt^2} = - \nabla U(q)</script>
<p>Thus, the Hamiltonian system is an equivalent description to Newton’s equations of motion.</p>
<h2 id="symplectic-maps--dynamical-systems">Symplectic Maps / Dynamical Systems</h2>
<p>We can define a map <script type="math/tex">\phi</script> that updates the state of the system over a length of time <script type="math/tex">\Delta t</script>:</p>
<script type="math/tex; mode=display">(q_{i+1}, p_{i+1}) = \phi (q_i, p_i)</script>
<p>Let <script type="math/tex">\phi'</script> be the <a href="https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant">Jacobian matrix</a> of <script type="math/tex">\phi</script>:</p>
<script type="math/tex; mode=display">% <![CDATA[
\phi' = \begin{pmatrix}
\frac{\partial \phi}{\partial q} & \frac{\partial \phi}{\partial p}
\end{pmatrix} %]]></script>
<p>The map <script type="math/tex">\phi</script> is symplectic if its Jacobian <script type="math/tex">\phi'</script> satifies:</p>
<script type="math/tex; mode=display">\phi'^T J \phi' = J</script>
<p>where</p>
<script type="math/tex; mode=display">% <![CDATA[
J = \begin{pmatrix}
0 & 1 \\
-1 & 0
\end{pmatrix} %]]></script>
<h2 id="harmonic-oscillator-is-symplectic">Harmonic Oscillator is Symplectic</h2>
<p>We now have the basic tools for describing Hamiltonian systems and proving symplecticness. Let’s apply these tools. As a Hamiltonian system, the map for the harmonic oscillator is symplectic. To demonstrate the proof of symplectiness for map, we will start by validating the analytical map for the harmonic oscillator is symplectic. The Hamiltonian is defined as follows:</p>
<script type="math/tex; mode=display">H(q, p) = \frac{1}{2}p^T M^{-1} p + \frac{1}{2} M \omega^2 q^2</script>
<p>The corresponding map for the system is given by</p>
<script type="math/tex; mode=display">\phi(q, p) = \begin{pmatrix}
\cos(\Delta t) q + \sin(\Delta t)p \\
-\sin(\Delta t) q + \cos(\Delta t)p
\end{pmatrix}</script>
<p>with Jacobian</p>
<script type="math/tex; mode=display">% <![CDATA[
\phi'(q, p) = \begin{pmatrix}
\frac{\partial \phi}{\partial q} & \frac{\partial \phi}{\partial p}
\end{pmatrix} \\
= \begin{pmatrix}
\cos (\Delta t) & \sin (\Delta t) \\
-\sin (\Delta t) & \cos (\Delta t)
\end{pmatrix} %]]></script>
<p>With a bit of arithmetic, we see that the map satisfies the criteria for symplectiness:</p>
<script type="math/tex; mode=display">% <![CDATA[
\phi'^T J \phi' =
\begin{pmatrix}
\cos (\Delta t) & -\sin (\Delta t) \\
\sin (\Delta t) & \cos (\Delta t)
\end{pmatrix}
\begin{pmatrix}
0 & 1 \\
-1 & 0
\end{pmatrix}
\begin{pmatrix}
\cos (\Delta t) & \sin (\Delta t) \\
-\sin (\Delta t) & \cos (\Delta t)
\end{pmatrix} \\
= \begin{pmatrix}
\sin (\Delta t) & \cos (\Delta t)\\
-\cos (\Delta t) & \sin (\Delta t)
\end{pmatrix}
\begin{pmatrix}
\cos (\Delta t) & \sin (\Delta t) \\
-\sin (\Delta t) & \cos (\Delta t)
\end{pmatrix} \\
= \begin{pmatrix}
\cos (\Delta t)\sin (\Delta t) - \cos (\Delta t)\sin (\Delta t) & \sin^2 (\Delta t) + \cos^2 (\Delta t) \\
-[\cos^2 (\Delta t) + \sin^2 (\Delta t)] & - \sin (\Delta t)\cos (\Delta t) + \cos (\Delta t)\sin (\Delta t)
\end{pmatrix} \\
= \begin{pmatrix}
0 & 1 \\
-1 & 0
\end{pmatrix} \\
= J %]]></script>
<p>Thus, the analytical harmonic oscillator map satisfies the conditions for symplectiness as expected.</p>
<h2 id="leapfrog-for-harmonic-oscillator">Leapfrog for Harmonic Oscillator</h2>
<p>Next, we will prove that the Leapfrog method is symplectic for the harmonic oscillator system.</p>
<p>In our <a href="/math/2016/11/11/deriving-leapfrog.html">previous blog post</a>, we derived the Leapfrog integrator. We reproduce it here, with the substitutions <script type="math/tex">x = q</script> and <script type="math/tex">v = M^{-1} p</script> to be consistent with the notation used in this blog post. We’ll re-arrange the integrator into two equations, one for <script type="math/tex">q(t + \Delta t)</script> and another for <script type="math/tex">p(t + \Delta t)</script>, so that we can form the flow equation <script type="math/tex">\phi(q, p)</script>. We will then use the Jacobian <script type="math/tex">\phi'</script> to prove that the Leapfrog integrator is symplectic.</p>
<script type="math/tex; mode=display">F(t) = -\nabla U(q(t)) \\
q(t + \Delta t) = q(t) + M^{-1}p(t)\Delta t + \frac{1}{2}M^{-1}F(t)\Delta t^2 \\
F(t + \Delta t) = -\nabla U(q(t + \Delta t)) \\
p(t + \Delta t) = p(t) + \frac{1}{2} [F(t) + F(t + \Delta t)] \Delta t \\</script>
<p>We substitute the potential</p>
<script type="math/tex; mode=display">U(q) = \frac{1}{2} M \omega^2 q^2</script>
<p>into <script type="math/tex">F(t)</script> and <script type="math/tex">F(t + \Delta t)</script> of the integrator equations:</p>
<script type="math/tex; mode=display">F(t) = -M \omega^2 q(t) \\
q(t + \Delta t) = q(t) + M^{-1}p(t)\Delta t + \frac{1}{2}M^{-1} F(t)\Delta t^2 \\
F(t + \Delta t) = -M \omega^2 (q(t + \Delta t)) \\
p(t + \Delta t) = p(t) + \frac{1}{2} [F(t) + F(t + \Delta t)] \Delta t \\</script>
<p>We substitute <script type="math/tex">F(t)</script> and <script type="math/tex">F(t + \Delta t)</script> into <script type="math/tex">q(t + \Delta t)</script> and <script type="math/tex">p(t + \Delta t)</script>, reducing our system to two equations:</p>
<script type="math/tex; mode=display">q(t + \Delta t) = q(t) + M^{-1}p(t)\Delta t - \frac{1}{2}\omega^2 q(t) \Delta t^2 \\
p(t + \Delta t) = p(t) + \frac{1}{2} [-M \omega^2 q(t) - M \omega^2 q(t + \Delta t)] \Delta t</script>
<p>We substitute <script type="math/tex">q(t + \Delta t)</script> into <script type="math/tex">p(t + \Delta t)</script> to get <script type="math/tex">q(t + \Delta t)</script> and <script type="math/tex">p(t + \Delta t)</script> in terms of <script type="math/tex">q(t)</script> and <script type="math/tex">p(t)</script> only:</p>
<script type="math/tex; mode=display">q(t + \Delta t) = q(t) + M^{-1}p(t)\Delta t - \frac{1}{2}\omega^2 q(t) \Delta t^2 \\
p(t + \Delta t) = p(t) + \frac{1}{2} [-M \omega^2 q(t) - M \omega^2 [q(t) + M^{-1}p(t)\Delta t - \frac{1}{2}\omega^2 q(t) \Delta t^2]] \Delta t</script>
<p>Thus, we can form the flow <script type="math/tex">\tilde{\phi}'(q, p)</script> as</p>
<script type="math/tex; mode=display">\tilde{\phi}'(q, p) = \begin{pmatrix}
(1 - \frac{1}{2} \omega^2 \Delta t^2) q(t) + M^{-1}\Delta t \, p(t) \\
(- M \omega^2 \Delta t + \frac{1}{4} M \omega^4 \Delta t^3) q(t) + (1 - \frac{1}{2} \omega^2 \Delta t^2) p(t)
\end{pmatrix}</script>
<p>with Jacobian</p>
<script type="math/tex; mode=display">% <![CDATA[
\tilde{\phi}' =
\begin{pmatrix}
a & b \\
c & d
\end{pmatrix} %]]></script>
<p>where</p>
<script type="math/tex; mode=display">a = d = 1 - \frac{1}{2} \omega^2 \Delta t^2 \\
b = M^{-1} \Delta t \\
c = - M \omega^2 \Delta t + \frac{1}{4} M \omega^4 \Delta t^3</script>
<p>Note that we denote the map as <script type="math/tex">\tilde{\phi}(q, p)</script> since it is an <strong>approximation</strong> to the true flow <script type="math/tex">\phi(q, p)</script>. We then substitute the Jacobian <script type="math/tex">\tilde{\phi}'(q, p)</script> into the equation for the conditions of symplectiness and solve:</p>
<script type="math/tex; mode=display">% <![CDATA[
\tilde{\phi}'^T J \tilde{\phi}' =
\begin{pmatrix}
a & c \\
b & d
\end{pmatrix}
\begin{pmatrix}
0 & 1 \\
-1 & 0
\end{pmatrix}
\begin{pmatrix}
a & b \\
c & d
\end{pmatrix} \\
= \begin{pmatrix}
-c & a \\
-d & b
\end{pmatrix}
\begin{pmatrix}
a & b \\
c & d
\end{pmatrix} \\
= \begin{pmatrix}
-ca + ac & -cb + ad \\
-da + bc & -db + bd
\end{pmatrix} \\
= \begin{pmatrix}
0 & -cb + ad \\
-da + bc & 0
\end{pmatrix} \\
= \begin{pmatrix}
0 & 1 \\
-1 & 0
\end{pmatrix} \\
= J %]]></script>
<p>where</p>
<script type="math/tex; mode=display">-da + bc = -(1 - \frac{1}{2} \omega^2 \Delta t^2)(1 - \frac{1}{2} \omega^2 \Delta t^2) + M^{-1}\Delta t( - M \omega^2 \Delta t + \frac{1}{4} M \omega^4 \Delta t^3) \\
= -1 + \frac{1}{2} \omega^2 \Delta t^2 + \frac{1}{2} \omega^2 \Delta t^2 - \frac{1}{4} \omega^4 \Delta t^4 - \omega^2 \Delta t^2 + \frac{1}{4} \omega^4 \Delta t^4 \\
= -1</script>
<p>and</p>
<script type="math/tex; mode=display">-cb + ad = -( - M \omega^2 \Delta t + \frac{1}{4} M \omega^4 \Delta t^3) M^{-1}\Delta t + (1 - \frac{1}{2} \omega^2 \Delta t^2)(1 - \frac{1}{2} \omega^2 \Delta t^2) \\
= \omega^2 \Delta t^2 - \frac{1}{4} \omega^4 \Delta t^4 + 1 - \frac{1}{2} \omega^2 \Delta t^2 - \frac{1}{2} \omega^2 \Delta t^2 + \frac{1}{4} \omega^4 \Delta t^4 \\
= 1</script>
<p>Thus, the Leapfrog method is symplectic for the harmonic oscillator system.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this blog post, we covered the basics of symplectic Hamiltonians and maps. We described concepts and notation from Hamiltonian dynamics and showed how they relate to the second-order differential equations used for Newton’s equations of motion. We then discussed the conditions for proving a map is symplectic.</p>
<p>We then applied the framework to two problems. First, we demonstrated the approach by proving that the harmonic oscillator is symplectic. Next, we showed that the Leapfrog integrator, which is an <strong>approximate</strong> map, is symplectic for the harmonic oscillator.</p>
<p>We covered a lot of material, but we have even more to cover in the future. First and foremost, we want to prove that the Leapfrog method is symplectic for all Hamiltonians, or at least those of the form we use in molecular dynamics. We also want to dive into differential geometry and better understand the definition for symplectiness and the relationship with the conditions we expressed above.</p>
<p>We also want to better understand the implications of symplectiness for simulation. In particular, we mentioned that symplectiness guarantees a bound on the error in energies and other statistics computed from the resulting trajectories. We want to better understand this relationship and examine the proofs around the relationships.</p>
<p><a name="reviews">1</a>: I’ve used papers by <a href="http://scitation.aip.org/content/aapt/journal/ajp/73/10/10.1119/1.2034523">Donnelly and Rogers</a>, <a href="http://link.springer.com/chapter/10.1007/978-1-4612-4066-2_10">Leimkuhler, Reich, and Skeel</a>, <a href="https://doi.org/10.1017/S0962492900002282">Sanz-Serna</a>, <a href="http://bionum.cs.purdue.edu/Skee98b.pdf">Skeel</a>, and <a href="http://link.springer.com/chapter/10.1007/978-94-011-2030-2_3">Yoshida</a> as guides for this blog post.</p>
Wed, 14 Dec 2016 12:13:19 +0000
http://rnowling.github.io/math/2016/12/14/leapfrog-symplectic-harmonic-oscillator.html
http://rnowling.github.io/math/2016/12/14/leapfrog-symplectic-harmonic-oscillator.htmlmathmath