Machine learning systems are incredibly data hungry. Data sets for production systems can easily exceed what can be processed on a single machine. Designing and implementing production systems that incorporate machine learning requires knowledge of: data storage systems, scalable data processing software, machine learning, MLOps, microservices, cloud computing, and algorithms and design patterns for batch and streaming analytics. I developed a foundation of knowledge spanning low-level system implementation through deployment and operation of systems with stringest reliability- and performance-related service level agreements (SLAs) through my experience with high-performance computing (HPC) from my Ph.D., full-time positions as a software and data science engineer from June 2014 to August 2018, and recent consulting work as a developer advocate . I’m developing curricula and realistic implementations of example systems for related courses in MSOE’s new graduate programs in machine learning. These course materials and example system implementations are available under open-source licenses through the MSOE Data-Intensive Systems Education (DISE) project.

The MSOE DISE GitHub organization contains repositories of class materials and example software.

Publications

Course Materials

ML Production Systems

Students will design, implement, deploy, and operate a machine learning-powered service, including components for data processing, model training, modeling serving, model evaluation, and monitoring. Technologies and design patterns for streaming and batch data processing as well as storage systems will be introduced. This course builds on and integrates previous course work in offline machine learning and microservices.

GitHub Repo: https://github.com/msoe-dise-project/ml-prod-sys-course

Distributed Storage Systems

In some applications, data storage and processing needs have vastly exceeded what can be accomplished using a single computer. A number of database and file systems that use distributed computing techniques to provide enhanced scalability and reliability have become available and been widely adopted. This course will cover software architectures, algorithms, and practical implications of approaches for scaling storage systems to large data sizes and high read/write throughputs, providing elasticity in the face of changing loads, and reliability in the face of failures. Relevant papers will be reviewed alongside case studies of industry and open-source implementations. Students will complete a term-long project to implement a functional distributed storage system.

GitHub Repo: https://github.com/msoe-dise-project/distributed-database-internals-course