Rails, Meet Data Science

Rails, Meet Data Science

Organizations today have more data than ever. Predictive modeling is a powerful way to use this data to solve problems and create better experiences for customers. For instance, do a better job keeping items in stock by predicting demand or lower costs by predicting fraud. If you use Ruby on Rails, it can be tough to know how to incorporate this into your app.

We’ll go over four patterns you can use for prediction with Rails. We used all four successfully during my time at Instacart. They can work when you have no data scientists (when I started) as well as when you have a strong data science team.

Patterns

With predictive modeling, you first train a model and then use it to predict. The patterns can be grouped by the language used for each task:

Pattern Train Predict
1 3rd Party 3rd Party
2 Ruby Ruby
3 Another Language Ruby
4 Another Language Another Language

Two popular languages for data science are Python and R.

You can decide which pattern to use for each model you build. We’ll walk through the approaches and discuss the pros and cons of each.

Pattern 1: Use a 3rd Party

Before building a model in-house, it’s good to see what already exists. There are a number of external services you can use for specific problems. Here are a few:

Pros

Cons

Pattern 2: Train and Predict in Ruby

Ruby has a number of libraries for building simple models. Simple models can perform very well since a large part of model building is feature engineering. This is a great option if there are no data scientists in your company or on your team. A developer can own the model end-to-end, which is great for speed and iteration.

Here are a few libraries for building models in Ruby:

Once a model is trained, you’ll need to store it. You can use methods provided by the library, or marshal if none exist. You can store the models as files or in the database.

Be sure to commit the code used to train models so you can update them with newer data in the future. The Rails console is a decent place to create them, or use a Jupyter notebook running IRuby for better visualizations (see setup instructions for Rails).

Pros

Cons

Pattern 3: Train in Another Language, Predict in Ruby

Ruby is getting better for data science thanks to SciRuby. However, languages like R and Python currently have much better tools. Also, many people who have experience building models don’t know Ruby.

Luckily, you can build models in another language and predict in Ruby. This way, you can use more advanced tools for visualization, validation, and tuning without adding complexity to your production stack. If you don’t have data scientists, you can use this pattern to contract with one.

Here are models that can currently predict in Ruby:

For this to work, models need to be stored in a shared format that both languages understand. PMML and PFA are two interchange formats. PFA is newer but has less adoption than PMML. Andrey Melentyev has a great post on the topic.

Once again, it’s important that models are reproducible. This allows you to update them with newer data in the future. Be sure to follow software engineering best practices like:

Here are some tools you can use:

Function Python R
Package management Pipenv Jetpack
Database access SQLAlchemy dbx
PMML export sklearn2pmml pmml

One place to be careful is implementing the features in Ruby. It must be consistent with how they were implemented in training. To ensure this is correct, verify it programmatically. Create a CSV file with ids and predictions from the original model and confirm the Ruby predictions match. Here’s some example code.

Pros

Cons

Pattern 4: Train and Predict in Another Language

The last option we’ll cover is doing both training and prediction outside Ruby. This is great if you have a team of data scientists who specialize in another language. This pattern allows data scientists to own models end-to-end.

It also gives you access to models that are not available in Ruby. For instance, there are forecasting libraries like Prophet and deep learning libraries like TensorFlow.

The implementation depends on how predictions are generated. Two common ways are batch and real-time.


Batch Predictions

Batch predictions are generated asynchronously and are typically run on a regular interval. This can be every minute or once a week. An example is a daily job that updates demand forecasts for the following weeks. Predictions can be stored and later used by the Rails app as needed.

Don’t be afraid to read and write directly to the database. While microservice design patterns caution against using the database as an API, we didn’t have much issue with it. When updating records, it’s also a good idea to write audits to see how predictions change over time.

Jobs can be scheduled with cron, or ideally a distributed scheduler like Mani for high availability. If you need to let the Rails app know a job has completed, you can do this through your messaging system. HTTP works great if you don’t have one.


Real-Time Predictions

Real-time predictions are generated synchronously and are triggered by calls from the Rails app. An example is recommending items to a user at checkout based off what’s in their cart.

HTTP is a common choice for retrieving predictions, but you can use a messaging system or even pipes. Great tools for HTTP are Django and Flask for Python and Plumber for R.


As with the other patterns, follow best engineering practices. In addition to ones previously mentioned:

Don’t be afraid to use Rails to manage the database schema. It’s easy enough for data scientists to learn to create and run migrations. Otherwise, you need to support another system for schema changes.

To store models, you most likely won’t use an interchange format, since libraries can’t load them. Instead, use serialization specific to the language, like pickle in Python and serialize in R.

If deciding between Python and R, Python has more general purpose libraries, so it’s easier to run in production.

Pros

Cons

Conclusion

You’ve now seen four great patterns for bringing predictive models to Rails. Each has different trade-offs, so we recommend taking the simplest approach that works for you. No matter which you choose, make sure your models are reproducible.

Happy modeling!

Updates

Published October 29, 2018 · Comment on Medium · Tweet

Thanks to Jeremy Stanley for reading a draft of this.


You might also enjoy

Learn Data Science

Securing Sensitive Data in Rails

Introducing pdscan: Scan Your Data Stores for Unencrypted Personal Data


All code examples are public domain.
Use them however you’d like (licensed under CC0).