Daru: Pandas for Ruby
Photo by Bruce Hong
NumPy and Pandas are two extremely popular libraries for machine learning in Python. Last post, we looked at Numo, a Ruby library similar to NumPy. As luck would have it, there’s a library similar to Pandas as well. It’s called Daru, and it’s the focus of this post.
2020 Update: Since writing this article, I created a data frame library called Rover that’s designed for data exploration and machine learning. Check it out as well.
Daru is a data analysis library. Its core data structure is the data frame, which is similar to an in-memory database table. Data frames have rows and columns, and each column has a specific data type. Let’s create a data frame with the most populous countries:
df = Daru::DataFrame.new( country: ["China", "India", "USA"], population: [1433, 1366, 329] # in millions )
Population data from the United Nations, 2019
Here’s what it looks like:
country population 0 China 1433 1 India 1366 2 USA 329
You can get specific columns with:
df[:country] df[:country, :population]
Or specific rows with:
df.first(2) # first 2 rows df.last(2) # last 2 rows df.row # 2nd row df.row[1..2] # 2nd and 3rd row
Filtering, Sorting, and Grouping
Select countries with over 1 billion people.
df.where(df[:population] > 1000)
For equality, use
df.where(df[:country].eq("China")) df.where(df[:country].in(["USA", "India"]))
Negate a condition with
Combine operators with
& (and) and
df.where(df[:country].eq("USA") | (df[:population] < 1400))
Sort the data frame by a column with:
df.sort([:population]) df.sort([:country], ascending: [false])
You can also group data and perform aggregations.
cities = Daru::DataFrame.new( country: ["China", "China", "India"], city: ["Shanghai", "Beijing", "Mumbai"] ) cities.group_by([:country]).count
Combining Data Frames
There are a number of ways to combine data frames. You can add rows:
countries = Daru::DataFrame.new( country: ["Indonesia", "Pakistan"], population: [271, 217] # in millions ) df.concat(countries)
Or add columns:
locations = Daru::DataFrame.new( continent: ["Asia", "Asia", "North America"], planet: ["Earth", "Earth", "Earth"] ) df.merge(locations)
You can also perform joins like in SQL.
cities = Daru::DataFrame.new( country: ["China", "China", "India"], city: ["Shanghai", "Beijing", "Mumbai"] ) df.join(cities, how: :inner, on: [:country])
Reading and Writing Data
Daru makes it easy to load data from a CSV file.
After manipulating the data, you can save it back to a CSV file.
You can also load data directly from Active Record.
relation = Country.where("population > 100") Daru::DataFrame.from_activerecord(relation)
For plotting, use a Jupyter notebook with IRuby. Create a plot with:
df.plot type: :bar, x: :country, y: :population do |plot, diagram| plot.x_label "Country" plot.y_label "Population (millions)" diagram.color(Nyaplot::Colors.Pastel1) end
You can also create line charts, scatter plots, box plots, and histograms.
You’ve now seen how to use Daru to:
- create data frames
- filter, sort, and group data
- combine data frames
- create plots
Try out Daru for your next analysis.