In some recent work, I needed to bin columns in a dplyr remote table. The new
dplyr::ntile function was not an option because the database I needed this to work on (MySQL) doesn't support window functions (at least, not until recently), and there was a use case for user-defined cut points rather than quantile-based cuts, so I needed to roll my own solution.
In the database world, a common approach to this is to run a join against a table with only the cut points, filter out rows with the target variable greater than the cut point, then take the minimum value (grouped by the primary key of the table).
Another approach available to me, since I'm not writing raw SQL, is to program a
CASE WHEN statement, generating the bins based off that conditional.
dplyr has the
case_when that works on database back ends, making this trivial to construct an arbitrarily long CASE statement.
Eventually all this logic is going to get wrapped up into a shiny app for users to apply to arbitrary columns with arbitrary cut points. I'm going to need a performant solution.
Because of their approach to non standard evaluation, programming against tidy-verse functions can be a pain. I need to use a function factory to generate the function I'll apply to a remote table to bin it with
dplyr::case_when. This is also a useful approach in benchmarking because by isolating any start-up costs or overhead associated with computing the function, I can get a better idea of how the generated SQL performs on it's own. Here's the code for the case when function factory:
This will return a function of data that will compute the new column for me.
The the join method with
dplyr is much more straightforward. I don't need a function factory at all--The only parameters are my (temporary) table of cut points and my data to filter by. The rest stays the same:
If you're not familiar with
rquery, it solves a similar problem as dplyr but in a different way.
rquery is based on Codd's relational algebra (Codd of Boyce-Codd Normal Form), so those who are accustomed to the verbs of dplyr may be lost initially. But with these less familiar verbs we also get the ability to compute on relational data in a way that is more flexible and better able to be optimized by query optimizers.
Another interesting feature of
rquery is the ability to re-use a chain of operations in other contexts--even in other databases--as long as the table name and columns referenced maintain their same names. I can use this to create a
rquery function factory the same way I did for case when--moving all the setup logic out of the function that gets benchmarked. Here's the
rquery function factory:
Notice it returns a function with no arguments. The only thing that is necessary to run this this chain of operations once we've computed it is the database connection.
I want to benchmark a few things. I want to know how each method fairs when varying both the number of rows in the remote table and the number of cut points. I also want to know whether putting an index on the column that gets cut will speed things up. The function
run_benchmark (see it in this gist) will do all necessary table setup and will call the function factories, finally benchmarking just the operation itself for
case_when, and both
rquery using the join method of binning. I'll be using the fantastic
microbenchmark package to do all the benchmarking.
Varying Table Size
First lets look at how varying table size affects each. With only three cut points,
rquery is the clear leader with
dplyr::case_when not far behind. Using a join with
dplyr pulls up the rear.
Note: there are no units on these plots. I think this one is in milliseconds. It doesn't really matter though--I'm interested in comparisons, not absolute time.
For my use case, I can stop here. I'm enforcing a limit on the maximum number of cuts a user can specify and it'll be close to three. All three methods are reasonably fast, but I'm using
case_when because a the client wants the code in dplyr and it avoids the overhead of having to write a temporary table with cut points in it--something is not a part of the benchmark here. I am still curious what happens when we bump the number of cut points to 10, 100, and even 1000.
Varying Cut Point Number
I assumed that this is where case_when would really start to shine--a join on 100 or 1000 cut points means every row in the source table gets joined to 100, or 1000, rows in the cut table--and I only have so much memory on my home laptop. But things ended up going the opposite way.
rquery with a join is still the fastest solution, but followed closely by dplyr's join. It looks like the performance of
case_when degrades exponentially as more cases are added.
Note: the y-axes above are on different scales.
Indexed Cut Column
Finally, Indexing. I expected this to make a big difference--but looks like it didn't help that much. This is again with 10,000 rows and 100 cut points.
Conclusions, Limitations, etc
rquery with a join is consistently the fastest option here. Based on what I've read of
rquery, this is because the SQL it writes is ripe for query optimization.
dplyr lags behind even CASE WHEN with only a few cut points but stays pretty much in line with
rquery once the number of cuts increase. I wonder if this is overhead in
dplyr code rather than anything on the database end? One way to get to the bottom of this would be to extract the queries that
rquery write and benchmark those.
I was surprised at the poor performance of the CASE expression at first. After reading a little more about how CASE WHEN works though, this makes sense. And lookup tables are a very common database pattern--of course they are going to be heavily optimized!