Digesting Data 2019.02

Cool Stuff

Data Pipelines

The Python package Great Expectations has a fantastic approach to data pipeline testing. It works with with in-memory pandas DataFrames and remote database objects via SQLAlchemy. Incorporating assertion-style testing inline with your data manipulations is now clean and easy.

And speaking of data pipelines, Bonobo is the new Python ETL framework to watch. It caters to smaller and more straightforward projects where Airflow or Luigi are overkill. Bonobo isn’t ready for prime time yet, but stay tuned!

Databases

Lately, I’ve been doing work with making databases more accessible to the part-time data analyst. For Python users, dataset lowers the barrier to using databases—it makes reading and writing remote database tables act a lot more like CSVs but with the added power of an easy-to-use query interface.

Shiny Deployment

If you deploy Shiny apps in production, check out ShinyProxy. It’s a free and open-source alternative to ShinyServer Pro, with the added bonus of better scalability. Interestingly, ShinyProxy doesn’t limit you to Shiny apps at all: You can deploy python-based apps written with Flask or Dash, or even use it to mimic the features of RStudio Server Pro without the added cost.

Tidyverse

Finally, if you’re a R tidyverse user, check out wrapr. It makes programming with tidyverse and rlang-based packages much easier.

R or Python?

A new post (and a great discussion in the comments) highlights one of the rarely discussed differences between R and Python: the license. When creating software for a client, there are cases where using R may not be an option because of the nature of the GPL. Know about open source licences!


New Releases and Developments

Databases

Version 11 of the database server PostgreSQL was released in October. The release brings enhanced query parallelism and support for just-in-time compilation of expressions via LLVM. In short, Postgres 11 is faster!

MySQL 8.0 also debuted last year, finally introducing common table expressions (CTEs) and window functions, in addition to numerous performance enhancements and other awesome features. If you use MySQL, it might be time to upgrade!

Programming

Last year, the Julia programming language hit version 1.0. Julia is as easy to write as python, but with speeds that give C a run for it’s money. Expect to see Julia taking a larger role in data science in the coming years!

All the way at the bottom of the release notes for Pandas version 0.24.0 we find some much needed enhancements to the DataFrame.to_sql() method. For those doing large database inserts from Python, check it out!


Recent Blog Posts

Classifying cats vs. dogs is cool, but rarely useful. My most recent blog posts include a series about a practical use case for machine learning. Each post takes you a littler deeper, as we move from problem conceptualization through model development and testing. Watch for posts on deployment and testing, coming soon!