clipivot
is a command-line tool I wrote in Rust that generates pivot tables from CSV files or any other delimiter-separated files. It works on arbitrarily large files, including ones that are larger than your RAM.
There are a few reasons I decided to build this program.
The first, and most intuitive, is that it works well with other command-line tools, like the csv toolkits xsv
, csvtk
, and csvkit
.
This makes the tool useful for quick data exploration tasks. I've used it to conduct common data sanity checks, like finding the earliest and most recent dates when certain categories appear in a dataset. I've also used it to replicate portions of data journlism pieces and to find potential stories to pursue in datasets. Overall, I find it to faster to use than either pandas
, the tidyverse
, or Excel for initial data exploration tasks.
The second reason is a more extreme case, especially in data journalism, where most datasets don't even reach 1 GB. And that's the thing I alluded to in the intro: clipivot
works on arbitrarily large datasets, including those that exceed RAM. This means that for large databases, you don't have to worry about doing anything differently than you would on smaller datasets. And while the performance isn't perfect, it's fast enough to do just fine in most cases, often beating the performance of pandas
(at least in larger databases).
To show this, I'll compute a pivot table using the DEA ARCOS database that the Washington Post obtained through a lawsuit and cleaned for other journalists to use. I'll specifically be using the raw, filtered data of shipments of hydrocodone and oxycodone pills to chain pharmacies, retail pharmacies and practitioners that the Washington Post has provided to journalists.
(This database is available to download here. Note that the version that I have downloaded consists of shipments between 2006 and 2012; the Washington Post has since obtained data from 2013 and 2014.)
Even missing data from 2013 and 2014, this database is quite large:
$ du wapo_cleaned.tsv -BG
75G wapo_cleaned.tsv
But using clipivot
and a couple of command-line tools, I'm going to count the total number of oxycodone and hydrocodone pills shipped to pharmacies, aggregated by the year and type of pill. Here's what that command looks like:
$ xsv fmt wapo_cleaned.tsv |
> csvtk mutate -f TRANSACTION_DATE -p "[0-9]{2}[0-9]{2}([0-9]{4})" -n year |
> clipivot sum -r year -c DRUG_NAME -v DOSAGE_UNIT -e |
> xsv select 1,HYDROCODONE,OXYCODONE | xsv sort -s 1 | xsv table
HYDROCODONE OXYCODONE
2006 5843738794.201000000000000015 2545387839.4
2007 6523828921.000 2939761146.0
2008 6933966164.125 3315565762.5
2009 7361480361.400 3779531181.5228659999999999999
2010 7657656885.20 4288524397.2393040000000000000
2011 8216725038.98 4558716202.16999999999998
2012 8114173954.5 4549792453.0
To learn more about the tool, see the GitHub page about it.