eBay's TSV Utilities

Command line tools for large tabular data files.


Project maintained by eBay Hosted on GitHub Pages — Theme by mattgraham

Visit the Performance Benchmarks main page
Visit the TSV Utilities main page

April 2018 Comparative Benchmarks Update

Overview

This is an update to the March 2017 Comparative Benchmarks Study. The 2017 study compared eBay's TSV Utilities to a number of other tools as a way to gauge the performance of the D programming language. This page contains updates to those benchmarks.

The goal of the 2017 study was to evaluate performance of an application written in a straightforward fashion, without going to unusual lengths to optimize performance. The 2018 version of eBay's TSV Utilities are still written in this style, but have had performance improvements since. The most notable is the use of Link Time Optimization (LTO) and Profile Guided Optimization (PGO). These are compiler technologies available via the LDC compiler.

See the main performance benchmarks page for more information on the goals of the study and the approach used. See the 2017 study for details about the individual benchmarks. This page is primarily reporting the updated results. The 2017 study also includes several analyses not included in the 2018 study, for example, a comparison of DMD and LDC compilers.

The 2018 study adds one additional benchmark to the 2017 set. Benchmarks were run on MacOS and Linux. Only the faster tools from the 2017 study were included in the 2018 update, the slower tools were dropped.

Benchmark tests

Seven tasks were used as benchmarks. Two forms of row filtering: numeric comparisons and regular expression match. Join two files on a common key. Simple statistical calculations (e.g. mean of column values). Convert CSV files to TSV. Two column selection tests (aka 'cut'), one with medium width lines, the other with short lines. The second column selection test was added in the 2018 study.

Reasonably large files were used. One 4.8 GB, 7 million rows, another 2.7 GB, 14 million rows, and a third file 1.7 GB, 86 million lines. Tests against smaller files gave results consistent with the larger file tests.

The MacOS benchmarks were run on a Mac mini, 16 GB RAM, SSD drives, 3 GHz Intel i7 (2 cores). This machine is not quite as fast as the MacBook Pro used for the 2017 study, but the performance is not far off.

The Linux benchmarks were run using a commodity cloud machine with a 16 CPU Intel (Haswell) 2095 MHz processor and 32 GB RAM. The 2017 benchmarks did not include Linux.

The MacOS benchmarks were run 25 times, the Linux benchmarks 35 times. The 10th percentile numbers are reported here. Median and quartile numbers were consistent and could have been used as well.

TSV Utilities performance improvements post the March 2017 Study

eBay's TSV Utilities became materially faster between the March 2017 and April 2018 studies. The tables below show these changes. The prebuilt binary for release v1.1.11 was used as a proxy for the March 2017 versions. v1.1.11 was released after the March 2017 study, but is largely unchanged.

Benchmark MacOS
March 2017
(v1.1.11)
MacOS
April 2018
(v1.1.19)
MacOS
Delta
Linux
March 2017
(v1.1.11)
Linux
April 2018
(v1.1.19)
Linux
Delta
Numeric row filter 4.25 3.35 21% 6.32 5.48 13%
Regex row filter 10.11 8.28 18% 9.72 8.80 9%
Column selection 4.26 2.93 31% 5.52 4.79 13%
Column selection (narrow) 25.12 10.18 59% 15.41 8.26 46%
Join two files 23.94 21.17 12% 28.77 26.68 7%
Summary statistics 16.97 9.82 42% 22.68 15.78 30%
CSV-to-TSV 30.70 10.91 64% 34.65 20.30 41%

There were three main sources of performance improvements between the 2017 and 2018 studies:

Benchmark results

Updates to the benchmarks from the 2017 study are provided below. The top four tools in each benchmark are shown.

A note about comparisons between individual tools

The intent of the study is to gain an overall picture of the performance of typical D programs. Comparisons across a range of tests and tools helps with this. However, any one test or comparison between individual tools is less meaningful. Each tool will have its own design and functionality goals. These goals will affect performance, often in ways that are a deliberate design trade-off on the part of the developer.

A few specific considerations:

Top four in each benchmark

The tables below show the fastest times for each benchmark. One table each for MacOS and Linux. Times are in seconds. A description of the individual tests can be found in the March 2017 study. eBay's TSV Utilities are shown in italics. Reference information for all tools is in the Test Details section.

MacOS: Top-4 in each benchmark

Benchmark Tool/Time Tool/Time Tool/Time Tool/Time
Numeric row filter tsv-filter mawk GNU awk csvtk
(4.8 GB, 7M lines) 3.35 15.06 24.25 39.10
Regex row filter xsv tsv-filter GNU awk mawk
(2.7 GB, 14M lines) 7.03 8.28 16.47 19.40
Column selection tsv-select xsv csvtk mawk
(4.8 GB, 7M lines) 2.93 7.67 11.00 12.37
Column selection (narrow) xsv tsv-select GNU cut csvtk
(1.7 GB, 86M lines) 9.22 10.18 10.65 23.01
Join two files tsv-join xsv csvtk  
(4.8 GB, 7M lines) 21.78 60.03 82.43  
Summary statistics tsv-summarize xsv csvtk GNU Datamash
(4.8 GB, 7M lines) 9.82 35.32 45.59 71.60
CSV-to-TSV csv2tsv xsv csvtk  
(2.7 GB, 14M lines) 10.91 14.38 32.49  

Linux: Top-4 in each benchmark

Benchmark Tool/Time Tool/Time Tool/Time Tool/Time
Numeric row filter tsv-filter mawk GNU awk csvtk
(4.8 GB, 7M lines) 5.48 11.31 42.80 53.36
Regex row filter xsv tsv-filter mawk GNU awk
(2.7 GB, 14M lines) 7.97 8.80 17.74 29.02
Column selection tsv-select mawk xsv GNU cut
(4.8 GB, 7M lines) 4.79 9.51 9.74 14.46
Column selection (narrow) GNU cut tsv-select xsv mawk
(1.7 GB, 86M lines) 5.60 8.26 13.60 23.88
Join two files tsv-join xsv csvtk  
(4.8 GB, 7M lines) 26.68 68.02 98.51  
Summary statistics tsv-summarize xsv GNU Datamash csvtk
(4.8 GB, 7M lines) 15.78 44.38 48.51 59.71
CSV-to-TSV csv2tsv xsv csvtk  
(2.7 GB, 14M lines) 20.30 26.82 44.82  

Test details

Tests were run on April 14, 2018. The latest released version of each tool was used. Details needed to reproduce the tests are given below. The March 2017 study has a more detailed description of the individual tests, however, everything needed to reproduce the tests can be found here.

Machines

Tools

GNU Awk, GNU cut and mawk are available on most platforms via standard package managers. The xsv and csvtk packages were downloaded from prebuilt binaries on their respective GitHub releases pages. eBay's TSV Utilities releases are also from the prebuilt binaries available from GitHub.

Data files

To create the left and right data files used in the join test:

$ number-lines -s line hepmass_all_train.tsv > hepmass_numbered.tsv
$ tsv-select -f 1-16 hepmass_numbered.tsv | tsv-sample -H -v 111 > hepmass_left.shuf.tsv
$ tsv-select -f 1,17-30 hepmass_numbered.tsv | tsv-sample -H -v 333 > hepmass_right.shuf.tsv
$ rm hepmass_numbered.tsv

Command lines

Command lines used in the tests are given below. The command lines are the same for Linux and MacOS. Both awk and gawk refer to GNU awk.

Numeric filter test:

$ [awk|mawk|gawk] -F $'\t' -v OFS='\t' '{ if ($4 > 0.000025 && $16 > 0.3) print $0 }' hepmass_all_train.tsv > /dev/null
$ csvtk filter2 -t -l -f '$4 > 0.000025 && $16 > 0.3' hepmass_all_train.tsv > /dev/null
$ tsv-filter -H --gt 4:0.000025 --gt 16:0.3 hepmass_all_train.tsv > /dev/null

Regular expression filter command lines:

$ [awk|gawk|mawk] -F $'\t' -v OFS='\t' '$10 ~ /[RD].*ION[0-2]/' TREE_GRM_ESTN_14mil.tsv > /dev/null
$ csvtk grep -t -l -f 10 -r -p '[RD].*ION[0-2]' TREE_GRM_ESTN_14mil.tsv > /dev/null
$ xsv search -s COMPONENT '[RD].*ION[0-2]' TREE_GRM_ESTN_14mil.tsv > /dev/null
$ tsv-filter -H --regex 10:'[RD].*ION[0-2]' TREE_GRM_ESTN_14mil.tsv > /dev/null

Column selection command lines:

$ [awk|gawk|mawk] -F $'\t' -v OFS='\t' '{ print $1,$8,$19 }' hepmass_all_train.tsv > /dev/null
$ csvtk cut -t -l -f 1,8,19 hepmass_all_train.tsv > /dev/null
$ cut -f 1,8,19 hepmass_all_train.tsv > /dev/null
$ tsv-select -f 1,8,19 hepmass_all_train.tsv > /dev/null
$ xsv select 1,8,19 hepmass_all_train.tsv > /dev/null

Column selection (narrow) command lines:

$ [awk|gawk|mawk] -F $'\t' -v OFS='\t' '{ print $1,$3 }' googlebooks-eng-all-1gram-20120701-a.tsv > /dev/null
$ csvtk cut -t -l -f 1,3 googlebooks-eng-all-1gram-20120701-a.tsv > /dev/null
$ cut -f 1,3 googlebooks-eng-all-1gram-20120701-a.tsv > /dev/null
$ tsv-select -f 1,3 googlebooks-eng-all-1gram-20120701-a.tsv > /dev/null
$ xsv select 1,3 googlebooks-eng-all-1gram-20120701-a.tsv > /dev/null

Join files command lines:

$ csvtk join -t -l -f 1 hepmass_left.shuf.tsv hepmass_right.shuf.tsv > /dev/null
$ tsv-join -H -f hepmass_right.shuf.tsv -k 1 hepmass_left.shuf.tsv -a 2,3,4,5,6,7,8,9,10,11,12,13,14,15 > /dev/null
$ xsv join 1 -d $'\t' hepmass_left.shuf.tsv 1 hepmass_right.shuf.tsv > /dev/null

Summary statistics command lines:

$ csvtk stat2 -t -l -f 3,5,20 hepmass_all_train.tsv > /dev/null
$ cat hepmass_all_train.tsv | datamash -H count 3 sum 3,5,20 min 3,5,20 max 3,5,20 mean 3,5,20 sstdev 3,5,20 > /dev/null
$ tsv-summarize -H --count --sum 3,5,20 --min 3,5,20 --max 3,5,20 --mean 3,5,20 --stdev 3,5,20 hepmass_all_train.tsv > /dev/null
$ xsv stats -s 3,5,20 hepmass_all_train.tsv > /dev/null

CSV-to-TSV command lines:

$ csvtk csv2tab TREE_GRM_ESTN_14mil.csv > /dev/null
$ csv2tsv TREE_GRM_ESTN_14mil.csv > /dev/null
$ xsv fmt -t '\t' TREE_GRM_ESTN_14mil.csv > /dev/null

Note: For xsv, the more correct way to operate on unescaped TSV files is to first pipe the data through xsv input --no-quoting first. e.g.

$ xsv input --no-quoting hepmass_all_train.tsv | xsv select 1,8,19 > /dev/null

However, adding an additional command invocation runs counter to the goals of the benchmark exercise, so it was not done in these tests.