eBay's TSV Utilities

Command line tools for large tabular data files.


Project maintained by eBay Hosted on GitHub Pages — Theme by mattgraham

Visit the Tools Reference main page
Visit the TSV Utilities main page

tsv-summarize reference

Synopsis: tsv-summarize [options] file [fileā€¦]

tsv-summarize generates summary statistics on fields of a TSV file. A variety of statistics are supported. Calculations can run against the entire data stream or grouped by key. Consider the file data.tsv:

Make    Color   Time
ford    blue    131
chevy   green   124
ford    red     128
bmw     black   118
bmw     black   126
ford    blue    122

The min and average 'time' values for the 'make' field is generated by the command:

$ tsv-summarize --header --group-by Make --min Time --mean Time data.tsv

This produces:

Make   Time_min Time_mean
ford   122      127
chevy  124      124
bmw    118      122

Using --group-by Make,Color will group by both 'Make' and 'Color'. Omitting the --group-by entirely summarizes fields for the full file.

The examples above specify fields by name. Fields can also be specified by field number, this works for files with and without header lines. For example:

$ tsv-summarize --header --group-by 1 --min 3 --mean 3 data.tsv

See Field syntax for more information about specifying fields.

tsv-summarize tries to generate useful headers, but custom headers can be specified. Example:

$ tsv-summarize --header --group-by 1 --min 3:Fastest --mean 3:Average data.tsv
Make  Fastest  Average
ford  122      127
chevy 124      124
bmw   118      122

Most operators take custom headers in a manner shown above, following the syntax:

--<operator-name> FIELD[:header]

Operators can be specified multiple times. They can also take multiple fields (though not when a custom header is specified). Examples:

--median 1,5-8
--median Count,Kilograms
--median '*_seconds'

The quantile operator requires one or more probabilities after the fields:

--quantile Count:0.25          # Quantile 1 of the 'Count' field
--quantile 2-4:0.25,0.5,0.75   # Q1, Median, Q3 of fields 2, 3, 4

Summarization operators available are:

   count       range        mad            values
   retain      sum          var            unique-values
   first       mean         stddev         unique-count
   last        median       mode           missing-count
   min         quantile     mode-count     not-missing-count
   max

Calculated numeric values are printed to 12 significant digits by default. This can be changed using the --p|float-precision option. If six or less it sets the number of significant digits after the decimal point. If greater than six it sets the total number of significant digits.

Calculations hold onto the minimum data needed while reading data. A few operations like median keep all data values in memory. These operations will start to encounter performance issues as available memory becomes scarce. The size that can be handled effectively is machine dependent, but often quite large files can be handled.

Operations requiring numeric entries will signal an error and terminate processing if a non-numeric entry is found.

Missing values are not treated specially by default, this can be changed using the --x|exclude-missing or --r|replace-missing option. The former turns off processing for missing values, the latter uses a replacement value.

Options:

Operators:

Tip: Bash completion is very helpful when using commands like tsv-summarize that have many options. See Enable bash-completion for details.