eBay's TSV Utilities

Command line tools for large tabular data files.

Project maintained by eBay Hosted on GitHub Pages — Theme by mattgraham

Visit the Tools Reference main page
Visit the TSV Utilities main page

tsv-summarize reference

Synopsis: tsv-summarize [options] file [file…]

tsv-summarize generates summary statistics on fields of a TSV file. A variety of statistics are supported. Calculations can run against the entire data stream or grouped by key. Consider the file data.tsv:

Make    Color   Time
ford    blue    131
chevy   green   124
ford    red     128
bmw     black   118
bmw     black   126
ford    blue    122

The min and average 'time' values for the 'make' field is generated by the command:

$ tsv-summarize --header --group-by Make --min Time --mean Time data.tsv

This produces:

Make   Time_min Time_mean
ford   122      127
chevy  124      124
bmw    118      122

Using --group-by Make,Color will group by both 'Make' and 'Color'. Omitting the --group-by entirely summarizes fields for the full file.

The examples above specify fields by name. Fields can also be specified by field number, this works for files with and without header lines. For example:

$ tsv-summarize --header --group-by 1 --min 3 --mean 3 data.tsv

See Field syntax for more information about specifying fields.

tsv-summarize tries to generate useful headers, but custom headers can be specified. Example:

$ tsv-summarize --header --group-by 1 --min 3:Fastest --mean 3:Average data.tsv
Make  Fastest  Average
ford  122      127
chevy 124      124
bmw   118      122

Most operators take custom headers in a manner shown above, following the syntax:

--<operator-name> FIELD[:header]

Operators can be specified multiple times. They can also take multiple fields (though not when a custom header is specified). Examples:

--median 1,5-8
--median Count,Kilograms
--median '*_seconds'

The quantile operator requires one or more probabilities after the fields:

--quantile Count:0.25          # Quantile 1 of the 'Count' field
--quantile 2-4:0.25,0.5,0.75   # Q1, Median, Q3 of fields 2, 3, 4

Summarization operators available are:

   count       range        mad            values
   retain      sum          var            unique-values
   first       mean         stddev         unique-count
   last        median       mode           missing-count
   min         quantile     mode-count     not-missing-count
   max

Calculated numeric values are printed to 12 significant digits by default. This can be changed using the --p|float-precision option. If six or less it sets the number of significant digits after the decimal point. If greater than six it sets the total number of significant digits.

Calculations hold onto the minimum data needed while reading data. A few operations like median keep all data values in memory. These operations will start to encounter performance issues as available memory becomes scarce. The size that can be handled effectively is machine dependent, but often quite large files can be handled.

Operations requiring numeric entries will signal an error and terminate processing if a non-numeric entry is found.

Missing values are not treated specially by default, this can be changed using the --x|exclude-missing or --r|replace-missing option. The former turns off processing for missing values, the latter uses a replacement value.

Options:

--h|help - Print help.
--help-verbose - Print detailed help.
--help-fields - Print help on specifying fields.
--V|version - Print version information and exit.
--g|group-by <field-list> - Fields to use as key.
--H|header - Treat the first line of each file as a header.
--w|write-header - Write an output header even if there is no input header.
--d|delimiter CHR - Field delimiter. Default: TAB. (Single byte UTF-8 characters only.)
--v|values-delimiter CHR - Values delimiter. Default: vertical bar (|). (Single byte UTF-8 characters only.)
--p|float-precision NUM - 'Precision' to use printing floating point numbers. Affects the number of digits printed and exponent use. Default: 12
--x|exclude-missing - Exclude missing (empty) fields from calculations.
--r|replace-missing STR - Replace missing (empty) fields with STR in calculations.

Operators:

--count - Count occurrences of each unique key (--g|group-by), or the total number of records if no key field is specified.
--count-header STR - Count occurrences of each unique key, like --count, but use STR as the header.
--retain <field-list> - Retain one copy of the field. The field header is unchanged.
--first <field-list>[:STR] - First value seen.
--last <field-list>[:STR]- Last value seen.
--min <field-list>[:STR] - Min value. (Numeric fields only.)
--max <field-list>[:STR] - Max value. (Numeric fields only.)
--range <field-list>[:STR] - Difference between min and max values. (Numeric fields only.)
--sum <field-list>[:STR] - Sum of the values. (Numeric fields only.)
--mean <field-list>[:STR] - Mean (average). (Numeric fields only.)
--median <field-list>[:STR] - Median value. (Numeric fields only. Reads all values into memory.)
--quantile <field-list>:p[,p...][:STR] - Quantiles. One or more fields, then one or more 0.0-1.0 probabilities. (Numeric fields only. Reads all values into memory.)
--mad <field-list>[:STR] - Median absolute deviation from the median. Raw value, not scaled. (Numeric fields only. Reads all values into memory.)
--var <field-list>[:STR] - Variance. (Sample variance, numeric fields only).
--stdev <field-list>[:STR] - Standard deviation. (Sample st.dev, numeric fields only).
--mode <field-list>[:STR] - Mode. The most frequent value. (Reads all unique values into memory.)
--mode-count <field-list>[:STR] - Count of the most frequent value. (Reads all unique values into memory.)
--unique-count <field-list>[:STR] - Number of unique values. (Reads all unique values into memory).
--missing-count <field-list>[:STR] - Number of missing (empty) fields. Not affected by the --x|exclude-missing or --r|replace-missing options.
--not-missing-count <field-list>[:STR] - Number of filled (non-empty) fields. Not affected by --r|replace-missing.
--values <field-list>[:STR] - All the values, separated by --v|values-delimiter. (Reads all values into memory.)
--unique-values <field-list>[:STR] - All the unique values, separated by --v|values-delimiter. (Reads all unique values into memory.)

Tip: Bash completion is very helpful when using commands like tsv-summarize that have many options. See Enable bash-completion for details.