eBay's TSV Utilities

Command line tools for large tabular data files.

Project maintained by eBay Hosted on GitHub Pages — Theme by mattgraham

Visit the Tools Reference main page
Visit the TSV Utilities main page

tsv-sample reference

Synopsis: tsv-sample [options] [fileā€¦]

tsv-sample subsamples input lines or randomizes their order. Several techniques are available: shuffling, simple random sampling, weighted random sampling, Bernoulli sampling, and distinct sampling. These are provided via several different modes operation:

Sample size: The --n|num option controls the sample size for all sampling methods. In the case of simple and weighted random sampling it also limits the amount of memory required.

Performance and memory use: tsv-sample is designed for large data sets. Algorithms make one pass over the data, using reservoir sampling and hashing when possible to limit the memory required. Bernoulli sampling and distinct sampling make immediate decisions on each line, with no memory accumulation. They can operate on arbitrary length data streams. Sampling with replacement reads all lines into memory and is limited by available memory. Shuffling also reads all lines into memory and is similarly limited. Simple and weighted random sampling use reservoir sampling algorithms and only need to hold the sample size (--n|num) in memory. See Shuffling large files for ways to use disk when available memory is not sufficient.

Controlling randomization: Each run produces a different randomization. Using --s|static-seed changes this so multiple runs produce the same randomization. This works by using the same random seed each run. The random seed can be specified using --v|seed-value. This takes a non-zero, 32-bit positive integer. A zero value is a no-op and ignored.

Weighted sampling: Weighted line order randomization is done using an algorithm for weighted reservoir sampling described by Pavlos Efraimidis and Paul Spirakis. Weights should be positive values representing the relative weight of the entry in the collection. Counts and similar can be used as weights, it is not necessary to normalize to a [0,1] interval. Negative values are not meaningful and given the value zero. Input order is not retained, instead lines are output ordered by the randomized weight that was assigned. This means that a smaller valid sample can be produced by taking the first N lines of output. For more information see:

Distinct sampling: Distinct sampling selects a subset based on a key in data. Consider a query log with records consisting of <user, query, clicked-url> triples. Distinct sampling selects all records matching a subset of values from one of the fields. For example, all events for ten percent of the users. This is important for certain types of analysis. Distinct sampling works by converting the specified probability (--p|prob) into a set of buckets and mapping every key into one of the buckets. One bucket is used to select records in the sample. Buckets are equal size and therefore may be a bit larger than the inclusion probability. Since every key is assigned a bucket, this method can also be used to fully divide a set of records into distinct groups. (See Printing random values below.) The term "distinct sampling" originates from algorithms estimating the number of distinct elements in extremely large data sets.

Printing random values: Most of these algorithms work by generating a random value for each line. (See also "Compatibility mode" below.) The nature of these values depends on the sampling algorithm. They are used for both line selection and output ordering. The --print-random option can be used to print these values. The random value is prepended to the line separated by the --d|delimiter char (TAB by default). The --gen-random-inorder option takes this one step further, generating random values for all input lines without changing the input order. The types of values currently used are specific to the sampling algorithm:

The specifics behind these random values are subject to change in future releases.

Compatibility mode: As described above, many of the sampling algorithms assign a random value to each line. This is useful when printing random values. It has another occasionally useful property: repeated runs with the same static seed but different selection parameters are more compatible with each other, as each line gets assigned the same random value on every run. This property comes at a cost: in some cases there are faster algorithms that don't assign random values to each line. By default, tsv-sample will use the fastest algorithm available. The --compatibility-mode option changes this, switching to algorithms that assign a random value per line. Printing random values also engages compatibility mode. Compatibility mode is beneficial primarily when using Bernoulli sampling or random sampling: