Unix for Poets

A lecture by Kenneth Church introducing unix text wrangling.

Assigned in cm3060 Topic 03: Language Modelling, but also relevant to cm3060 Topic 02: Basic Processing

Full text pdf

Introduces the following tools through use: grep, sort, uniq -c (count duplicates), tr (translate characters), wc (word count), sed (edit string), awk, cut, paste, comm, and join.

(in this note SI means standard input, SO standard output)

Selected Tool Summary

`grep`

print lines that match patterns

USAGE:

basic: grep [OPTION...] PATTERNS [FILE...]
explicit that -e precedes patterns: grep [OPTION...] -e PATTERNS [FILE...]
read patterns from file: grep [OPTION...] -f PATTERN-FILE [FILE...]

FULL DESCRIPTION:

grep searches for PATTERNS in each FILE. PATTERNS is one or more patterns separated by newline characters, and grep prints each line that matches a pattern. Typically PATTERNS should be quoted when grep is used in a shell command.

A FILE of “-” stands for standard input. If no FILE is given, recursive searches examine the working directory, and nonrecursive searches read standard input.

OPTIONS: -f FILE obtain patterns from FILE, one per line, -i ignore case, -v invert match, select non-matching lines, -w select only matches that form whole words, -x select only matches that form whole lines, -r, recursively read all files under each directory

OUTPUT OPTIONS: -c suppress normal output and print a count of matches instead, --color[=WHEN] surrounds matched strings with escape sequences for colouring in terminal output. Can be set to never to not to do this, -l prints the names of files which matched, -n prefix each line of output with the line number in the input file

`paste`

merges lines of files

USAGE: paste [OPTION]... [FILE]...

Writes lines consisting of sequentially corresponding lines from each FILE, separated by TABs or a given delimiter from a list -d

`sed`

stream editor for filtering and transforming text

USAGE: sed [OPTION]... {script-only-if-no-other-script} [input-file]...

A stream editor that makes one pass over the input(s), performing basic text transformations. Used especially for filtering text in a pipeline.

OPTIONS: -f add the contents of script file to the commands to be executed, -i edit files in place, -s consider files as separate, not one long stream.

`sort`

Sort lines of text files

USAGE: sort [OPTION]... [FILE]... ; or sort [OPTION]... --files0-from=F

Writes sorted concatenation of all FILE(s) to SO. With no FILE read SI.

OPTIONS: -b ignore leading blanks, -d dictionary order, -f ignore case, -i consider only printable chars, -n numeric order, -M month sort, -r reverse the result of comparisons.

`tr`

Translate, squeeze and/or delete characters from SI, writing to SO

USAGE: tr [OPTION]... SET1 [SET2]

SETS: Specified as strings of characters. Most represent themselves. Usual interpretations of ‘\n’ ‘\t’ ‘\\’ ‘a-z’ etc.

PATTERNS: :alnum: alphanuermic characters, :alpha: all letters, :digit: all digits, :lower:, all lc letters, :punt: all punctuation, :space:, all horizontal or vertical whitespace.

OPTIONS: -c use the complement of SET1; -d delete characters in SET1, don’t translate; -s replace a sequence of repeated character with a single occurence; -t first truncate SET1 to length of SET2.

EXAMPLES:

Tokenization: Replace everything that is not an alpha with a line feed (creates a set of tokens on new lines): tr -sc 'A-Za-z' '\012' < myText.txt

`uniq`

Report or omit repeated lines

USAGE: uniq [OPTION]... [INPUT [OUTPUT]]

Filters adjacent matching lines from INPUT (or SI) writing to OUTPUT (or SO). By default matching lines are merged to the first occurence.

Only detects repeated lines if they are adjacent, sort first!

OPTIONS: -c prefix output lines with counts, -d only print duplicate lines, one for each group, -i ignore case, -u only print unique lines, -s skip the first N chars.

Examples of Pipelines

Word frequency

Output list of words with frequency counts:

tr -sc 'A-Za-z' '\012' < myFile.txt | sort | uniq -c

Or to merge the case counts:

tr 'a-z' 'A-Z' < myFile.txt | tr -sc 'A-Z' '/012' | sort | uniq -c

Bigrams

tokenize by word, print $word_i$ and $word_{i+1}$ on the same line, count them:

tr -sc 'A-Za-z' '\012' < myFile.txt > myFile.words

tail +2 myFile.words > myFile.nextwords

paste myFile.words myFile.nextwords | sort | uniq-c > myFile.bigrams

`grep examples`

example	explanation
grep gh	find lines containing ‘‘gh’’
grep ’ˆcon’	find lines beginning with ‘‘con’’
grep ’ing$’	find lines ending with ‘‘ing’’
grep –v gh	delete lines containing ‘‘gh’’
grep –v ’ˆcon’	delete lines beginning with ‘‘con’’
grep –v ’ing$’	delete lines ending with ‘‘ing’’
grep ’[A–Z]’	lines with an uppercase char
grep ’ˆ[A–Z]’	lines starting with an uppercase char
grep ’[A–Z]$’	lines ending with an uppercase char
grep ’ˆ[A–Z]*$’	lines with all uppercase chars
grep ’[aeiouAEIOU]’	lines with a vowel
grep ’ˆ[aeiouAEIOU]’	lines starting with a vowel
grep ’[aeiouAEIOU]$’	lines ending with a vowel
grep –i ’[aeiou]’	ditto
grep –i ’ˆ[aeiou]’
grep –i ’[aeiou]$’
grep –i ’ˆ[ˆaeiou]’	lines starting with a non-vowel
grep –i ’[ˆaeiou]$’	lines ending with a non-vowel
grep –i ’[aeiou].*[aeiou]’	lines with two or more vowels
grep –i ’ˆ[ˆaeiou][aeiou][ˆaeiou]$’	lines with exactly one vowel

The remaining examples go deep on AWK, which may be covered in a later note.

Alex's Notes