Unix for Poets
A lecture by Kenneth Church introducing unix text wrangling.
Assigned in cm3060 Topic 03: Language Modelling, but also relevant to cm3060 Topic 02: Basic Processing
Introduces the following tools through use: grep
, sort
, uniq -c
(count duplicates), tr
(translate characters), wc
(word count), sed
(edit string), awk
, cut
, paste
, comm
, and join
.
(in this note SI means standard input, SO standard output)
Selected Tool Summary
grep
print lines that match patterns
USAGE:
basic:
grep [OPTION...] PATTERNS [FILE...]
explicit that
-e
precedes patterns:grep [OPTION...] -e PATTERNS [FILE...]
read patterns from file:
grep [OPTION...] -f PATTERN-FILE [FILE...]
FULL DESCRIPTION:
grep
searches for PATTERNS in each FILE. PATTERNS is one or more patterns separated by newline characters, and grep prints each line that matches a pattern. Typically PATTERNS should be quoted when grep is used in a shell command.
A FILE of “-” stands for standard input. If no FILE is given, recursive searches examine the working directory, and nonrecursive searches read standard input.
OPTIONS: -f FILE
obtain patterns from FILE, one per line, -i
ignore case, -v
invert match, select non-matching lines, -w
select only matches that form whole words, -x
select only matches that form whole lines, -r
, recursively read all files under each directory
OUTPUT OPTIONS: -c
suppress normal output and print a count of matches instead, --color[=WHEN]
surrounds matched strings with escape sequences for colouring in terminal output. Can be set to never
to not to do this, -l
prints the names of files which matched, -n
prefix each line of output with the line number in the input file
paste
merges lines of files
USAGE: paste [OPTION]... [FILE]...
Writes lines consisting of sequentially corresponding lines from each FILE
, separated by TABs
or a given delimiter from a list -d
sed
stream editor for filtering and transforming text
USAGE: sed [OPTION]... {script-only-if-no-other-script} [input-file]...
A stream editor that makes one pass over the input(s), performing basic text transformations. Used especially for filtering text in a pipeline.
OPTIONS: -f
add the contents of script file to the commands to be executed, -i
edit files in place, -s
consider files as separate, not one long stream.
sort
Sort lines of text files
USAGE: sort [OPTION]... [FILE]...
; or sort [OPTION]... --files0-from=F
Writes sorted concatenation of all FILE(s) to SO. With no FILE read SI.
OPTIONS: -b
ignore leading blanks, -d
dictionary order, -f
ignore case, -i
consider only printable chars, -n
numeric order, -M
month sort, -r
reverse the result of comparisons.
tr
Translate, squeeze and/or delete characters from SI, writing to SO
USAGE: tr [OPTION]... SET1 [SET2]
SETS: Specified as strings of characters. Most represent themselves. Usual interpretations of ‘\n’ ‘\t’ ‘\\’ ‘a-z’ etc.
PATTERNS: :alnum:
alphanuermic characters, :alpha:
all letters, :digit:
all digits, :lower:
, all lc letters, :punt:
all punctuation, :space:
, all horizontal or vertical whitespace.
OPTIONS: -c
use the complement of SET1
; -d
delete characters in SET1
, don’t translate; -s
replace a sequence of repeated character with a single occurence; -t
first truncate SET1
to length of SET2
.
EXAMPLES:
Tokenization: Replace everything that is not an alpha with a line feed (creates a set of tokens on new lines): tr -sc 'A-Za-z' '\012' < myText.txt
uniq
Report or omit repeated lines
USAGE: uniq [OPTION]... [INPUT [OUTPUT]]
Filters adjacent matching lines from INPUT (or SI) writing to OUTPUT (or SO). By default matching lines are merged to the first occurence.
Only detects repeated lines if they are adjacent, sort first!
OPTIONS: -c
prefix output lines with counts, -d
only print duplicate lines, one for each group, -i
ignore case, -u
only print unique lines, -s
skip the first N chars.
Examples of Pipelines
Word frequency
Output list of words with frequency counts:
tr -sc 'A-Za-z' '\012' < myFile.txt | sort | uniq -c
Or to merge the case counts:
tr 'a-z' 'A-Z' < myFile.txt | tr -sc 'A-Z' '/012' | sort | uniq -c
Bigrams
tokenize by word, print \(word_i\) and \(word_{i+1}\) on the same line, count them:
tr -sc 'A-Za-z' '\012' < myFile.txt > myFile.words
tail +2 myFile.words > myFile.nextwords
paste myFile.words myFile.nextwords | sort | uniq-c > myFile.bigrams
grep examples
example | explanation |
---|---|
grep gh | find lines containing ‘‘gh’’ |
grep ’ˆcon’ | find lines beginning with ‘‘con’’ |
grep ’ing$’ | find lines ending with ‘‘ing’’ |
grep –v gh | delete lines containing ‘‘gh’’ |
grep –v ’ˆcon’ | delete lines beginning with ‘‘con’’ |
grep –v ’ing$’ | delete lines ending with ‘‘ing’’ |
grep ’[A–Z]’ | lines with an uppercase char |
grep ’ˆ[A–Z]’ | lines starting with an uppercase char |
grep ’[A–Z]$’ | lines ending with an uppercase char |
grep ’ˆ[A–Z]*$’ | lines with all uppercase chars |
grep ’[aeiouAEIOU]’ | lines with a vowel |
grep ’ˆ[aeiouAEIOU]’ | lines starting with a vowel |
grep ’[aeiouAEIOU]$’ | lines ending with a vowel |
grep –i ’[aeiou]’ | ditto |
grep –i ’ˆ[aeiou]’ | |
grep –i ’[aeiou]$’ | |
grep –i ’ˆ[ˆaeiou]’ | lines starting with a non-vowel |
grep –i ’[ˆaeiou]$’ | lines ending with a non-vowel |
grep –i ’[aeiou].*[aeiou]’ | lines with two or more vowels |
grep –i ’ˆ[ˆaeiou]*[aeiou][ˆaeiou]*$’ | lines with exactly one vowel |
The remaining examples go deep on AWK, which may be covered in a later note.