Codd: Normalized Data base Structure: A Brief Tutorial (1971)
Metadata
Title: Normalized Data base Structure: A Brief Tutorial
Authors: Codd, E.
Publication Year: 1971
Journal: ACM SIGFIDET
Abstract
Casual and other users of large formatted data bases need a simple tabular (relational) view of the data rather than a network or tree-structured view. This paper illustrates the removal of repeating groups, hierarchic and plex structures, and cross-referencing structures. Finally, the simplification of data base relations by normalization is discussed.
Key Points
Starts by rehearsing the arguments for tables as a logical structure for data and specifies the properties of a table:
column-homogeneous. In any selected column the items are of the same kind, but items in different columns need not be of the same kind.
Scalar values. Each item is a simple number or character string, not a set or repeating group.
All rows are distinct (no dupliate rows).
The ordering of rows is immaterial.
The columns of a table have distinct names and the ordering of columns is immaterial.
As a result of 3 and 4, the table is a relation. With n columns, the table is a relation of degree n.
When a formatted data base is viewed as a collection of time-varying relations of assorted degrees, we shall speak of this view as the relational model. (p. 4)
The paper walks through how to refashion tree, plex, or cross-referenced structures into tabular form.
Normalization discussion starts on p. 11
Normalization is a step-by-step reversible process of replacing a given collection of relations by successive collections in which the relations have a progressively simpler and more regular structure. The simplifying process is based on non-statistical criteria. The reversibility guarnatees that we can recover the original collection of relations and therefore have not lost any information.
Objectives are listed including to tabulate any relation in the database, providing a simpler collection of operations for powerful retrieval, getting rid of operational dependencies, reducing the need for restructuring the db, making the model more informative, and making the collection of relations neutral to query statistics.
First Normal Form
The first step in normalization is to get refactor any compound columns, so that every item is a scalar value.
Second Normal Form
In the second step we look for any attributes that are dependent only on a sub-component of the primary key. if we find these we’ll have some irregularities:
Insertion anomaly: Let’s say an item x depends only on one attribute \(a_1\), but the primary key also includes \(a_2\). Now we can only insert a new x if we have an \(a_2\).
Deletion anomaly: If only one kind of an item remains attached to a key, deletion of the key causes deletion of the last remaining information on that item.
Update anomaly: If a change is made to the value of an attribute, the number of copies to be updated depends on the number of times it’s used.
Third Normal Form
Finally we look for transitive dependencies which give rise to similar anomalies.