Alex's Notes

Codd: A Relational Model of Data for Large Shared Data Banks (1970)

Metadata

  • Title: A relational model of data for large shared data banks

  • Authors: Codd, E.

  • Publication Year: 1970

  • Journal: Comms of the ACM

  • full text

Abstract

Future users of large data banks must be protected from having to know how the data is organized in the machine (the internal representation). A prompting service which supplies such information is not a satisfactory solution. Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed. Changes in data representation will often be needed as a result of changes in query, update, and report traffic and natural growth in the types of stored information.

Existing noninferential, formatted data systems provide users with tree-structured files or slightly more general network models of the data. In Section 1, inadequacies of these models are discussed. A model based on n-ary relations, a normal form for data base relations, and the concept of a universal data sublanguage are introduced. In Section 2, certain operations on relations (other than logical inference) are discussed and applied to the problems of redundancy and consistency in the user’s model.

Key Points

Codd’s motivation is data independence - the independence of application programs and terminal activities from growth in data types and changes in data representation; and data consistency avoiding troublesome inconsistencies. Prior work on relational theory and data had focused on inference systems, but that’s not his main focus.

At the time, there were several kinds of data dependency that needed to be worked through: ordering dependence, indexing dependence, and access path dependence. The main data models were network/graph models so these are his target.

Ordering dependence - information systems at the time failed to make a clean divide between order of presentation and stored ordering. So applications that depended on ordering could break if the stored ordering changed.

Indexing dependence - can applications remain functional as indices come and go?

Access Path dependence - many systems at the time provided tree-structured files or network models. If the structure changed, the application would break. The user is required to exploit a collection of access paths to the information of interest.

Relations

The term relation is used here in its accepted mathematical sense. Given sets \(S_1, S_2, \dots , S_n\) (not necessarily distinct), R is a relation on these n sets if it is a set of n-tuples each of which has its first element from \(S_1\), its second element from \(S_2\) , and so on. We shall refer to \(S_j\) as the jth domain of R. As defined above, R is said to have degree n. Relations of degree 1 are often called unary, degree 2 binary, degree 3 ternary, and degree n n-ary.

Relationships are the domain-unordered counterparts to relations. We don’t want users to have to remember the column sequence of a large table. So we uniquely identify each column with a role name, and then order doesn’t matter.

Normally one domain or combination of domains of a given relation has values that uniquely identify each element of that relation. Such a domain (or combination) is called a primary key. A primary key is nonredundant if it is either a simple domain or a combination such that none of the participating simple domains can be excluded while retaining unique identification.

There may be more than one non-redundant primary keys, in which case one is chosen arbitrarily.