CM3010 Topic 06: Distributed Databases and Alternative Models
Main Info
Title: Distributed Databases and Alternative Models
Teachers: David Lewis
Semester Taken: April 2022
Parent Module: cm3010: Databases and Advanced Data Techniques
Description
Distributed SQL, Map/Reduce, Document DBs.
Key Reading
Brewer, E CAP Twelve Years Later: How the Rules Have Changed, Computer 45(2) 2012.
Dean and Ghemawat MapReduce: a flexible data processing tool Communications of the ACM 53(1) 2010.
Li et al Distributed Data Management Using MapReduce ACM Comput. Survey 2013.
Lecture Summaries
6.0 Intro
Relational model was made in reaction to complex, sophisticated, bespoke data structures that were complex to use, police. Relational model was generic, simple, easy to communicate and reason about. Tables are easy to reason about as humans, and manipulate computationally.
Structures like complex hierarchies, and deep recursion, are hard to do in tables. The criticisms didn’t take much hold though, on an industrial level relational is king.
What’s changed though is rise of the web, big data and need for distributed data.
The user should see no difference between a distributed and non-distributed system.
Why do it? Parallelise operations. Avoid single point of failure. Divide large dataset. Data already partitioned (different territories have different data), want to unite them without building a single database.
This topic covers some of the models for relational DBs and the limitations of the relational model that motivates them.
6.1 Distributing Relational Databases
The theory of distributing databases dates back before the web. The original Codd requirements address distribution already. The interesting part of the literature are the tensions between the reliability guarantees and distribution.
Codd’s criteria for the user not seeing a difference between distributed/non-distributed systems.
Local Autonomy
Sites should operate independently. One site cannot interfere with another. No site should rely on another for its operation.
No centralisation
No single site controls transactions. No single site failure breaks the system.
Continuous operation
The system is available most of the time. The system is relaible - it fails rarely and recovers quickly.
Location Independence
User should not know where any data might be located.
Data can be moved without changing functionality.
Partition Independence
User need not know how the data is partitioned.
Data may be repartitioned without changing functionality.
Optimisation can still benefit from partition structure.
Replication Independence
Distributed databaes often need duplicate copies of data.
The user need not be aware of data replication.
Distributed Queries
Execute queries close to the data. More expensive to move data than process it, so filter first, and then move it.
X independence
Distribution makes no difference to the user means:
hardware independence
OS independence
network independence
DBMS independence
Should be able to distribute over different DBMS implementations. (this is aspirational)
Core concepts
Partitioning
How will you divide your data?
Also called fragmentation.
Vertical partitioning Take a table and break it into sets of columns (eg normalization is vertical partitioning to reduce duplication).
You might put lesser used columns in a smaller cluster, or separate columns that are read-heavy from those that are write-heavy.
Horizontal partitioning Very common. Take a million row table and divide it into ten 100000 row tables. You can then predict table size.
Catalogue management - information about the data (constraints etc) has to be distributed.
Recovery control is complex in a distributed system. Transactions can be hardened using the ‘two-phase’ protocol, allowing you to make sure every node is locked during an operation and only confirms when all nodes are happy. This requires co-ordination.
Brewer’s conjecture - we have three goals for a distributed database, but they are in tension:
Consistency - weaker than acid consistency. All parts of the database should converge on a consistent state.
Availability - every request should result in a response eventually.
Partition tolerance - if a network flaw breaks the network into separate subnets, the database should run and recover.
The conjecture is that you can’t satisfy all three of these at once, you need to prioritize them. You might decide to leave the relational world.
6.2 Key Value Stores
Relational databases are difficult to coordinate in a distributed fashion - joins and constraints are difficult to distribute.
You can simplify your data structures. With simpler structures comes simpler management.
Key value stores take this to the extreme - they represent data as a column for key and column for value. There’s no joins, just key and value, no foreign keys etc.
There’s no explicit connection between entries. We either retrieve by key, or by exhaustive exploration.
This means parallel processing and partitioning are easy - partitioning is always horizontal.
Processing should happen near the data.
One algorithm for key value processing is MapReduce. Many queries can be broken down into two phases - you gather and process information from base tables; then you process and summarise the result.
In SQL terms, in a select query we can think of a ‘map’ phase - the FROM and WHERE clauses - that map the source data, and the ‘reduce’ phase - SELECT, GROUP BY - that operates and reduces that data to its summary.
The map phase is carried out with direct access to the data - it operates over all the data, or an index. The output is a new key-value set (or a new table in SQL world).
The reduce phase doesn’t have to happen close to the data - they are carried out by reducer workers. They summarise data based on key. They can be assigned to workers based on key, in parallel. Output is collated and sent to users.
Why use it? It’s simple, and scales. It’s designed for distributed data. Many approaches follow similar approaches. It’s easy for distributed processing. You move the data you need, not the data you have (privacy). Can recover well, if a worker fails, but there is a co-ordinating node.
google stopped using MapReduce, but it’s still core to many distributed systems.
6.3 Document Databases
Key-Value stores and relational models are on opposite ends of a spectrum, the relational database enforces constraints on types and relationships, the KV store is just a key and value and no relation or type constraints.
By giving up the constraints we can distribute the data and processing. Is there a middle ground? One of the models that does this is document based stores.
When thinking about document stores we’re thinking about discrete structures, separate from each other but more complex than a key-value map.
Document structures are typically less strictly controlled than tables. This means we can nest structures, repeat elements of a data structure - we’re not prescribing the structure. They can be order-sensitive.
Documents seldom interlink - these work parallelizable - otherwise linking is less important for retrieval, if we’re going to use links for retrieval we’ll use an index.
In terms of formats, documents might come in markup languages, or bespoke formats. JSON is now common, we’ll focus on this here.
JSON is a simple expression of standard data structures found in many languages - it’s a serialization format for data persistence and transfer that’s easy to map to typical data structures in languages.
The syntax is very close to JS, with some differences - eg keys have to be quoted. There are no named structures/cross references.
MongoDB is a document-oriented model. It’s distributed by sharding (horizontal partitions). Open-ish source.
A collection is the equivalent of a table, we can insert like db.collection.insert([array])
retrieve like db.collection.find({year:2015})
We can use regex in the queries too: db.collection.find({title: /*Man/ })
We can search using numerical constraints: db.collection.find({year: {$lt: 2010}})
We can update by passing two parameters, the first is the pattern to match the records that you want to update, the second is the update, like this:
db.collection.update( {title: "Ant-Man"}, {$set: {year: 2015}}})
By default update changes one document, since that’s the most common scenario (the find can stop searching after finding a match). You can use updateMany
to update many records at once.
In a document db you’re more likely to have non-normalized data.
Mongo is quite SQL like in some ways - it can create indices etc, but it’s tree like in data modelling, so reduces joins.
It has suffered with security and reliability issues in the past.