CM3010 Topic 05: Advanced RDBMS Topics
Main Info
Title: Advanced RDBMS Topics
Teachers: David Lewis
Semester Taken: April 2022
Parent Module: cm3010: Databases and Advanced Data Techniques
Description
In this topic – now that we’ve covered the basics of effective, reliable database use – we turn to questions of speed and efficiency. We will look at analysing query performance and methods of optimisation. You will also work on your project.
This is a short topic as week 10 is given over to the midterm.
Lecture Summaries
DBMS work in a declarative rather than imperative style, we tell the DBMS the result we want and it’s up to the DB how it does it.
This makes optimization different from working with imperative languages. We need to understand what the DBMS needs to know, and help it make an efficient query.
We’ll change structures, or create new structures, to make queries faster.
5.1 Speeding Things Up
5.101 Query Efficiency
Relational DBs are fast. A lot of the time you never need to worry about performance or optimization. If you’ve normalized your DB, that will take care of a lot.
But that’s not always the case. If you have a large dataset, complex queries, or heavy user load, you can find performance issues.
What’s expensive in DBs?
Searching (checking values on every entry)
Sorting (at its best is n log n)
Copying (esp with joins)
So these are the things we look for when analysing performance issues.
First example is about using sorted tables. By default rows are not ordered. So search would be linear. What if the table was sorted by the field we want to search on? Then we can use binary search.
This is called clustered indexing, it uses no extra space, but it is a physical ordering of the data so you can only choose one field to sort on.
Alternatively we can stop worrying about the data in the table itself and use a different structure. We’ll usualy use either a B-tree or hash table.
There are also specialised indexes, for eg spatial data.
These indexes can be much smaller than the source tables, so that they often can be kept in memory, for faster access.
They can depend often on knowledge of the physical storage, so file reads can be optimized.
B-Trees
Balanced trees or B-trees are O(log n) search in typical and worst case.
They support ranges (eg find values between…) very easily.
Supports some approximate searching
Hash Tables
O(1) retrieval in typical case.
O(n) in unusual worst case.
Hash algorithm may be expensive.
Hashes don’t support approximation, designed to spread out broadly.
Large Data problems and examining queries
The order of operations when you get to large data sets becomes important. If you’re cross joining across large tables and creating copies of that data, before filtering, that gets expensive.
Let’s say we start by matching an inner join first, this drastically reduces the number of ops. Optimizers in the system will try to find the fastest order of operations.
This optimization is part of a query strategy. Advanced query strategies include building fresh indexes on the fly.
How do we as users choose when and how to optimize?
We can put the keyword EXPLAIN
before a query which then reports the operations and their sequence, so we can see what will happen.
You will see information like select_type
which may be simple
or involve sub-selects or unions. sub-selects can be expensive, so be careful when using them and see if there are more efficient alternatives. It’s best to get fluent in SQL before using them.
possible_keys
- this will report whether any keys might help the query.
rows
- how many rows will it look at?
filtered
- can it reduce the data before doing any copying? Expressed as a percentage.
Extra
- any other info. It might say for example Using where
or Using hash join
. The latter builds a hash table as it copies the data, and then can use that in later steps.
You can also try EXPLAIN ANALYSE <query>
which will explain and time it.
Walks through an example of playing with indexes.
5.103 Denormalisation
Up to now we’ve treated normalisation as the gold standard, but sometimes this is in conflict with speed.
Normalisation can reduce disk reads, and only read the data when we need it. We don’t need to repeatedly read repeated information.
It can also reduce the need for integrity checks.
It can reduce storage use by removing duplication.
But it increases the use of joins. Joins can be expensive (searching, index consulting).
One technique is to denormalise our db, merging tables to reduce query complexity. Instead of doing join selects we maintain a cache of a joined select.
Note this is not done up front, ie you still have to normalise your database. Denormlisation is done later.
It reduces joins. This is sometimes faster, you’ll only be able to tell whether it’s a useful step based on your specific needs.
Here’s how we might create a denormalised table:
CREATE TABLE MovieActors
(PRIMARY KEY
(Title, Year,
Location))
AS
SELECT Title, Year,
Location,
Name, DOB,
Gender
FROM MovieLocations
LEFT JOIN
Actors
ON
Actor1=Name;
This is risky for dynamic data, it creates duplicated data. So only do it with care.
An alternative is creating a view, an alternative means of caching, like this:
CREATE VIEW SFMoviesAndActors
AS
SELECT Title, Year
Location,
Name, DOB,
Gender
FROM MovieLocations
LEFT JOIN
Actors
ON
Actor1=Name;
This is a saved select, then we can treat this view as a table, we can do SELECT * FROM SFMoviesAndActors
.
These views are dynamic on their own, they are syntactic sugar, translating the neat little query into the big query.
However, they can be combined with caching, a SNAPSHOT
in the standard, or embodied or materialised views. These can be saved, and then updated when data changes.
Mutating a view will mutate the underlying source tables. You can give privileges to users on views. The views will regenerate when data changes.
This isn’t implemented in MySQL natively. You can build your own, but it is built into Postgres.
Summary
Don’t optimize until you find a speed problem, or can see one coming. Do safe optimizations first, before reaching for denormalisation.