CM3010 Topic 09: Multimedia and Retrieval
Main Info
Title: Multimedia and Retrieval
Teachers: David Lewis
Semester Taken: April 2022
Parent Module: cm3010: Databases and Advanced Data Techniques
Description
Information Retrieval.
Assigned Reading
Vannevar Bush As we may think.
Li-Chun Wang The Shazam audio retrieval algorithm (2003)
Schedl et al Neglected User in Music IR
Typke et al Using Transportation Distance for Measuring Melodic Similarity
Janssen et al Finding Occurrences of Melodic Segments in Folk Songs
Marques Visual IR: The State of the Art (2016)
Lecture Summaries
9.0 Introduction
When we’ve talked about data we’ve been talking about facts - informational nuggets.
But information that people work with is often less tangible - documents. A piece of music, a novel, videos. This is all data, but how do we decompose them to individual facts?
When we want to answer a user need about information of that kind we’ll need a different method than ‘fact-like’ information.
We’ll look at that here, but see cm3060 Natural Language Processing for more.
Starts with an example from IR in a music system. The user has a vague information need - I’d like some slow, quiet music. A piece of music is returned that starts off satisfying the criteria, but then speeds up and gets louder.
What does the query mean - what makes a piece of music slow or quiet, and what are the tolerances? This is the issue of vagueness and boundary decisions that are not alway determinate.
How we aggregate small parts of a document to describe it is a key issue in IR. This is the issue of ‘summary’.
Here are some other key issues:
Information may not be explict in the signal.
It may not be in the signal at all (I’d like Viennese music from 1803 please).
Answers may be subjective.
Quotes Vannevar Bush at length about the differences between structured information retrieval and the human mind associating items and ideas. This is mentioned in the reading above.
The core ideas in IR are that:
The User has an information need
The need is expressed as a query
The query is executed over data by an IR system
Results are returned, usually as a ranked list
Returned results are relevant if they match the information needs
9.1 Features for Complex Data
Much of what we do in IR tasks is based not on searching documents themselves but using features.
If we have an audio signal, say stereo, 44k samples per second, we’re not going to get much info by looking at an individual sample.
Similar with NLP, taking characters individually won’t tell us much. Same with looking at a single pixel in an image.
So we construct higher level structures above the raw data. These are called features.
We might count zero-crossings. We might look at whole words, or other tokens in natural language. We might look at colour regions in an image.
We might then need to step back again - pitch estimation in music, combinations of words or stemming/lemmatization in NLP. Shapes in images.
A feature is a measurable property derived from the signal. It can include metadata though. Could be boolean, numerical, categorical. Can be simple or a vector. Vector features include colour profile, letter frequency, energy per pitch class in music retrieval.
So for a ‘slow and quiet’ query we’d need tempo distribution, which needs tempo estimation, which needs beat tracking, which needs energy in time bins.
We’d also need loudness variation, which needs overall energy (in time bins), or track a dynamic range of energy - perception depends on relative loudness.
Features help us get from low level signal to higher level concepts. They reduce large complex data to smaller, simpler data.
They embody our expectations of salience. They can be re-weighted based on task and user.
There is still a semantic gap between the features and the user’s conceptual space. This is quite often where ML comes in. There won’t be a direct mapping between a query and a feature, bridging that gap is the biggest challenge.
9.2 Feature Space
To combine and use features we place them in a feature space.
Shows a simple 2D space and describes querying that space through looking at the distance to points. We might want to do similarity analysis.
A feature must be measurable, but it also must be mappable to a space, so mappable to a number.
In some cases this is straightforward - frequency is continuous for example. But what about the repeatable bins that are musical notes that logorithmically map to frequencies? See other modules for categorical variables and continuous feature space.
9.203 Distances and Similarity
We’re going to use the space to evaluate similarity and distance. How do we judge similarity?
We might use Euclidean distance, which can expand to arbitrary number of dimensions.
We could use Manhattan distance, add all the distances.
Why choose one over the other? The difference is that the Euclidean distance squares all its distances. A large distance in one feature will overwhelm the others. That will be penalized less in Manhattan distance.
If you’re looking for cases of similarity on some features even if others are quite different, Manhattan might be better.
Talks through some requirements for spaces and rules for metric spaces:
\(d(x,y) = 0\) iff \(x=y\): Identity of indiscernibles
\(d(x,y) = d(y,x)\): Symmetry (doesn’t always apply walking up a hill - physical distance is the same, but effort/time is not symmetrical). A music playlist might work well in one direction, but not the other.
\(d(x,z) \leq d(x,y) + d(y,z)\): Triangular Inequality (not universally reliable, in psychological space some triangular inequality might not apply). Some IR measures are not stricly metrics because they don’t obey this.
9.205 Speed and Indexing
IR is focused on users - they bring the need and formulate the query.
One need users invariably have is speed. To achieve this we precompute features as indices. Many searches are parallelisable.
Spatial indexes of metric spaces can be very fast for retrieval.
R-trees provide O(log n) retrieval.
Spacial indexing is built into many RDBMS packages.
Other techniques include reducing the dimensions to increase the speed and reduce irrelevant results.
Some search must happen live, so much effort goes into better algorithms.
9.3 Approximate Matching
How do we know if a result is good? We could use user ratings. But that might not be possible or give us good data.
Can we tell what a good result looks like?
One concept is precision precise searches return mostly good results, with few irrelevant entries.
The flip side is recall searches with high recall will return most of the relevant results, with few not appearing in the results set.
These cover different aspects of user satisfaction - did I get all I need, and did I have to wade through irrelevant results to get it?
If looking at a ranked list as you include more results precision will decline, recall will improve, then plateau.
What do we take from this kind of information?
Precision is the proportion of positive results that are true positives - relates to Type I errors.
Recall is the proportion of relevant documents that are retrieved as positive results = relates to Type II errors.
We might favour one of these over the other. To combine them we use the f-measure. It’s the harmonic mean of the two:
multiply the precision and recall, divide by the precision + recall. Multiply the result by 2.
Some users just want a good match, or a small number. Show me a web page about cheese. I don’t want them all, I just want one. Show me a film to watch tonight that’s like Ant-Man. Play me something that sounds like Radiohead.
Recall is irrelevant here, precision is vital, especially for the first few matches.
on the other hand, if I’m a researcher and I want to look at a particular topic, I want all the matches. It matters for my research that it’s complete, so far as possible.
Here, recall is much more important than precision. Precision is still helpful though, if 80% of my results are irrelevant it’s still annoying, but recall is vital.
There are nuanced cases, like looking for reviews of a commercial product, or list shows I might want to watch - I want a selection then I choose. Literature reviews can often be covered by a good selection, not universality. F-measure might be useful here as both precision and recall are useful.
Questions to ask: How many records will the user consult. how many records does the user need? Are precision and recall equally important?
how can you measure recall for a large, unknown corpus? There are sampling methods, but note that these are not symmetric for the two metrics, precision is easier to measure than recall.