The OpenUDDI server poses an interesting challenge when loaded with large amounts of data. The problem is actually pretty generic:
Given a number of objects where each of the objects have a set of (key, value) tuples associated, find the objects which have specific tuples associated. Preferably in SQL, where the tables could look like this:
| Id |
| 1 |
| 2 |
| 3 |
| ObjId | Key | Value |
| 1 | a | val1 |
| 1 | b | val2 |
| 2 | a | val1 |
| 3 | c | val1 |
This is just a simple example. In the OpenUDDI server, we’re working with about 100.000 entities in the Objects table (Business Entities) and 1,8 million key/value pairs, distributes pretty evenly out on the 100k objects. One of the interesting features is that some of the pairs appear in almost all objects while other pairs are close to unique.
When searching for pairs (a, val1) and (b, val2), SQL queries look something like this:
SELECT o.* FROM objects o, values v1, values v2 WHERE o.id = v1.objid AND o.id = v2.objid AND (v1.key = ‘a’ AND v1.value = ‘val1’) AND (v2.key = ‘b’ AND v2.value = ‘val2’)
In other words, a dynamic SQL statement based on the number of pairs in the query. Most databases cannot process this type of query effectively for two reasons:
- Statements cannot be reused, and complexity goes up as the number of pairs increase
- Index utilization is hard because some pairs occur many times while other pairs are almost unique. Only few database engines always use the unique pairs as the primary filter key, regardless of the order in the query
The question then is how to make this run fast. With OpenUDDI, this type of query can run in just under 1 second on PostgreSQL 8.3. All access is index based, but it’s still pretty slow, given that we would like to handle many requests per second.
It would be nice, of course, to be able to say that we’ve solved the problem, but unfortunately this is not so. For now, we’ve settled on a heuristics based optimization: We count all pairs and store the counts in memory. When a new query comes in, we find the pair with the lowest count and use only that pair in the SQL statement. This will speed the select statement up considerably, the only catch is that we now have to post-process the result in memory in order to filter out any entities which does not match the rest of the pairs.
The remaining pairs could probably be pushed into SQL too by using a subselect, but for now this solution works.
My only problem with this solution is that it seems as if there ought to be a better way. However, I can’t think of any, and neither can anybody I’ve talked to about it.