The project I’m working on now is using Oracle for data storage. Not only that, it’s good old 9.2. Try installing that on your brand new Ubuntu. Oh well, it’s doable, but recently we’ve begun discussing if we could run it on MySQL or PostgreSQL (I’m voting for PostgreSQL, but that’s mostly because I used MySQL back when foreign keys and other advanced features weren’t exactly implemented, and in the meantime, I’ve come to like PostgreSQL. At the very least, hitting Ctrl-c in psql does not exit the prompt like mysql does).
The way it works now is that if something goes wrong, we can either call Oracle or our strategic Oracle partner, and they’ll fix it. This is just about everything I’m against: paying large amounts of money just because then we can blame somebody else when things screw up.
There are basically two issues with switching to a more light-weight database, and they’re probably more or less the same on many projects. First of all, performance. Oracle probably has a good advantage there, especially in handling large and complex queries. Secondly, maintenance, especially backup and failover.
The only one of these points which I think is valid is backup – it doesn’t really help if it takes a week to backup or restore data. However, when that’s said, we’re left with two things: Performance and reliability. The Oracle way (I’m using this term as the broadest possible. Replace Oracle with MSSQL or DB2 if you like) is to add more memory, disk, clustering, high availability. Expensive, but you get to live in your little world where you can just write code against one large database.
The other way, which I prefer (and which is probably in the Web 2.0 spirit) is to distribute. Distribute both data and processing to a number of autonomous nodes which can operate independently of each other. This is no news, and has been done many times, but it’s not something that’s normally considered when building good old business applications.
The result of distributing is essentially that you think about how you’re accessing your data. Instead of just delegating al of the work to the database and hoping for the best, you’re actually forced to analyze data relationships to discover separate components. If this succeeds, the choice of database should no longer be about whether it can optimize a query over 50 tables with subselects, type conversions, views, functions, and other stuff, but if it is efficient at looking up simple data relations. My guess is that all the popular databases can do this, so then you’re free to choose the cheapest, the one which is easiest to work with, or whatever suits your environment.
Returning to the project I’m working on, we have some reservations about some of the queries we’re executing. They’re on Oracle syntax right now, but can probably be converted to regular SQL in a finite amount of time. That doesn’t make them any smaller, but we have a good amount of pretty static data (addresses, classifications, and so on), which are retrieved together with the more dynamic data. It the static data is removed from the SQL, we end up with some much simpler queries, which shouldn’t be a problem for any database engine. The problem which remains is how the static data is retrieved effeciently. At the moment, I’m leaning towards a solution where we implement a service which can take a list of data keys and return the data. Depending on the amount of data, the service can then be implemented as a in-memory map, a memcached cache, or maybe even something like Hadoop. No matter what, basing the model on a basic principle of isolating static data from the dynamic, and only querying the dynamic data seems like the way to go as a first step – and as a nice side effect, the dependency on the database’s ability to perform doesn’t matter that much anymore.
This probably just sound like drunken ramblings to those who have actually implemented distributed business systems, but bear with me, it’s a first for me, and I need to get things out of my head before the space runs out.