Thursday, December 11, 2014

VoltDB first steps - schema

The first, and most critical, step for working with VoltDB is figuring out what table(s) will be partitioned, and how. So, the schema...

The case I'm looking at is for a hypothetical ticket booking agency, for very large events (rock bands playing large stadiums, etc.). There's three basic tables:
  1. seat_info
  2. booking_info
  3. booking_seats
seat_info is a list of all of the seats, and whatever metadata is required.
booking_info is a list of the bookings, so far, and whatever metadata is required.
booking_seats relates seats to bookings, and events.

booking_seats is the table that's interesting; it will be the largest, and hottest, by far. The schema elements for booking_seats is below:

create table booking_seats (
  booking_id varchar(10) not null,
  seat_number smallint not null,
  seat_row int not null,
  event_id int not null,
  constraint booking_seats_unique_hash unique (seat_number, seat_row, event_id)
);

partition table booking_seats on column seat_row;

create index event_booking_hash_idx on booking_seats (event_id);


booking_id is a foreign key for booking_info.
seat_number and seat_row form a composite key into the seat_info table.
event_id will be a foreign key relationship "in the future".

seat_row is used for partitioning the booking_seats table.

The VoltDB user manual has an example of a flight booking schema that's been partitioned by flight_id, analogous to event_id in my example.  The use case here is different, and so the partitioning scheme needs to be different.

  • event_id would cause all of the traffic for the ticket booking system would be directed to a single partition when tickets for that event were put on sale - which sort of defeats the purpose of partitioning the data in the first place.
  • booking_id could be used for partitioning, but as it's not a part of the unique constraint, the constraint would be expensive to enforce.
  • seat_number isn't a good choice for partitioning because buying multiple tickets is normal, and each transaction should be kept within a single partition, if possible.
  • seat_row is my choice for partitioning, because most bookings tend to be in the same row, and because it's a part of the unique constraint that prevents seats being double-booked.

In theory booking_id, or even event_id, may make a sound choice for partitioning. For this example I really would like to have aggressive partitioning of the data to see what breaks stress VoltDB's capabilities.

This is a "not bad" schema; but the query for finding available seats is a bit ugly. I added the index on event_id, to optimise the query for available seats for a given event, but testing is really essential to see if it helps or hinders performance overall.

Sunday, December 07, 2014

Explorations of VoltDB

I've started on a little project to see how VoltDB lines up against its promises.

The Unique Selling Point (sort of) of VoltDB is that you can horiztonally scale an OLTP database, in a shared-nothing environment. That's quite a trick.

Horizontally-scalable OLAP systems are well-known, even passe; but getting an OLTP system that scales elegantly has been hard/impossible for a reasonable cost. Could you scale out an Oracle RAC? Sure, if you have a few million bucks for licences and hardware. Can you scale out MySQL? Sure, if you have an army of programmers willing and able to implement sharding in their applications.

The idea is that VoltDB takes the headache of managing sharding and scalable storage, from the children developers, and hands it to the crazy people DBAs.

VoltDB has a base price of Free, which I like, so I decided to give it a go. The idea is to come up with a little benchmark problem, to stress its ACIDity. My example case is for a ticketing agency, such as those types who manage major events like rock bands that will sell-out a 100,000 person stadium. At this point, I'm just mocking up a little data, and a few queries. I hope I'll get around to building a little erlvolt app to go with it.

Getting up and running was a breeze. The documentation and build is excellent in this respect, I can't fault it. I took the hardest possible route and compiled the code from scratch (on Ubuntu) without reading the documentation (which I've since read) - and it went brilliantly well, with an idiot-resistant build that coaxed me towards the correct answer when I got things wrong (generally missing dependencies which are "ant" and "g++").

For sane people, you can just register on their website, download a binary, and install it. There's an (oh-so-cool) docker image available, but I got bored of playing with that in less time than it took to download, so install it, it's easy.

I've found that by being badly behaved, I can crash the VoltDB server (and/or my browser) on my ancient little laptop. This worries me (only) a little. I have been pretty rough on the DB, throwing all kinds of broken rubbish at it, but a DB really does need to be bullet-proof. In terms of running queries, procedures, etc. voltdb just fine, and does what it says on the tin.

If anyone from VoltDB reads this, please address bug ENG-2526 - it forces developers to build misleading interfaces, and that's a really bad idea.

VoltDB looks promising, so far, and has all kinds of interesting features. I'd like to see VoltDB extend with more data types, and functions, and maybe referential integrity enforcement... and a pot of gold, and a pony...