Thursday, December 11, 2014

VoltDB first steps - schema

The first, and most critical, step for working with VoltDB is figuring out what table(s) will be partitioned, and how. So, the schema...

The case I'm looking at is for a hypothetical ticket booking agency, for very large events (rock bands playing large stadiums, etc.). There's three basic tables:
  1. seat_info
  2. booking_info
  3. booking_seats
seat_info is a list of all of the seats, and whatever metadata is required.
booking_info is a list of the bookings, so far, and whatever metadata is required.
booking_seats relates seats to bookings, and events.

booking_seats is the table that's interesting; it will be the largest, and hottest, by far. The schema elements for booking_seats is below:

create table booking_seats (
  booking_id varchar(10) not null,
  seat_number smallint not null,
  seat_row int not null,
  event_id int not null,
  constraint booking_seats_unique_hash unique (seat_number, seat_row, event_id)

partition table booking_seats on column seat_row;

create index event_booking_hash_idx on booking_seats (event_id);

booking_id is a foreign key for booking_info.
seat_number and seat_row form a composite key into the seat_info table.
event_id will be a foreign key relationship "in the future".

seat_row is used for partitioning the booking_seats table.

The VoltDB user manual has an example of a flight booking schema that's been partitioned by flight_id, analogous to event_id in my example.  The use case here is different, and so the partitioning scheme needs to be different.

  • event_id would cause all of the traffic for the ticket booking system would be directed to a single partition when tickets for that event were put on sale - which sort of defeats the purpose of partitioning the data in the first place.
  • booking_id could be used for partitioning, but as it's not a part of the unique constraint, the constraint would be expensive to enforce.
  • seat_number isn't a good choice for partitioning because buying multiple tickets is normal, and each transaction should be kept within a single partition, if possible.
  • seat_row is my choice for partitioning, because most bookings tend to be in the same row, and because it's a part of the unique constraint that prevents seats being double-booked.

In theory booking_id, or even event_id, may make a sound choice for partitioning. For this example I really would like to have aggressive partitioning of the data to see what breaks stress VoltDB's capabilities.

This is a "not bad" schema; but the query for finding available seats is a bit ugly. I added the index on event_id, to optimise the query for available seats for a given event, but testing is really essential to see if it helps or hinders performance overall.

No comments: