Lately I’ve been researching some into the holy grail of keyed data storage – best combination of performance, scalability, efficiency and availability. There are many, many options available ranging from the Berkeley DB to BigTable implementations like Hypertable.
Last weekend I spent some time looking into using BDB in a BigTable fashion for managing schema-free tables. However my tests revealed many problems with a solution like that. For instance, BDB is really slow when writing random keys into databases of >100k row size. In the beginning of this week I had a chat with Jon Åslund regarding this idea and he introduced me to Tokyo Cabinet – a modern, battle-tested and extremely high-performance DBM.
Despite the somewhat uncool name, Tokyo Cabinet is a silent beast developed by Mikio Hirabayashi and used in the high-load environment of Japanese Facebook-equivalent Mixi. TC (short for Tokyo Cabinet) is written in C99 C, sporting a clean and modern API.
Mikio states TC improves on other DBMs in the following areas:
As you might know Python lies close to my heart, so despite the multitude of language bindings available for TC I will only talk about the Python bindings pytc by Tasuku Suenaga.
At the time writing this pytc is only available from PyPI and the main repository, thus you need to build this trivial module yourself.
$ svn co http://svn.coderepos.org/share/lang/python/pytc/trunk/ pytc $ cd pytc $ sudo python setup.py install
Unfortunately there is no documentation of pytc – you need to look through pytc.c manually, which might give you pleasure or pain, depending on if you have a life or not :P. Anyhow, Tasuku have created two examples giving you most clues needed. Good thing is the Python API is almost a clean port of the C API, meaning you can refer to the C library reference and apply some common sense.
Here’s a very simple program using a hash database, setting three pairs and asserting one of them are retrieved in a proper manner:
import pytc db = pytc.HDB('casket.hdb', pytc.HDBOWRITER | pytc.HDBOCREAT) db.put('potato', 'potatis') db.put('carrot', 'morot') db.put('banana', 'banan') assert db.get('carrot') == 'morot'
As the data is mapped into shared memory and the library is thread-safe (locking per row or per database), multiple processes and/or threads of control can operate on the database at the same time. By default whole-database locking is performed, in which case there can be only one writer at any given moment, blocking any other readers or writers. Unless a writer is doing its thing, there can be multiple concurrent readers without any locking.
The performance of the library is very good – almost as fast as working directly with the C API – randomly reading 2 000 000 records took less than 15 seconds (~140 000 reads/sec) using a single thread of control. The database read from was a hash database with 10M records with a variable key length of 10-15 characters.
Update: I’m now working on the Python bindings under the name tc – currently available for Python 2.4, 2.5, 2.6 and 3.0 in MacPorts and PyPI. Source resides in the ‘Hub.
While Tokyo Cabinet is best thought of as an API to the native database routines, you’ll be happy to know that you can, in fact, run Tokyo Cabinet in stand alone mode with the help of Tokyo Tyrant - a high concurrency network interface to the underlying database. Backup, replication, and failover are all part of the package.
I ran a few benchmark tests of Tyrant, which proved that this solution indeed provide extremely high performance. This test was run on a MacPro with 8x2.8 GHz Intel Xeon processors with almost no other load, reserving 4 threads for the server.
Server started like this:
# Server with 4 threads, using a hash table with 10M buckets $ ttserver -thnum 4 "*#bnum=1000000"
Test 1 – 1 000 000 records, 4 threads: (Result: About 22 000 operations per second for any of reading, writing or deleting records)
Using the mget we can fetch several records in one call, drastically increasing performance:
The above numbers was aquired by running tcrmttest read against the local Tyrant server with 1M records:
$ tcrmttest read -tnum 4 -mul X localhost
Mikio provides a pretty good benchmark document in which he has compared many DBMs with Tokyo Cabinet:
Write performance (Writing 1 000 000 records)
Read performance (Fetch all 1 000 000 records)
A very interesting project which state imo. are in the transition between “new and experimental” and “good old reliable software”. I’ve tried to contact Tasuku Suenaga (without any success) about getting his Python bindings into MacPorts. Tokyo Cabinet and Tyrant are perfect complements to my Smisk web services framework – maybe I’ll continue working on Tasuku Suenaga Python bindings and bring TC to Debian and MacPorts as well as Smisk.
Ilya Grigorik has a pretty good introduction to Tokyo Cabinet covering parts of TC not mentioned in this article (for instance the array and table engine).