NoSql Databases – Part 1 – Landscape

5 January 2010

At Directi, we are taking a hard look at the way our applications need to store and retrieve data, and whether we really need to use a traditional RDBMS for all scenarios. This does not mean that we will eschew relational systems altogether. What it means is that we will use the best tool for the job – we will use non-relational options wherever needed and not throw everything at a relational database with a mindless one-size-fits-all approach.
This post covers the current landscape of the NoSQL space. In a subsequent post, I intend to cover in more detail the various problem areas addressed by NoSQL systems and the specific algorithms used.
Caveat: though I have tried quite hard to understand each database, the fact is that most of the systems discussed are quite comprehensive and summarizing their capabilities in a few bullets cannot do them any justice. And it is certainly possible that my interpretation maybe wrong in some cases since I haven’t actually used these systems as yet.

NoSql Characteristics

The need to look at Non SQL systems arises out of scalability issues with relational databases, which are a function of the fact that relational databases were not designed to be distributed (which is key to write scalability), and could thus afford to provide abstractions like ACID transactions and a rich high-level query model.
All NoSQL databases try and address the scalability issue in many ways – by being distributed, by providing a simpler data / query model, by relaxing consistency requirements, etc. I find it useful to bucket them as follows:

Distributed vs. Not-distributed

Distributed databases take the responsibility of data partitioning (for scalability) and replication (for availability) and do not leave that to the client.

Distributed	Not Distributed (responsibility on client)
Amazon Dynamo Amazon S3 Scalaris Voldemort CouchDb (thru Lounge) Riak MongoDb (in alpha) BigTable Cassandra HyperTable HBase	Redis Tokyo Tyrant MemcacheDb Amazon SimpleDb

Over here, the model of distribution which seems most compelling is the one used by Dynamo. Indeed, it is copied by Voldemort, Riak and Cassandra – three very different kinds of stores. The reason it is most compelling is because it gives simple knobs to an application to tune its expectations of durability, read performance, consistency, write performance, etc. This makes it very general purpose. The other reason this model is good is because it allows heterogeneous hardware to be used in an efficient way (however Cassandra departs from this).

Data Model richness

The other key distinction is in terms of richness of data model:

Key-Value store	Document store	Column-Store
Amazon Dynamo Amazon S3 Redis Scalaris Voldemort	Amazon SimpleDb Apache Couchdb MongoDb Riak	Cassandra Google BigTable HBase Hyperbase

On one end of the spectrum are the simple key-value stores: Dynamo, S3, Redis, Scalaris, Voldemort, etc. At the other end of the spectrum are the column-stores which provide a very rich model: BigTable, Cassandra, Hypertable and HBase fall into this bucket. This richness comes at a price – the model is not simple, and you need to think data modeling grounds up. Somewhere in between are the document stores – a sort of schema free cousins of relational databases: CouchDb, Riak, MongoDb, etc. Simple to understand and richer than plain key-value stores.

Disk vs. Memory

A third useful dimension is whether the database is memory-driven or disk-driven. This is important since in the latter case you need an explicit cache, while in the former case you are not durable:

Memory	Configurable	Disk
Scalaris Redis	BigTable Cassandra Hbase HyperTable	CouchDb MongoDb Riak Voldemort

On one end of the spectrum is Scalaris which is entirely memory-driven, and Redis which is primarily memory oriented (you can do background snapshots). Cassandra, BigTable, Hypertable, Hbase allow configuring how large the Memtable can get, so that provides a lot of control. The document stores – CouchDb, MongoDb and Riak – all seem to be using on-disk B+ trees, and Voldemort uses BDB and MySQL. Having a pluggable storage engine which is in-memory, does not make it configurable since the in that scenario you are entirely memory driven with no durability at all!

So which one wins? Well, I am biased towards the database solving the scalability issue by taking over the responsibility of partitioning instead of leaving that problem with me. So theoretically, Cassandra seems to combine the best of both worlds: the sophisticated distribution model of Dynamo with the richness of a column-store. Also, it is in heavy production use despite still being in beta. However, I believe the learning curve here would be higher for modeling data, and it may make sense to opt for the simplicity of a Voldemort for most cases. Ultimately it would depend on your app requirements and development team.

1. Why NoSql

Some of the stuff here may not resonate with you if you are an enterprise developer since enterprise apps don’t have to deal with the kind of gigantic scale that (some) consumer web applications deal with. However, given the rate at which data is growing and the number of users who are using IT systems, these issues are only going to become more and more common – for smaller consumer apps, as well as for enterprise apps. In fact, even today, irrespective of the scale at which your app operates, if you want to take advantage of a Cloud platform like Google App Engine or Microsoft Azure or Amazon Web Services, you would perhaps need to think of some of the issues below, because at the infra level these platforms do have to bother about high scale and may impose constraints on the application / data model to help them scale.

1.1 Relational databases are hard to scale

1.1.1 Replication – scaling by duplication

Master-Slave:
- Each write results in N x writes where N is the number of slaves. This leads to diminishing returns beyond a point, thus imposing a limit
- While reads would get faster (since you can now read from N nodes), writes are still bottle-necked to one node
- Critical reads still need to go the master since the write may not have propagated to all nodes. This logic needs to be built into the app.
- High volumes of data pose a problem since you need to duplicate the data N times. This also leads to limiting how much you can scale with this approach.
Multi-Master
- Can improve write scaling by adding more masters, but this may lead to conflicts
- Conflict resolution happens at O(N³) or O(N²) – http://research.microsoft.com/~gray/replicas.ps

1.1.2 Partitioning (sharding) – scaling by division:

Scales reads as well as writes
Not transparent to the application. The application needs to be partition aware.
The value of an RDBMS is in relations. Once partitioned, these relations get broken – you cannot do a join across shards – this now needs to be done in the app layer.
In general, manual sharding in relational databases is not simple.

1.2 Don’t need some features

1.2.1 UPDATEs and DELETEs

Typically not used since that leads to loss of information
- May need the record for auditing, or for re-activation
- Typically, the info is never really “deleted” from a domain perspective anyway
  - A user “leaves” a community – his posts would not be removed
  - Employees “leave” companies – their employment record is maintained
  - The canonical ACID transactions example: debits from a bank account – this is not a DELETE, but an INSERT
- Nor is info just “updated”
  - Salary gets “incremented” – the previous record is maintained
  - Your bank account balance is not updated – there are “debits” and “credits”
So one can typically model an UPDATE / DELETE as an INSERT and version the record.
- When data gets too large, archive inactive parts of data
Two problems that arise when you go for an INSERT-only system:
- The database cannot help you with cascades thru triggers – this needs to be done explicitly in the app layer
  - The cascades are actually far more complex than propagating a DELETE / UPDATE – this is a domain requirement:
    - When an employee leaves, you need to update the payroll system so that full and final compensation can be carried out
    - Everytime a bank account gets debited, checks need to be made on standing instructions, minimum account balance, etc.
- Queries need to filter out the inactive records
  - Can lead to dirty looking code – addressed using views
  - There would be some perf penalty that can be addressed by archival

1.2.2 JOINs

Why avoid
- Joins are expensive when data volumes are high since the database server has to perform complex set operations over large volumes of data
- Do not work across partitions
- Techniques like Materialized / Indexed Views not supported by all databases
How to avoid? De-normalize!
- Purpose of normalization
  - Make it easier to have consistent data by keeping just one copy
  - Reduce the amount of storage
- With De-normalization
  - Burden of consistency shifts from the database to the application layer
  - Easier if you only do INSERTs and no UPDATEs / DELETEs
  - Would lead to data bloat – can be significant for large volumes, but storage is cheap and you can archive inactive data

1.2.3 ACID Transactions

Atomic – do not need atomicity on modification of more than one record. Single key atomicity is enough
Consistency – CAP theorem – can get any two of Consistency, Availability, Partition tolerance – not all three. (Also see http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.1915)
- Most systems need partition tolerance and availability ahead of consistency.
  - Customer wants to place an order – you will accept the order, not return the money saying the system is unavailable – availability is important
  - Inventory would be checked asynchronously
  - Order details would be checked asynchronously
  - … would be done asynchronously
  - All this while data would be in an inconsistent state
  - This is ok – businesses are like that. They do not operate on a single version of truth. Reconciliation happens all the time.
- Therefore our data model need not be strongly / strictly consistent. We can do with Eventual Consistency.
- In most scenarios we need Read-Your-Writes Consistency and Monotonic Reads Consistency (as defined by Vogels in the paper above)
- Strong consistency relies upon conflict resolution at write time to keep read complexity simpler. This does not scale.
Isolation – do not need isolation beyond Read-Committed. Easy with single key atomicity (above)
Durability – need durability till the time that RAM becomes cheap enough that one can afford many peer replicated nodes holding data in memory so that data is available even with node failures.

1.2.4 Fixed Schema

In an RDBMS you have to define the schema before you can start using data (somewhat like declaring types in statically typed languages)
- Define each entity (table), its attributes (columns) and relations between entities
- Define usage patterns (indexes)
Modifying schemas is essential
- Intense competition and rapid growth necessitate adding new features / tweaking existing features rapidly
- Changes usually require modifying the data model, thus precipitating schema changes
Modifying the schema is hard
- Adding / Deleting / Modifying a column may lock the rows (imagine doing this for several million rows)
- Adding / Removing an index may lock up the table

1.3 Don’t get some features

Hard to model hierarchical data
Hard to model graphs
Don’t rely primarily on main memory
- Preferable to avoid going to disk as far as possible and serve out of main memory to get faster response time
- Most relational systems are not memory oriented, but disk-oriented. Even with large main memory, relational databases end up going to disk for most queries – they are not aggressive about serving data from main memory and avoiding going to disk.
- Vendors are trying to address this by acquiring / building in-memory database technology, but this is far from mainstream

2. NoSql Databases – Desired Characteristics

The environment expected is that the system would be spread over 100s to 1000s of commodity machines (hence called nodes) with different capacities. In a system like this failure is expected, tolerated and recovered from, with no loss of data, and without affecting the overall availability and scalability of the system at large. The following characteristics are explicitly required to address these requirements:

2.1 High Scalability

Ability to add nodes incrementally to support more users and data
Achieved via partitioning
Increasing number of nodes should not result in a diminishing return beyond a threshold

2.2 High Availability

No single point of failure
Data should be available even as some nodes go down
Achieved via replication since data is duplicated
Also achieved via partitioning since at least some data continues to be available

2.3 High Performance

Operations should return fast
Achieved via being main-memory oriented instead of being disk-oriented, using non-blocking writes, lower complexity algorithms, etc.

2.4 Atomicity

Individual writes need to be atomic – should not expose in-between state to a read operation
Batching of multiple writes into a single atomic unit not required

2.5 Consistency

Do not need strong consistency
Ok to have eventual consistency (read-your-writes consistency)

2.6 Durability

Data should be persisted to disk and not just kept in volatile memory

2.7 Deployment Flexibility

Addition / removal of nodes should distribute data and load automatically without requiring manual intervention
Should not impose constraints like requiring a distributed file system or a shared storage
Should not require specialized hardware
Should be able to work with heterogeneous hardware – identical hardware should not be required to use the system optimally (otherwise you need to upgrade all nodes to get max efficiency which is not practical in a system running on 1000s of nodes)
Application should be able to control the degree of consistency, durability and read / write performance it requires without being aware of the deployment model.

2.8 Modeling flexibility

Should be able to model various types of data in a simple way:

Key-Value pairs
Hierarchical data
Graphs

2.9 Query Flexibility

Multi-Gets (retrieve a bunch of values, for the set of keys provided, in one call)
Range queries (retrieve data based on the specified range of the key)

3. Database Managers

A Database Manager is a software library that is loaded in-process to provide a high performance database. Its focus is typically restricted to the job of maintaining on-disk (or in-memory) data structures. These libraries are often used by distributed databases as a storage backend for managing the CRUD operations.

3.1 Berkley DB

http://www.oracle.com/database/berkeley-db/index.html
Written in C. Language bindings: C++, Java
Data Model: BTree, Hash, Queue (fixed length) and Logical Record Number
API model: Key-Value. Value is opaque
Concurrency support
- Multi-process and multi-threaded access
- Locking subsystem.
Optional ACID transactions
- Three isolation levels
- Support for non-durable transactions
Optional Replication
- Master-slave replication (single master). Master selection based on Paxos type elections.
- Pluggable communication layer

3.2 Tokyo Cabinet

http://1978th.net/tokyocabinet/
Written in C. Language Bindings: Java, Ruby, Perl, Lua
Data Model: Hash, B+ Tree, Fixed length and Table
Concurrency:
- Re-entrant API functions.
- Writers block
- Locking granularity is per record for hash and fixed length databases, Per file for others
Transactions:
- Atomic operations
- Two isolation levels: serializable and read uncommitted
- Durability thru write ahead logging and shadow paging

4. Key Value NoSql Stores

Key-Value stores provide the simplest possible data model. However, this comes at a cost. Range queries are not straightforward (unless the database provides explicit support), and in general modeling applications in general on top of a key-value store can get complicated.

4.1 Amazon Dynamo

http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html – terrific paper, must read!
Distributed key-value store. Values are opaque to system. Targets objects typically < 1 MB
Internal to Amazon, not available for direct consumption externally
Partitioning
- Uses a variant of consistent hashing that addresses
  - Non-uniform data and load distribution
  - Heterogeneity in capacity / performance of each node
- The hash ring is divided into V partitions – each being serviced by a virtual node. There are P physical servers, with each physical server being typically responsible for Q/P virtual nodes, however this is decided based on the capacity of each physical node. Number of physical nodes is much less than number of virtual nodes.
- When a physical node starts for the first time, it chooses its set of tokens (virtual nodes) and maps nodes to their tokens. This mapping is stored on disk and initially contains only the local node and token set.
- Mappings stored at different nodes are communicated to peers using a Gossip based protocol.
- Eventually consistent view of mappings – each node connects to a randomly chosen peer every second and efficiently reconciles mappings.
Replication – data is duplicated on multiple hosts.
- Each key is associated with a “Preference List” – a list of nodes to which it is propagated. This consists of N nodes where replicas are stored, and some more nodes for failure handling (see below)
- Also, since number of physical nodes < virtual nodes, the preference list is built to ensure that the list contains only distinct physical nodes (by skipping positions in the hash ring). Since each node is aware of the token ranges handled by its peers (because of mappings getting propagated), this allows each node to forward a key’s Read / Write operation to the correct set of nodes directly
- There are three configurable values:
  - N = number of nodes that store replicas of data
  - W = Min no. of nodes that must acknowledge the receipt of a Write before a Write completes.
  - R = Min of nodes that are contacted when a data item is to be read
  - If R +W > N, then write and read set overlap, giving Quorum.
- Both Read and Write requests are handled by nodes called coordinators. A coordinator node is one of the top nodes in the Preference List for the key. The selection of a coordinator node is done by either using a partition aware client, or by using a load balancer.
  - Writes: The Write (Put) request is sent to the coordinator. The coordinator stores the value locally and then sends it to N highest ranking nodes in the Preference List that are reachable (healthy). If at least W-1 nodes respond with success, the Write is considered successful.
  - Reads: A Read (Get) request is received by the coordinator. The coordinator asks for all existing versions of that key from the top N ranking nodes in the Preference List that are reachable. If at least R nodes respond, the Read is considered a success.
  - So performance would be governed by the slowest node in a R/W operation.
- This model of configurable (N, R, W) values gives applications control on the desired performance, availability, durability and consistency
  - Increasing W would ensure that data is replicated many times increasing durability. However, since W nodes would also be required to have a successful write, availability would reduce as W increases. Performance would also reduce since the chance that one of the nodes is a low thruput node increases. Also increasing W would potentially increase number of divergent versions (see below) creating reconciliation overhead.
  - Increasing R would ensure that consistency is high, but again availability would be reduced and performance may also reduce.
  - Reducing R would improve read performance. So for a read-intensive, low updates application, (N, R, W) can be set to (3, 1, 3)
  - Typical values of (N, R, W) for Amazon Apps are (3, 2, 2)
Consistency – Since updates propagate to replicas asynchronously, the model is eventually consistent
- Uses object versioning (each update results in a new immutable object) using Vector Clocks.
- Consistency protocol: As mentioned above, a successful read requires that at least R nodes respond. Each response includes all versions of the object. If the coordinator node ends up with multiple versions of data, it does the following:
  - Returns all versions it deems to be causally un-related
  - Divergent versions are reconciled
  - Reconciled version superseding the current version is written back.
Handling Temporary Failures:
- As mentioned above, the Read and Write propagation considers the first N healthy nodes, not the first N nodes in the Preference List.
- However, when a node that is not in the top N nodes (but in the top N Healthy nodes) receives a Write request, it is also given extra metadata to indicate which node was it actually meant for. This is called a Hinted Replica.
- Each node periodically scans its hinted replica set and tries to connect to the original nodes for which the replica was meant.
- If it is able to connect, the node tries to deliver the replica to the original node and if delivered, may delete the replica
- This process is called Hinted Handoff
Handling Permanent Failures:
- Permanent failures can arise in case a hinted replica becomes unavailable or some other risk
- To reduce chances of permanent failures, nodes sync replicas thru Merkel Trees
- Each node maintains a Merkel tree for the key range hosted by it
- To sync, two nodes exchange the root of the tree common to the key ranges hosted by them, and in case there are any divergences, the apt sync action is taken.
Ring membership:
- Addition / removal of node from ring is relatively expensive because of the need to rebalance partition assignment
- Therefore addition of a node to a ring and its departure are explicit and not based on non-availability since non-availability can be because of transient conditions (say network)
- Membership propagation and reconciliation happens during the same communication that is used for token mapping propagation and reconciliation (see partitioning above)
Failure Detection:
- The need to detect failure is to avoid trying to reach unhealthy (unreachable) nodes.
- When a node A fails to reach another node B, it propagates that information thru Gossip. Checks periodically to see if B comes back. Info on arrival of node also propagates using Gossip.
Persistence:
- Pluggable model – Berkley DB Transactional store, MySQL, In-memory buffer with persistent backing store.
- Apps choose the kind of storage they need.

4.2 Amazon S3

http://aws.amazon.com/s3/
Distributed key-object store. Objects are opaque to system.
Size of an object can range from 1 byte to 5 GB.
No limits on number of objects – unlimited storage.
Geo-aware: You decide the Geo where your object is, and the object would stay within Amazon data-centers in that Geo.
Consistency: Read your writes on new PUTs, Eventual consistency on overwrite PUTs and DELETEs. Depends on Geo. See http://aws.amazon.com/s3/faqs/#What_data_consistency_model_does_Amazon_S3_employ.
Query model
- Iterable lists: http://docs.amazonwebservices.com/AmazonS3/2006-03-01/ListingKeys.html
- Retrieve objects using HTTP GET – supports chunked downloads. See http://docs.amazonwebservices.com/AmazonS3/2006-03-01/UsingGettingObjects.html
Should be useful if objects to be stored are large (files, blobs, persisted graphs, etc.)

4.3 Project Voldemort

http://project-voldemort.com/
Distributed key-value store written in Java. Inspired from Amazon Dynamo.
Partitioning and replication – same as Dynamo: Consistent Hashing, virtual nodes, preference lists for routing, Gossip based state propagation, etc.
Cluster Rebalancing – add node to cluster, node steals a partition from another node, pull-based gossip to let other nodes know. Donor nodes handles GETs while new node is transferring partitions.
Versioning – Based on Vector Clocks. Same as Dynamo again
Consistency – Read-repair. Write inconsistent versions across nodes, detect conflict during reads and fix them
Pluggable storage using Berkley DB, In-memory, MySQL
Pluggable serialization – supports JSON, Thrift, raw strings, Java, Google Protocol Buffers
Offline index rebuilding (using map-reduce, etc.) and then push to live server, transparently swap
Support for read-only nodes (for offline build and swap)
Pluggable NIO socket implementation
Support for data compression
Performance – 10 to 20k reqs / sec
- Running at: LinkedIn
License: Apache

4.4 Redis

http://code.google.com/p/redis/
Key-Value store written in ANSI C
- Values can be of multiple types: strings, lists, sets, ordered sets
Memory-driven – loads the entire data in RAM, but data can be persisted:
- Asynchronously saved snapshot – policies: time since last save, updates since last save
- Append-only mode (commands logged to a file). Three options: on every command, every second, upto OS
Single Process, no threads
Memcache like protocol. Multiple client libraries across several languages
Some of the more interesting Supported operations:
- Strings: Increments and Decrements, atomic multi-set
- Lists: Push and Pop, atomic R-pop, L-push (safe queues), Ranged Get
- Sets: Intersection, Union and Diff, and their corresponding operations that store the result against a key
- Sorting:
  - Sort operation that can be applied on a list or a set.
  - Sets when ordered return Ordered Sets (implemented using hash tables and skip lists). This means O(Log N) writes and O(1) reads. Ordered sets support Range operations.
Replication: Master-Slave
Partitioning: Up to the client library. Currently supported by
- http://github.com/nrk/predis/ (PHP)
- http://github.com/jdp/redisent (PHP)
- http://github.com/ezmobius/redis-rb (Ruby)
- http://github.com/ezmobius/redis-rb (Scala)
Performance:
- http://code.google.com/p/redis/wiki/Benchmarks reports 110k sets / sec and 81k gets / sec. Lots of other people have reported results as well
- Replication – http://groups.google.com/group/redis-db/browse_thread/thread/3ab1c8b2126f1b8/29bdb6c5973f0388 mentions 10 M keys replicated in 12 seconds on EC2
- Connections on Redis are cheap – no auth, etc. just a TCP handshake. Pipelining makes it even faster
License: New BSD
In use at: Github, Engine Yard, VideoWiki, and few others

4.5 Scalaris

http://code.google.com/p/scalaris/
Distributed key-value store based on written in Erlang
No persistence, entirely memory driven. Can use a backend like Tokyo Cabinet for handling database size > RAM + Swap.
Partitioning – Uses a modified version of Chord called Chord# that
- Preserves key order (allowing range queries)
- Does routing in node space
- Delivers proven O(Log N) performance (proof in the paper linked above)
Replication – Essential for being fault tolerant since there is no persistence. Uses a peer to peer replication approach called Symmetric Replication that reads and writes to a majority of nodes participating in replication.
Consistency: Strong consistency – uses a modified version of Paxos that provides atomic transactions: non-blocking writes that execute in parallel with strong consistency.
Language support: Erlang, Java, JSON RPC
License: Apache 2.0
Production Use: None mentioned

4.6 Others

Tokyo Tyrant
- http://1978th.net/tokyotyrant/
- Network interface to Tokyo Cabinet. Acts as a database server
- Written in C. Language bindings: Ruby, Perl.
- Basically it is a process that embeds Tokyo Cabinet and then provides an implementation for concurrent and remote access to the database.
  - Uses a thread pool using epoll / kqueue
  - Protocols: Binary over TCP/IP, Memcache, HTTP
- Partitioning: Up to client
MemcacheDb
- http://memcachedb.org/
- Key-Value store (uses BerkleyDB)
- Memcache protocol
- Master-slave replication
- Supports range queries
Chordless
- http://sourceforge.net/projects/chordless/
- Java implementation of http://pdos.csail.mit.edu/chord/
- Provides ACID transactions
MNesia
- http://www.erlang.org/doc/apps/mnesia/
- Written in Erlang, part of OTP
- Erlang specific – tables contain instances of Erlang records and records are represented as Erlang tuples

5. Document NoSql Stores

Document stores are a step further from key-value stores in the sense that a value associated with a key is a full blown record (document) which is not opaque to the database, but exposes a structure which allows the database to perform some operations on it. For example, the document can be a JSON serialized object.
Note that this approach is different from a relational database where the schema is defined upfront in the form of a table. In a document store, each document can have a different schema. Despite this, it is possible to describe relationships between records, just the way you do in relational databases:

One to Many:
- Embed key of foreign entity in the document. Then the foreign entity can be retrieved.
- Embed foreign entity itself in the document. Does not scale, but is faster
Many to Many
- Embed keys of one type of entities in the other type of entity
- Maintain another entity that embeds the keys of the related entities.

As should be obvious – document stores are conceptually very similar to relational databases except that schema definition is not upfront.

5.1 Amazon SimpleDB

http://aws.amazon.com/simpledb/
Written in Erlang
Data is modeled in terms of
- Domain (a container for holding related data together)
- Item (a entity that goes into a domain)
- Attribute and its Value (a properties of an item)
So basically each item is a dictionary to which you can add / remove attributes (keys) at any time. Related items go into a Domain.
When an item is added, it is indexed based on its attributes (and the attributes of other items in the domain).
You can issue SELECT statements much the way you would issue to a relational system, with the concept of a table being replaced by a Domain.
Eventually consistent
Partitioning: The domain-item-attribute model imposes limits on size of each domain, number of attributes / domain and number of domains, so in order to scale, you need to partition the data manually.

5.2 Apache CouchDB

http://couchdb.apache.org/
Written in Erlang
Data and Storage Model
- Records serialized in the form of JSON strings (documents)
- Each document has a unique DocId
- Uses B+ Trees: Both for main data as well as for Indexes
- Append only operations (Every update generates a new Sequence Id for the DocId). Committed data is never over-written
  - Every write operation also results in index updates which are also append only, so a new node is added to the B+ Tree which recursively leads to Log(n) updates to the B+ tree.
  - During Compaction older versions are removed
- Append only model means that Reads are done using Multi-Version Concurrency Control. A client can hold on to the older root of the B+ tree and get a consistent view while the data is being modified continuously.
- Writes are serialized, except for blobs which can be concurrently written.
- An update to a document is atomically committed with data and index updates getting synchronously flushed to disk, and the database header is updated.
- When resuming from a crash, no recovery is required (except for recovering the database header). Partial updates are simply truncated.
Views
- Required to “add structure back to unstructured and semi-structured data.”
- Can have multiple views for the same data
- Defined using a JavaScript function that maps keys to values in a special type of document called a Design Document.
- View functions are run when your query a view: The system takes the source code and runs it against every document in the database. There are basically two kinds of functions:
  - Map function: takes a single parameter – the document – a single document in the system, does some operation on it (can’t modify the document) and returns a result (mapping an input to a result).
  - Reduce function: takes a list of key-values and returns a single result after performing some aggregate operation on it (reducing multiple inputs to a single result)
- View results are stored in a B+ tree and unless the document changes or the view definition changes, the results are not re-computed and fetched straight from the B+ tree.
Replication
- Master-slave with multi-master support
- Only latest revision replicated
Conflict Handling in Multi-master
- CouchDb detects conflicts and uses a deterministic algorithm for picking a “winner”. Deterministic means that all replicas end up winning the same version without talking to each other.
- This winner may or may not be the one your app expected. This will have to fixed by the app itself.
Partitioning and Load Balancing – Provided thru CouchDb Lounge
- Uses consistent hashing to partition the data
- Dumbproxy used for all requests except View requests. Implemented as an Nginx module
- Smartproxy used for View requests only
  - Twisted python based daemon
  - Sends view requests to all shards
  - Merges responses before sending them back
  - Merges happen in constant memory space
License: Apache 2.0
Production Use: BBC, PylonsHQ, GPirate, and several others

5.3 Riak

http://riak.basho.com/
Written in Erlang. Uses the same architecture and algorithms as Amazon Dynamo.
Client libraries
- Erlang (JSON) – Jiak Client
- Erlang (raw) – Riak Client
- Python, PHP, Ruby, Java, JavaScript
Data Model
- Key-Value. Value is a document.
Riak nodes go on a ordered consistent hash ring
- 160-bit binary hash of key-value pair, maps to a position on the ring.
- Ring divided into partitions. Each partition is designated an owner called v-node. The v-node is said to claim a partition.
- Nodes try and run equal number of v-nodes. So each node is responsible for 1/(number of nodes) or (number of partitions) / (number of v-nodes). So if 2 nodes define a 16-partition cluster, each node would run 8 v-nodes.
Writes (Puts)
- Any node may be chosen as coordinator for the Write request
- Coordinator node figures out the v-node responsible for the partition where the key needs to go.
- Coordinator sends Write request to that v-node as well as next N-1 partitions in the ring, where N is a configurable parameter that defines how many copies of the value need to be created.
- Write request may also specify that at least W (=< N) of the v-nodes respond with success, and that DW (=< W) of the W nodes respond with success only after durably storing the state.
Reads (Gets)
- Any node may be chosen as a coordinator for the Read request
- Coordinator node figures out the v-node responsible for the partition for the key
- Sends request to v-node, as well as next N-1 nodes
- Request can also specify R (=< N) nodes that must reply before a response is returned.
Ring state propagation
- Consists of things like arrival and departure of nodes on ring
- Done using Gossip protocol
Eventual Consistency
- When a object is added it is tagged with a vector clock specifying its initial version
- For each update, the clock is updated in such a way that the system can compare two versions of an object and determine how the two objects are related (direct descendant, common parents, unrelated)
- To compare object versions, a Merkle tree is used
Failure handling is done using the same hinted replicas and hinted hand-offs approach as described in the Dynamo paper.
Storage model – pluggable:
- File based in a on-disk nested directory structure
- Erlang gb_trees based (http://www.erlang.org/doc/man/gb_trees.html)
- Volatile ETS tables (hash based data storage and access – see http://www.erlang.org/doc/man/ets.html)
- On-disk DETS tables (see http://www.erlang.org/doc/man/dets.html)
- Osmos tables – uses the Osmos file storage system. Useful for write-intensive systems.
Queries
- Map-reduce model
- Map operations run local to the data on the hash ring – can be run in parallel
- Reduce operations take inputs from the preceding phase and reduce (typically perform an aggregate operation) the input. This cannot be parallelized
License: Apache 2.0
Production use: No names mentioned, except that the FAQ says that “Riak has robustly served the needs of multiple revenue-generating applications for nearly two years now.”

5.4 MongoDB

http://www.mongodb.org/
Written in C++
Language bindings (drivers): C, C++, Java, JavaScript, Perl, PHP, Python, Ruby. Also community provided support for C#, F#, Erlang, Groovy, etc.
Data Model
- Collections – similar to Domains in Amazon SimpleDb – a bucket to hold together similar documents
- Key-Value, with value being binary serialized JSON (BSON)
  - 4 MB limit on a BSON object
  - Large object support via GridFS
- B-Trees used for indexes. You specify the fields on which to index using a function called db.things.ensureIndex() that takes the field as a parameter.
Storage: Uses memory mapped files. So caching is controlled by the OS VMM
Writes
- In-place updates
- Provides partial updates – you can update the value without sending a full row update.
- Single document Atomic updates. Atomic batch updates possible thru db.eval() but may not be supported with partitioned data
Queries
- Provides a JSON style based syntax to pick values from inside documents
- Support for conditional operators, regular expressions, etc.
- Supports forward-only cursors
- Query optimizer is not cost-based, instead multiple query plans are tried and the one that works best is picked.
Batch Operations
- Supports Map-Reduce over a Collection
Replication:
- Master-slave
- Replica pairs – automatically figure out master and slave roles
- Master-Master: Not recommended. Eventually consistent (how?)
Partitioning:
- Partitioning In alpha stage.
- Data sharded by Collection with order preservation
- A shard consists of one or more servers replicating using an enhanced version of replica pairs
- Shards deal with data in units called Chunks.
  - A chunk is a contiguous range of data from a collection with a max limit of 50 mb. After that a chunk splits into two chunks
  - Data migrates from one shard to another in chunks (in case of shard having excess data, or having nearby shards)
License: AGPL 3.0
In Production use at: Sourceforge, Github, EA, and several others

6. Column-Family NoSql Stores

6.1 BigTable

http://labs.google.com/papers/bigtable.html
Internally used at Google. Exposed to general public thru Google App Engine
Building Blocks
- Google File System – a distributed file system. Provides raw storage
  - Files broken into chunks (typically 64 mb)
  - Each chunk is replicated across 3 machines for safety.
    - A heuristic here is that one of the replicas tries be on the same rack as the original and the other two replicas somewhere else
  - Chunk Servers – responsible for storing chunks
  - Data transfer happens directly between clients and chunk servers
  - Master – responsible for metadata in the file system
- The storage file format is called SSTable (sorted strings table)
  - Persistent, ordered, immutable map of keys to values with both keys and values being arbitrary byte strings
  - Operations to look up the value specified with a key and to iterate over key/value pairs in a given key range
  - Optionally, an SSTable can be completely mapped into memory
  - Immutability of SSTables:
    - Means no synchronization required during file access.
    - Makes removing permanently deleted data a garbage collection exercise (happens in a major compaction – see below)
- Scheduler
  - Schedules jobs on machines
  - Watches for failures and re-schedules if required
- Chubby Service – Distributed lock / file / name manager
  - Coarse grained lock – can store small amount of data in lock
  - 5 replicas, 1 master. Need majority vote to be active. Uses Paxos to keep replicas in sync
  - Provides a namespace that consists of directories and small files. Each file / directory can be used as a lock, and reads and writes to a file are atomic.
  - Each chubby client maintains a session with a chubby service. There is a lease time which needs to renewed.
  - Clients can register callbacks on Chubby files and directories for notification of changes / session expiration
- Optional Services used by BigTable applications (but not used by BigTable itself)
  - Map-Reduce: Batch processing – clients often use this to read / write BigTable data
  - Sawzall: to support execution of client provided scripts in the server space for transformation, filtering and summarization
Data Model
- Multi-dimensional map (map of maps)
- Data organized into tables. A table is an entity that holds related data
  - Each record is identified by a row key
  - Each row has columns. Columns here should be thought of as attributes of a row. Not all rows would have the same columns
  - Intersection of a row and a column creates a cell
  - A cell has versioned data based on timestamp. So a single cell may holds multiple versions.
  - It is convenient to organize attributes. Thus columns are grouped into column families. A column family can be thought of as a namespace for a set of related attributes.
- So the dimensions of the multi-dimensional map are:
  - Row key: Uniquely identifies a datum (arbitrary byte string)
  - Column-family: A grouping of attributes – a namespace for a set of related attributes.
  - Column: A single attribute.
  - Timestamp: Denoting the time when the value was changed.
    - Different versions are stored in decreasing order of timestamp, so most recent versions can be read first
    - Client can specify that only last n-revisions of cell should be kept, or only values written in the last n-days should be kept
- Transactional access possible to any column within a single row
  - Can perform atomic read-modify-write sequences on data stored under a single row key
  - Transactions across row keys not supported
  - Batch operations across row keys are possible but are not atomic
Tablets
- Data is sorted lexicographically by Row key
- Row range for a table is dynamically partitioned.
- A row range is called a Tablet. So you can describe a tablet in terms of a start row key and an end row key
- One tablet = 100 to 200 mb
  - Tablets can be split based on tablet size or load
- A tablet server may end up serving anywhere from 10 to 1000 tablets (typically hundreds).
  - While each tablet consists of contiguous data, a tablet server would pick up tablets spread across the entire data set and not data that is related. This helps alleviate hot spots since load gets spread evenly
  - The high ratio of tablets to tablet servers gives very fast recovery since if a machine that was serving 100 tablets fails, 100 machines can each pick up 1 tablet from the failed machine making recovery highly parallelized
  - Also gives fine-grained load balancing – you can migrate tablets away from an overloaded machine.
- Servers are further grouped into cells
  - A cell is a collection of servers. Most cells have 10-20 servers
  - Some cells have as many as thousands of servers managing several hundred TB of data
A typical cluster consists of
- Commodity Intel based PC hardware with each node running
  - Google’s version of Linux
  - GFS Chunk server – for raw reads and writes
  - Scheduler slave process – runs jobs
  - Tablet server – serves BigTable read / write requests
  - Some nodes may also run a BigTable master which is responsible for meta-data requests (create a new table) and load monitoring / balancing
- GFS Master
- Cluster scheduling master – talks to the scheduler slave process on each node to startup jobs
- Lock service – used for electing BigTable masters and for bootstrapping the tablet lookup (below)
Locating Tablets
- To figure out which server is serving which tablet (row key range), this mapping info is kept in special metadata tables (which in turn can split into tablets)
- Three level hierarchical lookup with bootstrap provided by Chubby
  - Chubby file points to a root metadata tablet – this tablet is never split
  - This tablet points to other metadata tablets
  - Each 2nd level metadata tablet points to app data tablets
  - Each metadata row is an encoding of tablet’s table id and end row key, and takes approx 1 kb
  - With a 128 mb limit on metadata tables, this scheme can address upto 2^34 tablets (or 2^61 bytes in 128 mb tablets)
- The lookup is aggressively pre-fetched and cached by clients so most queries end up going straight to servers
  - Empty cache: There are 3 network round trips including one read from Chubby
  - Stale cache: could take up to round trips since stale entries are discovered via misses
Tablet Assignment
- Each tablet assigned to one server at a time. Master keeps track
- When a tablet gets unassigned, the Master asks a Tablet server with available capacity to take over the tablet
- Tablet assignment uses Chubby
  - On starting, a tablet server creates and acquires an exclusive lock on a uniquely named file in a specific Chubby directory called Servers
  - Master monitors this directory to discover tablet servers
  - If a tablet server loses its exclusive lock it stops serving the tablets
  - Tablet server may try and re-acquire the lock if the file still exists, however if the file is gone, the tablet server kills itself
- Master asks each tablet server periodically for the status of its lock
  - If tablet server reports a loss, or if master fails to connect to tablet server, the master tries to acquire an exclusive lock on the tablet server’s file
  - If master is able to acquire lock, it means Chubby is alive. So master deletes the file so that the tablet server can never serve again, and the master moves all tablets assigned to the tablet server to the set of unassigned tablets
  - If master is unable to reach Chubby, it kills itself
Writes
- When a write request goes to a tablet server, it checks it for well formedness, and authorization (Chubby maintains a list of permitted writers)
- Valid mutations are written to a commit log (grouped for small mutations)
- After the write has been committed, its contents are inserted into a sorted in-memory buffer called memtable. From here, updates keep going to a sequence of SSTables as they get older
Reads
- Similar checks as for a write request – well formedness and auth
- Read op executed on a merged view of SSTable sequence and memtable
Tablet Recovery
- Maybe required if a Tablet server dies
- Tablet server reads its metadata from the MetaData table – list of SSTables and set of redo points which point to commit logs
- Server reads indices of SSTables into memory and reconstructs memtable by applying all updates that have been committed since redo points
Compactions
- Minor Compactions:
  - As writes occur, memtable grows till a threshold. Is frozen. New memtable created. Frozen memtable converted to SSTable and written to GFS.
  - Shrinks memory usage of tablet server
  - Reduces amount of data that needs to be read from Commit log during recovery
- SSTable Compactions
  - As minor compactions occur, new SSTables get created. This can lead to a Read operation slowing down as it may need to merge data from multiple SSTables
  - So periodically a merging compaction runs which reads a few SSTables and the Memtable and writes out a new SSTable
  - This does not remove special deletion entries (used to suppress deleted data)
- Major Compactions
  - From time to time BigTable runs a compaction that rewrites all SSTables into exactly one table is. This is called a Major compaction
  - During this compaction deleted data / deletion entries are removed.
  - Required for claiming resources
Refinements
- Locality Groups: Are used to group together parts of data that have similar usage characteristics.
  - Client can group multiple column families into a locality group
  - Separate SSTable generated for each locality group
  - Locality group can be marked as being in-memory (lazy loaded). BigTable itself uses it for the Location column-family in metadata table.
- Compression: Clients can specify whether or not SSTables for a locality group are compressed or not, and which compression algorithm to use
- Caching: Tablet servers use two levels of caching:
  - Scan cache – higher level. Caches key-value pairs returned by SSTable interface to Tablet server code. Useful for apps that tend to read the same data repeatedly
  - Block cache – lower level. Caches SSTable blocks reads from GFS. Useful for apps that read data close to data they recently read

6.2 Cassandra

http://incubator.apache.org/cassandra/
Combines the distributed architecture of Dynamo with BigTable’s column-family data model
Written in Java. Thrift based interface. High-level libraries available on Ruby, Perl, Python, Scala and Java
Data Model:
- Multi-dimensional map indexed by a key like BigTable.
  - Each application creates its own key-space (table in BigTable)
  - Key – a name for a record. Can be an arbitrarily long string. Indexed by Cassandra
  - Column – an attribute of a record. Smallest datum you can get hold of. This is timestamped.
  - Column-family: Grouping of columns. Similar to a relational table.
  - Super-Columns: A list of columns
  - A column-family can either contain columns or super-columns, but not both. Coumn family containing super-columns is called super-column-family
  - The way to get to a piece of data is: KeySpace.ColumnFamily.Key.[SuperColumn].Column
- Sorting
  - Data is sorted at write time (unlike relational)
  - Columns are sorted within their row by column-name. The sorting provider is pluggable
Partitioning: Similar to Dynamo
- Consistent hashing using an order preserving hash function.
- However uses the Chord approach to load balancing rather than Dynamo’s v-node approach
Replication
- Same concept of coordinator nodes and preference lists as Dynamo
- Various replication policies: Datacenter aware, Rack-aware, Rack-unaware
- Rack-unaware approach is same as Dynamo (replicate to N-1 nodes)
- Rack aware and Datacenter aware use a more intricate algorithm which involves a leader (chosen using Zookeeper) that maintains the N-1 replicas invariant. This is apparently not there in Apache Cassandra.
Membership: Membership based on Scuttlebutt – anti-entropy Gossip based mechanism. Same concept of explicit membership changes as Dynamo
Failure detection: Uses a modified version of Φ Accrual Failure Detector which gives a probabilistic value for each monitored node. This apparently is fairly accurate and more representative of network conditions.
Failure Handling: Same concept of hinted replicas and hinted hand offs as in Dynamo
Persistence:
- Relies on local file system (unlike BigTable that uses GFS)
- Storage structure and approach is similar to BigTable: SSTables, Memtables, Commit logs, Compaction, Bloom filters, etc.
Writes
- Similar to the BigTable approach of writing to commit log for durability and recoverability, followed by update to Memtable
- Dedicated disk on each machine for commit log makes the writes sequential and thus maximizes thruput
- When memtable crosses a threshold (out of space / too many keys / client provided time duration), it gets written to two files:
  - Data file – SSTable
  - Index File – key/offset pair
- No seeks – always sequential, thus quite fast
- Atomic within Columnfamily
Reads
- Similar approach as Dynamo for figuring out which nodes will serve, read-repair, etc.
- At a storage level, approach is similar to BigTable: read from memtable + SSTable sequence
License: Apache 2
In Production at: Facebook, Digg, Rackspace

38 thoughts on “NoSql Databases – Part 1 – Landscape”

Anonymous
22 January 2010 at 2:00 pm
Best summary I’ve ever seen. Thank you!
Reply
JV
22 January 2010 at 2:00 pm
Hey Vineet,
Awesome post!
I could easily relate to your post and appreciate the cause, since I have faced the problem of scaling RDBMS first hand. I have seen RDBMS creep while doing analytical queris (OALP stuff, when most RDBMS are best suited only for OLTP), have done replication using WAL logs, etc. All that pain, left me with a bitter taste, that was when (an year back approx) I decided that RDBMS is not-the-way-to-go.
Since then I have been reading about and experimenting with various NoSQL sutff. I did my research, though not as comprehensive as yours, about the various options available out there. So far, I have experimented with Amazon SimpleDB, CouchDB and MongoDB (observation: all are document based) and written webapps with them. I have been thinking to write one/more blog post(s) about ‘my experiments with NoSQL’ for over a month now, not able to find enough time (nagging excuse), much like I have been thinking to write this response since Jan 5th and finally managing to do it today.:) I was thinking that my posts will be mostly ‘hands-on’ (code) with a short coverage of: history, essential basics and options at the begining.
Your post has brought out very ‘clear’ segmentation-map/big-picture of the various options and has helped me in attaining a much better understanding of the grand set of things. I guess the learning (for me) is still a long way to go and was wondering if we should catch up on the subject sometime and do some ‘concept’ and ‘hands-on’ sessions along with other NoSQL enthusiast developers.
Let me know what you think.
Once again, great post and looking forward to partII.
Reply
a
22 January 2010 at 2:00 pm
very interesting Vineet – you echo all the points that made me leave behind Relational Databases and build my own logic engine / rules bases that turned simple table based systems into fully relational database, but under our control.
we have built ERP/DSS solutions that drive 25 billion $ of global sales with all the Validations, Business Processes, OLTP, OLAP, and MIS/BI without writing a line of source code.
this engine logically got called the Application Processor because it does for ERPs what a word processor does for documents.
This product is still alive and well 15 years later, and can be demmonstrated any time you like. According to the boss of Novel India it embodies 98% of all business logic needed to build any database application.
warm regards
ashok hingorani
Reply
Bhaskar
22 January 2010 at 2:00 pm
(In response to Vineet’s question about GT.M partitioning)
GT.M is a key-value database where accesses look like array references, e.g., Set ^Capital(“India”)=”Delhi”, Set ^Capital(“USA”)=”Washington”, or Set ^Capital(“USA”,1790,1800)=”Philadelphia”. ^Capital is called a global variable. Now consider another global variable, e.g., ^Nation(“Delhi”)=”India”. A configuration file called a “global directory” maps global variables to database files. A database file can reside on the same node as application logic, or on another node, using a protocol called GNP (GT.M Network Protocol) that is layered on top of TCP. Mapping through a global directory file allows the database to be distributed transparent to the application. Since an application can switch global directories on the fly (and in principle can even dynamically create a global directory and use it), an application can also explictly control the partitioning.
Partitioned databases are used less and less these days with faster computers. An individual GT.M database file (one global variable – a logical database can consist of a virtually unlimited number of global variables) can grow into the TB range, and GT.M can achieve hundreds of thousands of updates per second on x86 Linux servers (with journaling and recovery). So these days, people tend to put updates on one node (with ACID transaction support) and replicate in real time to up to 16 nodes – the replication delay in systems that are not overloaded would be sub-millisecond plus network transit. Each of those can replicate to 16 more, and so on.
If there is interest, I can try to write-up GT.M specific information to add to the post. There’s way more than I can write in a comment.
Reply
Varun
22 January 2010 at 2:00 pm
Awesome post!!
Reply
Vineet Gupta
22 January 2010 at 2:00 pm
Bhaskar, can you share some details – I could not find info on partitioning on your website.
Nicholas, wasn’t aware of FluidDb, will check it out
Sergio – ditto for terrastore! There is so much interesting stuff happening these days!
Matysek – Of the top of my head, Azure table storage is key-value. Anon – the distinguishing characteristic of a column-store is in the storage. I am not sure how Table storage lays it out. The value is opaque afaik so indexing / interpreting value is up to you. On Sql Data Services, its first version was like a document store (the so-called ACE model). The current version is relational – mostly to accelerate adoption. The burden of partitioning lies with the app. They replicate to two locations. (again this is from memory, i havent checked sds / azure in a while)
Reply
Anonymous
22 January 2010 at 2:00 pm
@Matysek
I beleive Azure Table storage falls in the category of column store.
The table structure is called “EAV”
where “E” stands for entity which consists of a ROW identified by the partition key plus the row key.
“A” stands for attribute, which represents a column ( each row can have different columns if desired).
An the “V” stands for value which is the content of the “cell” that is the intersection of an Entity and one of its Attributes . Also each entity requires a time stamp that belongs to the entity itself and not to one of its attributes.
Reply
Matýsek
22 January 2010 at 2:00 pm
Great analysis Vineet. I have some questions. I am interested in Azure. There is Azure storage with tables. Tables have folowing properties:
keyPartitionKey
RowKey property
Timestamp property
It’s looks similar like key-value stores described in your post and I can use LINQ to query for exact data from table. Even across tables. It looks for me like a simple key-value database. I know that LINQ query is on App layer not on database layer somewhere shielded from app. These two technologies seem to me like DB splited on two parts (LINQ – selecting, Azure storage – storing data) What do you think about it?
What about Microsoft SQL Services? What kind of database works behind its interface looking like a reduced SQL? It’s a relational DB?
These questions are maybe stupid but I will be happy if you can help to understand to these things about Azure. I am just on begining 🙂 Thanks
Reply
Ashok
22 January 2010 at 2:00 pm
Great post Vineet, one of the best post i have ever read on NOSQL The links to the papers was just simply awesome.
Reply
Anonymous
22 January 2010 at 2:00 pm
check out kdb+
Reply
ou8125150
22 January 2010 at 2:00 pm
Have to agree about having a look at GT.M. Have worked with it quite a bit in very heavy production environments and it is rock solid and lightning fast.
Reply
Sergio Bossa
22 January 2010 at 2:00 pm
Hi Vineet,
you missed Terrastore as well: http://code.google.com/p/terrastore/
It’s a distributed *and* consistent document store.
Feel free to ask if you need additional information.
Cheers,
Sergio B.
Reply
Bhaskar
22 January 2010 at 2:00 pm
GT.M has ways to partition at the database engine level as well as at the application level. (I manage GT.M.)
Reply
Nicholas Tollervey
22 January 2010 at 2:00 pm
Nice post, Vineet. Although still very much in alpha, FluidDB from FluidInfo is perhaps something you should look at – no sql, cloud based, well thought out RESTful API, unique approach to “schema-less” and some *very* interesting and unique features that make it “social”. (NB: I have an account but I’m not involved with FluidInfo in any way).
Reply
Vineet Gupta
22 January 2010 at 2:00 pm
Rob – wasn’t aware of GT.M. Does it handle partitioning or is that left to the app?
Reply
rtweed
22 January 2010 at 2:00 pm
Vineet
It’s a shame you didn’t include GT.M in your analysis. It’s been around a lot longer than any of the ones you’ve looked at. See http://www.slideshare.net/robtweed/gtm-a-tried-and-tested-schemaless-database (more links on last slide)
Reply
Vineet Gupta
22 January 2010 at 2:00 pm
Rick, Emil – thanks for dropping by. The omission of graph databases was deliberate (but I believe I should have mentioned that part) since my focus was more on scalability than on domain modeling. Am quite well aware of graph databases and have been tracking Neo4j for well over an year now 🙂 Intend to cover that space separately.
Deepak – great hearing from you man – been a while! Was not aware of Persevere. Will check it out ..
Reply
Deepak Gulati
22 January 2010 at 2:00 pm
Nice one Vineet! There are some here that are new to me.
That said I’d like to add Persevere to the list:
http://www.persvr.org/
Reply
Emil Eifrem
22 January 2010 at 2:00 pm
Very good post. I agree with Rick that graph databases such as Neo4j are missing from the list. (Bias: /me is part of the Neo4j team.)
More specifically, when Rick mentions that graph databases address a different problem, I believe he talks about handling complexity.
I wrote a post about how the different types of NOSQL databases are focused on scaling to size and scaling to complexity. Feel free to check it out!
Reply
Rick Bullotta
22 January 2010 at 2:00 pm
Good stuff, Vineet. One other NOSQL approach you should investigate is Neo4J (www.neo4j.org). It attacks a different problem than the “infinitely scalable blob in the sky” stores, but one that needs a solution and that SQL does not address.
Reply
Anonymous
28 January 2010 at 8:53 pm
Excellent post, and I’m glad you provided links to the papers that give background.
Reply
SimonHF
14 February 2010 at 11:07 am
What about comparing performance? There seems to be HUGE differences in performance. Some NoSQL solutions boast 1,000s of transactions per second, while others 10,000s, others 100,000s, others 1,000,000s, and others 10,000,000s!
Reply
Bhaskar
14 February 2010 at 8:31 pm
I have proposed a benchmark for key-value datastores at http://docs.google.com/a/fnis.com/View?id=dd5f3337_12fzjpqbc2 – the problem is simple enough that it is easily programmed, but meaningful (especially in the enhancement that uses strings) because it balances reads and writes, mandates contention, recoverability, etc.
Reply
SimonHF
14 February 2010 at 10:43 pm
Thanks for the link @ Bhaskar. Unfortuntately it only appears to work for those who have an email address @fnis.com 🙁 Maybe you could make the proposal more open somehow? That would be great. It would be a great idea if we could somehow split up the work and get all the NoSQL offerings benchmarked using a benchmark or benchmarks we all agree on 🙂
Reply
Bhaskar
14 February 2010 at 10:54 pm
SimonHF, I am not sure why you are having problems accessing Google docs – perhaps I posted an internal URL rather than an external URL. I can’t check at the moment, but perhaps the benchmark thread on the NoSQL group at http://tinyurl.com/ycf6fvl will help. If the links there don’t work for you, please let me know and I will post updated links shortly.
Reply
SimonHF
15 February 2010 at 3:47 am
Thanks for the second link which works 🙂 Your proposed benchmark sounds interesting. Unfortuntately I am finding it difficult to understand ‘the problem’ trying to be solved. Normally in such cases I would examine the example source code to find out what is required. However, in this case the example source code is written in a language I don’t understand 🙁 Are there any other reference implementations? A reference implementation in C would be nice.
Reply
Bhaskar
15 February 2010 at 5:41 am
SimonHF, let’s do this: I’ll try to write up a description of what the example code does. Meanwhile, since the problem description is intended to be easily understood, but evidently can do a better job, can you ask questions about what is not clear and I will try to improve the problem statement? In order to conserve bandwidth of this blog’s readers, please do write to me offline at bhaskar at bhaskars dot com, and we can report back to the readers when we have an improved problem statement and solution annotation. Thanks.
Reply
SimonHF
15 February 2010 at 6:08 am
Sounds good 🙂
Reply
bastih
16 February 2010 at 1:21 am
Thought of turning this site into something in a wiki style so more information could be added?
I love this collection of information but I’d rather like to see it in a more structured format.
Reply
chris
18 February 2010 at 1:09 pm
you missed hamsterdb!
http://hamsterdb.com
Reply
Bradford
25 February 2010 at 7:08 am
This is tasty — I’m surprised I haven’t seen it until now! Would like to see more details on HBase 🙂 It’s good to have metrics to compare these “NoSQL” databases. Leads to structured discussion.
You may be interested in our new data platform at Drawn to Scale — we take a wider perspective than a single database. Our platform handles processing, storage, serving, and real-time querying and search. It’s easy, and scalable.
Sort of the “best of all worlds”. At least, we think so 🙂
Reply
royans
28 February 2010 at 9:56 pm
Very well written. I didn’t realize that cassandra uses zookeaper though. Is that a recent change ?
Reply
Anonymous
2 March 2010 at 10:17 pm
HamsterDB is GPL dude. Go away.
Reply
Atasay Gökkaya
3 March 2010 at 2:00 pm
Excellent post, comparison within structure. Thanks.
Reply
Vineet Gupta
6 March 2010 at 1:29 pm
Thanks folks for all your comments and feedback. A number of people have mentioned that there favorite database has not been included. Thing is, I am not really interested in building a comprehensive list – I wrote this post as a part of an exercise at Directi to understand the space and I thought it would be useful to share the research. IMHO there really is not much value in building a big list unless you spend a lot of time and energy to maintain it, since new products are constantly being built, and existing ones are constantly evolving (or dying out). Unfortunately I cannot spare that kind of time or energy. So if anyone is interested in taking over this content and build a wiki or something out of it, and keep it updated, I am most happy to contribute.
Reply
Gabriel
6 March 2010 at 2:02 pm
What a post! Fantastic. Clear, smart, complete.
Reply
Bhaskar
7 March 2010 at 12:48 am
Thanks to SimonHF’s inputs, I have updated the 3n+1 benchmark specification to be clearer: http://docs.google.com/View?id=dd5f3337_12fzjpqbc2
Also, I modified the program to solve the improved benchmark and publish results: http://docs.google.com/Doc?docid=0AZN3e-tpxLwOZGQ1ZjMzMzdfMjJ2cnF2cWNnbQ&hl=en
On my laptop, with 8 concurrent processes accessing the database I get a fairly consistent update rate of between 30,000 and 40,000 updates/second (with journaling enabled for crash recovery) along with 145,000 to 172,000 reads/second with reads and updates happening in parallel.
I was also surprised at how high some of the numbers ranged – in computing the steps in the 3n+1 sequence for 1 through 1,000,000 the highest number encountered was 56,991,483,520.
Reply
cotery
14 March 2010 at 12:08 am
Uh? Not a single word about Flare and Elliptics Network?
Reply