NoSql Databases – Part 1 - Landscape

[toc]

At Directi, we are taking a hard look at the way our applications need to store and retrieve data, and whether we really need to use a traditional RDBMS for all scenarios. This does not mean that we will eschew relational systems altogether. What it means is that we will use the best tool for the job – we will use non-relational options wherever needed and not throw everything at a relational database with a mindless one-size-fits-all approach.
This post covers the current landscape of the NoSQL space. In a subsequent post, I intend to cover in more detail the various problem areas addressed by NoSQL systems and the specific algorithms used.
Caveat: though I have tried quite hard to understand each database, the fact is that most of the systems discussed are quite comprehensive and summarizing their capabilities in a few bullets cannot do them any justice. And it is certainly possible that my interpretation maybe wrong in some cases since I haven’t actually used these systems as yet.
Summary
The need to look at Non SQL systems arises out of scalability issues with relational databases, which are a function of the fact that relational databases were not designed to be distributed (which is key to write scalability), and could thus afford to provide abstractions like ACID transactions and a rich high-level query model.
All NoSQL databases try and address the scalability issue in many ways – by being distributed, by providing a simpler data / query model, by relaxing consistency requirements, etc. I find it useful to bucket them as follows:

Distributed vs. Not-distributed: Distributed databases take the responsibility of data partitioning (for scalability) and replication (for availability) and do not leave that to the client.

Distributed
Not Distributed (responsibility on client)

  • Amazon Dynamo
  • Amazon S3
  • Scalaris
  • Voldemort
  • CouchDb (thru Lounge)
  • Riak
  • MongoDb (in alpha)
  • BigTable
  • Cassandra
  • HyperTable
  • HBase
  • Redis
  • Tokyo Tyrant
  • MemcacheDb
  • Amazon SimpleDb

Over here, the model of distribution which seems most compelling is the one used by Dynamo. Indeed, it is copied by Voldemort, Riak and Cassandra – three very different kinds of stores. The reason it is most compelling is because it gives simple knobs to an application to tune its expectations of durability, read performance, consistency, write performance, etc. This makes it very general purpose. The other reason this model is good is because it allows heterogeneous hardware to be used in an efficient way (however Cassandra departs from this).

Data Model richness: The other key distinction is in terms of richness of data model:

Key-Value store
Document store
Column-Store

  • Amazon Dynamo
  • Amazon S3
  • Redis
  • Scalaris
  • Voldemort
  • Amazon SimpleDb
  • Apache Couchdb
  • MongoDb
  • Riak
  • Cassandra
  • Google BigTable
  • HBase
  • Hyperbase

On one end of the spectrum are the simple key-value stores: Dynamo, S3, Redis, Scalaris, Voldemort, etc. At the other end of the spectrum are the column-stores which provide a very rich model: BigTable, Cassandra, Hypertable and HBase fall into this bucket. This richness comes at a price – the model is not simple, and you need to think data modeling grounds up. Somewhere in between are the document stores – a sort of schema free cousins of relational databases: CouchDb, Riak, MongoDb, etc. Simple to understand and richer than plain key-value stores.

Disk vs. Memory: A third useful dimension is whether the database is memory-driven or disk-driven. This is important since in the latter case you need an explicit cache, while in the former case you are not durable:

Memory
Configurable
Disk

  • Scalaris
  • Redis
  • BigTable
  • Cassandra
  • Hbase
  • HyperTable
  • CouchDb
  • MongoDb
  • Riak
  • Voldemort

On one end of the spectrum is Scalaris which is entirely memory-driven, and Redis which is primarily memory oriented (you can do background snapshots). Cassandra, BigTable, Hypertable, Hbase allow configuring how large the Memtable can get, so that provides a lot of control. The document stores – CouchDb, MongoDb and Riak – all seem to be using on-disk B+ trees, and Voldemort uses BDB and MySQL. Having a pluggable storage engine which is in-memory, does not make it configurable since the in that scenario you are entirely memory driven with no durability at all!

So which one wins? Well, I am biased towards the database solving the scalability issue by taking over the responsibility of partitioning instead of leaving that problem with me. So theoretically, Cassandra seems to combine the best of both worlds: the sophisticated distribution model of Dynamo with the richness of a column-store. Also, it is in heavy production use despite still being in beta. However, I believe the learning curve here would be higher for modeling data, and it may make sense to opt for the simplicity of a Voldemort for most cases. Ultimately it would depend on your app requirements and development team.
1. Why Non-Relational
Some of the stuff here may not resonate with you if you are an enterprise developer since enterprise apps don’t have to deal with the kind of gigantic scale that (some) consumer web applications deal with. However, given the rate at which data is growing and the number of users who are using IT systems, these issues are only going to become more and more common – for smaller consumer apps, as well as for enterprise apps. In fact, even today, irrespective of the scale at which your app operates, if you want to take advantage of a Cloud platform like Google App Engine or Microsoft Azure or Amazon Web Services, you would perhaps need to think of some of the issues below, because at the infra level these platforms do have to bother about high scale and may impose constraints on the application / data model to help them scale.
1.1 Relational databases are hard to scale
1.1.1 Replication - scaling by duplication

  • Master-Slave:
    • Each write results in N x writes where N is the number of slaves. This leads to diminishing returns beyond a point, thus imposing a limit
    • While reads would get faster (since you can now read from N nodes), writes are still bottle-necked to one node
    • Critical reads still need to go the master since the write may not have propagated to all nodes. This logic needs to be built into the app.
    • High volumes of data pose a problem since you need to duplicate the data N times. This also leads to limiting how much you can scale with
      this approach.
  • Multi-Master

1.1.2 Partitioning (sharding) – scaling by division:

  • Scales reads as well as writes
  • Not transparent to the application. The application needs to be partition aware.
  • The value of an RDBMS is in relations. Once partitioned, these relations get broken – you cannot do a join across shards – this now needs to be done in the app layer.
  • In general, manual sharding in relational databases is not simple.

1.2 Don’t need some features
1.2.1 UPDATEs and DELETEs

  • Typically not used since that leads to loss of information
    • May need the record for auditing, or for re-activation
    • Typically, the info is never really “deleted” from a domain perspective anyway
      • A user “leaves” a community – his posts would not be removed
      • Employees “leave” companies – their employment record is maintained
      • The canonical ACID transactions example: debits from a bank account – this is not a DELETE, but an INSERT
    • Nor is info just “updated”
      • Salary gets “incremented” – the previous record is maintained
      • Your bank account balance is not updated – there are “debits” and “credits”
  • So one can typically model an UPDATE / DELETE as an INSERT and version the record.
    • When data gets too large, archive inactive parts of data
  • Two problems that arise when you go for an INSERT-only system:
    • The database cannot help you with cascades thru triggers - this needs to be done explicitly in the app layer
      • The cascades are actually far more complex than propagating a DELETE / UPDATE – this is a domain requirement:
        • When an employee leaves, you need to update the payroll system so that full and final compensation can be carried out
        • Everytime a bank account gets debited, checks need to be made on standing instructions, minimum account balance, etc.
    • Queries need to filter out the inactive records
      • Can lead to dirty looking code – addressed using views
      • There would be some perf penalty that can be addressed by archival

1.2.2 JOINs

  • Why avoid
    • Joins are expensive when data volumes are high since the database server has to perform complex set operations over large volumes of data
    • Do not work across partitions
    • Techniques like Materialized / Indexed Views not supported by all databases
  • How to avoid? De-normalize!
    • Purpose of normalization
      • Make it easier to have consistent data by keeping just one copy
      • Reduce the amount of storage
    • With De-normalization
      • Burden of consistency shifts from the database to the application layer
      • Easier if you only do INSERTs and no UPDATEs / DELETEs
      • Would lead to data bloat – can be significant for large volumes, but storage is cheap and you can archive inactive data

1.2.3 ACID Transactions

  • Atomic – do not need atomicity on modification of more than one record. Single key atomicity is enough
  • ConsistencyCAP theorem – can get any two of Consistency, Availability, Partition tolerance – not all three. (Also see http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.41.1915)
    • Most systems need partition tolerance and availability ahead of consistency.
      • Customer wants to place an order – you will accept the order, not return the money saying the system is unavailable – availability is important
      • Inventory would be checked asynchronously
      • Order details would be checked asynchronously
      • … would be done asynchronously
      • All this while data would be in an inconsistent state
      • This is ok – businesses are like that. They do not operate on a single version of truth. Reconciliation happens all the time.
    • Therefore our data model need not be strongly / strictly consistent. We can do with Eventual Consistency.
    • In most scenarios we need Read-Your-Writes Consistency and Monotonic Reads Consistency (as defined by Vogels in the paper above)
    • Strong consistency relies upon conflict resolution at write time to keep read complexity simpler. This does not scale.
  • Isolation – do not need isolation beyond Read-Committed. Easy with single key atomicity (above)
  • Durability – need durability till the time that RAM becomes cheap enough that one can afford many peer replicated nodes holding data in memory so that data is available even with node failures.

1.2.4 Fixed Schema

  • In an RDBMS you have to define the schema before you can start using data (somewhat like declaring types in statically typed languages)
    • Define each entity (table), its attributes (columns) and relations between entities
    • Define usage patterns (indexes)
  • Modifying schemas is essential
    • Intense competition and rapid growth necessitate adding new features / tweaking existing features rapidly
    • Changes usually require modifying the data model, thus precipitating schema changes
  • Modifying the schema is hard
    • Adding / Deleting / Modifying a column may lock the rows (imagine doing this for several million rows)
    • Adding / Removing an index may lock up the table

1.3 Don’t get some features

  • Hard to model hierarchical data
  • Hard to model graphs
  • Don’t rely primarily on main memory
    • Preferable to avoid going to disk as far as possible and serve out of main memory to get faster response time
    • Most relational systems are not memory oriented, but disk-oriented. Even with large main memory, relational databases end up going to disk for most queries – they are not aggressive about serving data from main memory and avoiding going to disk.
    • Vendors are trying to address this by acquiring / building in-memory database technology, but this is far from mainstream

2. Desired Characteristics
The environment expected is that the system would be spread over 100s to 1000s of commodity machines (hence called nodes) with different capacities. In a system like this failure is expected, tolerated and recovered from, with no loss of data, and without affecting the overall a
vailability and scalability of the system at large. The following characteristics are explicitly required to address these requirements:
2.1 High Scalability

  • Ability to add nodes incrementally to support more users and data
  • Achieved via partitioning
  • Increasing number of nodes should not result in a diminishing return beyond a threshold

2.2 High Availability

  • No single point of failure
  • Data should be available even as some nodes go down
  • Achieved via replication since data is duplicated
  • Also achieved via partitioning since at least some data continues to be available

2.3 High Performance

  • Operations should return fast
  • Achieved via being main-memory oriented instead of being disk-oriented, using non-blocking writes, lower complexity algorithms, etc.

2.4 Atomicity

  • Individual writes need to be atomic – should not expose in-between state to a read operation
  • Batching of multiple writes into a single atomic unit not required

2.5 Consistency

  • Do not need strong consistency
  • Ok to have eventual consistency (read-your-writes consistency)

2.6 Durability

  • Data should be persisted to disk and not just kept in volatile memory

2.7 Deployment Flexibility

  • Addition / removal of nodes should distribute data and load automatically without requiring manual intervention
  • Should not impose constraints like requiring a distributed file system or a shared storage
  • Should not require specialized hardware
  • Should be able to work with heterogeneous hardware – identical hardware should not be required to use the system optimally (otherwise you need to upgrade all nodes to get max efficiency which is not practical in a system running on 1000s of nodes)
  • Application should be able to control the degree of consistency, durability and read / write performance it requires without being aware of the deployment model.

2.8 Modeling flexibility
Should be able to model various types of data in a simple way:

  • Key-Value pairs
  • Hierarchical data
  • Graphs

2.9 Query Flexibility

  • Multi-Gets (retrieve a bunch of values, for the set of keys provided, in one call)
  • Range queries (retrieve data based on the specified range of the key)

3. Database Managers
A Database Manager is a software library that is loaded in-process to provide a high performance database. Its focus is typically restricted to the job of maintaining on-disk (or in-memory) data structures. These libraries are often used by distributed databases as a storage backend for managing the CRUD operations.
3.1 Berkley DB

3.2 Tokyo Cabinet

  • http://1978th.net/tokyocabinet/
  • Written in C. Language Bindings: Java, Ruby, Perl, Lua
  • Data Model: Hash, B+ Tree, Fixed length and Table
  • Concurrency:
    • Re-entrant API functions.
    • Writers block
    • Locking granularity is per record for hash and fixed length databases, Per file for others
  • Transactions:
    • Atomic operations
    • Two isolation levels: serializable and read uncommitted
    • Durability thru write ahead logging and shadow paging

4. Key Value Stores
Key-Value stores provide the simplest possible data model. However, this comes at a cost. Range queries are not straightforward (unless the database provides explicit support), and in general modeling applications in general on top of a key-value store can get complicated.
4.1 Amazon Dynamo

  • http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html – terrific paper, must read!
  • Distributed key-value store. Values are opaque to system. Targets objects typically < 1 MB
  • Internal to Amazon, not available for direct consumption externally
  • Partitioning
    • Uses a variant of consistent hashing that addresses
      • Non-uniform data and load distribution
      • Heterogeneity in capacity / performance of each node
    • The hash ring is divided into V partitions – each being serviced by a virtual node. There are P physical servers, with each physical server being typically responsible for Q/P virtual nodes, however this is decided based on the capacity of each physical node. Number of physical nodes is much less than number of virtual nodes.
    • When a physical node starts for the first time, it chooses its set of tokens (virtual nodes) and maps nodes to their tokens. This mapping is stored on disk and initially contains only the local node and token set.
    • Mappings stored at different nodes are communicated to peers using a Gossip based protocol.
    • Eventually consistent view of mappings – each node connects to a randomly chosen peer every second and efficiently reconciles mappings.
  • Replication – data is duplicated on multiple hosts.
    • Each key is associated with a
      “Preference List” -  a list of nodes to which it is propagated. This consists of N nodes where replicas are stored, and some more nodes for failure handling (see below)
    • Also, since number of physical nodes < virtual nodes, the preference list is built to ensure that the list contains only distinct physical nodes (by skipping positions in the hash ring). Since each node is aware of the token ranges handled by its peers (because of mappings getting propagated), this allows each node to forward a key’s Read / Write operation to the correct set of nodes directly
    • There are three configurable values:
      • N = number of nodes that store replicas of data
      • W = Min no. of nodes that must acknowledge the receipt of a Write before a Write completes.
      • R = Min of nodes that are contacted when a data item is to be read
      • If R +W > N, then write and read set overlap, giving Quorum.
    • Both Read and Write requests are handled by nodes called coordinators. A coordinator node is one of the top nodes in the Preference List for the key. The selection of a coordinator node is done by either using a partition aware client, or by using a load balancer.
      • Writes: The Write (Put) request is sent to the coordinator. The coordinator stores the value locally and then sends it to N highest ranking nodes in the Preference List that are reachable (healthy). If at least W-1 nodes respond with success, the Write is considered successful.
      • Reads: A Read (Get) request is received by the coordinator. The coordinator asks for all existing versions of that key from the top N ranking nodes in the Preference List that are reachable. If at least R nodes respond, the Read is considered a success.
      • So performance would be governed by the slowest node in a R/W operation.
    • This model of configurable (N, R, W) values gives applications control on the desired performance, availability, durability and consistency
      • Increasing W would ensure that data is replicated many times increasing durability. However, since W nodes would also be required to have a successful write, availability would reduce as W increases. Performance would also reduce since the chance that one of the nodes is a low thruput node increases. Also increasing W would potentially increase number of divergent versions (see below) creating reconciliation overhead.
      • Increasing R would ensure that consistency is high, but again availability would be reduced and performance may also reduce.
      • Reducing R would improve read performance. So for a read-intensive, low updates application, (N, R, W) can be set to (3, 1, 3)
      • Typical values of (N, R, W) for Amazon Apps are (3, 2, 2)
  • Consistency – Since updates propagate to replicas asynchronously, the model is eventually consistent
    • Uses object versioning (each update results in a new immutable object) using Vector Clocks.
    • Consistency protocol: As mentioned above, a successful read requires that at least R nodes respond. Each response includes all versions of the object. If the coordinator node ends up with multiple versions of data, it does the following:
      • Returns all versions it deems to be causally un-related
      • Divergent versions are reconciled
      • Reconciled version superseding the current version is written back.
  • Handling Temporary Failures:
    • As mentioned above, the Read and Write propagation considers the first N healthy nodes, not the first N nodes in the Preference List.
    • However, when a node that is not in the top N nodes (but in the top N Healthy nodes) receives a Write request, it is also given extra metadata to indicate which node was it actually meant for. This is called a Hinted Replica.
    • Each node periodically scans its hinted replica set and tries to connect to the original nodes for which the replica was meant.
    • If it is able to connect, the node tries to deliver the replica to the original node and if delivered, may delete the replica
    • This process is called Hinted Handoff
  • Handling Permanent Failures:
    • Permanent failures can arise in case a hinted replica becomes unavailable or some other risk
    • To reduce chances of permanent failures, nodes sync replicas thru Merkel Trees
    • Each node maintains a Merkel tree for the key range hosted by it
    • To sync, two nodes exchange the root of the tree common to the key ranges hosted by them, and in case there are any divergences, the apt sync action is taken.
  • Ring membership:
    • Addition / removal of node from ring is relatively expensive because of the need to rebalance partition assignment
    • Therefore addition of a node to a ring and its departure are explicit and not based on non-availability since non-availability can be because of transient conditions (say network)
    • Membership propagation and reconciliation happens during the same communication that is used for token mapping propagation and reconciliation (see partitioning above)
  • Failure Detection:
    • The need to detect failure is to avoid trying to reach unhealthy (unreachable) nodes.
    • When a node A fails to reach another node B, it propagates that information thru Gossip. Checks periodically to see if B comes back. Info on arrival of node also propagates using Gossip.
  • Persistence:
    • Pluggable model - Berkley DB Transactional store, MySQL, In-memory buffer with persistent backing store.
    • Apps choose the kind of storage they need.

4.2 Amazon S3

4.3 Project Voldemort

4.4 Redis

4.5 Scalaris

  • http://code.google.com/p/scalaris/
  • Distributed key-value store based on written in Erlang
  • No persistence, entirely memory driven. Can use a backend like Tokyo Cabinet for handling database size > RAM + Swap.
  • Partitioning – Uses a modified version of Chord called Chord# that
    • Preserves key order (allowing range queries)
    • Does routing in node space
    • Delivers proven O(Log N) performance (proof in the paper linked above)
  • Replication – Essential for being fault tolerant since there is no persistence. Uses a peer to peer replication approach called Symmetric Replication that reads and writes to a majority of nodes participating in replication.
  • Consistency: Strong consistency - uses a modified version of Paxos that provides atomic transactions
    : non-blocking writes that execute in parallel with strong consistency.
  • Language support: Erlang, Java, JSON RPC
  • License: Apache 2.0
  • Production Use: None mentioned

4.6 Others

5. Document Stores
Document stores are a step further from key-value stores in the sense that a value associated with a key is a full blown record (document) which is not opaque to the database, but exposes a structure which allows the database to perform some operations on it. For example, the document  can be a JSON serialized object.
Note that this approach is different from a relational database where the schema is defined upfront in the form of a table. In a document store, each document can have a different schema. Despite this, it is possible to describe relationships between records, just the way you do in relational databases:

  • One to Many:
    • Embed key of foreign entity in the document. Then the foreign entity can be retrieved.
    • Embed foreign entity itself in the document. Does not scale, but is faster
  • Many to Many
    • Embed keys of one type of entities in the other type of entity
    • Maintain another entity that embeds the keys of the related entities.

As should be obvious – document stores are conceptually very similar to relational databases except that schema definition is not upfront.
5.1 Amazon SimpleDB

  • http://aws.amazon.com/simpledb/
  • Written in Erlang
  • Data is modeled in terms of
    • Domain (a container for holding related data together)
    • Item (a entity that goes into a domain)
    • Attribute and its Value (a properties of an item)
  • So basically each item is a dictionary to which you can add / remove attributes (keys) at any time. Related items go into a Domain.
  • When an item is added, it is indexed based on its attributes (and the attributes of other items in the domain).
  • You can issue SELECT statements much the way you would issue to a relational system, with the concept of a table being replaced by a Domain.
  • Eventually consistent
  • Partitioning: The domain-item-attribute model imposes limits on size of each domain, number of attributes / domain and number of domains, so in order to scale, you need to partition the data manually.

5.2 Apache CouchDB

  • http://couchdb.apache.org/
  • Written in Erlang
  • Data and Storage Model
    • Records serialized in the form of JSON strings (documents)
    • Each document has a unique DocId
    • Uses B+ Trees: Both for main data as well as for Indexes
    • Append only operations (Every update generates a new Sequence Id for the DocId). Committed data is never over-written
      • Every write operation also results in index updates which are also append only, so a new node is added to the B+ Tree which recursively leads to Log(n) updates to the B+ tree.
      • During Compaction older versions are removed
    • Append only model means that Reads are done using Multi-Version Concurrency Control. A client can hold on to the older root of the B+ tree and get a consistent view while the data is being modified continuously.
    • Writes are serialized, except for blobs which can be concurrently written.
    • An update to a document is atomically committed with data and index updates getting synchronously flushed to disk, and the database header is updated.
    • When resuming from a crash, no recovery is required (except for recovering the database header). Partial updates are simply truncated.
  • Views
    • Required to “add structure back to unstructured and semi-structured data.”
    • Can have multiple views for the same data
    • Defined using a JavaScript function that maps keys to values in a special type of document called a Design Document.
    • View functions are run when your query a view: The system takes the source code and runs it against every document in the database. There are basically two kinds of functions:
      • Map function: takes a single parameter – the document – a single document in the system, does some operation on it (can’t modify the document) and returns a result (mapping an input to a result).
      • Reduce function: takes a list of key-values and returns a single result after performing some aggregate operation on it (reducing multiple inputs to a single result)
    • View results are stored in a B+ tree and unless the document changes or the view definition changes, the results are not re-computed and fetched straight from the B+ tree.
  • Replication
    • Master-slave with multi-master support
    • Only latest revision replicated
  • Conflict Handling in Multi-master
    />
    • CouchDb detects conflicts and uses a deterministic algorithm for picking a “winner”. Deterministic means that all replicas end up winning the same version without talking to each other.
    • This winner may or may not be the one your app expected. This will have to fixed by the app itself.
  • Partitioning and Load Balancing - Provided thru CouchDb Lounge
    • Uses consistent hashing to partition the data
    • Dumbproxy used for all requests except View requests. Implemented as an Nginx module
    • Smartproxy used for View requests only
      • Twisted python based daemon
      • Sends view requests to all shards
      • Merges responses before sending them back
      • Merges happen in constant memory space
  • License: Apache 2.0
  • Production Use: BBC, PylonsHQ, GPirate, and several others

5.3 Riak

  • http://riak.basho.com/
  • Written in Erlang. Uses the same architecture and algorithms as Amazon Dynamo.
  • Client libraries
  • Data Model
    • Key-Value. Value is a document.
  • Riak nodes go on a ordered consistent hash ring
    • 160-bit binary hash of key-value pair, maps to a position on the ring.
    • Ring divided into partitions. Each partition is designated an owner called v-node. The v-node is said to claim a partition.
    • Nodes try and run equal number of v-nodes. So each node is responsible for 1/(number of nodes) or (number of partitions) / (number of v-nodes). So if 2 nodes define a 16-partition cluster, each node would run 8 v-nodes.
  • Writes (Puts)
    • Any node may be chosen as coordinator for the Write request
    • Coordinator node figures out the v-node responsible for the partition where the key needs to go.
    • Coordinator sends Write request to that v-node as well as next N-1 partitions in the ring, where N is a configurable parameter that defines how many copies of the value need to be created.
    • Write request may also specify that at least W (=< N) of the v-nodes respond with success, and that DW (=< W) of the W nodes respond with success only after durably storing the state.
  • Reads (Gets)
    • Any node may be chosen as a coordinator for the Read request
    • Coordinator node figures out the v-node responsible for the partition for the key
    • Sends request to v-node, as well as next N-1 nodes
    • Request can also specify R (=< N) nodes that must reply before a response is returned.
  • Ring state propagation
    • Consists of things like arrival and departure of nodes on ring
    • Done using Gossip protocol
  • Eventual Consistency
    • When a object is added it is tagged with a vector clock specifying its initial version
    • For each update, the clock is updated in such a way that the system can compare two versions of an object and determine how the two objects are related (direct descendant, common parents, unrelated)
    • To compare object versions, a Merkle tree is used
  • Failure handling is done using the same hinted replicas and hinted hand-offs approach as described in the Dynamo paper.
  • Storage model – pluggable:
  • Queries
    • Map-reduce model
    • Map operations run local to the data on the hash ring – can be run in parallel
    • Reduce operations take inputs from the preceding phase and reduce (typically perform an aggregate operation) the input. This cannot be parallelized
  • License: Apache 2.0
  • Production use: No names mentioned, except that the FAQ says that “Riak has robustly served the needs of multiple revenue-generating applications for nearly two years now.”

5.4 MongoDB

  • http://www.mongodb.org/
  • Written in C++
  • Language bindings (drivers): C, C++, Java, JavaScript, Perl, PHP, Python, Ruby. Also community provided support for C#, F#, Erlang, Groovy, etc.
  • Data Model
    • Collections – similar to Domains in Amazon SimpleDb – a bucket to hold together similar documents
    • Key-Value, with value being binary serialized JSON (BSON)
      • 4 MB limit on a BSON object
      • Large object support via GridFS
    • B-Trees used for indexes. You specify the fields on which to index using a function called db.things.ensureIndex() that takes the field as a parameter.
  • Storage: Uses memory mapped files. So caching is controlled by the OS VMM
  • Writes
    • In-place updates
    • Provides partial updates – you can update the value without sending a full row update.
    • Single document Atomic updates. Atomic batch updates possible thru db.eval() but may not be supported with partitioned data

    <
    br />

  • Queries
    • Provides a JSON style based syntax to pick values from inside documents
    • Support for conditional operators, regular expressions, etc.
    • Supports forward-only cursors
    • Query optimizer is not cost-based, instead multiple query plans are tried and the one that works best is picked.
  • Batch Operations
  • Replication:
  • Partitioning:
    • Partitioning In alpha stage.
    • Data sharded by Collection with order preservation
    • A shard consists of one or more servers replicating using an enhanced version of replica pairs
    • Shards deal with data in units called Chunks.
      • A chunk is a contiguous range of data from a collection with a max limit of 50 mb. After that a chunk splits into two chunks
      • Data migrates from one shard to another in chunks (in case of shard having excess data, or having nearby shards)
  • License: AGPL 3.0
  • In Production use at: Sourceforge, Github, EA, and several others

6. Column-Family Stores
6.1 BigTable

  • http://labs.google.com/papers/bigtable.html
  • Internally used at Google. Exposed to general public thru Google App Engine
  • Building Blocks
    • Google File System – a distributed file system. Provides raw storage
      • Files broken into chunks (typically 64 mb)
      • Each chunk is replicated across 3 machines for safety.
        • A heuristic here is that one of the replicas tries be on the same rack as the original and the other two replicas somewhere else
      • Chunk Servers – responsible for storing chunks
      • Data transfer happens directly between clients and chunk servers
      • Master – responsible for metadata in the file system
    • The storage file format is called SSTable (sorted strings table)
      • Persistent, ordered, immutable map of keys to values with both keys and values being arbitrary byte strings
      • Operations to look up the value specified with a key and to iterate over key/value pairs in a given key range
      • Optionally, an SSTable can be completely mapped into memory
      • Immutability of SSTables:
        • Means no synchronization required during file access.
        • Makes removing permanently deleted data a garbage collection exercise (happens in a major compaction – see below)
    • Scheduler
      • Schedules jobs on machines
      • Watches for failures and re-schedules if required
    • Chubby Service – Distributed lock / file / name manager
      • Coarse grained lock – can store small amount of data in lock
      • 5 replicas, 1 master. Need majority vote to be active. Uses Paxos to keep replicas in sync
      • Provides a namespace that consists of directories and small files. Each file / directory can be used as a lock, and reads and writes to a file are atomic.
      • Each chubby client maintains a session with a chubby service. There is a lease time which needs to renewed.
      • Clients can register callbacks on Chubby files and directories for notification of changes / session expiration
    • Optional Services used by BigTable applications (but not used by BigTable itself)
      • Map-Reduce: Batch processing – clients often use this to read / write BigTable data
      • Sawzall: to support execution of client provided scripts in the server space for transformation, filtering and summarization
  • Data Model
    • Multi-dimensional map (map of maps)
    • Data organized into tables. A table is an entity that holds related data
      • Each record is identified by a row key
      • Each row has columns. Columns here should be thought of as attributes of a row. Not all rows would have the same columns
      • Intersection of a row and a column creates a cell
      • A cell has versioned data based on timestamp. So a single cell may holds multiple versions.
      • It is convenient to organize attributes. Thus columns are grouped into column families. A column family can be thought of as a namespace for a set of related attributes.
    • So the dimensions of the multi-dimensional map are:
      • Row key: Uniquely identifies a datum (arbitrary byte string)
      • Column-family: A grouping of attributes – a namespace for a set of related attributes.
      • Column: A single attribute.
      • Timestamp: Denoting the time when the value was changed.
        • Different versions are stored in decreasing order of timestamp, so most recent versions can be read first
        • Client can specify that only last n-revisions of cell should be kept, or only values written in the last n-days should be kept
    • Transactional access possible to any column within a single row
      • Can perform atomic read-modify-write sequences on data stored under a single row key
      • Transactions across row keys not supported
      • Batch operations across row keys are possible but are not atomic
  • Tablets
    • Data is sorted lexicographically by Row key
    • Row range for a table is dynamically partitioned.
    • A row range is called a Tablet. So you can describe a tablet in terms of a start row key and an end row key
    • One tablet = 100 to 200 mb
      • Tablets can be split based on tablet size or load
    • A tablet server may end up serving anywhere from 10 to 1000 tablets (typically hundreds).
      • While each tablet consists of contiguous data, a tablet server would pick up tablets spread across the entire data set and not data that is related. This helps allevi
        ate hot spots since load gets spread evenly
      • The high ratio of tablets to tablet servers gives very fast recovery since if a machine that was serving 100 tablets fails, 100 machines can each pick up 1 tablet from the failed machine making recovery highly parallelized
      • Also gives fine-grained load balancing – you can migrate tablets away from an overloaded machine.
    • Servers are further grouped into cells
      • A cell is a collection of servers. Most cells have 10-20 servers
      • Some cells have as many as thousands of servers managing several hundred TB of data
  • A typical cluster consists of
    • Commodity Intel based PC hardware with each node running
      • Google’s version of Linux
      • GFS Chunk server – for raw reads and writes
      • Scheduler slave process – runs jobs
      • Tablet server – serves BigTable read / write requests
      • Some nodes may also run a BigTable master which is responsible for meta-data requests (create a new table) and load monitoring / balancing
    • GFS Master
    • Cluster scheduling master – talks to the scheduler slave process on each node to startup jobs
    • Lock service – used for electing BigTable masters and for bootstrapping the tablet lookup (below)
  • Locating Tablets
    • To figure out which server is serving which tablet (row key range), this mapping info is kept in special metadata tables (which in turn can split into tablets)
    • Three level hierarchical lookup with bootstrap provided by Chubby
      • Chubby file points to a root metadata tablet – this tablet is never split
      • This tablet points to other metadata tablets
      • Each 2nd level metadata tablet points to app data tablets
      • Each metadata row is an encoding of tablet’s table id and end row key, and takes approx 1 kb
      • With a 128 mb limit on metadata tables, this scheme can address upto 2^34 tablets (or 2^61 bytes in 128 mb tablets)
    • The lookup is aggressively pre-fetched and cached by clients so most queries end up going straight to servers
      • Empty cache: There are 3 network round trips including one read from Chubby
      • Stale cache: could take up to  round trips since stale entries are discovered via misses
  • Tablet Assignment
    • Each tablet assigned to one server at a time. Master keeps track
    • When a tablet gets unassigned, the Master asks a Tablet server with available capacity to take over the tablet
    • Tablet assignment uses Chubby
      • On starting, a tablet server creates and acquires an exclusive lock on a uniquely named file in a specific Chubby directory called Servers
      • Master monitors this directory to discover tablet servers
      • If a tablet server loses its exclusive lock it stops serving the tablets
      • Tablet server may try and re-acquire the lock if the file still exists, however if the file is gone, the tablet server kills itself
    • Master asks each tablet server periodically for the status of its lock
      • If tablet server reports a loss, or if master fails to connect to tablet server, the master tries to acquire an exclusive lock on the tablet server’s file
      • If master is able to acquire lock, it means Chubby is alive. So master deletes the file so that the tablet server can never serve again, and the master moves all tablets assigned to the tablet server to the set of unassigned tablets
      • If master is unable to reach Chubby, it kills itself
  • Writes
    • When a write request goes to a tablet server, it checks it for well formedness, and authorization (Chubby maintains a list of permitted writers)
    • Valid mutations are written to a commit log (grouped for small mutations)
    • After the write has been committed, its contents are inserted into a sorted in-memory buffer called memtable. From here, updates keep going to a sequence of SSTables as they get older
  • Reads
    • Similar checks as for a write request – well formedness and auth
    • Read op executed on a merged view of SSTable sequence and memtable
  • Tablet Recovery
    • Maybe required if a Tablet server dies
    • Tablet server reads its metadata from the MetaData table – list of SSTables and set of redo points which point to commit logs
    • Server reads indices of SSTables into memory and reconstructs memtable by applying all updates that have been committed since redo points
  • Compactions
    • Minor Compactions:
      • As writes occur, memtable grows till a threshold. Is frozen. New memtable created. Frozen memtable converted to SSTable and written to GFS.
      • Shrinks memory usage of tablet server
      • Reduces amount of data that needs to be read from Commit log during recovery
    • SSTable Compactions
      • As minor compactions occur, new SSTables get created. This can lead to a Read operation slowing down as it may need to merge data from multiple SSTables
      • So periodically a merging compaction runs which reads a few SSTables and the Memtable and writes out a new SSTable
      • This does not remove special deletion entries (used to suppress deleted data)
    • Major Compactions
      • From time to time BigTable runs a compaction that rewrites all SSTables into exactly one table is. This is called a Major compaction
      • During this compaction deleted data / deletion entries are removed.
      • Required for claiming resources
  • Refinements
    • Locality Groups: Are used to group together parts of data that have similar usage characteristics.
      • Client can group multiple column families into a locality group
      • Separate SSTable generated for each locality group
      • Locality group can be marked as being in-memory (lazy loaded). BigTable itself uses it for the Location column-family in metadata table.
    • Compression: Clients can specify whether or not SSTables for a locality group are compressed or not, and which compression algorithm to use
    • Caching: Tablet servers use two levels of caching:
      • Scan cache – higher level. Caches key-value pairs returned by SSTable interface to Tablet server code. Useful for apps that tend to read the same data repeatedly
      • Block cache – lower level. Caches SSTable blocks reads from GFS. Useful for apps that read data close to data they recently read

6.2 Cassandra

  • http://incubator.apache.org/cassandra/
  • Combines the distributed architecture of Dynamo with BigTable’s column-family data model
  • Written in Java. Thrift based interface. High-level libraries available on Ruby, Perl, Python, Scala and Java
  • Data Model:
    • Multi-dimensional map indexed by a key like BigTable.
      • Each application creates its own key-space (table in BigTable)
      • Key – a name for a record. Can be an arbitrarily long string. Indexed by Cassandra
      • Column – an attribute of a record. Smallest datum you can get hold of. This is timestamped.
      • Column-family: Grouping of columns. Similar to a relational table.
      • Super-Colum
        ns: A list of columns
      • A column-family can either contain columns or super-columns, but not both. Coumn family containing super-columns is called super-column-family
      • The way to get to a piece of data is: KeySpace.ColumnFamily.Key.[SuperColumn].Column
    • Sorting
      • Data is sorted at write time (unlike relational)
      • Columns are sorted within their row by column-name. The sorting provider is pluggable
  • Partitioning: Similar to Dynamo
    • Consistent hashing using an order preserving hash function.
    • However uses the Chord approach to load balancing rather than Dynamo’s v-node approach
  • Replication
    • Same concept of coordinator nodes and preference lists as Dynamo
    • Various replication policies: Datacenter aware, Rack-aware, Rack-unaware
    • Rack-unaware approach is same as Dynamo (replicate to N-1 nodes)
    • Rack aware and Datacenter aware use a more intricate algorithm which involves a leader (chosen using Zookeeper) that maintains the N-1 replicas invariant. This is apparently not there in Apache Cassandra.
  • Membership: Membership based on Scuttlebutt – anti-entropy Gossip based mechanism. Same concept of explicit membership changes as Dynamo
  • Failure detection: Uses a modified version of Φ Accrual Failure Detector which gives a probabilistic value for each monitored node. This apparently is fairly accurate and more representative of network conditions.
  • Failure Handling: Same concept of hinted replicas and hinted hand offs as in Dynamo
  • Persistence:
    • Relies on local file system (unlike BigTable that uses GFS)
    • Storage structure and approach is similar to BigTable: SSTables, Memtables, Commit logs, Compaction, Bloom filters, etc.
  • Writes
    • Similar to the BigTable approach of writing to commit log for durability and recoverability, followed by update to Memtable
    • Dedicated disk on each machine for commit log makes the writes sequential and thus maximizes thruput
    • When memtable crosses a threshold (out of space / too many keys / client provided time duration), it gets written to two files:
      • Data file – SSTable
      • Index File – key/offset pair
    • No seeks – always sequential, thus quite fast
    • Atomic within Columnfamily
  • Reads
    • Similar approach as Dynamo for figuring out which nodes will serve, read-repair, etc.
    • At a storage level, approach is similar to BigTable: read from memtable + SSTable sequence
  • License: Apache 2
  • In Production at: Facebook, Digg, Rackspace

Tags: