Everything you need to know about data modelling for document databases

Back in 90s and early 20th century applications were like following:

earlydaysofapps

Today, building applications is different than what it was back in 2000. Building modern day applications involves quite a few different technologies.

todaysapps

Back in the days, you could estimate and predict the size of your audience, today we have virality, different marketing schemes which makes it harder to predict the size of audience. You can never know, your next app can be the big hit.

We spent too much time in relational world, it takes time, effort and practice to bend your mind and understand how to do it in NoSQL way.

In relational data stores data is organized in tables which store rows and columns. Data types are usually the simple ones like, string, int, float date time. It is hard to store complex data types like arrays, hash tables and complex objects, which is called impedence mismatch.

On the other hand, key value or document stores can also store simple data types, moreover, they are comfortable storing complex data types usually in JSON format. JSON is basically a special class of string for representing data structures. Also, the schema in key/value and document stores are implicit and unenforced.

For many year now, we have been thought to normalize our data model for redundancy and efficiency as following.

aggregate-split

Relational data stores are like a garage, you break your car apart and store related ones in the shelves. Store related things together.

In a document databases, you persist the vehicle the way it is. “Come as you are”

jsonformat
Same data can be represented in JSON document format as above. It is much easier, compact and simpler to represent the aggregate in JSON. It is a denormalized form of the data. Also, It makes it more convenient to distribute this data across multiple servers or clusters. The main problem with relational data stores here is the JOIN operation which is expensive operation.
Here is the domain model for this particular example:

While ORMs come to rescue for relational databases, performance is usually becomes a problem. One can claim that serialization/deserialization for JSON documents can also become a problem however, usually serialization frameworks are a lot faster these days.

Schema free databases makes it easy to rapidly develop software, however, you need to understand and embrace the principles behind data modelling for document databases.

Questions you need to keep in mind during modelling are:
• How is data going to be persisted?
• How are you going to retrieve or query data?
• Is your application write or read bound?

Key Patterns:
We start by defining unique keys which we call predictable keys (that we can query easily). Following keys have corresponding values which are JSON documents.

-Users:
user::[email protected] -> User data as value
user::[email protected] -> User data as value

-Products
product::12345 [ID of the product] -> Product data as value

Then we have unpredictable keys as follows:

-Sessions
session::a5ab2863-db93-430e-8da3-feeb1998521f  -> Session document data

Unpredictable keys are queried with Map-Reduce functions in most key/value and NoSQL databases.

Counter-Id Pattern:

Almost all of the key/value and document stores provides very fast atomic counters. These atomic counters are safe for doing increment and decrement operations. The idea behind counter-id pattern is as follows:

counter-id_pattern

Additionally, we can get the current state or the value of the counter anytime and hence we can predictably query users or iterate over them. This is a pretty easy and clever technique in my opinion. Since we only do increment, I know the size of the data, I can even run multi-get operations on it or do some paging.

This is similar to Identity column in RDBMS systems. Each time I want to store a document, increment the counter, get the counter value, create a predictable key and store the key/value.

Lookup Pattern:

Lookup pattern is usually a two step process as follows:

lookup_pattern

Initially, we store the data with an unpredictable key (such as GUID, UUID), then we create references to the particular key/data with predictable keys such as (email, username etc.)

In this example: we store a user data with a GUID representation, then we store references to this particular key, such as email, username. While retrieving the user data in a predictable way, we use email and username to get the GUID representation, then we can do another GET operation with key we captured from the first query, in order to get the data.

This makes it easy to retrieve documents without Map-Reduce operations. This pattern is quite useful for large datasets. Most of the data stores provide map-reduce operations with eventual consistency that use B trees on disk with log(n) complexity. However, lookup pattern provides immediate consistency with O(1), constant lookup time. Hence lookup pattern is very fast, scales very well and provides read your own writes along with consistent performance.

You can use these patterns together. For example: you can use counter pattern along with lookup pattern as following:

counterpattern_lookuppattern

Initially we get an ID from our counter, we store the data with “user::54” key. Then we can add additional lookup keys such as  (“twitter::first”,”user::54″) and we use other keys that we will need to get this particular document. Again, we need a two step process to get the initial data. First we do a GET operation on (“twitter::firat”), which gives us the result (“user::54”), then we do another GET operation on (“user::54”) and we have the user document.

Yet again, these are pretty fast operations since they run in constant lookup time.

Pros of combining counter-id and lookup pattern:
These are extremely fast binary operations.
Linear performance.
Enables several ways to find single document.
Provides consistency, always consistent.

Cons of combining counter-id and lookup pattern:
Increases number of documents in the storage which is usually not a problem.

Below you can find the differences between RDBMS and Document Stores for persisting and querying the data.

rdbms_vs_keypatterns

Related data patterns:
To embed or to reference? That is the question.
Embedding is easy. You can persist the document the way it is.

Pros with embedding:
Easy to retrieve document, you can get the document at once. Single read operation brings you back the document. Single write or update operation takes care of your insert or update.

Cons with embedding:
In most data stores there is a certain limit to document size such as 20 MB (Mongodb) or 16 MB(Couchbase).

Most key/value stores or document stores are designed and developed to store small chunk of documents. Storing large documents are usually a bad idea.

Retrieving a large document can be a burden on network, even if you have gzip enabled. When you have multiple clients updating the same document then you might have versioning problems you need to take care of. You will have to implement optimistic or pessimistic locking.

At the same time, embedding can be performant due to locality reference.

When is it good to embed:
• There are contains relationships between entities.
• There are one-to-few relationships between entities.
• There is embedded data that changes infrequently.
• There is embedded data won’t grow without bound.
• There is embedded data that is integral to data in a document.

Usually, denormalized or embedded data models provide better read performance.

Below is an example of embedded document:
jsonformat

Below is an example of a document with references:

referenced_documents
References lines items are separate documents such the following:lineitem_id

Each line item can be accessed with the key captured from LineItems array above.

You can use both referencing and embedding as below:

mix

You can also use referencing heavily such as following:

referencing

Sometimes we need to embed, sometimes we need to reference. When do we embed and when do we reference? This is the critical decision for efficient modelling. There are some signals for selecting embedding or referencing.

Relationships…

relationship

1:1 (one to one) relationship, is usually better for embedding. Also, if there is a 1:Few relationship, it is also acceptable to embed. For example a blog post document with tags or categories. So we need to care about relationships and dependencies. Embedding leads to better performance because of RTT (round trip times) to databases.

1:many (one to many) relationship is good for referencing. For example: blog post with comments document. A blog post can have hundreds of comments with different sizes. There is no bound to referenced data. Instead of embedding, referencing is more efficient in this case. You can also have race conditions if you prefer to use embedding in this case.

M:m (Many to many relationship) is good for referencing as well.

Data volatility is important while choosing embedding vs referencing. How often does the data change? If the data doesn’t change so often, it is better to embed rather than referencing. However, if there are a lot of operations on particular documents, it is usually better to reference. So, if the volatility is low, embedding is OK, if volatility is high referencing is better. With high volatility you can have race conditions.

If a document is used by many clients, it is better to reference rather than embedding. Referencing provides better write performance. Because you have smaller documents to write. However, it causes performance hit for READ operations due to multiple Round trip times.

Normalizing data may save some space however,  storage is cheap today, also  requires single read and RTT. Normalizing data doesn’t align with classes/objects , object model. Referencing typically provides faster write speed. Denormalized aligns with classes/objects, however, requires updates in multiple places. Typically provides faster read speed, since you get the document at once.

Clearly there are trade offs for referencing and embedding. It is up to you to decide which one to use based on your data model and access patterns.

Data is data, regardless of the shape of it. It can be stored in table or a document store, it doesn’t matter. Document stores are structural, there is schema and object model.

It might take a while to get used to the data modelling with key/value and document stores, you might need to bend your mind and get rid of relational data modelling concepts. Once you get used to it, data modelling with document stores are pretty trivial. Understanding how to create keys is usually a skill, but you can create quite amazing patterns once you get used to it.

These access patterns can be used in many key/value and NoSQL stores, for example: Redis, MemCache, Couchbase, Couchdb, Mongodb, Riak, Dynamo and cloud document store solutions.

Moreover, hints above for referencing or embedding comes handy with the data stores above as well.

Boundaries at NoSQL

Defining and developing a model is usually the first thing we do while developing software. In this process we define classes, interfaces, with relationships and so on.

Boundaries and Aggregates is a pattern in domain driven design. A collection of related objects can be viewed as a single unit which is called aggregate. There is an integrity in the aggregate. Moreover, there are boundaries between classes and relationship.

We need to persist aggregates to the databases. We usually use ORM (object relational mappers) for this purpose while working with relational databases.

aggregate-split

Above, we have an aggregate for an online store, which is a shopping cart. Using ORM and relational data store, aggregate is persisted to related tables. Customer (1001), Ann has many line items and has a payment details. Likewise, aggregate can be populated back into object model in a similar way. So, ORMs provide great benefits while working with aggregates by simplifying the programmers work.

Same pattern applies to NoSQL databases. Regardless of the persistence technology, if we use aggregates in our software, we persist these aggregates to the data store. Likewise, we can query this aggregate by line items, payment details etc.

How does aggregates and boundaries work with NoSQL stores?

Key-value stores, document stores, columnar, graph stores, as you can imagine they have different means of storing aggregates. While key value and document stores are similar, in the sense they can both stored and queried by key, documents can also be queried by properties as well. Key value stores doesn’t provide functionality to query by properties.

Columnar databases, uses a key space and column family. Instead of using rows we have family of columns. For graph databases we can easily store and query relationships between objects.

Most NoSQL databases provides an easy way to store aggregates in memory to data stores.

While aggregate oriented databases are great transactional systems, we can’t say the same for analytics systems. Generating analytics and reports from such aggregates is difficult and inefficient due to data residence.

Even though NoSQL databases claim to be a great fit for big data analytics system, I wouldn’t agree on that. One has to create map-reduce on these particular databases, which there are serious limitations on them.

Couchbase Client SDK Internals

In the previous post, there was an introduction to Couchbase. In this post, there is a summary of how client SDKs work internally.

While developing applications that communicates with Couchbase, developers use client SDKs.

Clients, in order to communicate with Couchbase, creates connnections, sends data over connections, tear down the connection and so forth.

Two things we need to pay attention is thread safety and connection management. Anytime we are working with systems that uses TCP/IP, we should be careful about connections. Because creating and tearing down connections are expensive operations. Moreover, thread safety is important, if the client SDK is not thread safe and being used carelessly in a multi-threaded environment or server, it will cause problems.

While most programming languages provides means for concurrency, threading and parallel computing, if the Client SDKs doesn’t support such functionality, you will have problems. That is why, it is essential to know if the client SDKs you will be using in your project is thread safe or not. Re-using a single object/connection will perform much better than creating and cleaning up references. With that said Java and .Net clients are thread safe. That means, if you create a single instance of the client, you can re-use the instance. However, if the connection fails, you will need to re-initialize it.

When it comes to connection management, optimal is minimal amount of connections. You should try to have one connection per application or client SDK.

For key based access, memcached protocol is used, on the other hand, for views another port is used.

When the client is initialized, we supply some basic information, such as where the client can collect the cluster map. Client uses HTTP to get cluster’s status and configuration data. Clients receives back list of serves (all the servers and their statuses), vBucket Map (list of all the vBuckets in the cluster, each buckets also provides the replicas position).

When the client is doing any operations, document key is hashed (consistent hashing). The result is between 1 and 1024. Couchbase cluster has 1024 vBuckets and the result of the key hash is where the document will be stored. Once the vbucket is located, client simply sends the data to the corresponding server.

Using long polling, client continuously communicates with the node to retrieve cluster map. If there is any change in the cluster map, client will be aware of the changes.

For client samples you can check out Couchbase Github page.

Couchbase introduction

We are all aware of Relational Database Management Systems which is based on Relational Model developed by Codd (normalization, foreign keys, joins etc.). RDBMS provides features like transactions, locks, ACID properties and joins. Data is stored in tables as rows. However there are some limitations of RDBMS in terms of scalability and schema changes.

Sample_DB_EER

Scalability has been a hot topic since the internet bubble for modern web applications. Hence the name, web scale, or internet scale. Today, many developers are addicted to this word “scalability”. In developers eyes everything should be scalable. Every architecture or design built should be in web scale. While this doesn’t apply to all projects, yet developers want to scale everything.

horizontal-vs-vertical-scaling-vertical-and-horizontal-scaling-explained-diagram

Scalability comes in two flavors, scaling out (horizontal) and scaling up (vertical). Scaling up is usually buying a more expensive hardware that has more CPU, RAM and storage. I have seen data stores with 8 TB of RAM, 128 core CPU and lots of storage. However there are limitations that a single machine can do. Scaling out is simply adding more servers/nodes to a cluster. The challenges with scaling out by adding more nodes to a cluster are how to distribute the data across the cluster evenly or logically, and then how to find them, we have to consider transactions and so on. Even though there are articles or books indicating the change of schema in RDBMS is rather difficult today it is easier and can be done online without taking the whole database offline. With NoSQL many of them are design and built as Schemaless. However, there is an implicit schema and in fact while accessing data from applications we need to know the data model, hence in application we need schema.

NoSQL data stores are no silver bullet and many times projects are developed with NoSQL stores because of the hype (Hype Driven Development). It is essential to consider the product being developed and then picking the correct data store for it.

In this article a cluster is collection of instances that are configured as logical cluster.

Some of the major benefits of NoSQL stores are known as follows: being Schemaless that provides rapid application development, can scale easily, high availability.

Moreover, you need to understand the CAP theorem if you are considering NoSQL stores.

Consistency: Generally speaking in a distributed system, the data across the clusters should be consistent with each other. Every component in a system should see the same data.

Availability: Systems should be able to serve clients at all times even if there are failures in some other parts of the systems. In short, client should be able to read, write and update all the time. Every request made to the system should receive a response.

Partition Tolerance: In a distributed system failure of replication data, communication between nodes shouldn’t stop the system from satisfying user requests. The system should continue functioning even if some parts fail or traffic is lost.

So the cap theorem indicates in a distributed environment you can have only two of the above three features. You cannot have all of them. Depending on your product and requirements you need to pick two.

cap-theorem-venn-diagram

Couchbase goes for AP (availability and partition tolerance) and provides eventual consistency.

couchbase_logo

Couchbase server is persistent, distributed, document based data store that is part of the NoSQL movement. Even though we say Couchbase is document oriented however the difference between document oriented and key value stores is little blurry since we have a key pattern that we can access objects by their key. Couchbase also provides Views which developers can access objects with other properties as well.

I have written about Membase which is dated back in 2011. There was also couchdb. These two companies developed Couchbase. Couchbase has the following set of features.

Scalability: Couchhbase scales very well and in a very easy operational way. Data is shared across the cluster between nodes. Hence performing lookups, disk and I/O operations are shared between nodes. Distributing data and load across the cluster is achieved by vBuckets (logical partitions, shards). Clients use consistent hashing in order to work with the data. Moreover, Couchbase scales linearly without compromising performance. Performance doesn’t change by adding new nodes to the cluster.

Schemaless: You don’t need to define an explicit schema, data model can be changed easily. In relational model, we need to have a rigid schema, and yet schema changes many times over the length of the project. Back in days, this was a challenge for relational stores. Today, there are migrations for changing schemas in relational stores without taking the data storage down. Migrations can be done online. However, care is needed while applying migrations to avoid data loss.

JSON document store: data is stored in JSON format. JSON is a compact key-value data format that is used for many purposes.

Clustering with replication: All nodes within a cluster are identical, Couchbase provides easy, built-in clustering support as well as data replication with automatic fail over. Certainly clustering is one of the feature that enables high availability.

High availability: It is easy to add and remove nodes to and from a cluster without compromising the availability. There is no single point of failure in Couchbase cluster, since the clients are aware of the entire cluster topology, including where every document is located.

Cache: Couchbase has built-in cache support. All documents are stored in the RAM. When the cache is full, the documents are ejected.

Heterogeneous: Within a Couchbase cluster all the nodes are identical and equal. There is no master node and single point of failure. Data is distributed and the load is distributed uniformly across the cluster.

One of the most distinguishing features of Couchbase is, being very fast. This is mostly due to a solid caching layer inherited from memcached.

In a Couchbase instance there is data manager and cluster manager.

data_manager

Cluster manager is responsible for nodes within the cluster, distribution of the data, handling fail over, monitoring, statistics and logging. Cluster manager generates the cluster map which client uses to access the data and find out where the data is located. Moreover, cluster manager provides a management console.

Data manager deals with storage engine, cache and querying. Clients uses cluster map to find out where the data is then work with data manager to retrieve/store/update the data.

Bucket is data container that stores data, similar to database in RDBMS systems. There is no notion of tables in Couchbase. While data is stored as rows in RDBMS, in couchbase data is stored as JSON documents in buckets. There are two types of buckets in couchbase.

In a bucket we have ejection, replication, warmup and rebalancing.

Memcached: Data is stored in the memory only. There is no persistence. If the node fails, reboot or shuts down, you lose all the data, since the data is in volatile space. Maximum document size can be 1 MB and if the bucket runs out of memory, eviction occurs. The oldest data will be discarded. Replication is not possible in Memcached bucket. Data is stored as blobs.

Couchbase: Data is stored in memory and persisted to disk. Also replication is done and delivers high availability since data is spread across the cluster. One a client sends data, it is stored in the memory and sent to disk-write queue as well as replication queue. Client can read the document from RAM while persistence is happening. This is called eventual persistence. Once the data is persisted to disk, data can survive a reboot or shut down. Maximum document size of a document is 20MB for Couchbase bucket.

A document has a unique ID to access it later on and the value of the document can be anything. Document has internal some properties used internally. Rev is internal revision ID used for versioning. Expiration can be set to expire documents. CAS is used for handling concurrency.

One of fundamentals developers need to pay attention is key patterns. While developing application using Couchbase, there are key patterns in order to find the documents later on.

While storing a document, you need a KEY to store the document and access the same document later on. So couchbase is like KEY-VALUE store. However, you can also access documents without keys, such as if you are interested in querying by other properties of documents. There are key patterns in couchbase for better management and development.

Virtual Buckets comes in play while distributing data across the cluster. Every document should have a unique document just like RDBMS primary key. Each bucket has 1024 logical partitions called virtual buckets. Using consistent hashing cluster manager knows which node to send the data and from which node to retrieve it.

Couchbase ensures that most frequently used documents stay in the memory, this features increases performance. However, when the memory cannot store any new incoming documents, it will eject the old documents. There is a configurable threshold for ejecting documents.

After a reboot or importing backup, couchbase can load data from disk to memory to serve requests. This process is called warmup.

Couchbase can have 3 replicas of data stored across the cluster which enables high availability. So same document in 4 different places. During a failure, cluster map is rebuilt serving from the replicas. Moreover, re-balancing takes place while adding or removing nodes from the cluster. Once replication is complete cluster map is re-built.

It is usually a better practice to store everything in a single document. Remember there are no joins in Couchbase and you don’t have normalization that you have in RDBMS. So if you store all the data in a single document, you will have some redundancy as well as same data in all the documents. However there are also benefits of storing everything in a single document. Storing and retrieving documents becomes faster. No joins are required since everything is in one place. If you split your data across multiple documents, you will have normalization which leads to less disk space, however there might be a performance penalty since documents will be spilled across the cluster. You will need multiple read operations to fully load the object model. By knowing the pros and cons, you can decide which model to go with.

Atomicity is handled at document level within Couchbase. A document is always stored as a whole object never partially, ensuring atomicity. Moreover, in a write operation client doesn’t wait the document to be persisted to disk or replicated. Response is returned to the client immediately. However, with client SDK you can wait until the document is persisted to disk and replicated. As you can imagine, client would wait for the operations to complete. While reading a document if the document is not in the memory, Couchbase will get it from the disk, store it in the memory and return it to the client.

The update operation is important. Cas (compare and swap or check and set) will provide a cas value to the client and while client is updating a document, it will send along the CAS value, if the cas value is same in the server, the update will sync, if the CAS value is different than what the server has, it will throw an exception. In RDBMS in order to maintain consistency, client acquires a lock on a record, so another client cannot update the same record. During a bulk update, one can have a table level locking, which hits the application performance. In Couchbase document based locking is possible. Both optimistic and pessimistic concurrency controls are possible with Couchbase.

While writing a new document to the server, server acknowledges the client for receiving the document, in order to find out if the document is persisted to disk, Observe method can be used.

Flush method can be used to flush all of the documents from a bucket. Use it with care.

flush

Moreover, there are atomic counters which can be used for counting operations or access patterns.

Couchbase supports both synchronous and asynchronous connection and operations.