Everything you need to know about data modelling for document databases

Back in 90s and early 20th century applications were like following:

earlydaysofapps

Today, building applications is different than what it was back in 2000. Building modern day applications involves quite a few different technologies.

todaysapps

Back in the days, you could estimate and predict the size of your audience, today we have virality, different marketing schemes which makes it harder to predict the size of audience. You can never know, your next app can be the big hit.

We spent too much time in relational world, it takes time, effort and practice to bend your mind and understand how to do it in NoSQL way.

In relational data stores data is organized in tables which store rows and columns. Data types are usually the simple ones like, string, int, float date time. It is hard to store complex data types like arrays, hash tables and complex objects, which is called impedence mismatch.

On the other hand, key value or document stores can also store simple data types, moreover, they are comfortable storing complex data types usually in JSON format. JSON is basically a special class of string for representing data structures. Also, the schema in key/value and document stores are implicit and unenforced.

For many year now, we have been thought to normalize our data model for redundancy and efficiency as following.

aggregate-split

Relational data stores are like a garage, you break your car apart and store related ones in the shelves. Store related things together.

In a document databases, you persist the vehicle the way it is. “Come as you are”

jsonformat
Same data can be represented in JSON document format as above. It is much easier, compact and simpler to represent the aggregate in JSON. It is a denormalized form of the data. Also, It makes it more convenient to distribute this data across multiple servers or clusters. The main problem with relational data stores here is the JOIN operation which is expensive operation.
Here is the domain model for this particular example:

While ORMs come to rescue for relational databases, performance is usually becomes a problem. One can claim that serialization/deserialization for JSON documents can also become a problem however, usually serialization frameworks are a lot faster these days.

Schema free databases makes it easy to rapidly develop software, however, you need to understand and embrace the principles behind data modelling for document databases.

Questions you need to keep in mind during modelling are:
• How is data going to be persisted?
• How are you going to retrieve or query data?
• Is your application write or read bound?

Key Patterns:
We start by defining unique keys which we call predictable keys (that we can query easily). Following keys have corresponding values which are JSON documents.

-Users:
user::[email protected] -> User data as value
user::[email protected] -> User data as value

-Products
product::12345 [ID of the product] -> Product data as value

Then we have unpredictable keys as follows:

-Sessions
session::a5ab2863-db93-430e-8da3-feeb1998521f  -> Session document data

Unpredictable keys are queried with Map-Reduce functions in most key/value and NoSQL databases.

Counter-Id Pattern:

Almost all of the key/value and document stores provides very fast atomic counters. These atomic counters are safe for doing increment and decrement operations. The idea behind counter-id pattern is as follows:

counter-id_pattern

Additionally, we can get the current state or the value of the counter anytime and hence we can predictably query users or iterate over them. This is a pretty easy and clever technique in my opinion. Since we only do increment, I know the size of the data, I can even run multi-get operations on it or do some paging.

This is similar to Identity column in RDBMS systems. Each time I want to store a document, increment the counter, get the counter value, create a predictable key and store the key/value.

Lookup Pattern:

Lookup pattern is usually a two step process as follows:

lookup_pattern

Initially, we store the data with an unpredictable key (such as GUID, UUID), then we create references to the particular key/data with predictable keys such as (email, username etc.)

In this example: we store a user data with a GUID representation, then we store references to this particular key, such as email, username. While retrieving the user data in a predictable way, we use email and username to get the GUID representation, then we can do another GET operation with key we captured from the first query, in order to get the data.

This makes it easy to retrieve documents without Map-Reduce operations. This pattern is quite useful for large datasets. Most of the data stores provide map-reduce operations with eventual consistency that use B trees on disk with log(n) complexity. However, lookup pattern provides immediate consistency with O(1), constant lookup time. Hence lookup pattern is very fast, scales very well and provides read your own writes along with consistent performance.

You can use these patterns together. For example: you can use counter pattern along with lookup pattern as following:

counterpattern_lookuppattern

Initially we get an ID from our counter, we store the data with “user::54” key. Then we can add additional lookup keys such as  (“twitter::first”,”user::54″) and we use other keys that we will need to get this particular document. Again, we need a two step process to get the initial data. First we do a GET operation on (“twitter::firat”), which gives us the result (“user::54”), then we do another GET operation on (“user::54”) and we have the user document.

Yet again, these are pretty fast operations since they run in constant lookup time.

Pros of combining counter-id and lookup pattern:
These are extremely fast binary operations.
Linear performance.
Enables several ways to find single document.
Provides consistency, always consistent.

Cons of combining counter-id and lookup pattern:
Increases number of documents in the storage which is usually not a problem.

Below you can find the differences between RDBMS and Document Stores for persisting and querying the data.

rdbms_vs_keypatterns

Related data patterns:
To embed or to reference? That is the question.
Embedding is easy. You can persist the document the way it is.

Pros with embedding:
Easy to retrieve document, you can get the document at once. Single read operation brings you back the document. Single write or update operation takes care of your insert or update.

Cons with embedding:
In most data stores there is a certain limit to document size such as 20 MB (Mongodb) or 16 MB(Couchbase).

Most key/value stores or document stores are designed and developed to store small chunk of documents. Storing large documents are usually a bad idea.

Retrieving a large document can be a burden on network, even if you have gzip enabled. When you have multiple clients updating the same document then you might have versioning problems you need to take care of. You will have to implement optimistic or pessimistic locking.

At the same time, embedding can be performant due to locality reference.

When is it good to embed:
• There are contains relationships between entities.
• There are one-to-few relationships between entities.
• There is embedded data that changes infrequently.
• There is embedded data won’t grow without bound.
• There is embedded data that is integral to data in a document.

Usually, denormalized or embedded data models provide better read performance.

Below is an example of embedded document:
jsonformat

Below is an example of a document with references:

referenced_documents
References lines items are separate documents such the following:lineitem_id

Each line item can be accessed with the key captured from LineItems array above.

You can use both referencing and embedding as below:

mix

You can also use referencing heavily such as following:

referencing

Sometimes we need to embed, sometimes we need to reference. When do we embed and when do we reference? This is the critical decision for efficient modelling. There are some signals for selecting embedding or referencing.

Relationships…

relationship

1:1 (one to one) relationship, is usually better for embedding. Also, if there is a 1:Few relationship, it is also acceptable to embed. For example a blog post document with tags or categories. So we need to care about relationships and dependencies. Embedding leads to better performance because of RTT (round trip times) to databases.

1:many (one to many) relationship is good for referencing. For example: blog post with comments document. A blog post can have hundreds of comments with different sizes. There is no bound to referenced data. Instead of embedding, referencing is more efficient in this case. You can also have race conditions if you prefer to use embedding in this case.

M:m (Many to many relationship) is good for referencing as well.

Data volatility is important while choosing embedding vs referencing. How often does the data change? If the data doesn’t change so often, it is better to embed rather than referencing. However, if there are a lot of operations on particular documents, it is usually better to reference. So, if the volatility is low, embedding is OK, if volatility is high referencing is better. With high volatility you can have race conditions.

If a document is used by many clients, it is better to reference rather than embedding. Referencing provides better write performance. Because you have smaller documents to write. However, it causes performance hit for READ operations due to multiple Round trip times.

Normalizing data may save some space however,  storage is cheap today, also  requires single read and RTT. Normalizing data doesn’t align with classes/objects , object model. Referencing typically provides faster write speed. Denormalized aligns with classes/objects, however, requires updates in multiple places. Typically provides faster read speed, since you get the document at once.

Clearly there are trade offs for referencing and embedding. It is up to you to decide which one to use based on your data model and access patterns.

Data is data, regardless of the shape of it. It can be stored in table or a document store, it doesn’t matter. Document stores are structural, there is schema and object model.

It might take a while to get used to the data modelling with key/value and document stores, you might need to bend your mind and get rid of relational data modelling concepts. Once you get used to it, data modelling with document stores are pretty trivial. Understanding how to create keys is usually a skill, but you can create quite amazing patterns once you get used to it.

These access patterns can be used in many key/value and NoSQL stores, for example: Redis, MemCache, Couchbase, Couchdb, Mongodb, Riak, Dynamo and cloud document store solutions.

Moreover, hints above for referencing or embedding comes handy with the data stores above as well.