Following slides belong to the tech talk I have given at Yahoo!.
NoSQL: Theory, Implementation, an Introduction.
Following slides belong to the tech talk I have given at Yahoo!.
NoSQL: Theory, Implementation, an Introduction.
Christof Strauch, from Stuttgart Media University, has written an incredible 120+ page paper titled NoSQL Databases as an introduction and overview to NoSQL databases . The paper was written between 2010-06 and 2011-02, so it may be a bit out of date, but if you are looking to take in the NoSQL world in one big gulp, this is your chance. I asked Christof to give us a short taste of what he was trying to accomplish in his paper:
The paper aims at giving a systematic and thorough introduction and overview of the NoSQL field by assembling information dispersed among blogs, wikis and scientific papers. It firstly discusses reasons, rationales and motives for the development and usage of nonrelational database systems. These can be summarized by the need for high scalability, the processing of large amounts of data, the ability to distribute data among many (often commodity) servers, consequently a distribution-aware design of DBMSs.
The paper then introduces fundamental concepts, techniques and patterns that are commonly used by NoSQL databases to address consistency, partitioning, storage layout, querying, and distributed data processing. Important concepts like eventual consistency and ACID vs. BASE transaction characteristics are discussed along with a number of notable techniques such as multi-version storage, vector clocks, state vs. operational transfer models, consistent hashing, MapReduce, and row-based vs. columnar vs. log-structured merge tree persistence.
As a first class of NoSQL databases, key-value-stores are examined by looking at the proprietary, fully distributed, eventual consistent Amazon Dynamo store as well as popular opensource key-value-stores like Project Voldemort, Tokyo Cabinet/Tyrant and Redis.In the following, document stores are being observed by reviewing CouchDB and MongoDB as the two major representatives of this class of NoSQL databases. Lastly, the paper takes a look at column-stores by discussing Google’s Bigtable, Hypertable and HBase, as well as Apache Cassandra which integrates the full-distribution and eventual consistency of Amazon’s Dynamo with the data model of Google’s Bigtable.”
For the last couple years there has been a great increase in the NoSQL movement and products, in this blog you will find some useful information on classification and ecosystem of NoSQL systems.
Dynamic Scaling
To read more.
I wrote about memcached in my previous post and I was researching on Membase, which is a NoSQL implementation of memcached. That is Memcached persisted to disk. Membase is a key-value pair database management system. It s very easy to install, configure and get it running within 5 minutes. It is backwards compatible with memcached. Membase provides simple, fast, easy to scale key-value data operations with low latency and high throughput. You will very often hear about low latency and high throughput property of Membase. Asynchronous writes are fairly common, wherever possible in Membase. Membase is designed to run on many configurations, from a single node to clusters of servers.
Membase was mainly developed by the folks who developed Memcached.
In one of my experiences, I failed to mix Windows 7 instance of Membase and a Centos Instance.
Membase adds disk persistence (with hierarchical storage management), data replication, live cluster reconfiguration, rebalancing and multi-tenancy with data partitioning on top of Memcached‘s caching functionality. In my opinion Membase has become a different product than memcached. Even though you can setup a memcached instance with the same install package, the underlying algorithms should be different.
Membase implements CP (Consistency and Partial Tolerance) in terms of CAP Theorem, which means you will suffer from Availability.
One of the interesting feauture of Membase is Rebalancing. You install an instance of membase and lets say you want to add more server, simply install another instance of membase on another server, then join that server to the Membase cluster via the web admin tool. Once you hit the rebalance button, everything is being done for you, behind the scenes, Warning: the server that you are joining to the server will lose all the data. Membase does replication via asynchronous writes.
Below is a benchmark of Membase on VMWare instances. Instances are idential. This Membase benchmark was done via C# Enyim client libraries and VMWare instances are CENTOS 5.5 with 2GB ram each.
| Items | Writes/second | Read/Second | Updates/second | |
| Single node | 1.000.000 | 6250 | 8196 | 6024 |
| Two nodes, rebalancing enabled | 1.000.000 | 3412 | 8130 | 4219 |
| Three nodes, rebalancing enabled | 1.000.000 | 3984 | 6993 | 3937 |
As you would realize that the performance is somewhat degrading as the number of replicas increases which is natural.
Membase guarantees data consistency. Once you insert a data, it s replicated across all the servers.
Membase is simple (key value store), fast (low, predictable latency) and elastic (effortlessly grow or shrink a cluster).
Membase supports Memcached ASCII and Binary protocols (uses existent Memcached libraries and clients). You can use Memcached clients to work with Membase as well.
Membase has a nice management interface where you can manage your instances very easily. It s very user friendly interface.
Membase is elastic in nature, you can scale out very easily, you can add more servers within seconds and rebalance them easily. You can also configure fail over, and auto-fail over rather easily. Fail over mechanism is very fast.
In a clustered Membase setup, there is a master node, set of servers elect a leader and that s your master node. Replication and write operations are coordinated through this master node.
As I have written in my Memcached article, Memcahed uses consistent hashing in order the determine the server to send and read the data from. This consistent hashing algorithm was nothing more than a modulo operator. Membase uses a similar algorithm to determine this, however, this time, Membase uses a vbuckets, ie: virtual buckets. This is similar to a dictionary implementation. Within this solution servers are still not aware of each other which is a desirable property. VBuckets are another layer of all possible set of keys as an array in front of the set of servers. It is simply a mapping. You can read more about Membase VBuckets at Northscale articles.
As a NoSQL implementation, I have done some prototyping and reseaching with Membase. Even though I run into some issues, I was happy with overall implementation of Membase. You can read more about Membase on the official site. Have fun!.
UPDATE: Turns out that the following post is invalid/inaccurate.
Don’t use MongoDB
=================
I’ve kept quiet for awhile for various political reasons, but I now
feel a kind of social responsibility to deter people from banking
their business on MongoDB.
Our team did serious load on MongoDB on a large (10s of millions
of users, high profile company) userbase, expecting, from early good
experiences, that the long-term scalability benefits touted by 10gen
would pan out. We were wrong, and this rant serves to deter you
from believing those benefits and making the same mistake
we did. If one person avoid the trap, it will have been
worth writing. Hopefully, many more do.
Note that, in our experiences with 10gen, they were nearly always
helpful and cordial, and often extremely so. But at the same
time, that cannot be reason alone to supress information about
the failings of their product.
Why this matters
—————-
Databases must be right, or as-right-as-possible, b/c database
mistakes are so much more severe than almost every other variation
of mistake. Not only does it have the largest impact on uptime,
performance, expense, and value (the inherit value of the data),
but data has *inertia*. Migrating TBs of data on-the-fly is
a massive undertaking compared to changing drcses or fixing the
average logic error in your code. Recovering TBs of data while
down, limited by what spindles can do for you, is a helpless
feeling.
Databases are also complex systems that are effectively black
boxes to the end developer. By adopting a database system,
you place absolute trust in their ability to do the right thing
with your data to keep it consistent and available.
Why is MongoDB popular?
———————–
To be fair, it must be acknowledged that MongoDB is popular,
and that there are valid reasons for its popularity.
* It is remarkably easy to get running
* Schema-free models that map to JSON-like structures
have great appeal to developers (they fit our brains),
and a developer is almost always the individual who
makes the platform decisions when a project is in
its infancy
* Maturity and robustness, track record, tested real-world
use cases, etc, are typically more important to sysadmin
types or operations specialists, who often inherit the
platform long after the initial decisions are made
* Its single-system, low concurrency read performance benchmarks
are impressive, and for the inexperienced evaluator, this
is often The Most Important Thing
Now, if you’re writing a toy site, or a prototype, something
where developer productivity trumps all other considerations,
it basically doesn’t matter *what* you use. Use whatever
gets the job done.
But if you’re intending to really run a large scale system
on Mongo, one that a business might depend on, simply put:
Don’t.
Why not?
——–
**1. MongoDB issues writes in unsafe ways *by default* in order to
win benchmarks**
If you don’t issue getLastError(), MongoDB doesn’t wait for any
confirmation from the database that the command was processed.
This introduces at least two classes of problems:
* In a concurrent environment (connection pools, etc), you may
have a subsequent read fail after a write has “finished”;
there is no barrier condition to know at what point the
database will recognize a write commitment
* Any unknown number of save operations can be dropped on the floor
due to queueing in various places, things outstanding in the TCP
buffer, etc, when your connection drops of the db were to be KILL’d or
segfault, hardware crash, you name it
**2. MongoDB can lose data in many startling ways**
Here is a list of ways we personally experienced records go missing:
1. They just disappeared sometimes. Cause unknown.
2. Recovery on corrupt database was not successful,
pre transaction log.
3. Replication between master and slave had *gaps* in the oplogs,
causing slaves to be missing records the master had. Yes,
there is no checksum, and yes, the replication status had the
slaves current
4. Replication just stops sometimes, without error. Monitor
your replication status!
**3. MongoDB requires a global write lock to issue any write**
Under a write-heavy load, this will kill you. If you run a blog,
you maybe don’t care b/c your R:W ratio is so high.
**4. MongoDB’s sharding doesn’t work that well under load**
Adding a shard under heavy load is a nightmare.
Mongo either moves chunks between shards so quickly it DOSes
the production traffic, or refuses to more chunks altogether.
This pretty much makes it a non-starter for high-traffic
sites with heavy write volume.
**5. mongos is unreliable**
The mongod/config server/mongos architecture is actually pretty
reasonable and clever. Unfortunately, mongos is complete
garbage. Under load, it crashed anywhere from every few hours
to every few days. Restart supervision didn’t always help b/c
sometimes it would throw some assertion that would bail out a
critical thread, but the process would stay running. Double
fail.
It got so bad the only usable way we found to run mongos was
to run haproxy in front of dozens of mongos instances, and
to have a job that slowly rotated through them and killed them
to keep fresh/live ones in the pool. No joke.
**6. MongoDB actually once deleted the entire dataset**
MongoDB, 1.6, in replica set configuration, would sometimes
determine the wrong node (often an empty node) was the freshest
copy of the data available. It would then DELETE ALL THE DATA
ON THE REPLICA (which may have been the 700GB of good data)
AND REPLICATE THE EMPTY SET. The database should never never
never do this. Faced with a situation like that, the database
should throw an error and make the admin disambiguate by
wiping/resetting data, or forcing the correct configuration.
NEVER DELETE ALL THE DATA. (This was a bad day.)
They fixed this in 1.8, thank god.
**7. Things were shipped that should have never been shipped**
Things with known, embarrassing bugs that could cause data
problems were in “stable” releases–and often we weren’t told
about these issues until after they bit us, and then only b/c
we had a super duper crazy platinum support contract with 10gen.
The response was to send up a hot patch and that they were
calling an RC internally, and then run that on our data.
**8. Replication was lackluster on busy servers**
Replication would often, again, either DOS the master, or
replicate so slowly that it would take far too long and
the oplog would be exhausted (even with a 50G oplog).
We had a busy, large dataset that we simply could
not replicate b/c of this dynamic. It was a harrowing month
or two of finger crossing before we got it onto a different
database system.
**But, the real problem:**
You might object, my information is out of date; they’ve
fixed these problems or intend to fix them in the next version;
problem X can be mitigated by optional practice Y.
Unfortunately, it doesn’t matter.
The real problem is that so many of these problems existed
in the first place.
Database developers must be held to a higher standard than
your average developer. Namely, your priority list should
typically be something like:
1. Don’t lose data, be very deterministic with data
2. Employ practices to stay available
3. Multi-node scalability
4. Minimize latency at 99% and 95%
5. Raw req/s per resource
10gen’s order seems to be, #5, then everything else in some
order. #1 ain’t in the top 3.
These failings, and the implied priorities of the company,
indicate a basic cultural problem, irrespective of whatever
problems exist in any single release: a lack of the requisite
discipline to design database systems businesses should bet on.
Please take this warning seriously.
We have vertical scalability that we throw bigger, powerful boxes to the problem and pray that it scales.
When we don’t have any more capacity for a single big box, we need another of single big box, and then we have to worry about consistency, latency, and replication, fault tolerance and so forth. We want to optimize our replication and communication, so we want to disable logging, journals just to improve the performance, which are not desirable at all. Sooner, we will run into many problems with managing these boxes.
Then we come to a point where we want to take off some load from the database, enter distributed caching. Caching doesn’t fully solve our problems because we are still dealing with Relational data model, joins, schema changes, normalization and queries.
We come to a point in which we want our application and data storages to be massively scalable, fault tolerant and consistent. While trying to achieve all these properties, we face with several problems.
For the last two decades RDMB became very popular for several reasons, simplicity being the major one. With the development of powerful SQL (Structured Query Language), relational databases become the center of database systems. With SQL, everyone was able to use data manipulation and data definition so easily. And SQL became ANSI standard.
Enter transactions. For RDMS with SQL there was a need for transactions and in my opinion transactions is a very important property for a data store to be aware of states. With transactions, most people will think about commit, rollback and ACID terms. Commit is if everything goes well with an operation then we can safely store the data in the storage, if something fails and we have an inconsistent state, we would like to rollback in order to avoid inconsistent state of data. ACID is rather a large topic. We are talking about Atomicity, Consistency, Isolation, and Durability of the state. Atomic means, all or nothing, so we either commit the data or we don’t. Consistent means, the data should be in one and only one state. It cannot be in multiple states, users of this data cannot see multiple states or version of the data. Isolated means, one and only one client of the data can operate on the data. You can’t have multiple clients working on the data at the same exact time, which would lead the data to be in a inconsistent state. Every operation on the data should be isolated from each other. Durable means that once the transaction is committed, the data would be in the same state as long as it is not modified by another transaction. Being said about transactions and ACID properties, it is rather easy to implemented transactions on single application compared to distributed systems. When you want to implement transactions across distributed system, then you have to consider the whole system, which introduces transaction managers, synchronizations etc. Distributed transactions introduce several complexities, and fault tolerance should be implemented very carefully.
In order to scale database, Sharding is introduced which is dividing up the data into meaningful clusters based on common id. For example, by initial letter of a person, which you can have 26 servers or so from A-Z, you can distribute your data accordingly to all the servers. Sharding is like partitioning your data based on a common key. Selecting a common key is very important. There are also other ways of sharding such as feature or functional sharding, in which you shard your databases based on functionality and feature. For example, you can store user data in one database and in another database you can store some product data etc. Moreover, there is key based sharding as I described above, these can be tweaked as for what you need. Hashing also can be used in this scheme. Another way to shard is to use a Look up tables, hash table, or dictionary, disadvantage of this (if there is a single lookup table), this table would be the bottleneck, and single point of failure. However, these days there are several fault tolerant distributed hash tables.
Sharding introduced shared nothing architectures, with sharding the entities within the shards are not aware of each other, and operate independently. Shared nothing architecture becomes very popular in the last few years or so. With shared nothing there is no dependency between entities of your system. There is no central control or master nodes. All of the nodes are same.
Cassandra is a decentralized, distributed, fault tolerant, elastically scalable, tunable consistent and column oriented database. It was designed based on Amazon Dynamo and Google Big Table models. Cassandra is decentralized and distributed so that client of the Cassandra is not aware of anything. So Cassandra acts as a single entity if it is distributed or not. There is no central authority, no master nodes. This is a great benefit in terms of failures, so there is no single point of failure, if one of your Cassandra server dies, the whole system keeps performing as if nothing happened. This is often called server symmetry, all the nodes are symmetric. Because Cassandra nodes are identical and there is no central authority this greatly increases availability due to decentralized model.
High availability and fault tolerance is a must have aspect of a distributed system, whatever is happening at the system, it should be seamless to the clients of the system. High availability is satisfying the requests of the clients. Moreover, in daily computing, you face with several problems from a system bug, to a failure to system/hardware failure; it is really very hard to know when there will be any kind of problem within a system. These problems should be considered and handled very carefully. Under any kind of errors, the system should still be available and satisfy its job.
Consistency is one of the most important parts of any distributed systems. While working with big data and distributed data stores, you will have to deal with consistency. There are flavors of consistency such as strict consistency, casual consistency and eventual consistency. Strict consistency is when you want to get the most recent data from any node of your distributed system. This is great! However, you will have to synchronize your data across all the servers before you can actually satisfy the request. In this case you have a problem with latency, and synchronization of you data across your cluster. So there is a trade off. Strict consistency is mostly used with financial applications and systems. When your data is very important you most likely want to go with this consistency level. Moreover, synchronization across all the nodes of your system is another hard problem you will have to face. Strict consistency uses global clock across your distributed system and works based on the global clock/timestamp. Latency makes this problem really hard. Casual consistency on the other hand offers weaker consistency. We are still trying for consistent data across our distributed system but this time instead of timestamp and global clock, we are interested in the events. So we act upon the occurrence of the events, rather than global clock. Eventual consistency, dictates that the data will propagate across all the servers at some point, we don’t know when, but will do. This is the weakest type of consistency. Basically, you write some data to a data store and it will be available in all the nodes after some time.
CAP theorem! According to CAP theorem you can only implement two of Consistency, availability and partial tolerance. You cannot have all three properties at the same time in a system. You can either implement Consistency, Availability or Consistency, Partial Tolerance or Availability and Partial Tolerance. This theorem has been proven by Nancy Lynch et al. At the very center of this theorem there is data replication and high availability and consistency comes in when you want to replicate your data across the nodes of your system. Consistency is relative to replication factor, in Cassandra consistency is tunable consistency, which means you can configure your instances in such a way that number of nodes that your data will replicate will be configurable. You can increase or decrease your consistency level and eventually data will be in consistent state across your servers. If you configure your data to be consistent across all your nodes than you will suffer from high availability, because the data will be written to all the nodes in system, it won’t be available right away. If you configure your data replication to be consistent in a small set of nodes, than you will have higher availability. I really like the fact that you configure consistency level with Cassandra. This is a great benefit.
This article (“Starbucks doesn’t use two phase commit“), is really a great to two phase commit.
MongoDb is known to be a scalable, document oriented database. MongoDB provides easy scalability, sharding, fail over and many more cool features. Moreover, replication and master slave configuration is relatively easy.
Start mongodb server:
./mongod
/data/db folder should exist and should be writable. Also port 27017 should be available for mongo to use. Mongo will enable a web interface for status and administrative information.
Mongodb uses admin database for administrative tasks to authentication and many other tasks. There is Local db which the data is stored locally only, not being distributed. Then there is config database which stores the configuration information for database server.
Mongodb uses Collection which corresponds to tables in relational model. Moreover, mongo uses Documents which corresponds to Rows in relations Model. Even though there is no notion of schema, it is not enforced, it is definitely important to keep the document model consistent with others. Moreover, storing documents to their own collections, in a hierarchy, is as well essential. For example: even though you can store any kind of document within a single collection, it doesn’t make much sense, due to later on while querying the collection, you will have to sort them out.
In my opinion document based model fits object oriented model better than relational model, not fully, but thinking about a document as an object is much more sensible.
File = {
“FileName”:”Users.csv”,
“FIleSize” : 1024
};
This is our File Document. Entries within this document can be thought as a key value pair, or a variable and a value.
If you are on mongo shell:
# db.files.insert(File);
This command will insert the File document to files collection. Whenever, there is an entry in a collection,it produces a document id, which is : “_id” represents and object id.
# db.files.find();
find command is executed on collections and returns all the documents within that collection. find(Predicate) method also takes a predicate which returns the document satisfying property.
Ex: db.Files.find({“FileName”:”MongoDB”});
There is also findone() method which returns one document.
Update(Predicate, data) method takes two arguments, one is a predicate that we want to run the update and the other argument is the data we want to update the document with.
Remove() method is used to remove all the documents from a collection. Moreover, Remove(Predicate) is used to remove the elements that satisfies the predicate.
# db.files.remove({“FileName”:”Mysql”});
MongoDB has 6 data type which are as follows: null, Boolean, numeric (Double), string (UTF-8), array (object array) and object. Moreover, MongoDb supports embedded documents, ie: document within another document.
MongoDb produces globally unique object id for documents, which I think is a great feature. Then you don’t have to worry about synchronizing the ids across multiple server etc.
object id is in a hierarchy in the sense: timestamp + machine id + process id + increment. Including all these information in the object id, ensures that object id is globally unique and there won’t be any collision. As far as my experience, generating unique ids, guids are computationally expensive. Therefore, mongodb provides functionality for the clients to generate an object id and inject the values to the documents. These would reduce the load and burden of database to execute this expensive operation.
MongoDb supports batch inserts besides the regular inserts in which you insert documents one by one. Batch insert reduces the communication between the client and server. Batch inserts are way faster than inserting one at a time. One of the drawback of insertion to mongodb is there is no data validation which means, you can insert malicious data in your database and if you want to do validation, operations will be slower.
#db.foo.insert(“hello”:”world”);
Removing documents from a collection is very simple:
#db.foo.remove();
This command clears the content of the collection but doesn’t clear the indexes. Weird right?
Remove(Predicate) method takes a predicate or many to remove a item that satisfies the conditions.
Clearing a collection to remove all the elements are computationally expensive. However dropping a collection is rather very fast.
Updates are very interesting topic in Databases and Concurrency. Here is how mongodb handles it: last update stays there. There is no notion of transactions or locking, whichever update executes last, stays consistent. Updates are atomic.
Update has a special operator , $set, which is used to add a property to a document. In other words, you can the schema of the document while updating with this operator. Likewise, you can remove a key from the document and change the schema of the document with $unset operator.
$inc operator provide increment of a key that represents an integer. This is very useful for counters, analytics etc.
Mongodb doesnt support transactions which is considered to be a missing implementation, but if you look at it one of the most favorite database engine, mysql’s MyIsam engine doesnt support transactions but it is being used very widely.
Mongodb uses BSON data format, which is binary JSON. Even though BSON takes more space than JSON. It has certain benefits such as performance, processing BSON is more efficient than processing JSON. Another benefit is that every client knows how to convert BSON to any data format for processing purposes.
Mongodb uses dynamic querying and update in-place. Updates in place are very efficient operations and are done with lazy writes, because accessing and writing to disk is very slow and mongodb uses memory mapped files to store data, updating in-place can increase mongodb’s performance.On the other hand, the data mongodb is updating might be stale for a very short time. On the other hand, CouchDB uses MVCC, which keeps multiple versions of data from different users. By MVCC data is guaranteed to be up to date.
Storing binary data in mongodb is also an easy operation, mongodb supports binary data storage up to 4MB and if you need to store bigger files, mongodb partitions the file for you and store them in chunks. Then when you run range queries, you can get the chunks of the data and merge them or use them as you need.
Mongodb supports manual and auto sharding. Sharding is basically partitioning your data sets across set of servers. With manual sharding, you are responsible of determining the servers for the data to go as well as merging the results for your queries. With auto sharding mongodb takes care of everything for you. You will have the impression of talking to a single server while you are working with auto sharding, even though, this hasnt been implemented yet, it promises great flexibility with scaling.
with the following commands you can have some information about your database and environment.
One of the great features of mongodb is its schemaless database design, you dont need to a schema as you do with relational database management systems. This provided a great strength while you are developing your applications, first benefit is that changing requirements and specifications, second benefit is changing nature of your data and another one is having different data for same data types.
Mongodb uses Collections to store documents. Collections is a similar notion to tables in relational database terminology. Collections are container that are used to store documents, they are dynamic, they grow as you add more documents to it. There are also capped collections which contains limited amount of documents. Fixed sized collections.
Data types in mongodb are followings: String, Integer, Double, Boolean, Min/Max Keys, Arrays, Timestamp, object, objectid, null, symbo, date, regular expressions, javascript code and binary data.
ObjectId is a special type. ObjectId is created automatically by mongodb for every document. ObjectId is 12 bytes hex string which is built by timestamp, machine id, etc This guarantees that the objectid’s across all the servers are unique. Moreover, in a sharded, clustered environment managing auto generated primary key ids is very difficult, auto generated objectid’s help with the document ids. This is a great feature in my opinion.
You can use indexes in Mongodb to increase your read and query efficiency. However keep in mind that adding indexes to a collection would increase your read speed but it might effect your write speeds. You will have a performance hit on your collection if you are writing too much data and reading seldom. This is due to indexes and data structures being used in the indexes, ie: B Trees.
With Mongodb, unlike relational databases, you dont have to create databases or collections right away. If you dont have a database, or collection mongodb will create it for you when you execute your code.
In order to view current databases use the following:
# show dbs
Then if you want to use a specific database:
# use myDb
Once you are in a database context, you can view the collections as follows:
# show collections
Once you switch to a database, current database is referred as db.
While working with mongodb collections, you will often search documents, find() is used to find document, find() also takes predicates to match the documents being search.
# db.items.find({Foo : “bar”});
will return you items with Foo attribute is “bar”.
While searching, you can limit or skip documents as well.
# db.items.find().limit(10);
# db.items.find().skip(20);
you can also sort your results:
# db.items.find().sort({Foo : 1};
One of the nice feature is that you can combine these function calls and make it fluent.
# db.items.find().sort().limit(10).skip(20);
The notion of capped collections in mongodb comes handy when you want the insertion order to be natural. While you are inserting documents to the regular collections, your items are not guaranteed to be in order, this might be due to the collection size can exceed the capacity of the collection and can be mapped to another collection. Capped collections are fixed size collection, just like a fixed size circular buffer, once the capped collection is full, new items will let the old items to be purged, dropped. You can define the size of the capped collections. Capped collections guarantees the insertion order. Updating documents within capped collections is possible if and only if the document size will remain same, also, you can not delete a document from a capped collection. In order to achieve these, you will have to re-create your capped collection and insert the documents again.
In order to find out about your collection you can invoke validate() method on your collection.
As stated you can use the find() method to query your documents, once in a while findOne() method also can be used to find a single document.
Mongodb aggregation methods are very easy to use. For aggregation you can use count, distinct and group functionality.
count() method is used to find out about the number of document in a collection as follows:
# db.users.count();
or with a predicate:
# db.users.find({Age : 30}).count();
distinct() is used to return the distinct, unique documents within a collection.
# db.users.distinct();
You can also pass predicates to the distinct method.
Moreover, mongodb provides many predicate, and conditional operators such as $gt, $gte, $lt, $lte for filtering your results.
# db.users.find(Age:{$lt : 21})
There is a $ne operator to get documents that a predicate not equal to a value.
# db.users.find(Name : {$ne : “firat”})
$in operator is a filter to get the documents within the defined array. There is also $nin operator to find documents not in an array.
# db.users.find(Name : {$in : ["foo","firat","bar"])
This call finds documents whose name is foo or firat or bar.
# db.users.find(Name : {$nin : ["foo","firat","bar"]})
This call finds documents whose name is not foo or firat or bar.
There is a counter call to $in which is $all, which gets you the document that has $all the predicates in the statement.
$size operator lets you specify the number of sub documents within a document and returns you the resulting document.
# db.users.find({Accounts : {$size:2}});
returns you users with 2 accounts.
Because mongodb is schema less, sometimes you search for documents and keys might not be in the document definition. You can use $exists operator to find documents with specified key as follows:
# db.users.find({DateOfBirth : {$exists :true}})
Mongodb querying is lazy. Querying the database returns you a cursor not the results. As you are iterating through the result set you get the results lazily. This is a nice feature.
Mongodb provides an upsert() statement that either inserts a new document or updates a current one.
Moreover, there is an $inc operator that increments a value, this might turn handy for analytics.
$set operator sets a value. $unset operator unsets a particular value. There is $push and $pop operators just like a stack that allows users to add or remove elements from the document.
In order to remove documents from a collection:
# db.users.remove({“Username” : “firat”})
NoSQL databases have been become very popular lately and they are being used from corps to small companies. I have done some comparisons between them as you ll see in the link below.
Types of NoSQL Databases
How do we want to benefit from NoSQL
Challenges with NoSQL Database
RDMS vs NoSQL
Object relational mapping
N+1 select problem, inefficient. There is work around.
Bad Behavior has blocked 250 access attempts in the last 7 days.