Following slides belong to the tech talk I have given at Yahoo!.
NoSQL: Theory, Implementation, an Introduction.
Following slides belong to the tech talk I have given at Yahoo!.
NoSQL: Theory, Implementation, an Introduction.
Christof Strauch, from Stuttgart Media University, has written an incredible 120+ page paper titled NoSQL Databases as an introduction and overview to NoSQL databases . The paper was written between 2010-06 and 2011-02, so it may be a bit out of date, but if you are looking to take in the NoSQL world in one big gulp, this is your chance. I asked Christof to give us a short taste of what he was trying to accomplish in his paper:
The paper aims at giving a systematic and thorough introduction and overview of the NoSQL field by assembling information dispersed among blogs, wikis and scientific papers. It firstly discusses reasons, rationales and motives for the development and usage of nonrelational database systems. These can be summarized by the need for high scalability, the processing of large amounts of data, the ability to distribute data among many (often commodity) servers, consequently a distribution-aware design of DBMSs.
The paper then introduces fundamental concepts, techniques and patterns that are commonly used by NoSQL databases to address consistency, partitioning, storage layout, querying, and distributed data processing. Important concepts like eventual consistency and ACID vs. BASE transaction characteristics are discussed along with a number of notable techniques such as multi-version storage, vector clocks, state vs. operational transfer models, consistent hashing, MapReduce, and row-based vs. columnar vs. log-structured merge tree persistence.
As a first class of NoSQL databases, key-value-stores are examined by looking at the proprietary, fully distributed, eventual consistent Amazon Dynamo store as well as popular opensource key-value-stores like Project Voldemort, Tokyo Cabinet/Tyrant and Redis.In the following, document stores are being observed by reviewing CouchDB and MongoDB as the two major representatives of this class of NoSQL databases. Lastly, the paper takes a look at column-stores by discussing Google’s Bigtable, Hypertable and HBase, as well as Apache Cassandra which integrates the full-distribution and eventual consistency of Amazon’s Dynamo with the data model of Google’s Bigtable.”
For the last couple years there has been a great increase in the NoSQL movement and products, in this blog you will find some useful information on classification and ecosystem of NoSQL systems.
Dynamic Scaling
To read more.
In today’s computing one of the biggest challenge is scalability to deal with high volume traffics, big data and processes.
There are several architectural patterns for scalable distributed system some of which are as follows:
Floyd cycle finding algorithm is used to find a cycle within a collection, graph, or container. A simple example would help understand better. Given a linked list, detect if it contains a cycle or not. If you use two pointers, one of which will traverse the linked list, two nodes at a time and another pointer that will traverse the linked list one node at a time, eventually they will meet at the cycle.
Naive implementation of Stack in C# uses array. In this post, I have implemented the stack with LinkedList. This implementation is not thread safe.
using System;
using System.Collections;
using System.Collections.Generic;
namespace CodeCodeCode
{
public class StackLinkedList : IEnumerable, ICollection
{
private readonly LinkedList _container;
public StackLinkedList()
{
_container = new LinkedList();
}
public StackLinkedList(IEnumerable collection)
{
foreach (T element in collection)
{
_container.AddLast(element);
}
}
public StackLinkedList(Stack _stack)
{
while (_stack.Count > 0)
{
T element = _stack.Pop();
_container.AddLast(element);
}
}
#region ICollection Members
public void Add(T item)
{
throw new NotImplementedException();
}
public void Clear()
{
_container.Clear();
}
public bool Contains(T item)
{
return _container.Contains(item);
}
public void CopyTo(T[] array, int arrayIndex)
{
throw new NotImplementedException();
}
int ICollection.Count
{
get { return _container.Count; }
}
public bool IsReadOnly
{
get { return false; }
}
public bool Remove(T item)
{
return _container.Remove(item);
}
#endregion
#region IEnumerable Members
public IEnumerator GetEnumerator()
{
foreach (T element in _container)
{
yield return element;
_container.Remove(element);
}
}
IEnumerator IEnumerable.GetEnumerator()
{
return GetEnumerator();
}
#endregion
public T Pop()
{
LinkedListNode element = _container.First;
_container.Remove(element);
return element.Value;
}
public void Push(T element)
{
_container.AddFirst(element);
}
public T Peek()
{
return _container.First.Value;
}
public int Count()
{
return _container.Count;
}
}
}
[TestClass]
public class UnitTest1
{
private StackLinkedList _collection;
public UnitTest1()
{
_collection = new StackLinkedList();
}
[TestMethod]
public void PushMethod()
{
_collection.Push(1);
_collection.Push(2);
_collection.Push(3);
Assert.AreEqual(3, _collection.Peek());
Assert.AreEqual(3, _collection.Pop());
Assert.AreEqual(2, _collection.Pop());
Assert.AreEqual(1, _collection.Pop());
Assert.AreEqual(0, _collection.Count());
}
}
RabbitMQ is yet another messaging queue. It is open source, robust, easy, reliable, portable and scalable with high throughput and latency. RabbitMQ is developed with Erlang.
Before going into details of RabbitMQ, let’s cover some basic glossary, definitions and uses. A message is an entity that is transferred between different components of an architecture or infrastructure. Message can have several formats, from text to serialized binary. Messaging are sending and receiving messages between different parts of the system.
A messaging infrastructure provides several benefits as follows:
Interoperability: Your infrastructure can have different components in different technologies and languages. Having a messaging queue in the middle provides interoperability so independent components work seamlessly, unaware of other platforms.
Loose coupling: It is one of the best practices both in programming and in software design to implement lose coupling. A messaging queue helps achieve this easily. You can break you software into components which won’t have any dependencies between and can run independently from each other.
Scalability: Loose coupling brings scalability along. If you design your application that you can loose couple your application components, you can scale it easily.
Portable: You can port your messaging queue to any architecture. ActiveMQ and RabbitMQ provide servers to any platform.
Reliable: Most Messaging Queues provides reliable solutions, so that clients won’t lose any data. Messaging queues can work with many data stores such as databases, file systems or caches. It is essential to persist the data which is delivered to queue to a persistent storage in order to avoid data loss. Moreover, there is several persistence schemes implemented in different messaging queues, and interesting one that ActiveMQ uses is temp files via a message dispatcher, if there is no persistence setup. Yet reliability comes to play when we care about messaging reliability. We would like to make sure that when we send a message, messaging queue receives our message, and likewise, we would like to ensure that when we ask for a message, we get a message and this message is removed from the queue. These operations are done with acknowledgements, similar to TCP.
Support for Protocols: most of the message queues support multiple transport protocols such as TCP, HTTP, and STOMP etc.
Enter Producer-Consumer Problem. Producer-consumer problem is a classical problem in computer science that deals with synchronization between processes. We have a producer which produces and feeds data, and then we have a consumer that consumes data. Producer and consumer share a common buffer to send and receive data respectively. Producer –consumer has a great benefit in parallelization of work, while producers are producing data, consumers can consume at the same time. Moreover, you can have multiple producers as well as multiple consumers. Wrong implementations of this concept can cause several problems such as deadlocks, race conditions. Producer-consumer problem can use a Queue as the buffer storage. We can have a blocking queue for this purpose; also, synchronization should be handled carefully, in order to avoid inconsistent states. Here is the usage of a Blocking queue in a producer – consumer problem, while producers are sending data to a blocking queue, consumers can request messages from the queue and process them for whatever purpose they need to. In order to avoid overflow, queue can block producers from overloading the queue with too much data, this is also called throttling. In order to avoid overflows and overloading of the message queue, producers are blocked until there are more resources available. Likewise for consumers, if the queue is empty, consumers just wait until there is a new message on the queue to be processed.
There are more scenarios regarding using Messaging queues, such as publish/subscribe (Observer pattern) in which your consumers become subscribers to a publisher and they received messages from Queue when there is a new message.
More to come…
I’m going to install RabbitMQ on my CentOS box.
# wget http://www.rabbitmq.com/releases/rabbitmq-server/v2.7.0/rabbitmq-server-2.7.0-1.noarch.rpm
# rpm –install rabbitmq-server-2.7.0-1.noarch.rpm
warning: rabbitmq-server-2.7.0-1.noarch.rpm: Header V4 DSA signature: NOKEY, key ID 056e8e56
error: Failed dependencies:
erlang >= R12B-3 is needed by rabbitmq-server-2.7.0-1.noarch
We need to install erlang, which RabbitMQ was developed with. It requires a version greater than R12B-3
First we need to install some libraries to install erlang.
# sudo yum install gcc glibc-devel make ncurses-devel openssl-devel
Then we download erlang.
# wget http://erlang.org/download/otp_src_R14B03.tar.gz
# tar zxvf otp_src_R14B03.tar.gz
# cd otp_src_R14B03
# ./configure && make && sudo make install
# rpm -Uvh rabbitmq-server-2.7.0-1.noarch.rpm
and this will install rabbitmq on your server.
An alternative would be as follows:
# wget -O /etc/yum.repos.d/epel-erlang.repo http://repos.fedorapeople.org/repos/peter/erlang/epel-erlang.repo
# yum install erlang
# rpm -ivh rabbitmq-server-2.6.1-1.noarch.rpm
Have fun.
I wrote about memcached in my previous post and I was researching on Membase, which is a NoSQL implementation of memcached. That is Memcached persisted to disk. Membase is a key-value pair database management system. It s very easy to install, configure and get it running within 5 minutes. It is backwards compatible with memcached. Membase provides simple, fast, easy to scale key-value data operations with low latency and high throughput. You will very often hear about low latency and high throughput property of Membase. Asynchronous writes are fairly common, wherever possible in Membase. Membase is designed to run on many configurations, from a single node to clusters of servers.
Membase was mainly developed by the folks who developed Memcached.
In one of my experiences, I failed to mix Windows 7 instance of Membase and a Centos Instance.
Membase adds disk persistence (with hierarchical storage management), data replication, live cluster reconfiguration, rebalancing and multi-tenancy with data partitioning on top of Memcached‘s caching functionality. In my opinion Membase has become a different product than memcached. Even though you can setup a memcached instance with the same install package, the underlying algorithms should be different.
Membase implements CP (Consistency and Partial Tolerance) in terms of CAP Theorem, which means you will suffer from Availability.
One of the interesting feauture of Membase is Rebalancing. You install an instance of membase and lets say you want to add more server, simply install another instance of membase on another server, then join that server to the Membase cluster via the web admin tool. Once you hit the rebalance button, everything is being done for you, behind the scenes, Warning: the server that you are joining to the server will lose all the data. Membase does replication via asynchronous writes.
Below is a benchmark of Membase on VMWare instances. Instances are idential. This Membase benchmark was done via C# Enyim client libraries and VMWare instances are CENTOS 5.5 with 2GB ram each.
| Items | Writes/second | Read/Second | Updates/second | |
| Single node | 1.000.000 | 6250 | 8196 | 6024 |
| Two nodes, rebalancing enabled | 1.000.000 | 3412 | 8130 | 4219 |
| Three nodes, rebalancing enabled | 1.000.000 | 3984 | 6993 | 3937 |
As you would realize that the performance is somewhat degrading as the number of replicas increases which is natural.
Membase guarantees data consistency. Once you insert a data, it s replicated across all the servers.
Membase is simple (key value store), fast (low, predictable latency) and elastic (effortlessly grow or shrink a cluster).
Membase supports Memcached ASCII and Binary protocols (uses existent Memcached libraries and clients). You can use Memcached clients to work with Membase as well.
Membase has a nice management interface where you can manage your instances very easily. It s very user friendly interface.
Membase is elastic in nature, you can scale out very easily, you can add more servers within seconds and rebalance them easily. You can also configure fail over, and auto-fail over rather easily. Fail over mechanism is very fast.
In a clustered Membase setup, there is a master node, set of servers elect a leader and that s your master node. Replication and write operations are coordinated through this master node.
As I have written in my Memcached article, Memcahed uses consistent hashing in order the determine the server to send and read the data from. This consistent hashing algorithm was nothing more than a modulo operator. Membase uses a similar algorithm to determine this, however, this time, Membase uses a vbuckets, ie: virtual buckets. This is similar to a dictionary implementation. Within this solution servers are still not aware of each other which is a desirable property. VBuckets are another layer of all possible set of keys as an array in front of the set of servers. It is simply a mapping. You can read more about Membase VBuckets at Northscale articles.
As a NoSQL implementation, I have done some prototyping and reseaching with Membase. Even though I run into some issues, I was happy with overall implementation of Membase. You can read more about Membase on the official site. Have fun!.
Bad Behavior has blocked 97 access attempts in the last 7 days.