NoSQL: Theory, Implementations , an Introduction

Categories: NoSQL, Scalability
Tags: No Tags
Comments: No Comments
Published on: January 17, 2012

Following slides belong to the tech talk I have given at Yahoo!.

NoSQL: Theory, Implementation, an Introduction.

 

NoSQL Databases – NoSQL Introduction and Overview

Categories: NoSQL, Scalability
Tags: No Tags
Comments: No Comments
Published on: January 16, 2012

Christof Strauch, from Stuttgart Media University, has written an incredible 120+ page paper titled NoSQL Databases as an introduction and overview to NoSQL databases . The paper was written between 2010-06 and 2011-02, so it may be a bit out of date, but if you are looking to take in the NoSQL world in one big gulp, this is your chance. I asked Christof to give us a  short taste of what he was trying to accomplish in his paper:

The paper aims at giving a systematic and thorough introduction and overview of the NoSQL field by assembling information dispersed among blogs, wikis and scientific papers. It firstly discusses reasons, rationales and motives for the development and usage of nonrelational database systems. These can be summarized by the need for high scalability, the processing of large amounts of data, the ability to distribute data among many (often commodity) servers, consequently a distribution-aware design of DBMSs.

The paper then introduces fundamental concepts, techniques and patterns that are commonly used by NoSQL databases to address consistency, partitioning, storage layout, querying, and distributed data processing. Important concepts like eventual consistency and ACID vs. BASE transaction characteristics are discussed along with a number of notable techniques such as multi-version storage, vector clocks, state vs. operational transfer models, consistent hashing, MapReduce, and row-based vs. columnar vs. log-structured merge tree persistence.

As a first class of NoSQL databases, key-value-stores are examined by looking at the proprietary, fully distributed, eventual consistent Amazon Dynamo store as well as popular opensource key-value-stores like Project Voldemort, Tokyo Cabinet/Tyrant and Redis.
In the following, document stores are being observed by reviewing CouchDB and MongoDB as the two major representatives of this class of NoSQL databases. Lastly, the paper takes a look at column-stores by discussing Google’s Bigtable, Hypertable and HBase, as well as Apache Cassandra which integrates the full-distribution and eventual consistency of Amazon’s Dynamo with the data model of Google’s Bigtable.”

NoSQL classification

Categories: NoSQL, Scalability
Tags: No Tags
Comments: No Comments
Published on: January 16, 2012

For the last couple years there has been a great increase in the NoSQL movement and products, in this blog you will find some useful information on classification and ecosystem of NoSQL systems.

Related article.

The Common Principles Behind the NOSQL Alternatives

Categories: NoSQL, Scalability
Tags: No Tags
Comments: No Comments
Published on: January 16, 2012

The Common Principles Behind the NOSQL Alternatives

Assume that Failure is Inevitable

Partition the Data

Keep Multiple Replicas of the Same Data

Dynamic Scaling

Query Support

To read more.

Architectural Patterns for Scalability

Categories: Uncategorized
Tags: No Tags
Comments: No Comments
Published on: December 20, 2011

In today’s computing one of the biggest challenge is scalability to deal with high volume traffics, big data and processes.

There are several architectural patterns for scalable distributed system some of which are as follows:

  1. Load Balancers: Load balancing is one of the easiest approach to provide some level of scalability. As well as hardware for load balancing, there are also software for it. In order for a load balancer to be implemented efficiently, units should have a shared nothing architecture. Once placed in front of application servers, load balancers distributes the load evenly across all the servers within the system. There are several load balancing algorithms, the most basic one is round robin. Some of other ones are based on some pattern, load balancing based on weighted graph and so on. Load balancing is a beautiful concept.
  2. Distributed Caching: Distributed Hash tables or caches are used in front of data storage for scaling up reads. Instead of querying data stores each time, querying an in memory caching system speed up applications. Some of the implementations are Memcached, Redis etc. Consistent hashing is used for in the clients. There is no notion of replication.
  3. Distributed Message Queues. Blocking Queue implementation (FIFO delivery) implemented as a network service. Distributed Message Queues are based on Producer/Consumer problem. There are several concepts with producer/consumer other known as publisher/subscriber pattern.  It is fairly easy to scale applications using this pattern.
  4. Gossip and Nature-inspired Architectures. Each node randomly pick and exchange information with follow nodes.
  5. Map Reduce/ Data flows. Scalable pattern to describe and execute Jobs.
  6. Tree of responsibility. Break the problem down recursively and assign to a tree, each parent node delegating work to children nodes.
  7. Stream processing. Process data streams, data that is keeps coming.
  8. Scalable Storages. Ranges from Databases, NoSQL storages, Service Registries, to File systems.

Floyd Cycle Finding Algorithms

Categories: Uncategorized
Tags: No Tags
Comments: No Comments
Published on: December 12, 2011

Floyd cycle finding algorithm is used to find a cycle within a collection, graph, or container. A simple example would help understand better. Given a linked list, detect if it contains a cycle or not. If you use two pointers, one of which will traverse the linked list, two nodes at a time and another pointer that will traverse the linked list one node at a time, eventually they will meet at the cycle.

Stack with LinkedList Implementation

Categories: Data Structures
Comments: No Comments
Published on: December 12, 2011

Naive implementation of Stack in C# uses array. In this post, I have implemented the stack with LinkedList. This implementation is not thread safe.

using System;
using System.Collections;
using System.Collections.Generic;

namespace CodeCodeCode
{
    public class StackLinkedList : IEnumerable, ICollection
    {
        private readonly LinkedList _container;

        public StackLinkedList()
        {
            _container = new LinkedList();
        }

        public StackLinkedList(IEnumerable collection)
        {
            foreach (T element in collection)
            {
                _container.AddLast(element);
            }
        }

        public StackLinkedList(Stack _stack)
        {
            while (_stack.Count > 0)
            {
                T element = _stack.Pop();

                _container.AddLast(element);
            }
        }

        #region ICollection Members

        public void Add(T item)
        {
            throw new NotImplementedException();
        }

        public void Clear()
        {
            _container.Clear();
        }

        public bool Contains(T item)
        {
            return _container.Contains(item);
        }

        public void CopyTo(T[] array, int arrayIndex)
        {
            throw new NotImplementedException();
        }

        int ICollection.Count
        {
            get { return _container.Count; }
        }

        public bool IsReadOnly
        {
            get { return false; }
        }

        public bool Remove(T item)
        {
            return _container.Remove(item);
        }

        #endregion

        #region IEnumerable Members

        public IEnumerator GetEnumerator()
        {
            foreach (T element in _container)
            {
                yield return element;
                _container.Remove(element);
            }
        }

        IEnumerator IEnumerable.GetEnumerator()
        {
            return GetEnumerator();
        }

        #endregion

        public T Pop()
        {
            LinkedListNode element = _container.First;

            _container.Remove(element);

            return element.Value;
        }

        public void Push(T element)
        {
            _container.AddFirst(element);
        }

        public T Peek()
        {
            return _container.First.Value;
        }

        public int Count()
        {
            return _container.Count;
        }
    }
}
    [TestClass]
    public class UnitTest1
    {

        private StackLinkedList _collection;

        public UnitTest1()
        {
            _collection = new StackLinkedList();
        }

        [TestMethod]
        public void PushMethod()
        {
             _collection.Push(1);
             _collection.Push(2);
             _collection.Push(3);
            Assert.AreEqual(3, _collection.Peek());
            Assert.AreEqual(3, _collection.Pop());
            Assert.AreEqual(2, _collection.Pop());
            Assert.AreEqual(1, _collection.Pop());
            Assert.AreEqual(0, _collection.Count());
        }

    }

RabbitMQ

Categories: Scalability
Tags: No Tags
Comments: No Comments
Published on: November 24, 2011

RabbitMQ is yet another messaging queue. It is open source, robust, easy, reliable, portable and scalable with high throughput and latency. RabbitMQ is developed with Erlang.

Before going into details of RabbitMQ, let’s cover some basic glossary, definitions and uses. A message is an entity that is transferred between different components of an architecture or infrastructure. Message can have several formats, from text to serialized binary. Messaging are sending and receiving messages between different parts of the system.

A messaging infrastructure provides several benefits as follows:

Interoperability: Your infrastructure can have different components in different technologies and languages. Having a messaging queue in the middle provides interoperability so independent components work seamlessly, unaware of other platforms.

Loose coupling: It is one of the best practices both in programming and in software design to implement lose coupling. A messaging queue helps achieve this easily. You can break you software into components which won’t have any dependencies between and can run independently from each other.

Scalability: Loose coupling brings scalability along. If you design your application that you can loose couple your application components, you can scale it easily.

Portable: You can port your messaging queue to any architecture. ActiveMQ and RabbitMQ provide servers to any platform.

Reliable: Most Messaging Queues provides reliable solutions, so that clients won’t lose any data. Messaging queues can work with many data stores such as databases, file systems or caches. It is essential to persist the data which is delivered to queue to a persistent storage in order to avoid data loss. Moreover, there is several persistence schemes implemented in different messaging queues, and interesting one that ActiveMQ uses is temp files via a message dispatcher, if there is no persistence setup. Yet reliability comes to play when we care about messaging reliability. We would like to make sure that when we send a message, messaging queue receives our message, and likewise, we would like to ensure that when we ask for a message, we get a message and this message is removed from the queue. These operations are done with acknowledgements, similar to TCP.

Support for Protocols: most of the message queues support multiple transport protocols such as TCP, HTTP, and STOMP etc.

Enter Producer-Consumer Problem. Producer-consumer problem is a classical problem in computer science that deals with synchronization between processes. We have a producer which produces and feeds data, and then we have a consumer that consumes data. Producer and consumer share a common buffer to send and receive data respectively. Producer –consumer has a great benefit in parallelization of work, while producers are producing data, consumers can consume at the same time. Moreover, you can have multiple producers as well as multiple consumers. Wrong implementations of this concept can cause several problems such as deadlocks, race conditions. Producer-consumer problem can use a Queue as the buffer storage. We can have a blocking queue for this purpose; also, synchronization should be handled carefully, in order to avoid inconsistent states. Here is the usage of a Blocking queue in a producer – consumer problem, while producers are sending data to a blocking queue, consumers can request messages from the queue and process them for whatever purpose they need to. In order to avoid overflow, queue can block producers from overloading the queue with too much data, this is also called throttling. In order to avoid overflows and overloading of the message queue, producers are blocked until there are more resources available. Likewise for consumers, if the queue is empty, consumers just wait until there is a new message on the queue to be processed.

There are more scenarios regarding using Messaging queues, such as publish/subscribe (Observer pattern) in which your consumers become subscribers to a publisher and they received messages from Queue when there is a new message.

More to come…

RabbitMQ install on Centos

Categories: Scalability
Tags: No Tags
Comments: No Comments
Published on: November 24, 2011

I’m going to install RabbitMQ on my CentOS box.

# wget http://www.rabbitmq.com/releases/rabbitmq-server/v2.7.0/rabbitmq-server-2.7.0-1.noarch.rpm

# rpm –install rabbitmq-server-2.7.0-1.noarch.rpm
warning: rabbitmq-server-2.7.0-1.noarch.rpm: Header V4 DSA signature: NOKEY, key ID 056e8e56
error: Failed dependencies:
erlang >= R12B-3 is needed by rabbitmq-server-2.7.0-1.noarch

We need to install erlang, which RabbitMQ was developed with. It requires a version greater than R12B-3

First we need to install some libraries to install erlang.

# sudo yum install gcc glibc-devel make ncurses-devel openssl-devel

Then we download erlang.

# wget http://erlang.org/download/otp_src_R14B03.tar.gz

# tar zxvf otp_src_R14B03.tar.gz

# cd otp_src_R14B03

# ./configure && make && sudo make install

#  rpm -Uvh rabbitmq-server-2.7.0-1.noarch.rpm

and this will install rabbitmq on your server.

An alternative would be as follows:

# wget -O /etc/yum.repos.d/epel-erlang.repo http://repos.fedorapeople.org/repos/peter/erlang/epel-erlang.repo
# yum install erlang
#  rpm -ivh rabbitmq-server-2.6.1-1.noarch.rpm

Have fun.

Membase

Categories: NoSQL
Tags:
Comments: No Comments
Published on: November 22, 2011

I wrote about memcached in my previous post and I was researching on Membase, which is a NoSQL implementation of memcached. That is Memcached persisted to disk. Membase is a key-value pair database management system. It s very easy to install, configure and get it running within 5 minutes. It is backwards compatible with memcached. Membase provides simple, fast, easy to scale key-value data operations with low latency and high throughput. You will very often hear about low latency and high throughput property of Membase. Asynchronous writes are fairly common, wherever possible in Membase. Membase is designed to run on many configurations, from a single node to clusters of servers.

Membase was mainly developed by the folks who developed Memcached.

In one of my experiences, I failed to mix Windows 7 instance of Membase and a Centos Instance.

Membase adds disk persistence (with hierarchical storage management), data replication, live cluster reconfiguration, rebalancing and multi-tenancy with data partitioning on top of Memcached‘s caching functionality. In my opinion Membase has become a different product than memcached. Even though you can setup a memcached instance with the same install package, the underlying algorithms should be different.

Membase implements CP (Consistency and Partial Tolerance) in terms of CAP Theorem, which means you will suffer from Availability.

One of the interesting feauture of Membase is Rebalancing. You install an instance of membase and lets say you want to add more server, simply install another instance of membase on another server, then join that server to the Membase cluster via the web admin tool. Once you hit the rebalance button, everything is being done for you, behind the scenes, Warning: the server that you are joining to the server will lose all the data. Membase does replication via asynchronous writes.

Below is a benchmark of Membase on VMWare instances. Instances are idential. This Membase benchmark was done via C# Enyim client libraries and VMWare instances are CENTOS 5.5 with 2GB ram each.

Items Writes/second Read/Second Updates/second
Single node 1.000.000 6250 8196 6024
Two nodes, rebalancing enabled 1.000.000 3412 8130 4219
Three nodes, rebalancing enabled 1.000.000 3984 6993 3937

As you would realize that the performance is somewhat degrading as the number of replicas increases which is natural.

Membase guarantees data consistency. Once you insert a data, it s replicated across all the servers.

Membase is simple (key value store), fast (low, predictable latency) and elastic (effortlessly grow or shrink a cluster).

Membase supports Memcached ASCII and Binary protocols (uses existent Memcached libraries and clients). You can use Memcached clients to work with Membase as well.

Membase has a nice management interface where you can manage your instances very easily. It s very user friendly interface.

Membase is elastic in nature, you can scale out very easily, you can add more servers within seconds and rebalance them easily. You can also configure fail over, and auto-fail over rather easily. Fail over mechanism is very fast.

In a clustered Membase setup, there is a master node, set of servers elect a leader and that s your master node. Replication and write operations are coordinated through this master node.

As I have written in my Memcached article, Memcahed uses consistent hashing in order the determine the server to send and read the data from. This consistent hashing algorithm was nothing more than a modulo operator. Membase uses a similar algorithm to determine this, however, this time, Membase uses a vbuckets, ie: virtual buckets. This is similar to a dictionary implementation. Within this solution servers are still not aware of each other which is a desirable property. VBuckets are another layer of all possible set of keys as an array in front of the set of servers. It is simply a mapping. You can read more about Membase VBuckets at Northscale articles.

As a NoSQL implementation, I have done some prototyping and reseaching with Membase. Even though I run into some issues, I was happy with overall implementation of Membase. You can read more about Membase on the official site. Have fun!.

 

page 1 of 25»

Welcome , today is Friday, January 27, 2012

Bad Behavior has blocked 97 access attempts in the last 7 days.