•  
  • Scalability (17)

NoSQL: Theory, Implementations , an Introduction

Categories: NoSQL, Scalability
Tags: No Tags
Comments: No Comments
Published on: January 17, 2012

Following slides belong to the tech talk I have given at Yahoo!.

NoSQL: Theory, Implementation, an Introduction.

 

NoSQL Databases – NoSQL Introduction and Overview

Categories: NoSQL, Scalability
Tags: No Tags
Comments: No Comments
Published on: January 16, 2012

Christof Strauch, from Stuttgart Media University, has written an incredible 120+ page paper titled NoSQL Databases as an introduction and overview to NoSQL databases . The paper was written between 2010-06 and 2011-02, so it may be a bit out of date, but if you are looking to take in the NoSQL world in one big gulp, this is your chance. I asked Christof to give us a  short taste of what he was trying to accomplish in his paper:

The paper aims at giving a systematic and thorough introduction and overview of the NoSQL field by assembling information dispersed among blogs, wikis and scientific papers. It firstly discusses reasons, rationales and motives for the development and usage of nonrelational database systems. These can be summarized by the need for high scalability, the processing of large amounts of data, the ability to distribute data among many (often commodity) servers, consequently a distribution-aware design of DBMSs.

The paper then introduces fundamental concepts, techniques and patterns that are commonly used by NoSQL databases to address consistency, partitioning, storage layout, querying, and distributed data processing. Important concepts like eventual consistency and ACID vs. BASE transaction characteristics are discussed along with a number of notable techniques such as multi-version storage, vector clocks, state vs. operational transfer models, consistent hashing, MapReduce, and row-based vs. columnar vs. log-structured merge tree persistence.

As a first class of NoSQL databases, key-value-stores are examined by looking at the proprietary, fully distributed, eventual consistent Amazon Dynamo store as well as popular opensource key-value-stores like Project Voldemort, Tokyo Cabinet/Tyrant and Redis.
In the following, document stores are being observed by reviewing CouchDB and MongoDB as the two major representatives of this class of NoSQL databases. Lastly, the paper takes a look at column-stores by discussing Google’s Bigtable, Hypertable and HBase, as well as Apache Cassandra which integrates the full-distribution and eventual consistency of Amazon’s Dynamo with the data model of Google’s Bigtable.”

NoSQL classification

Categories: NoSQL, Scalability
Tags: No Tags
Comments: No Comments
Published on: January 16, 2012

For the last couple years there has been a great increase in the NoSQL movement and products, in this blog you will find some useful information on classification and ecosystem of NoSQL systems.

Related article.

The Common Principles Behind the NOSQL Alternatives

Categories: NoSQL, Scalability
Tags: No Tags
Comments: No Comments
Published on: January 16, 2012

The Common Principles Behind the NOSQL Alternatives

Assume that Failure is Inevitable

Partition the Data

Keep Multiple Replicas of the Same Data

Dynamic Scaling

Query Support

To read more.

RabbitMQ

Categories: Scalability
Tags: No Tags
Comments: No Comments
Published on: November 24, 2011

RabbitMQ is yet another messaging queue. It is open source, robust, easy, reliable, portable and scalable with high throughput and latency. RabbitMQ is developed with Erlang.

Before going into details of RabbitMQ, let’s cover some basic glossary, definitions and uses. A message is an entity that is transferred between different components of an architecture or infrastructure. Message can have several formats, from text to serialized binary. Messaging are sending and receiving messages between different parts of the system.

A messaging infrastructure provides several benefits as follows:

Interoperability: Your infrastructure can have different components in different technologies and languages. Having a messaging queue in the middle provides interoperability so independent components work seamlessly, unaware of other platforms.

Loose coupling: It is one of the best practices both in programming and in software design to implement lose coupling. A messaging queue helps achieve this easily. You can break you software into components which won’t have any dependencies between and can run independently from each other.

Scalability: Loose coupling brings scalability along. If you design your application that you can loose couple your application components, you can scale it easily.

Portable: You can port your messaging queue to any architecture. ActiveMQ and RabbitMQ provide servers to any platform.

Reliable: Most Messaging Queues provides reliable solutions, so that clients won’t lose any data. Messaging queues can work with many data stores such as databases, file systems or caches. It is essential to persist the data which is delivered to queue to a persistent storage in order to avoid data loss. Moreover, there is several persistence schemes implemented in different messaging queues, and interesting one that ActiveMQ uses is temp files via a message dispatcher, if there is no persistence setup. Yet reliability comes to play when we care about messaging reliability. We would like to make sure that when we send a message, messaging queue receives our message, and likewise, we would like to ensure that when we ask for a message, we get a message and this message is removed from the queue. These operations are done with acknowledgements, similar to TCP.

Support for Protocols: most of the message queues support multiple transport protocols such as TCP, HTTP, and STOMP etc.

Enter Producer-Consumer Problem. Producer-consumer problem is a classical problem in computer science that deals with synchronization between processes. We have a producer which produces and feeds data, and then we have a consumer that consumes data. Producer and consumer share a common buffer to send and receive data respectively. Producer –consumer has a great benefit in parallelization of work, while producers are producing data, consumers can consume at the same time. Moreover, you can have multiple producers as well as multiple consumers. Wrong implementations of this concept can cause several problems such as deadlocks, race conditions. Producer-consumer problem can use a Queue as the buffer storage. We can have a blocking queue for this purpose; also, synchronization should be handled carefully, in order to avoid inconsistent states. Here is the usage of a Blocking queue in a producer – consumer problem, while producers are sending data to a blocking queue, consumers can request messages from the queue and process them for whatever purpose they need to. In order to avoid overflow, queue can block producers from overloading the queue with too much data, this is also called throttling. In order to avoid overflows and overloading of the message queue, producers are blocked until there are more resources available. Likewise for consumers, if the queue is empty, consumers just wait until there is a new message on the queue to be processed.

There are more scenarios regarding using Messaging queues, such as publish/subscribe (Observer pattern) in which your consumers become subscribers to a publisher and they received messages from Queue when there is a new message.

More to come…

RabbitMQ install on Centos

Categories: Scalability
Tags: No Tags
Comments: No Comments
Published on: November 24, 2011

I’m going to install RabbitMQ on my CentOS box.

# wget http://www.rabbitmq.com/releases/rabbitmq-server/v2.7.0/rabbitmq-server-2.7.0-1.noarch.rpm

# rpm –install rabbitmq-server-2.7.0-1.noarch.rpm
warning: rabbitmq-server-2.7.0-1.noarch.rpm: Header V4 DSA signature: NOKEY, key ID 056e8e56
error: Failed dependencies:
erlang >= R12B-3 is needed by rabbitmq-server-2.7.0-1.noarch

We need to install erlang, which RabbitMQ was developed with. It requires a version greater than R12B-3

First we need to install some libraries to install erlang.

# sudo yum install gcc glibc-devel make ncurses-devel openssl-devel

Then we download erlang.

# wget http://erlang.org/download/otp_src_R14B03.tar.gz

# tar zxvf otp_src_R14B03.tar.gz

# cd otp_src_R14B03

# ./configure && make && sudo make install

#  rpm -Uvh rabbitmq-server-2.7.0-1.noarch.rpm

and this will install rabbitmq on your server.

An alternative would be as follows:

# wget -O /etc/yum.repos.d/epel-erlang.repo http://repos.fedorapeople.org/repos/peter/erlang/epel-erlang.repo
# yum install erlang
#  rpm -ivh rabbitmq-server-2.6.1-1.noarch.rpm

Have fun.

Redis Installation on Centos

Categories: Scalability
Comments: No Comments
Published on: November 11, 2011

Redis is a new Cool key value store, which is very different than your ordinary key value stores. Redis Key be string, has, list, set, sorted set.

Here is how you install Redis on Centos:

$  wget http://redis.googlecode.com/files/redis-2.4.2.tar.gz
$ tar xzf redis-2.4.2.tar.gz
$ cd redis-2.4.2
$ make

At this point you have Redis install into src folder.

$ cd src

$ ./redis-server &

Now you have a Redis instance running. You want to run a test if everything is installed and running correctly as follows in Redis root directory.

$ make test

if you come across with the following error:

You need ‘tclsh8.5′ in order to run the Redis test

that means you need tclsh8.5 :)

So you can download tclsh8.* and

$ configure && make && make install && make clean

then you can run

$make test

for Redis again.

There is a Redis client you can use to interact and learn.You can interact with Redis using the built-in client:

$ src/redis-cli
redis> set foo bar
OK
redis> get foo
“bar”

enjoy the rest and you can read more about Redis in my blog about Redis.

Memcached

Categories: Linux, Programming, Scalability
Tags: No Tags
Comments: 1 Comment
Published on: October 24, 2011

Memcached is Free & open source, high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.

Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, page rendering or simply anything you like to store temporarily.

Memcached is simply a Key/Value store. You can see it as a standalone distributed hash table or dictionary. Memcached doesn’t know what your date is like, all it does is to store key value pairs with expiration and using LRU (Least recently used) algorithm to maintain the cached items.

What makes Memcached cool is that it provides scalability. How does it do this? By hashing algorithms that its clients implements. There are two phase of hashing, one of which happens at the client and the other happens at the server. Therefore, eventually all the memcached clients implement some hashing algorithm in order to benefit from the memcached’s distributed nature. I will mention this in a bit.

Having said that memcached is a distributed hash table (key value store), servers are disconnected from each other, and usually they are unaware of each there. There is no communication between memcached servers, any synchronization or broadcasting. This increases the flexibility to be able to scale out the memcached servers. If you are running low on resources on a memcached server, you can add another memcached server, and you can keep adding more cache servers as you need. You should pay attention that, if you don’t add cache servers, your cached item will start to be dropped out of memcached as the cache becomes full and as I mentioned Least recently used algorithm is used to drop the oldest cached items. This is also called Eviction.

In computer science world, one of the most important notations is Algorithmic complexity. Example: searching for an item, sorting a collection etc. Memcached considers this and implements a O(1), constant time, key value store. This means that storing an item to cache and extracting an item from the cache is constant time operation, which is very fast. This is achieved by implementing a good hash code method that doesn’t cause collisions.

On another note, Memcached storage of cached items is not traversable/iterable. You cannot traverse the whole cache.

Memcached is awesome! But not for every architecture.

  • You have objects larger than 1MB.
    • Memcached is not for large media and streaming huge blobs.
  • You have keys larger than 250 chars.
    • Memcached doesn’t support more than 250 chars.
  • If you want persistence or a database. You might consider MemcacheDB which provides persistence for Memcached.
  • You’re running in an insecure environment. Memcached doesn’t have any authentication or authorization system.

As I mentioned Memcached has two-stage hashing. It behaves as a giant hash table, looking up key = value pairs. Give it a key, and set or get some arbitrary data. That is it really. That is all it does.

When doing a memcached lookup, first the client hashes the key against the whole list of servers that you need to introduce to your clients. Once it has chosen a server after the first hashing procedure, the client then sends its request, and the server that was chosen does an internal hash key lookup for the actual item data. This enables the client to know which cache server to query again, when the item that was sent to cache is requested.

You need to understand that memcached is not redundant. There is no notion of replication or communication between cache servers; they are unaware of each other. Their only purpose is to store an item and give back an item. If one of your cache servers fails, you will lose all your data within that cache server. You will have to remove the cache server that failed from the list of the cache servers of your clients, ie: configuration.

Moreover, when one of your cache fail and you want to add another one, or you remove your cache from your clients, that will cause a big problem which is all your data, cached items will be invalid. This is due to double hashing mechanism, which clients, uses to hash the items based on the servers. So all your cache will be invalid, and you will have a spike. In order to avoid this, you will have to start a new node and assign the IP address of the dead node to the newly created node; this will prevent all your data to be invalid. Yet another way to solve this problem would be to use Consistent Hashing , in order to avoid computation of hash values of all the data.

Memcached operations aim to be atomic. All individual commands sent to memcached are atomic.

Memcached mimics organization of data via namespaces and it only stores objects. On the other hand, Microsoft App Fabric Cache provides notion of regions or sections. This allows you to keep the same Type of items in independent caches. Moreover, you can store strong Type instead of objects with AppFabric Cache, which I think these features are great.

Memcached is fast. It utilizes highly efficient, non-blocking networking libraries to ensure that memcached is always fast even under heavy load. In other words, in circumstances where your database might be falling over, memcached won’t be. Which is precisely what memcache was designed to do: to take the load off of your database, which for the majority of popular web applications is the biggest performance bottleneck and risk to scalability.

Memcached is simple and easy to deploy. It does not require a lot of technical knowledge to use or use effectively – it just does what it is supposed to.

Compressing large values is a great way to get more out of your memory and network communication/bandwidth. Compression can save a lot of memory for some values, and also potentially reduce latency as smaller values are quicker to fetch over the network.

Most clients support enabling or disabling compression by threshold of item size, and some on a per-item basis. Smaller items won’t necessarily benefit as much from having their data reduced, and would simply waste CPU.

Main Operations to work with Memcached is as follows:

Storing an item to database, you can pass values for datetime or timespan for expiration of the object being set.

Remove an item from the cache.

Increment and decrement given a keys value.

TryGet or Get is used to get an object from the cache.

Some of the clients allows retrieving multiple elements from the cache.

 

On another note, you can read about Microsoft AppFabric Cache.

Windows Server AppFabric

Categories: Programming, Scalability, Web
Tags: ,
Comments: 1 Comment
Published on: July 19, 2011

Caching and reliable hosting is one the important elements for high available and scalable applications. Microsoft delivers Microsoft server appfabric. Two major components are Caching and Hosting. OK, i dont much know about hosting except the fact that it uses Microsoft Activation service, and I think, IIS is using it internally. Again, I m not very familiar with it. But you can learn more in the following URL:
Microsoft AppFabric
and Microsoft provides a training kit which is really useful :
AppFabric Traning Kit

As far as caching, AppFabric Cache used to be velocity cache. It s more than a distributed key value pair, it has many other features such as tags, where you can tag the key value pairs and then you can get them by tag. Moreover, you can create regions to store the relative items within an organization. For example: You can have a student cache, teacher cache, classes cache, etc. You dont have to stick everything into a single cache.

Just like other cache services, the API provides, Add (Adds an item to Cache), Get(Gets an item from cache), Put(Adds or updates an item), Remove methods(Removes an item from cache). Moreover, you can handle concurrency with Optimistic or Pessimistic Concurrency. Usually, you would like to define these in an interface and wrap the Factory Method of Data Classes. In terms of configuration, there are several configuration parameters for appfabric cache, during development, I usually prefer storing the configuration in the dedicated config file, such as cache.config. Moreover, in your application you would prefer to wrap the datacachefactory class, instantiate it and keep it live as long as the application lives.

You can set expiration time for the items in the cache, default is 10 minutes. You can set it so that it doesn’t expire etc.

There is cache eviction, which occurs when the machine the cache is on, runs out of memory. The cache items becomes evicted and drops. Least recently used algorithm is used for this implementation.

While developing code with cache, it is always good to do defensive programming, because there might be several problems, such as items might expire, cache servers might go down etc. So it is very desirable to consider all these failures.

One important point is that, every item you store in the cache should be serializable. All the primitive type in C#.Net are serializable, if you are creating your own types, make sure you make them serializable, there are couple ways to do that, such as implementing ISerializable interface or by using DataContract and DataMember attributes (Which WCF uses). Once your items are serializable you can store them in the cache.

Usual usage of cache is to be placed in front of database. So each request doesnt go to database but instead, it gets the data from cache which increases the response time of the application. This is just one of the usage of cache. Cache basically takes the load of the database.

The 10 commandments of good source control management

Categories: Scalability
Tags: No Tags
Comments: No Comments
Published on: May 3, 2011

This article explains and discusses many benefits of source control and usages.

Lately, distributed source control applications have gain popularity.

page 1 of 2»

Welcome , today is Tuesday, February 7, 2012

Bad Behavior has blocked 250 access attempts in the last 7 days.