Memcached

Memcached is Free & open source, high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.

Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, page rendering or simply anything you like to store temporarily.

Memcached is simply a Key/Value store. You can see it as a standalone distributed hash table or dictionary. Memcached doesn’t know what your date is like, all it does is to store key value pairs with expiration and using LRU (Least recently used) algorithm to maintain the cached items.

What makes Memcached cool is that it provides scalability. How does it do this? By hashing algorithms that its clients implements. There are two phase of hashing, one of which happens at the client and the other happens at the server. Therefore, eventually all the memcached clients implement some hashing algorithm in order to benefit from the memcached’s distributed nature. I will mention this in a bit.

Having said that memcached is a distributed hash table (key value store), servers are disconnected from each other, and usually they are unaware of each there. There is no communication between memcached servers, any synchronization or broadcasting. This increases the flexibility to be able to scale out the memcached servers. If you are running low on resources on a memcached server, you can add another memcached server, and you can keep adding more cache servers as you need. You should pay attention that, if you don’t add cache servers, your cached item will start to be dropped out of memcached as the cache becomes full and as I mentioned Least recently used algorithm is used to drop the oldest cached items. This is also called Eviction.

In computer science world, one of the most important notations is Algorithmic complexity. Example: searching for an item, sorting a collection etc. Memcached considers this and implements a O(1), constant time, key value store. This means that storing an item to cache and extracting an item from the cache is constant time operation, which is very fast. This is achieved by implementing a good hash code method that doesn’t cause collisions.

On another note, Memcached storage of cached items is not traversable/iterable. You cannot traverse the whole cache.

Memcached is awesome! But not for every architecture.

  • You have objects larger than 1MB.
    • Memcached is not for large media and streaming huge blobs.
  • You have keys larger than 250 chars.
    • Memcached doesn’t support more than 250 chars.
  • If you want persistence or a database. You might consider MemcacheDB which provides persistence for Memcached.
  • You’re running in an insecure environment. Memcached doesn’t have any authentication or authorization system.

As I mentioned Memcached has two-stage hashing. It behaves as a giant hash table, looking up key = value pairs. Give it a key, and set or get some arbitrary data. That is it really. That is all it does.

When doing a memcached lookup, first the client hashes the key against the whole list of servers that you need to introduce to your clients. Once it has chosen a server after the first hashing procedure, the client then sends its request, and the server that was chosen does an internal hash key lookup for the actual item data. This enables the client to know which cache server to query again, when the item that was sent to cache is requested.

You need to understand that memcached is not redundant. There is no notion of replication or communication between cache servers; they are unaware of each other. Their only purpose is to store an item and give back an item. If one of your cache servers fails, you will lose all your data within that cache server. You will have to remove the cache server that failed from the list of the cache servers of your clients, ie: configuration.

Moreover, when one of your cache fail and you want to add another one, or you remove your cache from your clients, that will cause a big problem which is all your data, cached items will be invalid. This is due to double hashing mechanism, which clients, uses to hash the items based on the servers. So all your cache will be invalid, and you will have a spike. In order to avoid this, you will have to start a new node and assign the IP address of the dead node to the newly created node; this will prevent all your data to be invalid. Yet another way to solve this problem would be to use Consistent Hashing , in order to avoid computation of hash values of all the data.

Memcached operations aim to be atomic. All individual commands sent to memcached are atomic.

Memcached mimics organization of data via namespaces and it only stores objects. On the other hand, Microsoft App Fabric Cache provides notion of regions or sections. This allows you to keep the same Type of items in independent caches. Moreover, you can store strong Type instead of objects with AppFabric Cache, which I think these features are great.

Memcached is fast. It utilizes highly efficient, non-blocking networking libraries to ensure that memcached is always fast even under heavy load. In other words, in circumstances where your database might be falling over, memcached won’t be. Which is precisely what memcache was designed to do: to take the load off of your database, which for the majority of popular web applications is the biggest performance bottleneck and risk to scalability.

Memcached is simple and easy to deploy. It does not require a lot of technical knowledge to use or use effectively – it just does what it is supposed to.

Compressing large values is a great way to get more out of your memory and network communication/bandwidth. Compression can save a lot of memory for some values, and also potentially reduce latency as smaller values are quicker to fetch over the network.

Most clients support enabling or disabling compression by threshold of item size, and some on a per-item basis. Smaller items won’t necessarily benefit as much from having their data reduced, and would simply waste CPU.

Main Operations to work with Memcached is as follows:

Storing an item to database, you can pass values for datetime or timespan for expiration of the object being set.

Remove an item from the cache.

Increment and decrement given a keys value.

TryGet or Get is used to get an object from the cache.

Some of the clients allows retrieving multiple elements from the cache.

 

On another note, you can read about Microsoft AppFabric Cache.

LinkedList

LinkedList

LinkedList is a data structure to store collections of data with the following properties:

  • Data is stored within units called Node.
    Nodes are usually represented by independent entities to LinkedList class or representation.
    Nodes as a class or struct can contain many information.
    Nodes are connected to each other by pointers.
    There can be next, previous and multi-level pointers.
    LinkedList consists of Nodes.
    LinkedList can grow and shrink while you maintain the nodes.
    LinkedList doesn’t require continuous memory space. Nodes that make up the linked list can be scattered within the memory.
    LinkedList doesn’t waste memory space, unlike arrays.

Main LinkedList operations are:

  • Insert a Node to LinkedList
  • Remove a Node from LinkedList
  • Find a Node in LinkedList
  • Traversing the link (forward traversal, reverse traversal)

Arrays overview:

One memory block is allocated for the entire array to hold the elements. Array elements can be accessed directly in constant time. Arrays implement random access interface to support the contract of random access of the elements.

Arrays are simple to use. Constant access makes it faster to work with arrays. There are also disadvantages to work with arrays as follows:

  • Fixed size; size of the array is static. You need to specify the size of the array in order to be able to use it.
  • One block allocation, in order to initialize the array, you need one block of memory. If your array is too big to fit into the available memory, you won’t be able to create the array.
  • Inserting elements to random positions in the array are expensive operation, due to shifting the elements. Random inserts are expensive, random reads are cheap.

There is also dynamic arrays which the internal array that dynamic array is being maintained with a dynamic wrapper class. Allocation, extension and trimming are done automatically by the wrapping class, which gives you the benefit of not maintaining it yourself.

One of the advantages of linked list over an array is storage allocation. In linked list, your storage area grows as you insert new elements to the array. You don’t need predetermined allocation space in order to use linked list. On contrary, with arrays, you need to know the allocation space/number of elements that the array will hold, in order to create an array. Moreover, if you will need more space, you will have to create another array, copy the content of your first array to the new one you created and that’s how you will have more space. This is an expensive operation, because you will have to copy all the elements to the newly created array, and if you do this many time, it will cost you performance.

There are some disadvantages of linked list such as accessing a random element; say the nth element requires traversing all the elements until you find that nth element. So there is no direct access to nodes except the first and possibly to the last node, if provided.

Another disadvantage of linked list is due to locality of nodes. With arrays, you have all the elements adjacent to the others. They are contained within a memory block. This enables CPU caching at run time. On the other hand, with Linked List, the nodes you have within LinkedList container might be scattered all over the memory, so accessing these items requires extra time for scanning and finding the item. With today’s computer this performance difference doesn’t make too much difference but this is yet another fact.

Even though there are lot of programmers out there in the market complains about the difficulties with maintaining the pointers of linked list. I disagree with that. I find it quite easy to follow and modify.

Generally, in programming languages such as C# or Java, linked list implementations are done with doubly linked list (in this implementation doubly linked list is circular). With doubly linked list, each node has link to previous node and next node. You can also have head and tail nodes, so that you can access the head and the tail of the linked list in constant time.

Linked list are asked in many interviews because it shows how programmers can think and make use of pointers. Also, recursion are iteration can be applied to linked lists very easily that s another reason why Linked lists show up in the interviews often.

Main Linked List Operations in depth:

Insertion:

Inserting a node to the head of linked list is a constant time operation. You can point the head node to the node you want to insert and make the head nodes next pointer as the previous head pointer. Inserting to an arbitrary location of the linked list might take close to linear time. If you are inserting to the end of the linked list this operation requires traversing all the nodes of the linked list and pointing the last nodes next pointer to the node to be inserted. If you want to insert to the middle of the linked list, you can use two pointers, one of which movies two nodes at a time and the second pointer moves one node at a time, when your fist fast moving pointer reaches to end, the second pointer, comes to the middle of the linked list. If you are using a doubly linked list, accessing middle element can also be done with again using two pointers, one that starts from the head and iterates forward and another pointer that starts from the tail and iterates backwards, when two nodes overlap, that s your middle node.

In reality programmers usually use linked list to insert elements to the end and to the beginning of the linked list. Only in puzzles and interviews, programmers are asked to insert to the middle of the linked list and some other varieties.

Deletion/Removal of a node:

Deleting a node is simply disconnecting a node from the linked list container. If you are using a singly linked list or doubly linked list, deletion varies. If you are using a singly linked list, while deleting a random element you will need to have access to the node that you would like to delete and you will also need to have access to the previous node in order to disconnect the desired node from linked list. If you are using a doubly linked list, deleting a node is rather simpler, once you have access to the node you would like to delete, all you have to do is to point the previous node’s next pointer to the current node’s next pointer and you are all set.

One important point you need to understand is that in order to delete a node, you need to have access to it, as I just mentioned above. In order to get access to a node, you will have to iterate the linked list and find the node, and then you can process the delete operation. If the node you want to delete is the last node, which you don’t know up front, this operation might cost you linear time.

One of the tricky question that might be asked to you is as follows: “if you have access only to a node that you want to delete but not to the previous node in singly linked list with Next pointer only, how do you this?”. The answer is very straightforward as you know that you have access to the next node, next node data, and next nodes next pointer, so you take the data of the next node and store it in the node that will be deleted and point the current nodes next node to the next node’s next node. You are all set.

Finding a node:

                Searching for a node and find a node within linked list is an important operation and it is used as auxiliary operation to other methods such as deleting a node. If you are using a singly linked list, in order to find a node within the container, you will need to start traversing the list from a starting point, let’s say from the beginning and then traverse the nodes of the linked list until you find the node. This is a linear operation/scan. It might take linear time to find the node you are searching for.

Traversing/Iterating Linked List:

                Traversing or iterating the linked list is simply by following the next pointer of the current nodes with a pointer until there is no more next pointer. ie: Current nodes next pointer is null. It is quite simple to traverse a singly linked list in forward traversal. However, if you like to reverse a singly linked list, you will need an algorithm to do that. Iterating / traversing a doubly linked list in both directions are quite simple.

There are variations of linked list, such as singly linked list, doubly linked list, circular linked list etc. One of the new optimized linked list implementation is XOR linked list. With XOR linked list, in terms of using two pointers to next and previous nodes, XOR of the two next and previous nodes is being used. Benefit of this is space. This is usually called memory efficient doubly linked list (XOR Linked List). The new definition of Node class now contains PointerDiff, instead of Next and Previous pointers.

Preliminary:

X ^ X = 0
X ^ 0 = X
X ^ Y = Y ^X (Symmetric)
(X ^ Y) ^ Z = X ^ (Y ^ Z) (Transitive)

The concept is very straightforward. Once again, each node will have a single node to allow us to travel in both directions. Within the node we will store only the XOR of current nodes next and previous pointers. Look at the following example:

A -> B -> C -> D

You have 4 nodes. B’s PointerDiff would be (A^ C), C’s PointerDiff would be B ^ D. etc.

How does it work?

If you want to move from B to C, B’s PointerDiff is: (A^C). We XOR the PointerDiff with A. and the result is: (A^C) ^ A = C.

MongoDB Tutorial

MongoDb is known to be a scalable, document oriented database. MongoDB provides easy scalability, sharding, fail over and many more cool features. Moreover, replication and master slave configuration is relatively easy.

Start mongodb server:

./mongod

/data/db folder should exist and should be writable. Also port 27017 should be available for mongo to use. Mongo will enable a web interface for status and administrative information.

Mongodb uses admin database for administrative tasks to authentication and many other tasks. There is Local db which the data is stored locally only, not being distributed. Then there is config database which stores the configuration information for database server.

Mongodb uses Collection which corresponds to tables in relational model. Moreover, mongo uses Documents which corresponds to Rows in relations Model. Even though there is no notion of schema, it is not enforced, it is definitely important to keep the document model consistent with others. Moreover, storing documents to their own collections, in a hierarchy, is as well essential. For example: even though you can store any kind of document within a single collection, it doesn’t make much sense, due to later on while querying the collection, you will have to sort them out.

In my opinion document based model fits object oriented model better than relational model, not fully, but thinking about a document as an object is much more sensible.

File = {

“FileName”:”Users.csv”,
“FIleSize” : 1024

};

This is our File Document. Entries within this document can be thought as a key value pair, or a variable and a value.

If you are on mongo shell:

# db.files.insert(File);

This command will insert the File document to files collection. Whenever, there is an entry in a collection,it produces a document id, which is : “_id” represents and object id.

# db.files.find();

find command is executed on collections and returns all the documents within that collection. find(Predicate) method also takes a predicate which returns the document satisfying property.

Ex: db.Files.find({“FileName”:”MongoDB”});

There is also findone() method which returns one document.

Update(Predicate, data) method takes two arguments, one is a predicate that we want to run the update and the other argument is the data we want to update the document with.

Remove() method is used to remove all the documents from a collection. Moreover, Remove(Predicate) is used to remove the elements that satisfies the predicate.

# db.files.remove({“FileName”:”Mysql”});

MongoDB has 6 data type which are as follows: null, Boolean, numeric (Double), string (UTF-8), array (object array) and object.  Moreover, MongoDb supports embedded documents, ie: document within another document.

MongoDb produces globally unique object id for documents, which I think is a great feature. Then you don’t have to worry about synchronizing the ids across multiple server etc.
object id is in a hierarchy in the sense: timestamp + machine id + process id + increment. Including all these information in the object id, ensures that object id is globally unique and there won’t be any collision. As far as my experience, generating unique ids, guids are computationally expensive. Therefore, mongodb provides functionality for the clients to generate an object id and inject the values to the documents. These would reduce the load and burden of database to execute this expensive operation.

MongoDb supports batch inserts besides the regular inserts in which you insert documents one by one. Batch insert reduces the communication between the client and server. Batch inserts are way faster than inserting one at a time. One of the drawback of insertion to mongodb is there is no data validation which means, you can insert malicious data in your database and if you want to do validation, operations will be slower.

#db.foo.insert(“hello”:”world”);

Removing documents from a collection is very simple:

#db.foo.remove();

This command clears the content of the collection but doesn’t clear the indexes. Weird right?

Remove(Predicate) method takes a predicate or many to remove a item that satisfies the conditions.

Clearing a collection to remove all the elements are computationally expensive. However dropping a collection is rather very fast.

Updates are very interesting topic in Databases and Concurrency. Here is how mongodb handles it: last update stays there. There is no notion of transactions or locking, whichever update executes last, stays consistent. Updates are atomic.

Update has a special operator , $set, which is used to add a property to a document. In other words, you can the schema of the document while updating with this operator. Likewise, you can remove a key from the document and change the schema of the document with $unset operator.

$inc operator provide increment of a key that represents an integer. This is very useful for counters, analytics etc.

Mongodb doesnt support transactions which is considered to be a missing implementation, but if you look at it one of the most favorite database engine, mysql’s MyIsam engine doesnt support transactions but it is being used very widely.

Mongodb uses BSON data format, which is binary JSON. Even though BSON takes more space than JSON. It has certain benefits such as performance, processing BSON is more efficient than processing JSON. Another benefit is that every client knows how to convert BSON to any data format for processing purposes.

Mongodb uses dynamic querying and update in-place. Updates in place are very efficient operations and are done with lazy writes, because accessing and writing to disk is very slow and mongodb uses memory mapped files to store data, updating in-place can increase mongodb’s performance.On the other hand, the data mongodb is updating might be stale for a very short time. On the other hand, CouchDB uses MVCC, which keeps multiple versions of data from different users. By MVCC data is guaranteed to be up to date.

Storing binary data in mongodb is also an easy operation, mongodb supports binary data storage up to 4MB and if you need to store bigger files, mongodb partitions the file for you and store them in chunks. Then when you run range queries, you can get the chunks of the data and merge them or use them as you need.

Mongodb supports manual and auto sharding. Sharding is basically partitioning your data sets across set of servers. With manual sharding, you are responsible of determining the servers for the data to go as well as merging the results for your queries. With auto sharding mongodb takes care of everything for you. You will have the impression of talking to a single server while you are working with auto sharding, even though, this hasnt been implemented yet, it promises great flexibility with scaling.

with the following commands you can have some information about your database and environment.

One of the great features of mongodb is its schemaless database design, you dont need to a schema as you do with relational database management systems. This provided a great strength while you are developing your applications, first benefit is that changing requirements and specifications, second benefit is changing nature of your data and another one is having different data for same data types.

Mongodb uses Collections to store documents. Collections is a similar notion to tables in relational database terminology. Collections are container that are used to store documents, they are dynamic, they grow as you add more documents to it. There are also capped collections which contains limited amount of documents. Fixed sized collections.

Data types in mongodb are followings: String, Integer, Double, Boolean, Min/Max Keys, Arrays, Timestamp, object, objectid, null, symbo, date, regular expressions, javascript code and binary data.

ObjectId is a special type. ObjectId is created automatically by mongodb for every document. ObjectId is 12 bytes hex string which is built by timestamp, machine id, etc This guarantees that the objectid’s across all the servers are unique. Moreover, in a sharded, clustered environment managing auto generated primary key ids is very difficult, auto generated objectid’s help with the document ids. This is a great feature in my opinion.

You can use indexes in Mongodb to increase your read and query efficiency. However keep in mind that adding indexes to a collection would increase your read speed but it might effect your write speeds. You will have a performance hit on your collection if you are writing too much data and reading seldom. This is due to indexes and data structures being used in the indexes, ie: B Trees.

With Mongodb, unlike relational databases, you dont have to create databases or collections right away. If you dont have a database, or collection mongodb will create it for you when you execute your code.

In order to view current databases use the following:

# show dbs

Then if you want to use a specific database:

# use myDb

Once you are in a database context, you can view the collections as follows:

# show collections

Once you switch to a database, current database is referred as db.

While working with mongodb collections, you will often search documents, find() is used to find document, find() also takes predicates to match the documents being search.

# db.items.find({Foo : “bar”});

will return you items with Foo attribute is “bar”.

While searching, you can limit or skip documents as well.

# db.items.find().limit(10);

# db.items.find().skip(20);

you can also sort your results:

# db.items.find().sort({Foo : 1};

One of the nice feature is that you can combine these function calls and make it fluent.

# db.items.find().sort().limit(10).skip(20);

The notion of capped collections in mongodb comes handy when you want the insertion order to be natural. While you are inserting documents to the regular collections, your items are not guaranteed to be in order, this might be due to the collection size can exceed the capacity of the collection and can be mapped to another collection. Capped collections are fixed size collection, just like a fixed size circular buffer, once the capped collection is full, new items will let the old items to be purged, dropped. You can define the size of the capped collections. Capped collections guarantees the insertion order. Updating documents within capped collections is possible if and only if the document size will remain same, also, you can not delete a document from a capped collection. In order to achieve these, you will have to re-create your capped collection and insert the documents again.

In order to find out about your collection you can invoke validate() method on your collection.

As stated you can use the find() method to query your documents, once in a while findOne() method also can be used to find a single document.

Mongodb aggregation methods are very easy to use. For aggregation you can use count, distinct and group functionality.

count() method is used to find out about the number of document in a collection as follows:

# db.users.count();

or with a predicate:

# db.users.find({Age : 30}).count();

distinct() is used to return the distinct, unique documents within a collection.

# db.users.distinct();

You can also pass predicates to the distinct method.

Moreover, mongodb provides many predicate, and conditional operators such as $gt, $gte, $lt, $lte for filtering your results.

# db.users.find(Age:{$lt : 21})

There is a $ne operator to get documents that a predicate not equal to a value.
# db.users.find(Name : {$ne : “firat”})

$in operator is a filter to get the documents within the defined array. There is also $nin operator to find documents not in an array.

# db.users.find(Name : {$in : [“foo”,”firat”,”bar”])

This call finds documents whose name is foo or firat or bar.

# db.users.find(Name : {$nin : [“foo”,”firat”,”bar”]})

This call finds documents whose name is not foo or firat or bar.

There is a counter call to $in which is $all, which gets you the document that has $all the predicates in the statement.

$size operator lets you specify the number of sub documents within a document and returns you the resulting document.

# db.users.find({Accounts : {$size:2}});

returns you users with 2 accounts.

Because mongodb is schema less, sometimes you search for documents and keys might not be in the document definition. You can use $exists operator to find documents with specified key as follows:

# db.users.find({DateOfBirth : {$exists :true}})

Mongodb querying is lazy. Querying the database returns you a cursor not the results. As you are iterating through the result set you get the results lazily. This is a nice feature.

Mongodb provides an upsert() statement that either inserts a new document or updates a current one.

Moreover, there is an $inc operator that increments a value, this might turn handy for analytics.

$set operator sets a value. $unset operator unsets a particular value. There is $push and $pop operators just like a stack that allows users to add or remove elements from the document.

In order to remove documents from a collection:

# db.users.remove({“Username” : “firat”})