C# Data Structures – Collections overview

Within C# there are several collections (data structures, containers) ready to use for your applications. Moreover, there are several collections (data structures, containers) in computer science that serves lot of different purposes.

Array :   Let’s start with array, which is likely to be the most used data structures within implementations of C# collections, even the data structures like List, Stack, Queue and many others use array as the base the data structure.

Arrays are very efficient data structures. There are single dimensional array as well as multi-dimensional arrays and jagged arrays. Before getting into how to create arrays and their methods, let’s point out the memory structure of an array. When you create an array, you will need contiguous space in memory. Let’s say you have an array of size 1 million, you will need space for 1 million blocks within the memory. This can be good and bad. It can be good because keeping data at the same region will make access to it faster, compared to having elements all over the memory, such as linked list node. Also, it can be bad, if you reallocate your array you will need enough contagious space within AppDomain for reallocation, otherwise it fails. Is this a seldom case? Yes.

Arrays are based on indexes, which (almost all the time) starts with index 0. There are ways to start an array with a different index, which is mentioned in CLR via C# book. Arrays are fixed size once you create them, arrays don’t grow dynamically unlike a list or arraylist. If you have an array of size 5 and you need to insert the 6th element to it, you will have to allocate a new array with more capacity (load factor), copy all the elements from current element to the new array and then you can insert new elements to it.

Let’s talk about the insertion, search and delete operations on arrays. In order to insert an element to an array, you have to know the location, index, in which you need to insert the element to. If there is an element at the location you insert an element, it will place the new item to that location and you will lose whichever element was there. So insertion to an array takes constant time , O(1). However, if the array is full and you need to insert an additional item, as I mentioned it, you will have to copy all the array elements to a new array, which takes a constant time to insert for each element, this operation becomes amortized and overall insert to an array is still constant time, O(1). Searching an element in an array is very easy, you have to iterate through all the elements and until you find the element you are looking for, which is O(n). There are methods to sort an array. Once the array is sorted you can use binary search, which is way faster than linear search. Deleting an element from an array is same as inserting an element, it takes constant time.

Arrays are enumerable collection in C#, in other words you can enumerate through an array using foreach. Moreover, you can use a for loop and access the array elements by index. Arrays are Reference Types. Once you create an array, it is created on the heap even if it holds a reference type or value type. Because arrays are enumerable, you can also call LINQ methods on arrays.

If you like to read more about the syntax of arrays as well as the methods go to:

http://msdn.microsoft.com/en-us/library/9b9dty7d.aspx

ArrayList : ArrayList is a dynamic array of objects. While this data structure used to be very popular due to dynamic feature, it has been almost abandoned and being kept in the framework for backward compatibility.

If you look at the definition of ArrayList Source code, which is as follows:

public class ArrayList : IList, ICollection, IEnumerable, ICloneable
{
// Fields
private const int _defaultCapacity = 4;
private object[] _items;
private int _size;
[NonSerialized]
private object _syncRoot;
private int _version;
private static readonly object[] emptyArray;
// Much more goes on.
}

You will see that the default capacity of an ArrayList is 4 as well as ArrayList contains items of object type. ArrayList implements IEnumerable which indicates that you can enumerate through the elements that ArrayList contains via foreach loop. Another Collection Interface ArrayList implements is IList, which allows accessing ArrayList elements by index, as well as inserting an item at an index location and removing an item from an index location. Accessing an item via index happens with an indexer. Moreover, ArrayList implements ICollection interface which allows consumers to be able to add items, remove items to the collection. Also there ICollection interface has contains method that ArrayList implements.

You can learn more about ArrayList internal structure by using .Net Reflector. I would like to point out that adding an item, and retrieving an item from an ArrayList as well as other operations are not efficient due to casting. in terms of basic inserting and retrieving you have the same complexity as an array, furthermore you don’t have to reallocate, copy the elements to new ones, ArrayList does it for you. So even though it has many benefits, ArrayList is an inefficient data structure.

if you studied casting or boxing, you can cast anything to an object, which is called boxing, so object is a box and it takes anything, you can put whatever you want in an object. You dont have to worry about the type of what you are putting in an object, however while retrieving, you have to know what to cast to (unbox) the object to, otherwise you will get a casting exception. This operation becomes very expensive, if you cast many items. This is the main reason why ArrayList is inefficient.

List<T>: List is a generic collection type, you will have to define the type of the elements you will be using with the list. List uses array internally. Definition of the list is as follows:

public class List<T> : IList<T>, ICollection<T>, IEnumerable<T>, IList, ICollection, IEnumerable
{
// Fields
private const int _defaultCapacity = 4;
private static readonly T[] _emptyArray;
private T[] _items;
private int _size;
[NonSerialized]
private object _syncRoot;
private int _version;

Generic List Collection uses Array internally as well. By implementing IList<T> interface, it ensures to provide implementation to an indexer, Insert, Remove methods. Moreover, List<T> also implements ICollection<T> interface which ensures to provide implementation to several collection methods like add, remove, count etc.

Just like ArrayList, List<T> also provides several high level methods to insert, remove, sort, search and manipulate the list without knowing what s going on internally. This is a great abstraction and eases the development for programmers. List<T> implements IEnumerable so i can be traversed/enumerated with a foreach loop, it also has index support so you can access List<T> elements by index. You can use Linq queries and Standard Query methods on a List<T>.

Difference between ArrayList and List<T> is that you dont need boxing and unboxing with List<T> since you define the generic type during the initialization of the List. So avoiding casting brings tremendous performance gain especially if you have high number of elements in your collection.

keep in mind that List uses array internally, so again, you will need contiguous memory space to keep the elements together.

LinkedList<T> : Linked Lists are composed of LinkedListNodes. These nodes are basically containers which keeps elements within. here is a definition of a node for singly linked list.

class Node<T>
{

public T Value {set;get;}
public Node<T> NextNode {set;get;}

}

This is the most basic definition of a Node class for a LinkedList. Node contains data and it also contains a pointer to the next node. This Singly Linked List Node is forward only, there is no Previous Node Pointer. You can add a PreviousNode<T> to the Node class and maintain the previous node while building your linked list.

For this text, I will write about Doubly Linked List, because .net Linked List implementation is a doubly linked list and certainly doubly linked list makes life way easier, compared to Singly Linked List.

Linked List nodes are stored in the memory unlike arrays. You wont need contiguous memory space to store Linked List Nodes. Nodes are sparse in the memory. This is good and bad. It s good, so that you won’t need contiguous memory space, and you can use all of the memory. It’s bad, accessing them will require to scan the memory, which might be a little slower compared to accessing an item in the memory by an index.

Linked Lists are very popular and used in many applications. Moreover, you can implement other data structures, using linked lists, such as queue, stack and deque.

There is also circular doubly linked list or circular linked lists. The idea behind circular linked lists are they point to a node (usually head node) in the linked list.

Lets briefly describe insertion, find and delete operations in linked list. To insert an item to a linked list, there are various ways to do it, if you want to just insert an element to the head node, it is constant time O(1), if you like to insert an element to the end of the list it takes O(n) time, linear time, because you will have to find the last element and insert to the end of it. These is for a NON circular linked list, for a circular linked with head and tail nodes, in other words sentinel nodes. I assume that you dont have sentinel nodes, and you just need to add to last.

Deleting a node from linked list takes linear time as well. You need to find the node to be deleted and disconnect it from the linked list.

Searching an element in a linked list takes linear time. You need to visit all the nodes in order to find an element. There is no notion of binary search for linked list data structure.

There are several linked list algorithms and uses. You can check out this post on linked lists and algorithms for linked list.

Linked list is a very simple data structure, you should be able to implement a linked list easily. It is a good practice.

SkipList : I will briefly mention about skip list. Skip list is an amazing list type data structure. Skip list also uses Node<T>, nodes to represent data in containers and like linked list it has pointers, unlike linked list, skip list has many pointers to many arbitrary elements within the list. Skip list needs to keep order, sorted order. One of the important purpose of skip list is to increase searching time. Searching in a skip list takes O(logn), logarithmic time, which is faster than searching a linked list. There is no notion of indexes with skip list.

inserting an element, deleting an element from skip list is also complicated operations. Because you will have to maintain all the pointers, and make sure they wont break.

You can read more about skip list on Wiki.

There is no Skip List Implementation within C#, so you will have to implement it yourself.

Stack<T>: Stack is a first in last out based data structure. It is a very common data structure. Stack is very easy to understand and implement. Stack uses array as the internal data structure. You can also implement a stack using linked list. Both are pretty straight forward to implement.If you use an Array for implementing Stack, you will have to maintain the array as it grows and shrinks. Another data structure which can be used for implementing a stack can be circular buffer, and this can be more efficient than both using an array or linked list.

In C# stack implements ICollection and IEnumerable interfaces, which means you can iterate through stack elements due to IEnumerable interface, moreover, you can use methods of ICollection interfaces such as, Count, CopyTo etc. You can iterate the stack from top to bottom and from bottom to top, if you write a Custom Enumerator, which is very straight forward.

Main operations of Stack are Push, Pop and Peek. While adding an item to stack, you use Push method. While removing an item from the stack, you need to call Pop method. Peek returns the element on top of stack, however, it doesn’t remove the item from the Stack.

Implementing Stack in C# is very straightforward.

Running time, of inserting an item to a stack is O(1), you just insert the item to the location where the current pointer is pointing to. Popping, removing an item from stack also takes O(1), because you just remove the first item on top of stack. If you want to remove any element in the stack it might take up to O(n). Searching an item in a stack is linear time, O(n), as you need to iterate through the whole stack.You can search a Stack by using Contains(T item) method or you can peek all the elements and check equality, either way, you need to iterate through all the elements within Stack.

Stack is a very easy and useful data structure and it is very common. With .Net 4.0 C# has ConcurrentStack<T> which can be used in multi threaded applications. If you want a Thread Safe Stack Implementation, I would highly recommend if you use ConcurrentStack<T> instead of you trying to build your own Stack implementation with locks etc.

Queue<T>: Queue is a commonly used data structure, just like Stack and they are very similar, you can implement a Queueu with Stack and vice versa.

Queue is a First in First Out data structure. First item you insert is removed from Queue, once you remove an item. Imagine a two ended stream, items are entering from one end and leaving from another end in order of insertion.

C# implementation of Queue uses Array internally with a single pointer. Becauase Queue uses array internally and naturally, array will grow as you insert (enqueue) elements to Queue, you will have to maintain array ie: grow and shrink it. There is a grow factor variable used internally for array. This factor indicates how much the array will be growing as you enqueue items.

Queue can be implemented using a linked list or circular buffer as well easily.

Main operations of Queue is enqueue, dequeue, and peek. Enqueue is for inserting an item to the queue. Dequeue is for removing an item from the Queue. Peek is for looking for the item in front of the queue, but not removing it actually.

Asymptotic notation of Queue is very similar to Stack, Enqueue to Queue and Dequeue from Qeueue takes O(1), constant time. You just do the operation to the array element that pointer is pointing to. Peek is as well takes constant time. Searching an element in a Queue is takes linear time. You have to iterate through all the elements, at most, to find a item within Queue, this takes O(n) time.

There are several algorithms and usage of Queue data structure. With .Net 4.0, now there is ConcurrentQueue for thread safe queue operations.

Queue implements IEnumerable, which means, you can iterate through all the elements of the Queue, you can also peek all the elements without removing them.

You can also iterate the Queue from both ends if you write a custom enumerator.

Queue is commonly used within applications.

HashSet: HashSet is very cool data structure and it is very efficient and fast. HashSet contains only unique items, in order words, duplicates are not allowed in hash set. Hash set is due to Set Theory, you can use any operation that a Set in math supports. Such as union, intersection, overlapping, difference etc. This brings lot of value to HashSet.

HashSet is based on Hash values of items. Due to this, hashset resembles HashTable or dictionary.Implementing hash code is very important. Once the hash code is calculated for an item/object, it determines duplicates. If there is a duplicate in the set, you wont be able to insert the object to the hash set. Therefore, you GetHashcode method should be implemented carefully. Moreover, GetHashcode method should be very fast, implementing a slow GetHashcode method might cause performance degration, if your application takes lot of time calculating hashcode. If different objects results in same hashcode, this is called Collision. Collision is HashSet basically fails to insert an element the collection, unlike HashTable.

HashSet uses two arrays Internally as well.One to store the hash values of the elements and one to store the Items/Objects namely :
private Slot<T>[] m_slots;

private int[] m_buckets;

Slot is a value type, Struct, that keeps the elements in. Buckets are where the hashcode and corresponding element resides.

As you would imagine, both of these arrays should be maintained when they need to grow and shrink. Both HashSet and HashTable has to handle this natural cause. Once the hash array and bucket array is full, there are two new arrays are created to store Slots and hash codes, then the next prime number that s bigger than the current array size is calculated. This new prime number is set as the size of new Slot and Hash Code array. Here is the interesting part, hash codes are being recalculated, by finding the remainder by the prime number and Items are being inserted to Slots array correspondingly. Even though this seems to be a costly operation, it is very efficient for a implementing a Set.

In order to add an item to HashSet,you need to use Add(T item) method, which call AddIfNotPresent(T item) method, and does all the operations i just mentioned including adding the item to both of the arrays.

HashSet implements ISet<T> which has public methods as follows :add, ExceptWith, IsProperSubsetOf, IsProperSuperSetOf, IsSubsetOf, IsSupersetOf, Overlaps, SetEquals, UnionWith, SymmetricExceptWith.

Moreover, HashSet implements, IEnumerable which indicates that you can iterate through the HashSet elements.

HashSet also implements ICollection<T> which has methods : Add, Remove, Clear, Contains and Count. So HashSet implements all these methods.

There is much more details to HashSet. Above is just a little summary of HashSet in C#.

The HashSet<T> performs very well for add and search operations. It has O(1) for add in most cases. When the internal capacity must be increased, it is an O(n) operation. Increasing capacity is infrequent. Any search operation (Contains, Remove, and similar operations) are O(1). That’s great. Enumerating the elements in a sorted order forces you to copy the items to a different collection (like a List<T>) and sort the resulting list. You could construct a LINQ query to order the elements, however internally that query will likely use some form of temporary storage to create the sorted sequence. That means every sort will be an expensive operation. Sort is typically an O(n ln n) operation, Also, because the HashSet<T> does not have a sort method, you’ll also have increased memory pressure and time cost to copy the elements.

Enter SortedSet.

SortedSet: Ordering and sorting is a very important topic in Computer Science. There are more then dozen of sorting algorithms and still this day, there is constant development and optimization of sorting algorithms.

SortedSet uses Red-Black Trees internally. Red-Black trees are balanced trees to guarantee logarithmic time for inserting and searching for elements. Red black trees are not easy to implement because you need to maintain the height balance of the tree as well as the colors of the nodes. You can read more about Red-Black trees online. All you need to know for this context is that, elements are being stored internally within SortedSet as sorted and inserting an element to SortedSet is fast, logarithmic time and searching for an element, finding an element takes logarithmic time as well.

Enumerating a sorted set will provide ordered , sorted, elements.

SortedSet also provides all the set methods like union, intersection etc. SortedSet implements ISet so all methods I mentioned above are implemented for SortedSet as well. All these methods are really easy to implement, such as reverse, subset, superset and etc.

I think implementation of red black trees are the hard part for SortedSet.

If you like to check out some code examples, take a look at this link

Circular Buffer : TODO

Heap : TODO

HashTable : TODO


SortedSet : TODO

Dictionary : TODO

Sorted Dictionary : TODO

Tree : TODO

Binary Search Tree : TODO

Graph : TODO

 

 

 

Best Practices and Hidden Features of C#

Best Practices and Hidden Features of C#

While working with files, folders and IO, C# provides several helper classes for the job one of which is Path class. While most of the developers still uses string operations to combine the paths, Path class provides a Combine method to get the job done.

Ex:

var  path = Path.Combine(@”c:\csv\”, “foo”); // return a string of the new path.

Path class also provides many other methods to access to many functionality of the path information, such as full path, extension, file name, directory name and so on.

C# introduces ?? operator, null-coalescing, to evaluate expressions for their null value. if the expression on the left is null, expression on the right will be evaluated. ?? operator comes in very handy while working with databases due to the results can include null values. Moreover, you can chain many ?? null-coalescing operators together in the same statement. Using null-coalescing you can as well lazily initialize your objects.

ex:

var res = obj ?? new YourObject();

Casting is an expensive and tedious process. C# introduces is and as operators to simplify casting. When you are doing a cast, there is a possibility of an exception being thrown, to avoid this, you can use is and as operators. is operator is used to check the type of an object, if(myObject is MyObject) like this. This operation returns true or false. There is no exception thrown. Likewise with as operator, if the cast fails, CLR wont throw an exception, instead it will return false. If the cast is successful, you will get the casted object.

C# provides automatic properties which is a good start while coding. Most of the time if you might need encapsulate some logic within the properties you can easily change it through automatic properties. Properties are an important concept to implement encapsulation and properties provides a good starting point for encapsulation.

ex :

public string Name {set;get;}

While working with generics you will encounter some cases to assign values to your generic types, and when you wont know what type you will be expecting and what to assign to your variables, default(T) comes in handy. In the run time, CLR finds out the default value of the generic type, if it s an object, null, if it s value type, whatever the default value of it will be used by default method.

While working with string data, you will need to escape characters like \ or spaces, verbatim string comes in handy.

ex:

string path = @”c:\csv\foo.csv”; // if you dont use @, you will have to escape \.

Nullable types, is a great feature of C#, while working with databases or APIs, sometimes, you get no result, null values, and using value types, you will get exceptions most of the time for having null values, this is the time you will need to use nullable types.

Environment class provides several utilities and most of them are system independent. Some of the methods are:

Environment.FailFast(string); //This is used fail fast, right away, reporting to event log with the exception.
Environment.NewLine; // Provides new line, system independent
There are several other cool methods about the system information provided by Environment class.

Using statement and directive provides several benefits one of them is to make the object used within the using statement to be eligible for garbage collection. When the object is out of scope, object becomes eligible for garbage collection and CLR invokes the finalizer or the dispose method of the object.

Ex:

using(ISession session = new Session()){} // when the session object is out of scope, you wont need to explicitly call the dispose method of the object, CLR finds it and executes it for you for garbage collection.
Another feature of using is you can create shortcuts for long named objects as follows:

using dict = Dictionary<MyLongNamedObject, ANotherLongNamedObjectOrCollection>;

There are several approach for string comparison and below is an efficient way of doing it :

myString.Equals( theirString, StringComparison.OrdinalIgnoreCase ); // return boolean. If you convert both of your strings to upper or lower, you are creating extra objects which will as well needed to be recycled etc. So this is an efficient way of comparing strings in C#

Most of the time developers check if the string is null or empty seperately. C# has a string method to check if the string is empty or null as follows :

string.IsNullOrEmpty(myString);// return boolean.

Yet another problem is while the string is null or white space, OK you can trim the string and check the white spaces, which creates new string, C# provides method to check if the string is null or contain white space(s) via

string.IsNullOrWhiteSpace(myString);