What happened to turkiye.gov.tr

turkiye.gov.tr is our e-government gateway to many useful resources and hope this article finds its destinations.

Recently turkiye.gov.tr launched a new feature for viewing family trees. Great feature and work! There was a rush for many curious minds to find out about their ancestors. I guess turkiye.gov.tr wasn’t expecting too much traffic, so, things just went sideways in the most colossal way.

I am not familiar with the architecture of turkiye.gov.tr or the ancestry service but worked with many large-scale systems.

Below you will find 10 facts, I believe, caused outage for turkiye.gov.tr.

Fact #1: Big systems fail more often than small systems.
Big systems have more external and internal dependencies along with many moving parts. Therefore, big systems should have many defensive mechanisms in place.

Fact #2: Blocked threads are the number one cause of most failures.
Slow applications and hung threads are the most popular reasons of failures. These reasons lead to cascading failures and chain reactions. Blocked/hung threads can happen due to several reasons like deadlocks, starvation and live locks. Hung thread detection policies, timeouts, circuit breakers and bulkheads can prevent these failures.

Fact #3: Integration points are number one killer in any system.
Every integration point eventually fails, however a failure in the integration point shouldn’t take down the whole application. Cascading failures occur when problems in one integration point propagates. Failures in integrated services becomes your problem. It even become more serious if you are not prepared for it. Same as above, defensive programming, timeouts and circuit breakers can prevent serious failures.

Fact #4: Tight coupled systems fail more often than loose coupled systems.
Decoupling middleware is a good practice to enable loose coupling for integration. This principle is applicable and best practice for cloud native applications.

Fact #5: High traffic site’s Resource/Connections pools get drained very quickly.
Resource/Connection pool have limitations. They can run out of resources rapidly and application performance will start degrading.

Fact #6: Unbalanced capacities causes failures and scalability problems for applications.
If the capacities are not aligned, then you have a problem. Capacity and sizing should be planned accordingly.

Fact #7: Never trust a code you have no control over or you didn’t develop which can be a third partly library or a remote system developed by someone else.
The downstream application which is running a blocking code, can take your application down.

Fact #8: Slow applications gets more traffic.
When an application is slow, users hit re-load button or F5 many times to reach to the application which causes more traffic.

Fact #9: Fail fast all the time and retry gradually.
Exponential back off, Circuit breakers and timeouts should be embraced. User friendly error codes or default messages should be presented to the user until the stability is established.

Fact #10: Appreciate your hardware resources and utilize them wisely.
Don’t fall for CPU and Memory is CHEAP. This is not TRUE. Long running CPU cycles can cause contention which slows down your application and eventually it will fail. Paged or unfragmented memory causes slow seek times.

What happened reminds of dotcom bubble back in 2000s circa. Yahoo!, Altavista, Lycos and many internet giants back then faced similar problems and they started developing scalable platforms.


Circuit breakers

In my previous post, bulkheads, I mentioned that “Since the popularity of service oriented architectures and then microservices, people are talking about bulkheads and some other terms like circuit breakers, timeouts etc. However, I see that many lacks the understanding of what these terms really mean or how to implement them. I decided to cover some of them and wanted to start with Bulkheads.”. In this post I will briefly cover up Circuit breakers.

Not many years ago, people used to plug too many electronical appliances into their circuit, each appliance drew certain amount of current and when the current is resisted, it created so much head in the walls of the house and then the house burns down. Later on there were some optimizations but still houses on fire.

Circuit breakers come to rescue.

This idea can also be applied to software applications with integration points. A software with one or more integrations are destined to fail at any given time. This is a given. Circuit breakers can help prevent operations that is already known not to be healthy.

Circuit breakers are a way to degrade functionality when the system is under stress. Changes in the circuit breakers’ state should always be logged and monitored. Circuit breakers are effective at guarding against integration points, cascading failures, slow responses etc.

In a normal closed state, circuit breakers executes operations as usual. Other services can be invoked, internal operations can proceed. However if any operations fail, circuit breakers store this information, that is including any times. Once the number of failures exceeds a certain threshold, circuit breaker trips and opens the circuit. Any call made to the circuit breaker, it fails immediately.

This is very important, because most of the failures occurs  due to blocked threads, race conditions and dead locks. Hence, if a service is not responding or continuously timing out, what is the point of invoking it. All yours threads will be blocked and you will run out of them. Soon, your JVM or runtime will crash.

After a configurable amount of time, circuit breaker can go into half open state, in which calls can pass-through and if all goes well, circuit breaker closes. If things are still failing, circuit breakers opens again.

If circuit breaker is open, you can either let the user know something is not working and check back soon. Or you can have fallback services. The latter is much more better however services unfortunately can not have fallback due to their responsibility and function.

You can have multiple circuit breakers for different purposes such as timeouts, connections refused and other type of failures.

There are several tools that can help you implement circuit breakers in your system. Netflix has an open source project for this purpose called hystrix . You can check out it and see how things work.



Since the popularity of service oriented architectures and then microservices, people are talking about bulkheads and some other terms like circuit breakers, timeouts etc. However, I see that many lacks the understanding of what these terms really mean or how to implement them. I decided to cover some of them and wanted to start with Bulkheads.

This term comes from Ships. In a ship, bulkheads are partitions that can be sealed and closed during an emergency. Something like following:

If one of the compartment starts taking water, it can be sealed once hatches are closed which prevents the water moving from one compartment to another hence sinking the ship.

Same technique can be employed for software and architecture. By partitioning your system you can avoid cascading failures. Bulkheads can be applied to physical and application services in such a way that if one of the hardware or application fails, the system should continue functioning. Critical applications should be partitioned and bulkheads should be implemented.

Imagine you have an application A and application B. Then there is a critical common service called service C. This service is very critical for both apps. In a conventional architecture, the design is as follows:

The problem with this architecture is if Service C goes down for any reason both of the apps will be affected. So Bulkheads pattern recommends the following:

Deploying Service C for both of the Apps provides better stability for the apps. This can be simply independent hardware, application host or thread pool.  You can partition thread pool in an application by deploying to multiple virtual machine.

Today many application servers provide means to separate runtime environments for applications. You can deploy the same application under different context and assign seperate JVM or CLR to go with it.

Also today we have docker and several Virtualization software which makes implementing Bulkheads easily.


Talk is cheap, show me the code

Day by day, I believe in hype driven development. “We have to use docker”, “Oh! Microservices are awesome.”, “Hey hotness, have you tried Redux, it is fucking awesome”.

Soon enough these have become quite irritating.

Hours of discussions, bullshit and meetings, all around the hype without any deliverables.

It is essential to create value in the work place and for the projects. Continuous hype and criticizing the technologies, surely doesn’t help anyone.

“Talk is cheap show me the code” — Linus Torvalds.

One can only survive with the hype for a short period of time.

A gentle introduction to HTTP/2

Let’s start with HTTP/0.9

The original HTTP as defined in 1991 was developed by Tim Barnes Lee who is a web developer. It was a text based request and response protocol. You could do HTTP GET and only for HTML (response type). It was solely used for sharing documents mostly about physics. After each request connection was closed.

Later on, HTTP/0.9 was extended to HTTP/1.0, request and response headers were added, also, you request images, text files, CSS and other.

In 1999, HTTP/1.1 showed up, Persistent connections (keep alive) was introduced, chunked transfer encoding and host header were added. With host headers, it was possible to host multiple sites for an IP. It was a huge SUCCESS!

Problems with HTTP/1.1:

  • It wasn’t designed for todays web pages
    • 100+ HTTP requests, and 2MB+ page size.
  • Requires multiple connections
  • Lack of prioritization
  • Verbose headers
  • Head of line blocking

Lets break it down.

Multiple connections:

When the app requires 100+ HTTP requests, there is a limit to the number of connections a browser can open per host, most browsers support 6 connections simultaneously. This becomes a problem, it takes time to establish and be efficient, there is 3 way handshake every time it needs a connection.

Before HTTP/1.1 each resource required a 3 way hand shake which was very wasteful.

With HTTP/1.1 Connection header was introduced. By default Connection: Keep-alive was introduced. With this header, the problem with three way handshake was eliminated and everything was able to be done via single TCP connection.

TCP is a reliable protocol. Each packet sent we receive a acknowledgement.

This introduced the head of line blocking problem as below. If the data 2 fail, it will block the subsequent requests.

There are other problems like slow start and sliding window. Adjusting the transfer rate by the condition of the network. This is called Flow Control.

Lack of prioritization:

Priority is decided by browser. Not be developer or the application. There is no way to specify the order or responses. Browsers need to decide how to best use the connections and the order of resources. There is no prioritization built on HTTP/1.1

Verbose headers:

There is no header compression. You can use GZIP to compress the content however, the headers such as Cookie, User Agent, Referrer and others are not compressed. HTTP is stateless by default; therefore Cookies, User Agent and other headers are sent to server every time, which is inefficient especially for very high volume sites.

Head of line blocking:

Once you create a connection, and send a request to a resource, that connection is dedicated for that request, until the response comes back, you can’t use that connection. ie: If you need to get 30 resources from host, you can get 6 at a time. Once you request 6 resources, the others must wait for these 6 requests to finish. So, HTTP/1.1 works in a serial way (serial request and response). This is called Head of line blocking.

These are the problems with HTTP/1.1. Then we have other problems like bandwidth and latency.

Bandwidth is measured in bits per second, which is relatively easy to add more to a system. Bandwidth is usually expressed as network bandwidth, data bandwidth or digital bandwidth.

Latency is the time interval between cause and affect in the system. In internet latency is typically the time interval between request and response. Latency is measured in milliseconds, which is based on distance and speed of light, there is not much to do when it comes to latency. You can use CDN to beat latency, however that comes with a price.

For more information about latency and impacts read “it’s the latency, Stupid“, by Stuart Cheshire.

Increasing bandwidth helps improve web performance, page loading times etc. However, there is a limit to it. On the other hand, fixing the latency problems almost helps linearly to performance. You can read this post for more information about bottlenecks. Based on the tests done on this post indicates that improving latency is more efficient than improving bandwidth, when it comes to web page optimization.

If we were to compare internet to a highway bandwidth is the number of lanes and latency is time it takes to travel a specific distance which depends on traffic, speed limit and others.

Goals of HTTP/2

  • Minimize impact of Latency
  • Avoid head of line blocking
  • Use single connection per host
  • Backwards compatible
    • Fall back to HTTP/1.1
    • HTTP/1.1 methods, status, headers still work.

Biggest success of HTTP/2 is reducing latency by introducing full request and response multiplexing.

HTTP/2 Major Features

  • Binary framing layer
    • Not a text based protocol anymore, binary protocols are easier to parse and more robust.
  • Resource prioritization
  • Single TCP connection
    • Fully multiplexed
    • Able to send multiple requests in parallel over a single TCP connection.
  • Header compression
    • It uses HPACK to reduce overhead.
  • Server push

HTTP/2 introduces some more improvements, more details: HTTP/2 RFC7540

Binary framing layer uses doesn’t use any text, currently we can trace and debug, this change will require tools for debugging.

Resource prioritization will allow browsers and developers to prioritize the resources requested. Priorities can be changed at any time based on resources or application. So far so good. However, if there is a problem with a high priority resource, browser can intervene and request the low priority resources.

Most important change in HTTP/2 is Single TCP connection per host which solves lot of problems. HTTP/2 multiplexes request and response frames from various streams. Lots of less resources are used, there is no 3 ways handshake, no TCP Start slow and Head of Line Blocking.


When a user requests a web page, headers are sent for every single request which are Cookies, User Agent and others. It doesn’t make too much sense to send User Agent for every single request. To solve this problem dynamic table was introduced.

When you send a request to a host, the following headers are sent along. On the consequent requests, not the whole values are sent, instead the compression values are transmitted. In the future requests if the compressed values are same, nothing will be sent again. If User-Agent doesn’t change it won’t send anymore.

HTTP headers

Original value Compression value
Method GET 2
Scheme HTTP 6
User-agent Mozilla Firefox/Windows/10/Blah 34
Host Yahoo 20
Cookie Blah 89


Currently server push is experimental today yet, servers try to predict what will be requested next and push to client.

ALNP is needed for HTTP/2.

As part of the spec, HTTP/2 doesn’t require HTTPS however, if you need HTTPS, you need to use TLS 1.2+ and also don’t use some cipher suites.

Can you use HTTP/2?

With HTTP/2 we have single multiplex connection, which means some of the performance optimization techniques sorta becomes obsolete such as CSS sprites, JS bundling, and domain sharding. Since connections are cheaper with HTTP/2 and more efficient, resources can be cached, modified and requested independently, there are of course tradeoffs which you need to decide how to implement.

However, I think, web performance optimizations like fewer HTTP requests, send as little and as infrequently as possible still applies.

While many major Internet giants are using HTTP/2, it is still not adapted as much. I assume, adoption will take a while and maturity of this new exciting protocol will come along.

Here are some demos, showing the difference between HTTP/1.1 vs HTTP/2.

Akamai demo

CDN 77 demo

There is a new reference you can check out:



Blue-Green deployments – no more downtime!

Deploying new functionality or version of a software always have the risks of introducing bugs, downtime and chaos. I usually get shivers during deployments J  Some of the things that can go wrong:

  • Application failures
  • Capacity issues
  • Infra failures
  • Scaling issues

If you are practicing continuous delivery correctly, releasing to production is not always another deployment. You need to measure risks, think of what can go wrong, coordinate and communicate while going to production.

There are low risks techniques to release software which are as follows:

  • Blue green deployment
  • Canary releases
  • Dark launching
  • Production immune system
  • Feature toggles

Blue green deployment is a release management technique to minimize downtime, avoid outages and provides a way for roll back during deployment.

Initially you need to have at least two identical setups.

For a setup of two identical environments, one is called blue while the other is called green. That is why this release technique is called Blue/Green Deployment. Companies like Netflix call this technique Red and Black deployment. I have also heard; it is called A/B deployment. Regardless of the name, the idea behind it is pretty straightforward.

You can always have multiple identical setups in geographically distributed data centers or within the same data center, also on cloud.

A load balancer or a proxy is a must to achieve this process.

While deploying, you need to cut off the traffic from one setup and have the traffic go to other setup(s). The idle environment is called blue and the active setup is called green. You do the deployment to blue environment, once the deployment is done, you do the sanity checks and tests, once the environment is healthy, you can take the traffic onto it. If you see any problems, with the blue setup, you can always roll back to the stable version.

The best part of this practice is that deployment happens seamlessly to your clients.

This technique can eliminate downtime due to application deployment. In addition, blue-green deployment reduces risk: if something unexpected happens with your new release on Green, you can immediately roll back to the last version by switching back to Blue.

If you have multiple data centers:

Initiall two data centers are active, serving to clients.

Take the traffic off from Cluster 1 and let it go to Cluster 2. Do the deployment to Cluster1, check if deployment is successful, run the tests.

Take the traffic off from Cluster 2 and let it go to Cluster 1, your new version is now live. Then do the deployment to cluster2, check if it is successful, test it.

Have the traffic go to both cluster by routing the traffic to both data centers.


Data Store replication: If your application is using any data stores across different region or data center replication becomes crucial. Database and schema migrations need to be implemented. This can be a uni- directional or bi-directional replication depending on the needs.

However, if you are using a single data store that is feeding applications, database schema changes need some attention as well.

If your app uses a relational database, blue-green deployment can lead to discrepancies between your Green and Blue databases during an update. To maximize data integrity, configure a single database for backward and forward compatibility.

Service Contract changes: Yet another challenge is updating service contracts while keeping the applications up and running.

Cache warming up: After deployment warming up caches can take some time.

Session Management: Session management strategy is crucial and can be problematic during deployment. Using a dedicated session storage will help a lot while doing blue/green deployments.

Today, we have docker, kubernetes and cloud of course. All these platforms also supporting and embracing blue green deployment.

Read more at martin fowler



Everything you need to know about data modelling for document databases

Back in 90s and early 20th century applications were like following:


Today, building applications is different than what it was back in 2000. Building modern day applications involves quite a few different technologies.


Back in the days, you could estimate and predict the size of your audience, today we have virality, different marketing schemes which makes it harder to predict the size of audience. You can never know, your next app can be the big hit.

We spent too much time in relational world, it takes time, effort and practice to bend your mind and understand how to do it in NoSQL way.

In relational data stores data is organized in tables which store rows and columns. Data types are usually the simple ones like, string, int, float date time. It is hard to store complex data types like arrays, hash tables and complex objects, which is called impedence mismatch.

On the other hand, key value or document stores can also store simple data types, moreover, they are comfortable storing complex data types usually in JSON format. JSON is basically a special class of string for representing data structures. Also, the schema in key/value and document stores are implicit and unenforced.

For many year now, we have been thought to normalize our data model for redundancy and efficiency as following.


Relational data stores are like a garage, you break your car apart and store related ones in the shelves. Store related things together.

In a document databases, you persist the vehicle the way it is. “Come as you are”

Same data can be represented in JSON document format as above. It is much easier, compact and simpler to represent the aggregate in JSON. It is a denormalized form of the data. Also, It makes it more convenient to distribute this data across multiple servers or clusters. The main problem with relational data stores here is the JOIN operation which is expensive operation.
Here is the domain model for this particular example:

While ORMs come to rescue for relational databases, performance is usually becomes a problem. One can claim that serialization/deserialization for JSON documents can also become a problem however, usually serialization frameworks are a lot faster these days.

Schema free databases makes it easy to rapidly develop software, however, you need to understand and embrace the principles behind data modelling for document databases.

Questions you need to keep in mind during modelling are:
• How is data going to be persisted?
• How are you going to retrieve or query data?
• Is your application write or read bound?

Key Patterns:
We start by defining unique keys which we call predictable keys (that we can query easily). Following keys have corresponding values which are JSON documents.

user::[email protected] -> User data as value
user::[email protected] -> User data as value

product::12345 [ID of the product] -> Product data as value

Then we have unpredictable keys as follows:

session::a5ab2863-db93-430e-8da3-feeb1998521f  -> Session document data

Unpredictable keys are queried with Map-Reduce functions in most key/value and NoSQL databases.

Counter-Id Pattern:

Almost all of the key/value and document stores provides very fast atomic counters. These atomic counters are safe for doing increment and decrement operations. The idea behind counter-id pattern is as follows:


Additionally, we can get the current state or the value of the counter anytime and hence we can predictably query users or iterate over them. This is a pretty easy and clever technique in my opinion. Since we only do increment, I know the size of the data, I can even run multi-get operations on it or do some paging.

This is similar to Identity column in RDBMS systems. Each time I want to store a document, increment the counter, get the counter value, create a predictable key and store the key/value.

Lookup Pattern:

Lookup pattern is usually a two step process as follows:


Initially, we store the data with an unpredictable key (such as GUID, UUID), then we create references to the particular key/data with predictable keys such as (email, username etc.)

In this example: we store a user data with a GUID representation, then we store references to this particular key, such as email, username. While retrieving the user data in a predictable way, we use email and username to get the GUID representation, then we can do another GET operation with key we captured from the first query, in order to get the data.

This makes it easy to retrieve documents without Map-Reduce operations. This pattern is quite useful for large datasets. Most of the data stores provide map-reduce operations with eventual consistency that use B trees on disk with log(n) complexity. However, lookup pattern provides immediate consistency with O(1), constant lookup time. Hence lookup pattern is very fast, scales very well and provides read your own writes along with consistent performance.

You can use these patterns together. For example: you can use counter pattern along with lookup pattern as following:


Initially we get an ID from our counter, we store the data with “user::54” key. Then we can add additional lookup keys such as  (“twitter::first”,”user::54″) and we use other keys that we will need to get this particular document. Again, we need a two step process to get the initial data. First we do a GET operation on (“twitter::firat”), which gives us the result (“user::54”), then we do another GET operation on (“user::54”) and we have the user document.

Yet again, these are pretty fast operations since they run in constant lookup time.

Pros of combining counter-id and lookup pattern:
These are extremely fast binary operations.
Linear performance.
Enables several ways to find single document.
Provides consistency, always consistent.

Cons of combining counter-id and lookup pattern:
Increases number of documents in the storage which is usually not a problem.

Below you can find the differences between RDBMS and Document Stores for persisting and querying the data.


Related data patterns:
To embed or to reference? That is the question.
Embedding is easy. You can persist the document the way it is.

Pros with embedding:
Easy to retrieve document, you can get the document at once. Single read operation brings you back the document. Single write or update operation takes care of your insert or update.

Cons with embedding:
In most data stores there is a certain limit to document size such as 20 MB (Mongodb) or 16 MB(Couchbase).

Most key/value stores or document stores are designed and developed to store small chunk of documents. Storing large documents are usually a bad idea.

Retrieving a large document can be a burden on network, even if you have gzip enabled. When you have multiple clients updating the same document then you might have versioning problems you need to take care of. You will have to implement optimistic or pessimistic locking.

At the same time, embedding can be performant due to locality reference.

When is it good to embed:
• There are contains relationships between entities.
• There are one-to-few relationships between entities.
• There is embedded data that changes infrequently.
• There is embedded data won’t grow without bound.
• There is embedded data that is integral to data in a document.

Usually, denormalized or embedded data models provide better read performance.

Below is an example of embedded document:

Below is an example of a document with references:

References lines items are separate documents such the following:lineitem_id

Each line item can be accessed with the key captured from LineItems array above.

You can use both referencing and embedding as below:


You can also use referencing heavily such as following:


Sometimes we need to embed, sometimes we need to reference. When do we embed and when do we reference? This is the critical decision for efficient modelling. There are some signals for selecting embedding or referencing.



1:1 (one to one) relationship, is usually better for embedding. Also, if there is a 1:Few relationship, it is also acceptable to embed. For example a blog post document with tags or categories. So we need to care about relationships and dependencies. Embedding leads to better performance because of RTT (round trip times) to databases.

1:many (one to many) relationship is good for referencing. For example: blog post with comments document. A blog post can have hundreds of comments with different sizes. There is no bound to referenced data. Instead of embedding, referencing is more efficient in this case. You can also have race conditions if you prefer to use embedding in this case.

M:m (Many to many relationship) is good for referencing as well.

Data volatility is important while choosing embedding vs referencing. How often does the data change? If the data doesn’t change so often, it is better to embed rather than referencing. However, if there are a lot of operations on particular documents, it is usually better to reference. So, if the volatility is low, embedding is OK, if volatility is high referencing is better. With high volatility you can have race conditions.

If a document is used by many clients, it is better to reference rather than embedding. Referencing provides better write performance. Because you have smaller documents to write. However, it causes performance hit for READ operations due to multiple Round trip times.

Normalizing data may save some space however,  storage is cheap today, also  requires single read and RTT. Normalizing data doesn’t align with classes/objects , object model. Referencing typically provides faster write speed. Denormalized aligns with classes/objects, however, requires updates in multiple places. Typically provides faster read speed, since you get the document at once.

Clearly there are trade offs for referencing and embedding. It is up to you to decide which one to use based on your data model and access patterns.

Data is data, regardless of the shape of it. It can be stored in table or a document store, it doesn’t matter. Document stores are structural, there is schema and object model.

It might take a while to get used to the data modelling with key/value and document stores, you might need to bend your mind and get rid of relational data modelling concepts. Once you get used to it, data modelling with document stores are pretty trivial. Understanding how to create keys is usually a skill, but you can create quite amazing patterns once you get used to it.

These access patterns can be used in many key/value and NoSQL stores, for example: Redis, MemCache, Couchbase, Couchdb, Mongodb, Riak, Dynamo and cloud document store solutions.

Moreover, hints above for referencing or embedding comes handy with the data stores above as well.

Bitmap – BitArray – BitSet – Bit Vector

Besides image processing, Bitmaps are handy with real time operations and analytics. Bitmap is a useful data structure and extremely space efficient. Also, BitMaps are underlying data structure for other data structures. You can find Bitmaps being used in many places including linux kernel.

Bitmap, BitSet or BitVector is an array of zeros and ones. A cell, bit in bitmap can be set to either one or zero. Since BitMap uses array, we can work with index, offset. Bit operations such as OR, AND, XOR can be used with ease.


Population count can be easily calculated for Bitmaps, which is the number of bits that are set to 1. There are efficient algorithms for calculating population count, there are even special hardware instructions for this operation.

Bit related operations are of two kinds: Constant time (O(1)) e.g. operations to get/set of a particular bit and operations that are O(N) which basically operate on a group of bits. N in these cases is the length of bits that the operation will need to work on.

Common use case for bitmaps are counting stuff. Here are some examples:

  • Has user 123 been online today? This week? This month?
  • Has user 123 performed action “X”?
  • How many users have been active have this month?
  • How many unique users have performed action “X” this week?
  • How many % of users that were active last week are still active?
  • How many % of users that were active last month are still active this month?
  • And a lot of other things!

For more info checkout wiki Bit Array.

Bloom filters: explanation, use cases and examples

Bloom filter is an exotic data structure; everyone is talking about. It is used to tell whether an element is in the set or not. Bloom filters are memory efficient and fast data structure. It is a probabilistic data structure. It tells that an item is definitely not in the set or may be in the set.

Visualisation of Bloom Filter

Most important operations of a bloom filter are test and add.

Test is used to check whether a given element is in the set or not. If it returns:

  • false then the element is definitely not in the set.
  • true then the element is probably in the set. The false positive rate is a function of the bloom filter’s size and the number and independence of the hash functions used.

Add simply adds an element to the set. Removal is impossible without introducing false negatives, but extensions to the bloom filter are possible that allow removal e.g. counting filters.


Use cases:
The classic example is using bloom filters to reduce expensive disk (or network) lookups for non-existent keys. Bloom filters are being used by NoSQL database engines, especially by key/value stores in order to efficiently query data stores. If an element is not in the bloom filter than we know that we don’t need an expensive lookup operation. If the key is in the bloom filter then we know we can query the data stores for the given key. We can expect some false positives which needed to be handled by the engine.

Bloom filter can be used anywhere where false positives are OK and can be handled but false negatives are not acceptable.

Bloom filters can also be used to reduce I/O operations and thus increase performance.


A bloom filter doesn’t store the elements themselves, this is the important part. You don’t use a bloom filter to test if an element is present, you use it to test whether it’s certainly not present, since it guarantees no false negatives.

How does it work?

The bloom filter essentially consists of a bit vector (BitSet) of length m.

To add an item to the bloom filter, we feed it to k different hash functions and set the bits at the resulting positions. For example, bit vector can be m=50 cells in length and k=3 means, 3 hash functions will be applied to the key. Note that sometimes the hash functions produce overlapping positions, so less than k positions may be set.

To test if an item is in the filter, again we feed it to the k hash functions. This time, we check to see if any of the bits at these positions are not set. If any are not set, it means the item is definitely not in the set. Otherwise, it is probably in the set. More hash function you use, less likely you will get false positives.

Bloom filters are space efficient, since it uses bit vectors only. Given a Bloom filter with m bits and k hashing functions, both insertion and membership testing are O(k). That is, each time you want to add an element to the set or check set membership, you just need to run the element through the k hash functions and add it to the set or check those bits.

Examples: Wiki

  • Akamai‘s web servers use Bloom filters to prevent “one-hit-wonders” from being stored in its disk caches. One-hit-wonders are web objects requested by users just once, something that Akamai found applied to nearly three-quarters of their caching infrastructure. Using a Bloom filter to detect the second request for a web object and caching that object only on its second request prevents one-hit wonders from entering the disk cache, significantly reducing disk workload and increasing disk cache hit rates.[10]
  • Google BigTable, Apache HBase and Apache Cassandra, and Postgresql[11] use Bloom filters to reduce the disk lookups for non-existent rows or columns. Avoiding costly disk lookups considerably increases the performance of a database query operation.[12]
  • The Google Chrome web browser used to use a Bloom filter to identify malicious URLs. Any URL was first checked against a local Bloom filter, and only if the Bloom filter returned a positive result was a full check of the URL performed (and the user warned, if that too returned a positive result).[13][14]
  • The Squid Web Proxy Cache uses Bloom filters for cache digests.[15]
  • Bitcoin uses Bloom filters to speed up wallet synchronization.[16][17]
  • The Venti archival storage system uses Bloom filters to detect previously stored data.[18]
  • The SPIN model checker uses Bloom filters to track the reachable state space for large verification problems.[19]
  • The Cascading analytics framework uses Bloom filters to speed up asymmetric joins, where one of the joined data sets is significantly larger than the other (often called Bloom join in the database literature).[20]
  • The Exim mail transfer agent (MTA) uses Bloom filters in its rate-limit feature.[21]
  • Medium uses Bloom filters to avoid recommending articles a user has previously read.[22]

4 Great web optimization tools

logo-small@x2Crazyegg.com provides heat maps, mouse moves, scrolls, corn flakes etc. Google Analytics doens’t provide these functionality. With crazyegg.com you can measure your site usability such as where are your users spending their time and interacting with your site, where are they clicking or scrolling to which is quite valuable information.

Alternative to crazyegg.com there is hotjar.com which also provides heat maps, conversion funnels, visitor recordings on different devices.

Optimizely-logoOptimizely.com optimizes web experience via A/B testing. I worked on the similar project at ebay.com and certainly it has great benefits. You can set up tests for small percents (%1 or %2) of your site users to view different view of your site. You can measure and optimize your usability with the results. You can optimize your site, campains and increase conversion.

0pyDMTypBrowserstack.com helps you automate tests of your site for different browsers easily. It provides a device cloud for testing your site. Broserstack can also generate screenshots for different browsers running on different devices.

pingdom-300x79Pingdom tools helps your measure your site speed and with optimizations you can make. This tool shows your site performance with respect to DNS lookups, round trip times, size of site’s media etc.