turkiye.gov.tr is our e-government gateway to many useful resources and hope this article finds its destinations.
Recently turkiye.gov.tr launched a new feature for viewing family trees. Great feature and work! There was a rush for many curious minds to find out about their ancestors. I guess turkiye.gov.tr wasn’t expecting too much traffic, so, things just went sideways in the most colossal way.
I am not familiar with the architecture of turkiye.gov.tr or the ancestry service but worked with many large-scale systems.
Below you will find 10 facts, I believe, caused outage for turkiye.gov.tr.
Fact #1: Big systems fail more often than small systems.
Big systems have more external and internal dependencies along with many moving parts. Therefore, big systems should have many defensive mechanisms in place.
Fact #2: Blocked threads are the number one cause of most failures.
Slow applications and hung threads are the most popular reasons of failures. These reasons lead to cascading failures and chain reactions. Blocked/hung threads can happen due to several reasons like deadlocks, starvation and live locks. Hung thread detection policies, timeouts, circuit breakers and bulkheads can prevent these failures.
Fact #3: Integration points are number one killer in any system.
Every integration point eventually fails, however a failure in the integration point shouldn’t take down the whole application. Cascading failures occur when problems in one integration point propagates. Failures in integrated services becomes your problem. It even become more serious if you are not prepared for it. Same as above, defensive programming, timeouts and circuit breakers can prevent serious failures.
Fact #4: Tight coupled systems fail more often than loose coupled systems.
Decoupling middleware is a good practice to enable loose coupling for integration. This principle is applicable and best practice for cloud native applications.
Fact #5: High traffic site’s Resource/Connections pools get drained very quickly.
Resource/Connection pool have limitations. They can run out of resources rapidly and application performance will start degrading.
Fact #6: Unbalanced capacities causes failures and scalability problems for applications.
If the capacities are not aligned, then you have a problem. Capacity and sizing should be planned accordingly.
Fact #7: Never trust a code you have no control over or you didn’t develop which can be a third partly library or a remote system developed by someone else.
The downstream application which is running a blocking code, can take your application down.
Fact #8: Slow applications gets more traffic.
When an application is slow, users hit re-load button or F5 many times to reach to the application which causes more traffic.
Fact #9: Fail fast all the time and retry gradually.
Exponential back off, Circuit breakers and timeouts should be embraced. User friendly error codes or default messages should be presented to the user until the stability is established.
Fact #10: Appreciate your hardware resources and utilize them wisely.
Don’t fall for CPU and Memory is CHEAP. This is not TRUE. Long running CPU cycles can cause contention which slows down your application and eventually it will fail. Paged or unfragmented memory causes slow seek times.
What happened reminds of dotcom bubble back in 2000s circa. Yahoo!, Altavista, Lycos and many internet giants back then faced similar problems and they started developing scalable platforms.