Even the biggest players in the payment industry are not immune to these outages. For example, in 2009, Paypal suffered a 4 hour service outage, with an estimated loss of $28 million in commerce (more). Every payment service suffers from outages at some point, and it is not uncommon to encounter services that suffer several outages in just a year's time.
Reasons behind outages varies, as does the scale of those itself. Many transaction processors have agreed maintenance windows, needed for service upgrades, network modifications, or just internal testing. We call these "Planned outages", and they usually are planned to occur during the least exposed times, when the impact is minimal. It is a completely different story for "Unplanned outages", which are the focus of this article.
From the technical point of view, we are looking for the outage culprit. Was the processing application performing unsatisfactorily, was it a networking issue, or just human error? Was the authorization service unavailable completely, or just not responding to a certain subset of transactions? Yes, very important questions, but the most important question of all is how many customers were affected, and how does it affect our reputation with our customers at large? Are we still a reliable partner for them, or are they going to change providers when the opportunity arises?
Fully effective preventive measures are usually very resource hungry, making them hard for management to justify. But how to find the right balance between your service robustness, and its cost-effectiveness? There are several good metrics to calculate this risk, but experience says that the best results are given by understanding it from the most direct hit to the business - actual loss of customers.
Let's imagine our simplified payment switch providing an authorization service for a mid-sized chain of retailers. This fictive network of merchants is generating about $50 payments every second, or $3000 every minute. It is a business rule that a retail service which honors its customers should retain its minimum availability above 99.9% for a whole year, and we consider falling below this a showstopper - resulting in the loss of the customer. This gives us roughly 525 minutes (~9 hours) outage per year. While this doesn't sound so difficult to achieve, we still need to subtract the time needed for all planned outages. Taking even a conservative view here, given the downstream implications of PCI-DSS compliance, count on at least 2 hours per planned outage every quarter, and we are now allowed only one hour of unplanned outage for the whole year!
Given the $3000 per minute figure mentioned above, and an average processing fee for one transaction of 0.2% (no Forex), our project is worth a little bit over $3M per annum. And our showstopper below 99.9% mark is just a one hour outage away. Let's divide the whole project value by 60 minutes so we better understand how important is for us every minute. $50K per full outage minute sounds alarming, doesn't it? But how many minutes should we count to evaluate the system's robustness and stability? There are a few ways. The most trivial is to have a look into the service's past, if available, and get this value from its history. A more sophisticated way is to analyze and certify the service for the most common outage scenarios, such as network line drop-out, sudden primary server service shutdown and data renewal, or upstream authorization service disconnection.
An experienced production support team should be able to put together a good representation of such events, and provide simulation of these on available pre-production systems. Each of these scenarios will give a good overview about service stability, and analysts should be able to give a reasonable probability for each such event during a year. Getting back to our retail authorization service scenario. Let's imagine that we'll count 25 minutes on all yearly networking issues, cumulative 70 minutes of transactions not being responded to in peak times, and 100 minutes on other hardware related issues. This gives us total of 135 minutes of predicted outage. It is obvious that spending a whole project revenue just on cutting this value below 60 minutes would not be justifiable, but something has to be done. The following formula has been used with success on several real world projects:
Divide time value exceeding the showstopper in minutes by a whole outage time and divide the result of this operation by constant 2 again. Multiply your result by 100 to get a percentage of your project value for bringing your system robustness up above 99.9% availability. The graph better illustrates this formula. Showing this calculation on our hypothetical example, we would take 75 minutes of time below our limit value and divide it by 135. After dividing the result by constant 2 again and conversion to percents, we receive 27% of our project value, or $833K worth of investment into our service stability.
Even though this calculated price may seem to be pretty high, it needs to be taken into account that in our case this value represent factors such as missing network architecture redundancy, wrong system sizing and possibly invalid DR design. These three things are cornerstones of robust payment services, and will need to be implemented to retain the customer's confidence and loyalty. This value also indicates that the initial system architecture was heavily neglected, and testing was not carried with a focus on service sustainability into the future. Finally, it would depend on management to decide whether such investment is justified against keeping the customer for the next several years, or whether the penalties from falling short of the SLA are a better alternative.
The total price needed for outage mitigation doesn't have to be so high. Understanding needs and bottlenecks of your payment service may simply lead to cutting down the planned outage times, leaving more time to deal with the unexpect and unplanned outages. A best practice is evolving and practicing DR procedures under simulated production load, or just improvement of system monitoring alarms so the mitigation action would be triggered sooner. It is always better to discover an issue proactively. When the customer calls in to report it to you, it is already too late.
In this article we presented how difficult it is to find the balance between a robust and expensive payment system against the profit needs of the business, and we introduced one way how to calculate and justify the investment towards the customer's confidence. We also highlighted how important it is to test your service, since knowing its limits well before they are reached is business critical. Our customers have trust in us and our expertise and we cannot afford failing them even once.