Payment switches and the surrounding infrastructure are usually considered critical systems, where any failure involving data loss or down-time can result in lost revenue, lost customers, and lost sleep. Investments in monitoring and disaster recovery solutions act as insurance, allowing your team to minimise the impact of these failures and return to regular operation as quickly as possible.
The first step when designing your disaster recovery solution is to look at each part of the system and determine the level of data loss and outage that is acceptable to the organisation. This means deciding:
- How many seconds (or milliseconds) of data loss is acceptable.
- How many seconds (or milliseconds) of outage is acceptable.
- How much budget is available to spend on solutions.
It's difficult to identify every way in which a complex system might fail, but most disaster scenarios fall into the following broad categories which should be accounted for in your disaster planning. Note that these are ordered from most likely to least likely, and lowest impact to highest impact:
- Software failure.
- Database corruption (software error, security breach, negligent staff).
- Failure of a single data centre (network outage, power loss, water damage).
- Failure of all data centres.
When planning for these disaster scenarios, the following solutions are normally required. This is where you must carefully weigh risks vs. costs, as the best solutions are also the most expensive.
- Pro-active monitoring.
- Application level replication.
- Regular backups to a secure archive facility.
- Multiple active sites or fail-over (automatic if possible).
Pro-active monitoring is a crucial component in every disaster recovery solution. Your operations team will spend most of their time monitoring the system, ensuring everything is operating smoothly, and diagnosing failures scenarios as early as possible - ideally before they impact your business. Software solutions such as BP-Switch provide built-in monitoring functionality, exposing internal statistics such as memory usage, message queues, and processing throughput. This information is conveniently accessible through the administration interface, or can be pulled into a centralised monitoring tool using standard APIs.
In addition to real-time statistics, operating systems and payment switches produce events resulting in e-mails or SMS alerts being sent out, with identifiers showing the severity of the event along with a description of what happened and any additional context. These alerts can be sent directly to your team, or fed into a centralised monitoring tool for filtering, allowing operations staff to focus on relevant information and access it all in one place.
Dedicated monitoring software such as Nagios allows collection of a wide range of data from your infrastructure, providing a central location to check the status of server resources, service availability and network connectivity. These systems are very flexible, supporting monitoring protocols such as SNMP, event aggregation through email, and custom scripted probes and tests. We strongly recommend investigating your options and deploying a centralised monitoring solution. Thankfully this is a well-serviced market, with a number of respected commercial and open-source solutions available.
Hardware failure is often the root cause of expensive outages and data loss. Although a good monitoring solution can help to identify potential failures before they occur, disaster planning should account for unexpected hardware failure in production systems. Typically our customers refresh hardware on a 3 year cycle to minimise the risk of failure. In theory, replacing hardware gets cheaper over time, however as technology moves forward it can be difficult to acquire matching components. We recommend looking at long-term support options from your supplier when commissioning a hardware platform.
Clear and concise operational procedures are imperative to ensure your system runs smoothly. Along with day-to-day maintenance of the system, these should include regular fail-overs for disaster recovery to ensure that backup hardware is running well and your restoration procedures work as expected. In our experience, operations departments which don't exercise their disaster plans lack confidence in their procedures, in which case it's critically important that the procedures are improved and tested in production.
We typically treat a situation where all data centres are failing as an irreversible catastrophe, such as wide-spread natural disaster or political upheaval. In these situation it's likely that your only course of action will be to rebuild the system from backups, and accept the risk of long outages or high levels of data loss. This kind of eventuality should be acknowledged in your disaster planning and backup procedures, and can be useful for framing the discussion about acceptable risk vs. available budget.
Payment switching software typically provides the following features in support of disaster recovery. These are listed in order of lowest-to-highest cost, and highest-to-lowest risk of outage and data loss.
- Database backup (cold standby).
- Database level replication (warm standby a.k.a. Active::Passive).
- Application level cross-communication (clustered a.k.a. Active::Active).
- In-step processing.
Talk to your switch provider for more information on the costs of these approaches in terms of additional hardware, communications infrastructure, and software licenses.