Skip to main content

Re Thinking Resilience from a Technical Standpoint

By James Lee, DTCC Chief Technology Officer | October 29, 2019

Re-Thinking Resilience from a Technical Standpoint
(l. to r.) James Lee and Adrian Cockcroft.

How can companies protect against data failures and reduce their security vulnerabilities? The application of resilient principles to technology architecture, applications, infrastructure and data were among the topics I discussed with Adrian Cockcroft, Vice President of Cloud Architecture Strategy, Amazon Web Services, at the recent 2019 DTCC Annual Client Risk Forum, where we also shared our perspectives on how businesses can protect themselves from broad-based disruptions.

Evolution of Threat Factors

Businesses have seen an evolution of threat factors that go beyond physical events to cyber-based disruptions. How do we adopt to this evolution?

In the past, threat protection has been limited to scenarios that we can envision. But in today’s complex environment, we know that this is simply not enough. The evolution of technology has changed the way we think about how to protect our businesses and has created opportunities on how we address the concept of our traditional architecture, current infrastructure and data backups. Progressive companies, like Amazon and Netflix, have started to think about this.

Supporting Businesses in a Shifting Landscape

These progressive companies are looking at the concept of redundancy and recoverability in a different way. During my discussion with Adrian, he explained that the difference is a fundamental change: the cost of change has decreased while the pace of change has increased. Combined, those factors have made it easier to roll out new applications. In the past, a new deployment might take several months and involve an expensive and elaborate process. Today, we can take these projects and split them into small incremental units, and if something goes wrong, each step can be individually examined.

By breaking these updates down into small units, the risk for each individual change becomes extremely low. We are now able to pinpoint issues in a way that identifies them faster, and allows the problem to be fixed faster, without shutting down an entire operation.

Rethinking Traditional Models

How can we rethink a product lifecycle in the context of small changes? Adrian pointed out that the automation and benefit of cloud allows users to view their environment and fully know what state it is in. Whereas in the past, it was difficult to pinpoint what caused a failure, you can now look at the data environment and have full knowledge, including any changes that were made and when they were made. It allows you to query if the data is safe or identify any vulnerabilities.

So, in addition to increasing the rate of change, we are also able to document with confidence any changes that have been made. This level of confidence enables us to have continuous compliance, which can then lead to continuous security resilience - the ability to detect and patch in a continuous manner and limit exposure to security vulnerability.

Continuous Resilience

With these innovations we can perform operations in the cloud that were never done before, at a scale we never imagined. And how are progressive companies leveraging these new capabilities?

To build a very resilient system, you need to have multiple controls. Redundancy is necessary. For example, a typical backup data center will practice a test failover about once a year. This will undoubtedly be a major exercise, where a few applications will be tested at a time, but at no point will the whole system be tested. There is usually not a plan in place to test an entire backup data center in a larger-scale failure.

What we are seeing at leading organizations, such as Amazon or Netflix, is chaos engineering, or continuous resilience, where failures are introduced to test these firm’s failure systems. Cloud is used to build automation where failure systems are continuously tested. Netflix operates out of three regions, and about every two weeks they completely evacuate to test. They have improved to a point that the evacuation is complete in six minutes. While this is a best-case scenario, it shows it can be done.

So how can we take these practices a step further and build patterns for continuous resilience in financial services and other industries? We need to introduce failures to test these capabilities. As we move backup data centers to the cloud, we need to structure these systems in a way that effectively and actively tests a firm’s architecture.

It is a fundamental departure from how traditional businesses typically view technology - to design for failure as opposed to designing not to fail. Large-scale failures test the least well-tested part of an organization, and this is what needs to be tested. At DTCC, we’re re-thinking resilience from a technical standpoint. As the complexity of the financial services industry continues to grow, continuous resilience will be an ongoing dialogue.