Skip to main content

Mission Critical: Designing Highly Resilient Financial Services Applications

By Jeffrey Quinn, DTCC Executive Director, IT Architecture | 5 minute read | November 20, 2023

Across the financial services industry, the technology landscape continues to evolve with firms accelerating their cloud adoption efforts to capitalize on new business opportunities. DTCC has been on a cloud journey for over a decade as part of its modernization effort and, like most firms, has a hybrid platform strategy comprised of both public and private cloud technology. By refining and enhancing public cloud capabilities for the best-fit business cases, DTCC will deliver value for its clients and industry partners.

Related: Achieving Data Resiliency with Snowflake and AWS

Recently, DTCC worked with Amazon Web Services (AWS) to create a public cloud prototype to enhance multi-region resiliency. The prototype consisted of two reference applications that simulate a trade matching and settlement process. The applications were implemented using AWS features following a highly resilient design that allowed each application to rotate and recover across regions independently of one another. This fully functional prototype focused on applying DTCC resilience principles to simulated mission critical applications, resulting in architectural guidance to help clients increase the resilience of their public cloud solutions.

You can read more about the prototype in this recent technical paper, jointly published by DTCC and AWS. Read below for a few high-level points from the paper:

Resilience: A Top Priority for the Industry
With many firms on technology modernization journeys, leveraging public cloud services to enable rapid time to market, operational efficiencies and reduced risk is a strong option. As firm’s consider third-party cloud services, it’s imperative that application resilience is maintained and, where possible, exceeds what firms can achieve when hosting critical business applications in their own data centers.

At DTCC, we firmly believe that resilience must continue as an industry-wide business strategic effort to ensure the continued safety and soundness of the financial industry. Firms’ business resilience strategies – including roles, processes, architectures, and engineering – must be continually re-evaluated in the face of an ever-evolving technology landscape and evolving risks. Technology advancements, modern application designs and hybrid cloud solutions all contribute to the industry ecosystem. It’s why research, development, and planning with industry providers and regulators is key to influence and enhance resiliency across the financial markets.

As business systems and processes are modernized, it’s important that resilient designs are application-centric with a goal to provide granular protection and recovery solutions. This is especially true when designing public cloud applications that no longer rely on traditional infrastructure-based capabilities found within private datacenters.

In this paper, together with AWS, we:

  • Demonstrated a reliable, repeatable multi-region recovery solution for applications running in AWS.
  • Embedded resiliency into applications through the development and consumption of reusable AWS components and capabilities.
  • Created reusable architecture solutions based on software assets that demonstrate resiliency requirements with a focus on recovery time and recovery point objectives.
  • Delivered a fully automated solution that coordinates multiple systems and applications (also called orchestration) to solve for planned events (scheduled rotation) and unplanned events (disaster recovery).
  • Proved a recovery time objective (RTO) of under 30 minutes and a recovery point objective (RPO) of less than five seconds, which meet DTCC’s out-of-region recovery and resumption requirements. The team developed a reconciliation and replay process that was able to recover all data loss within and between applications following a simulated failure event.
    • Lessons Learned

      Delivering a public cloud application design that solves for unplanned and planned events requires careful consideration of the RPO and RTO.

      • Assess design considerations: Firms may consider a Hot/Warm Application Model to address RTOs, with a global traffic manager that redirects transactions to the active region. To address RPOs, firms should design for safe stores and checkpoints that allow for data reconciliation and replay.
      • Ensure application production readiness: Public cloud applications should be able to determine their operating state following any event, planned or unplanned.
      • Look at platforms holistically: Firms must consider the entire technology stack and identify platform capabilities that can improve application resilience. This includes services that have built-in recovery features, and automated recovery capabilities as well as offer cross-region replication to help mitigate data loss scenarios.
      • Provide critical non-negotiable capabilities:
        • Automation & Orchestration: Develop runbooks that automate and orchestrate all application recovery plans.
        • Testing: Execute failure-mode analysis and resiliency testing prior to production.
        • Monitoring: Build observability into the solution to deliver end-to-end visibility of health across regions

      The Bottom Line As firm’s consider the evolution of resiliency within their organizations, they should first, ensure comprehensive, repeatable, automated and verifiable recovery procedures are leveraged to gain the confidence that applications are indeed resilient. Second, firms must practice, practice, practice. Having fully automated runbooks will enable firms to regularly execute recovery exercises and/or rotation events, building maturity and confidence with decision makers. This means that in a real-life scenario, the ability to make the go / no-go decision to push the recovery button is much smoother and can lead to a shorter recovery timeframe.

      What’s Next As a critical infrastructure and service provider for the global capital markets, DTCC takes a resilience-first approach to technology principles and architectural design concepts. The resiliency journey is one that requires collaboration, especially as the industry landscape continues to rapidly evolve and become more complex. This prototype provides a way forward that firms can leverage in their own resiliency planning journey. The open-source code and documentation from the prototype are publicly available for firms to install. Industry technology teams are encouraged to use these artifacts as a baseline when designing their own critical application solutions. Moving forward, DTCC will continue to collaborate with AWS, advancing the conversation around public cloud adoption and creating solutions that deliver value for the industry and protect the capital markets.

Jeffrey Quinn

Executive Director, IT Architecture

Read Paper Read Fact Sheet