Cloud Native Architecture in Practice (2020-01)

This post was supposed to be published on the Google Cloud blog as a sequel to "5 Principles of Cloud Native Architecture" in January 2020. However, I left Google at that point, and despite a bunch of above-and-beyond efforts by the blog team, it never saw the light of day... I'm publishing here in case it is still useful, but be aware 1) it's now very dated, and 2) it does not necessarily represent Google's current thinking


There’s no question that architecting a cloud-native application looks and feels different than architecting it to run on traditional on-premises infrastructure. In a previous post, we looked at five guiding principles for adapting architectures to the cloud—high-level directives like designing for automation and practicing defense-in-depth—but what underlying technology and practices do you need to support that approach?

At Google Cloud, we offer a wide range of infrastructure components and services that can support cloud-native architecture—a cloud-native stack, as it were. In this post, we’ll consider our cloud-native architecture principles and look at the practical capabilities (both technical and skills) that organizations need to develop in order to deliver and operate cloud-native applications. Then we’ll look at how the products offered in the Google Cloud support those capabilities, and how this stack can make it easier to build applications in a cloud-native way.

Cloud-native architecture principles: a recap

Developing in a cloud-native way differs in a few fundamental ways from traditional applications, and those principles bear repeating. To recap, they are:

  • Design for automation: Design your application and infrastructure stack to be easily automated. Take advantage of the ease of automating cloud to reduce manual work and potential human error.

  • Be smart with state: Cloud-native applications are typically horizontally scaled, and use stateless components where possible to simplify scaling and resilience. Therefore centralize stateful data wherever possible, and use a data store designed for use in a distributed system.

  • Favor managed services: The managed services available in Google Cloud offer huge advantages in terms of resilience, time-to-market and reduced operational overhead. Many organizations are concerned about lock-in, but open standards mitigate these concerns, and even when they don't, the benefits often outweigh the costs.

  • Practice defense-in-depth: Build resilient and secure applications by assuming that every component exists in a low-trust environment. Each component should therefore assume that every other component could be compromised, and take appropriate measures to protect itself.

  • Always be architecting: Avoid thinking about your architecture as a 'project' which will be 'finished' and then maintained. Cloud-native architectures take a 'platform' approach, and are constantly being refined, evolved and integrated in new ways as the business needs change.

The cloud-native stack

A common cloud-native stack, in turn, delivers several capabilities that you wouldn’t typically find in a traditional environment. The diagram below shows the conceptual architecture of a cloud-native organization, with the common capabilities illustrated by the blue boxes.

Of course, every organization is different, but the intention is to show the capabilities typically found in an organization that is using a 'cloud-native' approach to IT. Most organizations will have most of these capabilities available—although not necessarily all of them. In our experience, most organizations will have a stack like this per project/team, often largely isolated from other systems; a few highly integrated organizations have a single stack that covers all the applications deployed in production.

In the sections below, we explore each of these capabilities, and the ways in which this cloud-native stack supports cloud-native architecture.

Storage

Most traditional architectures will centralize data, usually in a Relational Database Management System (RDBMS). But Principle 2 (Be smart with state) implies that cloud-native architectures should centralize state. When we talk about 'state', in turn, we can be referring to several distinct types of data:

  • Business data: The data that represents the business function the system is performing, typically entered by humans at some stage, but sometimes generated by other systems. Examples might be user profiles, shopping baskets, factory sensor readings, account balance, etc

  • Reference data: The data that represents the context of business function the system is performing, typically centrally managed by the organization. Examples might be product catalogues, factory sensor locations, account types, etc

  • Operational data: Data generated by the system itself as it performs the business function, and captured either for future use, or for operational reporting purposes. Examples might be transaction logs, records of discarded shopping baskets, logs of rejected orders, etc

  • Transient system data: Data generated by the system during its own functioning and stored, usually transiently, to allow the system to function. Examples might be the backend handling a particular request, items awaiting processing in a message queue, the number of copies of a module deployed, etc

  • System configuration data: Data about the system and how it should function, which is usually managed by the team running the system. For instance, maximum number of copies of the application to deploy, blacklisted IP addresses, administrator accounts, etc

In general, traditional architectures centralize only the first of the types of data listed above—the further down the list, the less likely the data will be stored centrally, and the more likely that state data is to reside on each individual component itself.

Cloud-native architectures, in contrast, typically centralize as much state as possible, often all the items on the list. However, to facilitate this, a cloud-native architecture will often use polygot persistence to match the resilience, latency, volume, value requirements to the right type of storage. For example, business data might be stored in an ACID-compliant relational data store, but operational data might be stored in a column store for easy analytics; transient data might be stored in a key-value store for fast storage and retrieval; and system configuration might be stored in a version management system as promoted by the GitOps approach.

To support this approach, Google Cloud offers a wide range of data storage options, each of which give you different capabilities and constraints, but all of which offer a managed way to centralize state. By placing your data in a managed data storage service, you reduce the operational burden of operating such a store, but you also get access to storage systems designed to work in a distributed way (Principle 3 - Favor managed services). As standalone back-end services, these services offer their own authentication and authorization mechanisms, and are designed to handle load (and overloading) intelligently (Principle 4 - Practice defense in depth).

Serverless and containers

Serverless and containers are the two most common execution environments for cloud-native systems. These technologies offer complementary (and overlapping) benefits: serverless environments allow you to focus on the code itself and leave almost all of the management to the platform provider (e.g. Cloud Functions). Containers are a more general execution environment, giving you more flexibility and control, at the expense of having to do more operational management.

Traditional architectures use virtual machines (VMs) as their execution environment. However, cloud-native principles emphasize a more dynamic footprint, with small, preferably immutable, units of deployment. VMs are often too 'heavy' for such an architecture, meaning they are too slow to spin-up, too large to store as images, and come in relatively coarse-grained sizes. Containers and serverless environments offer smaller deployment units making it easier to automate (Principle 1 - Design for automation), and also favor stateless or immutable design, helping to centralize state (Principle 2 - Be smart with state).

Google Cloud offers native hosting of containers through the Google Kubernetes Engine (GKE) service, which uses Kubernetes to abstract away the underlying platform, and make it easier to manage your application while still having a lot of control. The Anthos platform extends this capability to other environments outside of Google Cloud, such as on-premises.

Cloud Functions completely abstracts the application itself, allowing developers to create only a 'function' and have the platform invoke it in response to external events like changes in state or incoming user requests (Principle 3 - Favor managed services). Another option to consider is Cloud Run, which sits somewhere in the middle, offering a lot of the flexibility of GKE, while removing some of the operational burden.

Microservices

Microservices are a natural way to expose the functionality running in the containers and serverless environments described above. Done well, microservices are loosely-coupled and re-usable, as well as being easy to independently maintain, scale, and diagnose. The loose-coupling and re-usability allow the architecture to evolve and use existing functionality in new ways (Principle 5 - Always be architecting). The loose-coupling and well-defined interface also allow security protections like authentication and graceful degradation to be applied individually to each service (Principle 4 - Practice defense in depth).

Integration and APIs

As the number of microservices in your organization grows, you’ll need to understand and control how they interact, which failure mechanisms that may arise as a result, and ways to recover gracefully when failures do arise. Istio provides an open and container-native way to manage, visualize and control the interactions between these services as a so-called service mesh.

Today, Traffic Director offers a fully-managed control plane, using the same proxy technology as Istio. There’s also Anthos Service Mesh, which extends Traffic Director with Istio compatibility.

Pub/Sub provides an alternative pattern for service integration. It offers a message queuing framework that allows decoupling and scaling of services by making calls asynchronous.

Increasingly, interactions with business partners are being done electronically. As your organization develops a rich ecosystem of microservices, you might want to make more of those available to our business partners, or even to end-user developers. The Apigee platform provides an enterprise-grade set of tools to manage the safe and secure sharing of APIs with external partners (Principle 4 - Practice defense in depth). This same approach and toolset can also be used to integrate with existing infrastructure and applications on-premises, or in different cloud providers. Again, Apigee provides a secure intermediate layer with the possibility for format and protocol translation, as well as rate-limiting and authentication.

Infrastructure-as-Code

As described in the previous installment of this series, most traditional architectures are optimized for a fixed, vertically-scaled hardware footprint. But because cloud hardware is fully virtualized, the provisioning and decommissioning of hardware can be done in seconds, via API calls. This naturally lends itself to automation, and the logical extension of this is Infrastructure-as-Code (Principle 1 - Design for automation). In an Infrastructure-as-Code approach, the infrastructure footprint is defined by a set of code, which is then managed like any other code (stored in a version control system, subjected to unit tests, reviewed, and deployed automatically). This gives the organization the ability to recreate the infrastructure reliably and on-demand. This also makes it easier to make reliable and auditable changes to that infrastructure as the needs of the business change (Principle 5 - Always be architecting).

While Infrastructure-as-Code is typically used to define the initial configuration of the system, it's also very common for cloud systems to modify their hardware stack dynamically to respond to changes in conditions, for instance adding more nodes to a cluster to handle an increase in load. In most cases, this is managed by using Infrastructure-as-Code to configure templates and pools, which are then managed dynamically when the system is running. For instance, in Google Cloud, you can use managed instance groups.

Google Cloud supports Infrastructure-as-Code by offering Cloud Deployment Manager, which allows you to manage other resources in Google Cloud using templates written in YAML, Python and Jinja2. Google Cloud also supports open source options like Terraform.

Configuration and policy management

Just as Infrastructure-as-Code uses code management processes to define and manage infrastructure, more and more organizations are applying the same practices to the configuration of the system (Principle 1 - Design for automation), especially configuration that’s related to security and policy. Applying these practices to security makes particular sense, where review, repeatability and auditability are particularly important. The presence of agreed-upon, checked-in, immutable and machine-readable definitions for your security policy and config also makes it easy to perform regular automated checking of your production environment, so you can flag and revert any changes or deviations from that policy. This means that any change made to production outside of the approved process (be that through sloppiness or malice) should be found quickly and automatically reverted to an agreed secure state (Principle 4 - Practice defense in depth).

Extending this further, some organizations are now attempting to completely centralize the process of managing their full stack (from infrastructure, to configuration, to custom software) in their code repositories, with the same deployment and quality-assurance processes used for all these artifacts. Pushing a change to the code repository (typically Git) therefore becomes the only way to modify production. This approach is being referred to as 'GitOps'.

Continuous integration and delivery

Continuous integration is the practice of regularly building and integrating code from different developers, and has come to imply the practice of maintaining a pipeline that can accept code from a code repository, automatically build that code, and automatically run a full set of unit and system tests on the resulting artifacts (Principle 1 - Design for automation). The benefits of such a system are well-recognized, but crudely it ensures that there is always an up-to-date, tested and reliable version of the application available to deploy.

Continuous delivery is the practice of regularly taking the build artifacts from a repository, automatically performing further tests on them, and automatically deploying them to production, often with a phased roll-out, and sometimes automated rollback. This has obvious benefits for agility, since changes can be rapidly built, tested and deployed (Practice 5 - Always be architecting), but also for reliability and security, since the configuration of the system is well-understood and repeatable (Principle 4 - Practice defense in depth).

As well as supporting open source options like Jenkins, Google Cloud offers Cloud Build for continuous integration, and managed Spinnaker for continuous delivery.

Monitoring and operations

Like any system, cloud-native systems need to be monitored and maintained. At Google, Site Reliability Engineering drives a culture of continuous automation and improvement. Whether Site Reliability Engineering specifically or another form of DevOps is adopted, cloud-native organizations regard their systems as living and evolving, and look to constantly improve them (Principle 5 - Always be architecting).

Google Cloud has an integrated monitoring and operations toolset, which increasingly supports Site Reliability Engineering principles out-of-the-box if you want to adopt them.

Analytics and machine learning

A well-architected cloud-native application generates a wealth of data. You can use that data in many ways: to understand what is happening in your environment (Principle 4 - Practice defense in depth), but also to improve your platform for your customers and your organization (Principle 5 - Always be architecting). Google Cloud offers a range of products to help you process and analyze data at scale, and because running data analytics at scale is hard, most of them are offered as fully-managed services (Principle 3 - Favor managed services).

Increasingly, organizations are exploring machine learning as a way to automate routine tasks (Principle 1 - Design for automation). Google Cloud therefore makes it easy to pipe data into a range of products that allow you to use existing models easily (Principle 3 - Favor managed services), or allow you to build your own solutions from scratch.

Putting the cloud-native stack to work

Cloud is a broad and expanding topic, which touches on many diverse technologies, and many different techniques and practices. In our earlier blog post, we explained why it's important to change how we architect solutions to take into account the non-functional capabilities and constraints of the cloud. We also set forth five principles for cloud-native architecture to help steer your thinking. Hopefully this post, and the cloud-native stack it describes will help you to contextualize these different elements, and see how you can incorporate them into your everyday role, so you can help your organization be agile and successful.

For further reading, check out Patterns for scalable and resilient apps, which looks at many of the same topics in more depth.