Towards Modern Development of Cloud Applications

Towards Modern Development of Cloud Applications 现代化云应用翻译

Towards Modern Development of Cloud Applications

云应用论文读解

写在前头:文章包括三种格式的内容

原文的引用

原文应用的翻译作者评论

论文全文引用如下:

Via an internal survey of various infrastructure teams,we have found that most developers split their applicationsinto multiple binaries for one of the following reasons:(1) It improves performance. Separate binaries can be scaledindependently, leading to better resource utilization. (2) It improves fault tolerance. A crash in one microservice doesn’tbring down other microservices, limiting the blast radius ofbugs. (3) It improves abstraction boundaries. Microservicesrequire clear and explicit APIs, and the chance of code entanglement is severely minimized. (4) It allows for flexiblerollouts. Different binaries can be released at different rates,leading to more agile code upgrades

将其应用程序分成多个二进制文件的原因有以下几点:

(1)它提高了性能。单独的二进制文件可以独立扩展,从而提高了资源利用率。

(2)它提高了容错性。一个微服务的崩溃不会导致其他微服务崩溃,限制了错误的影响范围。

(3)它改善了抽象边界。微服务需要清晰明确的API,并且代码纠缠的机会被严重减小。

(4)它允许灵活的发布。不同的二进制文件可以以不同的速率发布,从而实现更灵活的代码升级。

以上可以理解为使用微服务方式开发的好处

However, splitting applications into independently deployable microservices is not without its challenges, someof which directly contradict the benefits.• C1: It hurts performance. The overhead of serializingdata and sending it across the network is increasinglybecoming a bottleneck [72]. When developers over-splittheir applications, these overheads compound [55].• C2: It hurts correctness. It is extremely challenging toreason about the interactions between every deployedversion of every microservice. In a case study of over100 catastrophic failures of eight widely used systems,two-thirds of failures were caused by the interactionsbetween multiple versions of a system [78].• C3: It is hard to manage. Rather than having a single binary to build, test, and deploy, developers have to manage𝑛 different binaries, each on their own release schedule.Running end-to-end tests with a local instance of theapplication becomes an engineering feat.• C4: It freezes APIs. Once a microservice establishes anAPI, it becomes hard to change without breaking theother services that consume the API. Legacy APIs lingeraround, and new APIs are patched on top.• C5: It slows down application development. When making changes that affect multiple microservices, developers cannot implement and deploy the changes atomically.They have to carefully plan how to introduce the changeacross 𝑛 microservices with their own release schedules.

使用独立拆分的方式部署程序并非没有挑战,其中一些抵消其优势。

  • C1 : 对性能的影响,将数据序列化并通过网络发送的开销越来越成为瓶颈,当开发人员过度拆分其应用程序时,这些开销会叠加 。
  • C2 : 它影响正确性。对每个部署版本的每个微服务之间的交互进行推理极具挑战性。在对八个广泛使用的系统进行了超过100次灾难性故障的案例研究中,三分之二的故障是由系统的多个版本之间的交互引起的。
  • C3 : 它难以管理。开发人员不再只需管理单个二进制文件的构建、测试和部署,而是必须管理 𝑛 个不同的二进制文件,每个文件都有自己的发布计划。使用本地应用程序的端到端测试变成了一项工程壮举。
  • C4 : 它冻结了API。一旦一个微服务建立了一个API,要想在不破坏消费该API的其他服务的情况下进行更改变得困难。旧的API仍然存在,新的API则被叠加在其上。
  • C5 : 它减缓了应用程序开发速度。当进行影响多个微服务的更改时,开发人员无法原子化地实施和部署更改。他们必须仔细规划如何在其各自的发布计划中引入更改,跨 𝑛 个微服务进行更改。

开发了多个基于微服务的系统后,确实发现版本管理是个巨大的挑战,往往代码不到实际上线时这个挑战还在持续,需要与线上完全一致的环境是困难的,代价高昂的。这里还有对测试的不友好,新开发的部分虽然经过模块测试,但是很难保证对线上各种版本代码的影响。对于新加入的同学,尤其是刚刚接触新业务的同学非常的不友好,需要非常多的时间去理解服务期间的调用关系。

In our experience, we have found that an overwhelmingnumber of developers view the above challenges as a necessary part of doing business. Many cloud-native companies are in fact developing internal frameworks and processesthat aim to ease some of the above challenges, but not fundamentally change or eliminate them altogether. For example,continuous deployment frameworks [12, 22, 37] simplify howindividual binaries are built and pushed into production, butthey do nothing to solve the versioning issue; if anything,they make it worse, as code is pushed into production at afaster rate. Various programming libraries [13, 27] make iteasier to create and discover network endpoints, but do nothing to help ease application management. Network protocolslike gRPC [18] and data formats like Protocol Buffers [30]are continually improved, but still take up a major fractionof an application’s execution cost.

根据我们的经验,我们发现绝大多数开发人员将上述挑战视为业务不可或缺的一部分。事实上,许多云原生公司正在开发内部框架和流程,旨在减轻上述一些挑战,但并不从根本上改变或完全消除它们。例如,持续部署框架 [12, 22, 37] 简化了个别二进制文件的构建和推送到生产环境的过程,但它们并没有解决版本控制的问题;如果说有的话,它们会让问题变得更糟,因为代码被更快地推送到生产环境。各种编程库 [13, 27] 让创建和发现网络端点更加容易,但对于减轻应用程序管理方面没有任何帮助。像 gRPC [18] 这样的网络协议和像 Protocol Buffers [30] 这样的数据格式不断得到改进,但仍然占据了应用程序执行成本的主要部分。

CICD 加内部的规范流程虽然可以 解决一份不如版本管理和部署方面的问题,但是像开发缓慢、对服务性能的影响及服务,对于代码的维护难度加大等问题都不会从根本上消除。快速构建和执行有快速的代码开发节奏,这样出现依赖中间版本的代码很难部署和维护。

There are two reasons why these microservice-based solutions fall short of solving challenges C1-C5. The first reasonis that they all assume that the developer manually splits theirapplication into multiple binaries. This means that the network layout of the application is predetermined by the application developer. Moreover, once made, the network layoutbecomes hardened by the addition of networking code intothe application (e.g., network endpoints, client/server stubs,network-optimized data structures like [30]). This meansthat it becomes harder to undo or modify the splits, evenwhen it makes sense to do so. This implicitly contributes tothe challenges C1, C2 and C4 mentioned above

这些基于微服务的解决方案未能解决挑战C1-C5的原因有两个。第一个原因是它们都假设开发人员手动将其应用程序拆分为多个二进制文件。这意味着应用程序的网络布局是由应用程序开发人员预先确定的。此外,一旦确定,网络布局就会通过将网络代码添加到应用程序中而变得固定(例如,网络端点、客户端/服务器存根、网络优化数据结构如[30])。这意味着即使在有意义的情况下,撤销或修改拆分也变得更加困难。这暗含地导致了上述提到的挑战C1、C2和C4。

这是不是DevOps的副作用,本来是希望开发自运维,但是在分工上这个从事两份工作的职业并没有2倍的时间来处理问题,并且当运维的人员增加,对服务的状态的了解会变得困难,我们并不清楚谁有权限修改某个配置导致服务整体出现故障。当代码出现依赖地狱时,部署代码将是一个复杂的工程。

The second reason is the assumption that application binaries are individually (and in some cases continually) releasedinto production. This makes it more difficult to make changesto the cross-binary protocol. Additionally, it introduces versioning issues and forces the use of more inefficient dataformats like [23, 30]. This in turn contributes to the challenges C1-C5 listed above.

第二个原因是假设应用程序二进制文件是单独地(在某些情况下是连续地)发布到生产环境中的。这使得对跨二进制协议进行更改变得更加困难。此外,它引入了版本控制问题,并迫使使用更加低效的数据格式,比如[23, 30]。这反过来又导致了上述列出的挑战C1-C5。曾经出过一个主意,在远程调用的接口上绑定版本,导致旧的API 无法退出,留在线上可能成为攻击的入口。

In this paper, we propose a different way of writing anddeploying distributed applications, one that solves C1-C5.Our programming methodology consists of three core tenets:(1) Write monolithic applications that are modularizedinto logically distinct components.(2) Leverage a runtime to dynamically and automaticallyassign logical components to physical processesbased on execution characteristics.(3) Deploy applications atomically, preventing different versions of an application from interacting.

在本文中,我们提出了一种不同的编写和部署分布式应用程序的方式,这种方式解决了挑战C1-C5。我们的编程方法论包括三个核心原则:(1)编写模块化的单体应用程序,将其划分为逻辑上独立的组件。(2)利用运行时环境,根据执行特征动态自动地将逻辑组件分配给物理进程。(3)以原子方式部署应用程序,防止不同版本的应用程序相互交互。

没有实践过需要看下面的内容如何实现。2 点可以理解为依赖如K8S这样的基础服务来决定服务的运行,而不是依赖程序员对于代码的拆分,来决定代码时机执行的位置。3 点执像其中多个痛点,包括服务版本依赖、服务的运维和部署,但是同样面临高内部高耦合的问题。

Other solutions (e.g., actor based systems) have also triedto raise the abstraction. However, they fall short of solvingone or more of these challenges (Section 7). Though thesechallenges and our proposal are discussed in the context ofserving applications, we believe that our observations andsolutions are broadly useful.

其他解决方案(例如,基于actor的系统)也尝试提高抽象级别。然而,它们未能解决这些挑战中的一个或多个(见第7节)。尽管这些挑战和我们的提议是在为应用程序提供服务的情况下讨论的,但我们相信我们的观察和解决方案具有广泛的实用性。"Actor" 系统通常指的是一种并行计算模型,用于构建并发和分布式系统。在这个模型中,计算单元被称为"actors",每个actor都是独立的实体,它们之间通过消息传递进行通信。每个actor都有自己的状态和行为,并且可以并发地执行。当一个actor接收到消息时,它可以执行一些计算,更新自己的状态,并发送消息给其他actor,从而引发更多的计算。分布式计算框架如 Apache Kafka、Apache Storm 和 Apache Flink 等都使用了 Actor 模型来实现分布式计算和数据流处理。我对这个系统粗浅的理解为,事件触发的流式系统。

Proposed Solution

干货来了

The two main parts of our proposal are (1) a programmingmodel with abstractions that allow developers to write singlebinary modular applications focused solely on business logic,and (2) a runtime for building, deploying, and optimizingthese applications.

我们提议的两个主要部分是:(1)一个编程模型,其中包含允许开发人员编写专注于业务逻辑的单一二进制模块化应用程序的抽象;(2)一个用于构建、部署和优化这些应用程序的运行时环境。

感觉起来1比较关键,2的技术应该是确定的,就看1如何配合2。

The programming model enables a developer to write adistributed application as a single program, where the code issplit into modular units called components (Section 3). Thisis similar to splitting an application into microservices, except that microservices conflate logical and physical boundaries. Our solution instead decouples the two: componentsare centered around logical boundaries based on applicationbusiness logic, and the runtime is centered around physical boundaries based on application performance (e.g., twocomponents should be co-located to improve performance).This decoupling—along with the fact that boundaries can bechanged atomically—addresses C4.

这个编程模型使开发人员能够将分布式应用程序编写为一个单独的程序,其中代码被分割成称为组件的模块化单元(见第3节)。这与将应用程序拆分为微服务类似,不同之处在于微服务将逻辑和物理边界混为一谈。我们的解决方案相反,将两者解耦:组件围绕基于应用程序业务逻辑的逻辑边界,而运行时环境则围绕基于应用程序性能的物理边界(例如,两个组件应该共同部署以提高性能)。这种解耦 - 以及边界可以被原子化地更改的事实 - 解决了挑战C4。

By delegating all execution responsibilities to the runtime,our solution is able to provide the same benefits as microservices but with much higher performance and reduced costs(addresses C1). For example, the runtime makes decisions onhow to run, place, replicate, and scale components (Section 4).Because applications are deployed atomically, the runtimehas a bird’s eye view into the application’s execution, enabling further optimizations. For example, the runtime canuse custom serialization and transport protocols that leverage the fact that all participants execute at the same version.

通过将所有执行责任委托给运行时环境,我们的解决方案能够提供与微服务相同的好处,但性能更高,成本更低(解决了挑战C1)。例如,运行时环境会决定如何运行、放置、复制和扩展组件(见第4节)。由于应用程序是原子化部署的,运行时环境可以全面了解应用程序的执行情况,从而实现进一步的优化。例如,运行时环境可以使用定制的序列化和传输协议,利用所有参与者执行相同版本的事实。

这样版本和代码是绑定的,A版本的API 只会有A 版本的代码访问,而不用担心别的服务来访问。这里没有理解如何解决挑战C1 性能问题,可以插一个Flag看看后面如何做。

Writing an application as a single binary and deploying itatomically also makes it easier to reason about its correctness(addresses C2) and makes the application easier to manage(addresses C3). Our proposal provides developers with a programming model that lets them focus on application businesslogic, delegating deployment complexities to a runtime (addresses C5). Finally, our proposal enables future innovationslike automated testing of distributed applications (Section 5).

将应用程序编写为单个二进制文件并以原子方式部署还使得对其正确性进行推理更加容易(解决了挑战C2),并且使得应用程序更易于管理(解决了挑战C3)。我们的提议为开发人员提供了一个编程模型,让他们专注于应用程序业务逻辑,将部署复杂性委托给运行时环境(解决了挑战C5)。最后,我们的提议还能够实现未来的创新,如自动化测试分布式应用程序(见第5节)。

so perfect

Programming Model

  1. Components
The key abstraction of our proposal is the component. Acomponent is a long-lived, replicated computational agent,similar to an actor [2]. Each component implements an interface, and the only way to interact with a component is bycalling methods on its interface. Components may be hostedby different OS processes (perhaps across many machines).Component method invocations turn into remote procedurecalls where necessary, but remain local procedure calls if thecaller and callee component are in the same process.Components are illustrated in Figure 1. The example application has three components: 𝐴, 𝐵, and 𝐶. When the application is deployed, the runtime determines how to co-locate and replicate components. In this example, components 𝐴and 𝐵 are co-located in the same OS process, and methodcalls between them are executed as regular method calls.Component 𝐶 is not co-located with any other componentand is replicated across two machines. Method calls on 𝐶 areexecuted as RPCs over the network.Components are generally long-lived, but the runtime mayscale up or scale down the number of replicas of a componentover time based on load. Similarly, component replicas mayfail and get restarted. The runtime may also move componentreplicas around, e.g., to co-locate two chatty components inthe same OS process so that communication between thecomponents is done locally rather than over the network.

我们提议的关键抽象是组件。一个组件是一个长寿命的、复制的计算代理,类似于一个actor [2]。每个组件都实现了一个接口,与组件进行交互的唯一方法是调用其接口上的方法。组件可以由不同的操作系统进程托管(可能跨越多台机器)。在必要时,组件方法调用会变成远程过程调用,但如果调用方和被调用方组件位于同一个进程中,则仍然是本地过程调用。组件如图1所示。示例应用程序有三个组件:𝐴、𝐵和𝐶。当应用程序部署时,运行时环境确定如何共同部署向现代云应用开发,并且复制组件。在这个例子中,组件 𝐴 和 𝐵 在同一个操作系统进程中共存,并且它们之间的方法调用被执行为常规方法调用。组件 𝐶 没有与任何其他组件共存,并且在两台机器上复制。对 𝐶 的方法调用是通过网络执行的远程过程调用(RPC)组件通常是长期存在的,但运行时环境可能会根据负载的情况随时间增加或减少组件的副本数量。同样,组件副本可能会失败并重新启动。运行时环境还可能会移动组件副本,例如,将两个频繁通信的组件共同放置在同一个操作系统进程中,以便组件之间的通信在本地而不是通过网络进行。

哦,这看起来很像是severless的想法,应该是抽象的层次不同,运行时来决定接口实际执行时远程调用还是本地执行。这里有个想法,就是关于数据库事务的问题,这样的模式下如何很好的处理呢,如果调用的程序成为远程调用了,比如像 spring aop 加的事务将失效。需要同样事务的方法,就必须写在一个方法调用里。

  1. API
For the sake of concreteness, we present a component APIin Go, though our ideas are language-agnostic. A “Hello,World!" application is given in Figure 2. Component interfaces are represented as Go interfaces, and component implementations are represented as Go structs that implementthese interfaces. In Figure 2, the hello struct embeds theImplements[Hello] struct to signal that it is the implementation of the Hello component.Init initializes the application. Get[Hello] returns aclient to the component with interface Hello, creating itif necessary. The call to hello.Greet looks like a regularmethod call. Any serialization and remote procedure callsare abstracted away from the developer.

为了具体起见,我们以 Go 语言为例介绍了一个组件 API,尽管我们的想法与语言无关。"Hello, World!" 应用程序如图2所示。组件接口表示为 Go 接口,而组件实现表示为实现这些接口的 Go 结构体。在图2中,hello 结构体嵌入了 Implements[Hello] 结构体,以表示它是 Hello 组件的实现。Init 初始化应用程序。Get[Hello] 返回具有接口 Hello 的组件客户端,如果需要,则创建它。对 hello.Greet 的调用看起来像是一个普通的方法调用。任何序列化和远程过程调用都被抽象出来,开发人员无需关心。对go语言不熟,但是从这个例子中,我们看到一些概念,这里面有一些要素如:接口、接口的实现、调用接口的实体。系统在调用时实体方法时调用的是接口,接口的实现可以在本地也可以在远程。

Runtime

  1. Overview
Underneath the programming model lies a runtime that isresponsible for distributing and executing components. Theruntime makes all high-level decisions on how to run components. For example, it decides which components to co-locateand replicate. The runtime is also responsible for low-level details like launching components onto physical resourcesand restarting components when they fail. Finally, the runtime is responsible for performing atomic rollouts, ensuringthat components in one version of an application never communicate with components in a different versionThere are many ways to implement a runtime. The goal ofthis paper is not to prescribe any particular implementation.Still, it is important to recognize that the runtime is notmagical. In the rest of this section, we outline the key piecesof the runtime and demystify its inner workings.

在编程模型的基础上是一个运行时环境,负责分发和执行组件。运行时环境做出所有高级别的决策,例如,它决定了哪些组件应该共同部署和复制。运行时环境还负责低级别的细节,比如将组件启动到物理资源上,并在组件失败时重新启动它们。最后,运行时环境负责执行原子化的部署,确保应用程序的一个版本中的组件永远不会与另一个版本中的组件进行通信。实现运行时环境有许多方法。本文的目标不是规定任何特定的实现方式。然而,重要的是要认识到运行时环境并不是神奇的。在本节的其余部分,我们概述了运行时环境的关键部分,并揭示了其内部运作方式。这里的重点是运行时可以决定组件的不熟,这里是自动执行的,不是人为操作的,这里的优化应该是按照某些基本的原则,可选的如调度策略、熔断策略、重拾策略、甚至缓存策略、重启的策略,这样做之后每个版本交付的代码面向其他版本封闭。但是一个超大的应用在编译和启动的时候会更消耗时间吧,当然运营时应该要保证当服务准备好时才对外提供服务。

  1. Code Generation
The first responsibility of the runtime is code generation. Byinspecting the Implements[T] embeddings in a program’ssource code, the code generator computes the set of all component interfaces and implementations. It then generatescode to marshal and unmarshal arguments to componentmethods. It also generates code to execute these methodsas remote procedure calls. The generated code is compiledalong with the developer’s code into a single binary

运行时环境的第一个责任是代码生成。通过检查程序源代码中的 Implements[T] 嵌入,代码生成器计算出所有组件接口和实现的集合。然后,它生成代码来编组和解组传递给组件方法的参数。它还生成代码以执行这些方法作为远程过程调用。生成的代码与开发人员的代码一起编译成一个单独的二进制文件。这段看起来和RPC 调用的逻辑类似,需要通过接口的实现被代理并加入远程调用的代码。

  1. Application-Runtime Interaction
With our proposal, applications do not include any codespecific to the environment in which they are deployed, yetthey must ultimately be run and integrated into a specificenvironment (e.g., across machines in an on-premises clusteror across regions in a public cloud). To support this integration, we introduce an API (partially outlined in Table 1) thatisolates application logic from the details of the environment.The caller of the API is a proclet. Every application binaryruns a small, environment-agnostic daemon called a procletthat is linked into the binary during compilation. A procletmanages the components in a running binary. It runs them,starts them, stops them, restarts them on failure, etc.

根据我们的提议,应用程序不包含任何特定于其部署环境的代码,然而它们最终必须在特定的环境中运行和集成(例如,在本地集群中跨机器运行或在公共云中跨区域运行)。为了支持这种集成,我们引入了一个 API(部分在表1中概述),该 API 将应用程序逻辑与环境的详细信息隔离开来。API的调用者是一个 proclet。每个应用程序二进制文件都运行一个小型的、与环境无关的守护进程,称为 proclet,在编译过程中被链接到二进制文件中。一个 proclet 管理着运行中的二进制文件中的组件。它运行它们,启动它们,停止它们,在失败时重新启动它们等。如果是一个远程调用,在编写代码是要确定接口的实现在哪个服务中,这样我们必须手动指定实现服务的服务名,而如果这个服务是跨公共云的,那我们还要指定更多,每个次调用看似都是通过接口来完成的,但是其实现方式各不相同(本地、本地集群、公有云集群、公有云非集群),但是每类调用对与调动是有共性的。proclet 引入的必要性。proclet 是个与环境无关的守护进程。proclet 负责组建的注册、发现、启动和失败时重启。

The implementer of the API is the runtime, which is responsible for all control plane operations. The runtime decides how and where proclets should run. For example, amultiprocess runtime may run every proclet in a subprocess; an SSH runtime may run proclets via SSH; and a cloudruntime may run proclets as Kubernetes pods [25, 28].Concretely, proclets interact with the runtime over a Unixpipe. For example, when a proclet is constructed, it sendsa RegisterReplica message over the pipe to mark itselfas alive and ready. It periodically issues ComponentsToHostrequests to learn which components it should run. If a component calls a method on a different component, the procletissues a StartComponent request to ensure it is started

API的实现者是运行时环境,它负责所有控制平面操作。运行时环境决定了 proclet 应该如何以及在何处运行。例如,一个多进程运行时环境可能会在每个子进程中运行每个 proclet;一个SSH运行时环境可能会通过SSH运行 proclet;而一个云运行时环境可能会将 proclet 作为Kubernetes pod 运行。具体来说,proclet 通过 Unix 管道与运行时环境进行交互。例如,当一个 proclet 被构造时,它通过管道发送一个 RegisterReplica 消息来标记自己为活动并准备好。它定期发出 ComponentsToHost 请求来了解应该运行哪些组件。如果一个组件调用另一个组件的方法,proclet 将发出 StartComponent 请求来确保它被启动。

proclet ——> API ——> runtime

API
Description
RegisterReplica
Register a proclet as alive and ready.
StartComponent
Start a component, potentially in another process.
ComponentsToHost
Get components a proclet should host.

The runtime implements these APIs in a way that makessense for the deployment environment. We expect most runtime implementations to contain the following two pieces:(1) a set of envelope processes that communicate directlywith proclets via UNIX pipes, and (2) a global manager thatorchestrates the execution of the proclets (see Figure 3).

运行时环境以一种适合部署环境的方式实现这些API。我们预计大多数运行时环境实现将包含以下两个部分:

(1)一组通过UNIX管道直接与proclet通信的封装进程

(2)一个全局管理器,用于协调proclet的执行(见图3)。

An envelope runs as the parent process to a proclet andrelays API calls to the manager. The manager launches envelopes and (indirectly) proclets across the set of availableresources (e.g., servers, VMs). Throughout the lifetime ofthe application, the manager interacts with the envelopesto collect health and load information of the running components; to aggregate metrics, logs, and traces exported by he components; and to handle requests to start new components. The manager also issues environment-specific APIs(e.g., Google Cloud [16], AWS [4]) to update traffic assignments and to scale up and down components based on load,health, and performance constraints. Note that the runtimeimplements the control plane but not the data plane. Procletscommunicate directly with one another.

一个 envelope 作为一个 proclet 的父进程运行,并将 API 调用传递给管理器。管理器启动 envelope 和(间接地)proclets在可用资源集(例如服务器、虚拟机)上。在应用程序的生命周期内,管理器与封装器交互,收集正在运行的组件的健康和负载信息;聚合组件导出的度量、日志和跟踪信息;处理启动新组件的请求。管理器还发出特定于环境的API(例如,Google Cloud,AWS)来更新流量分配,并根据负载、健康和性能约束来扩展组件。需要注意的是,运行时环境实现了控制平面,但没有实现数据平面。Proclets 直接相互通信。

控制平面之负责调用的协调,不会传递数据,实际的调用数据的传递在 proclets之间进行。确实当你有一个集群的时候确实很想知道他们上面都在运行什么,运行的怎么样,很自然的想法是,当你发现一个机器忙坏了,而其他的机器却在空闲时,想要重新分配他们的运行,这是个很自然的想法。

  1. Atomic Rollouts
Developers inevitably have to release new versions of theirapplication. A widely used approach is to perform rollingupdates, where the machines in a deployment are updatedfrom the old version to the new version one by one. Duringa rolling update, machines running different versions of thecode have to communicate with each other, which can leadto failures. [78] shows that the majority of update failuresare caused by these cross-version interactions.To address these complexities, we propose a different approach. The runtime ensures that application versions arerolled out atomically, meaning that all component communication occurs within a single version of the application.The runtime gradually shifts traffic from the old version tothe new version, but once a user request is forwarded to aspecific version, it is processed entirely within that version.One popular implementation of atomic rollouts is the use ofblue/green deployments [9].

开发人员不可避免地需要发布其应用程序的新版本。一个广泛使用的方法是执行滚动更新,其中部署中的机器逐个从旧版本更新到新版本。在滚动更新期间,运行不同代码版本的机器必须相互通信,这可能导致故障。[78] 表明,大多数更新失败都是由这些跨版本交互引起的。为了解决这些复杂性,我们提出了一种不同的方法。运行时环境确保应用程序版本是原子化部署的,这意味着所有组件之间的通信都发生在应用程序的单个版本中。运行时环境逐渐将流量从旧版本转移到新版本,但一旦用户请求被转发到特定版本,它就完全在该版本中进行处理。原子化部署的一个流行的实现方式是使用蓝/绿部署。

想起来就很方便,比折腾服务网格方便。

Enabled Innovations

  1. Transport, Placement, and Scaling
The runtime has a bird’s-eye view into application execution,which enables new avenues to optimize performance. Forexample, our framework can construct a fine-grained callgraph between components and use it to identify the criticalpath, the bottleneck components, the chatty components, etc.Using this information, the runtime can make smarter scaling, placement, and co-location decisions. Moreover, becauseserialization and transport are abstracted from the developer,the runtime is free to optimize them. For network bottlenecked applications, for example, the runtime may decideto compress messages on the wire. For certain deployments,the transport may leverage technologies like RDMA [32].

运行时环境能够全局观察应用程序的执行情况,从而为优化性能开辟了新的途径。例如,我们的框架可以构建组件之间的细粒度调用图,并使用它来识别关键路径、瓶颈组件、频繁通信的组件等。利用这些信息,运行时环境可以做出更智能的扩展、放置和共同部署决策。此外,由于序列化和传输被从开发人员那里抽象出来,运行时环境可以自由地对它们进行优化。例如,对于网络瓶颈应用程序,运行时环境可以决定在传输过程中压缩消息。对于某些部署,传输可能利用像 RDMA 这样的技术。

带有监控的环境,而且是项目定制的。

  1. Routing
The performance of some components improves greatlywhen requests are routed with affinity. For example, consideran in-memory cache component backed by an underlyingdisk-based storage system. The cache hit rate and overall performance increase when requests for the same key are routedto the same cache replica. Slicer [44] showed that many applications can benefit from this type of affinity based routingand that the routing is most efficient when embedded inthe application itself [43]. Our programming framework can be naturally extended to include a routing API. The runtime could also learn which methods benefit the most fromrouting and route them automatically.

当请求具有亲和性路由时,某些组件的性能会显著提高。例如,考虑一个由基础磁盘存储系统支持的内存缓存组件。当针对相同键的请求路由到同一缓存副本时,缓存命中率和整体性能都会提高。Slicer [44] 表明许多应用程序可以从这种基于亲和性的路由中受益,并且当它嵌入在应用程序本身时,路由效率最高[43]。我们的编程框架可以自然地扩展以包括一个路由API。运行时环境还可以学习到哪些方法最适合进行路由,并自动进行路由。

混乱的调用关系被缕清了,而且不用为调用的实现操心的内容好评。

  1. Automated Testing
One of the touted benefits of microservice architectures isfault-tolerance. The idea is that if one service in an application fails, the functionality of the application degrades butthe app as a whole remains available. This is great in theory,but in practice it relies on the developer to ensure that theirapplication is resilient to failures and, more importantly, totest that their failure-handling logic is correct. Testing isparticularly challenging due to the overhead in building andrunning 𝑛 different microservices, systematically failing andrestoring them, and checking for correct behavior. As a result, only a fraction of microservice-based systems are testedfor this type of fault tolerance. With our proposal, it is trivialto run end-to-end tests. Because applications are written assingle binaries in a single programming language, end-toend tests become simple unit tests. This opens the door toautomated fault tolerance testing, akin to chaos testing [47],Jepsen testing [14], and model checking [62].

微服务架构被吹捧的一个优点是容错性。其思想是,如果应用程序中的一个服务失败,那么应用程序的功能会降级,但整个应用程序仍然可用。理论上这很好,但在实践中,它依赖于开发人员确保他们的应用程序对故障具有弹性,并且更重要的是测试他们的故障处理逻辑是否正确。由于构建和运行 𝑛 个不同的微服务的开销,以及系统地失败和恢复它们,并检查其正确行为,因此测试特别具有挑战性。因此,只有微服务架构的一小部分系统被测试以具备这种类型的容错性。通过我们的提议,运行端到端测试非常简单。因为应用程序是以单一二进制文件的形式使用单一编程语言编写的,端到端测试变成了简单的单元测试。这为自动化的容错性测试打开了大门,类似于混沌测试、Jepsen 测试和模型检查。

想在本地去调用和调试一个分布式的程序很痛苦,更不用说为这样的车需写测试用例了。

  1. Stateful Rollouts
Our proposal ensures that components in one version of anapplication never communicate with components in a different version. This makes it easier for developers to reasonabout correctness. However, if an application updates statein a persistent storage system, like a database, then differentversions of an application will indirectly influence each othervia the data they read and write. These cross-version interactions are unavoidable—persistent state, by definition, persistsacross versions—but an open question remains about howto test these interactions and identify bugs early to avoidcatastrophic failures during rollout.

我们的提议确保应用程序的一个版本中的组件永远不会与另一个版本中的组件进行通信。这使得开发人员更容易理解正确性。然而,如果一个应用程序更新持久存储系统(如数据库)中的状态,那么不同版本的应用程序将通过它们读写的数据间接地相互影响。这些跨版本的交互是不可避免的——持久状态根据定义会跨越版本持续存在——但一个尚未解决的问题是如何测试这些交互并在部署过程中及早发现错误,以避免灾难性的故障。

有状态部署是个挑战,如果多个版本的服务同时存在,那会考验对持久存储系统中的状态的设计。

  1. Discussion
Note that innovation in the areas discussed in this section isnot fundamentally unique to our proposal. There has been extensive research on transport protocols [63, 64], routing [44,65], testing [45, 75], resource management [57, 67, 71], troubleshooting [54, 56], etc. However, the unique features ofour programming model enable new innovations and makeexisting innovations much easier to implement.For instance, by leveraging the atomic rollouts in our proposal, we can design highly-efficient serialization protocolsthat can safely assume that all participants are using the sameschema. Additionally, our programming model makes it easyto embed routing logic directly into a user’s application,providing a range of benefits [43]. Similarly, our proposal’sability to provide a bird’s eye view of the application allows researchers to focus on developing new solutions for tuningapplications and reducing deployment costs.

需要注意的是,本节讨论的领域中的创新并不是我们提议的根本特色。关于传输协议 [63, 64]、路由 [44, 65]、测试 [45, 75]、资源管理 [57, 67, 71]、故障排除 [54, 56] 等方面已经进行了大量研究。然而,我们编程模型的独特特性使得新的创新成为可能,并且使得现有的创新更容易实现。例如,通过利用我们提议中的原子部署,我们可以设计高效的序列化协议,可以安全地假定所有参与者都在使用相同的模式。此外,我们的编程模型使得直接将路由逻辑嵌入到用户的应用程序中变得容易,提供了一系列的好处 [43]。同样,我们提议的能力提供了对应用程序的全局视图,使得研究人员可以专注于开发新的解决方案,用于调整应用程序并降低部署成本。

so cool . I really want to use this soon.

Prototype Implementation

干货中的干货

Our prototype implementation is written in Go [38] and includes the component APIs described in Figure 2, the codegenerator described in Section 4.2, and the proclet architecture described in Section 4.3. The implementation uses acustom serialization format and a custom transport protocolbuilt directly on top of TCP. The prototype also comes witha Google Kubernetes Engine (GKE) deployer, which implements multi-region deployments with gradual blue/greenrollouts. It uses Horizontal Pod Autoscalers [20] to dynamically adjust the number of container replicas based on loadand follows an architecture similar to that in Figure 3. Ourimplementation is available at github.com/ServiceWeaver.

我们的原型实现是用 Go 编写的 [38],包括图2中描述的组件API、第4.2节中描述的代码生成器和第4.3节中描述的 proclet 架构。该实现使用自定义的序列化格式和自定义的传输协议,直接构建在TCP之上。该原型还配备了一个 Google Kubernetes Engine (GKE) 部署器,实现了逐步蓝/绿色部署的多区域部署。它使用水平 Pod 自动缩放器 [20] 根据负载动态调整容器副本的数量,并遵循类似于图3中的架构。我们的实现可在 github.com/ServiceWeaver 上找到。

I follow you.

以下内容将都是翻译没有锐评了。

  1. Evaluation
To evaluate our prototype, we used a popular web application [41] representative of the kinds of microservice applications developers write. The application has eleven microservices and uses gRPC [18] and Kubernetes [25] to deploy onthe cloud. The application is written in various programminglanguages, so for a fair comparison, we ported the applicationto be written fully in Go. We then ported the application toour prototype, with each microservice rewritten as a component. We used Locust [26], a workload generator, to load-testthe application with and without our prototype.The workload generator sends a steady rate of HTTPrequests to the applications. Both application versions wereconfigured to auto-scale the number of container replicasin response to load. We measured the number of CPU coresused by the application versions in a steady state, as well astheir end-to-end latencies. Table 2 shows our results.

为了评估我们的原型,我们使用了一个流行的 Web 应用程序 [41],代表了开发人员编写的微服务应用程序的类型。该应用程序有十一个微服务,并使用 gRPC [18] 和 Kubernetes [25] 在云上部署。该应用程序是用各种编程语言编写的,因此为了进行公平比较,我们将应用程序全部改写为 Go。然后,我们将应用程序移植到我们的原型上,每个微服务都重写为一个组件。我们使用 Locust [26],一个工作负载生成器,对应用程序进行了负载测试,包括使用和不使用我们的原型。工作负载生成器向应用程序发送稳定速率的 HTTP 请求。两个应用程序版本都配置为根据负载自动调整容器副本的数量。我们测量了应用程序版本在稳定状态下使用的 CPU 核数,以及它们的端到端延迟。表2显示了我们的结果。

Most of the performance benefits of our prototype comefrom its use of a custom serialization format designed for nonversioned data exchange, as well as its use of a streamlinedtransport protocol built directly on top of TCP. For example,the serialization format used does not require any encodingof field numbers or type information. This is because allencoders and decoders run at the exact same version andagree on the set of fields and the order in which they shouldbe encoded and decoded in advance.For an apples-to-apples comparison to the baseline, wedid not co-locate any components. When we co-locate all eleven components into a single OS process, the number ofcores drops to 9 and the median latency drops to 0.38 ms,both an order of magnitude lower than the baseline. Thismirrors industry experience [34, 39].

我们的原型的大部分性能优势来自其使用了专为非版本化数据交换设计的自定义序列化格式,以及其使用了直接构建在 TCP 之上的简化传输协议。例如,所使用的序列化格式不需要对字段编号或类型信息进行编码。这是因为所有的编码器和解码器都运行在完全相同的版本上,并且事先就字段集和应该编码和解码的顺序达成一致。为了与基准进行苹果对苹果的比较,我们没有将任何组件共同部署。当我们将所有十一个组件共同部署到单个操作系统进程中时,核心数量减少到 9,中位延迟降至 0.38 毫秒,都比基线低一个数量级。这与行业经验相符 [34, 39]。

Actor Systems. The closest solutions to our proposal areOrleans [74] and Akka [3]. These frameworks also use abstractions to decouple the application and runtime. Ray [70]is another actor based framework but is focused on ML applications. None of these systems support atomic rollouts,which is a necessary component to fully address challengesC2-C5. Other popular actor based frameworks such as Erlang [61], E [52], Thorn [48] and C++ Actor Framework [10]put the burden on the developer to deal with system andlow level details regarding deployment and execution, hencethey fail to decouple the concerns between the applicationand the runtime and therefore don’t fully address C1-C5.Distributed object frameworks like CORBA, DCOM, and JavaRMI use a programming model similar to ours but sufferedfrom a number of technical and organizational issues [58]and don’t fully address C1-C5 either

Actor系统。与我们提议最接近的解决方案是 Orleans [74] 和 Akka [3]。这些框架也使用抽象来解耦应用程序和运行时环境。Ray [70] 是另一个基于actor的框架,但专注于机器学习应用程序。这些系统都不支持原子部署,而原子部署是完全解决C2-C5挑战所必需的组成部分。其他流行的基于actor的框架,如Erlang [61]、E [52]、Thorn [48]和C++ Actor Framework [10],将处理系统和低级别部署和执行细节的负担放在开发人员身上,因此它们未能将应用程序与运行时环境之间的关注点分离开来,因此无法完全解决C1-C5的问题。像CORBA、DCOM和Java RMI这样的分布式对象框架使用了类似于我们的编程模型,但它们受到了一些技术和组织上的问题的困扰 [58],并且也未能完全解决C1-C5的问题。

Microservice Based Systems. Kubernetes [25] is widelyused for deploying container based applications in the cloud.However, its focus is orthogonal to our proposal and doesn’taddress any of C1-C5. Docker Compose [15], Acorn [1],Helm [19], Skaffold [35], and Istio [21] abstract away somemicroservice challenges (e.g., configuration generation). However, challenges related to splitting an application into microservices, versioned rollouts, and testing are still left tothe user. Hence, they don’t satisfy C1-C5.

基于微服务的系统。Kubernetes [25] 被广泛用于在云中部署基于容器的应用程序。然而,它的重点与我们的提议正交,并且不解决 C1-C5 中的任何问题。Docker Compose [15]、Acorn [1]、Helm [19]、Skaffold [35] 和 Istio [21] 抽象了一些微服务挑战(例如,配置生成)。然而,与将应用程序拆分为微服务、版本化部署和测试相关的挑战仍然留给用户。因此,它们不满足 C1-C5。

Other Systems. There are many other solutions that makeit easier for developers to write distributed applications, including dataflow systems [51, 59, 77], ML inference servingsystems [8, 17, 42, 50, 73], serverless solutions [11, 24, 36],databases [29, 49], and web applications [66]. More recently,service meshes [46, 69] have raised networking abstractionsto factor out common communication functionality. Our proposal embodies these same ideas but in a new domain ofgeneral serving systems and distributed applications. In thiscontext, new challenges arise (e.g., atomic rollouts).

其他系统。有许多其他解决方案可以帮助开发人员编写分布式应用程序,包括数据流系统 [51, 59, 77]、机器学习推理服务系统 [8, 17, 42, 50, 73]、无服务器解决方案 [11, 24, 36]、数据库 [29, 49] 和 Web 应用程序 [66]。最近,服务网格 [46, 69] 将网络抽象提升到了一种可以因素化出常见通信功能的层次。我们的提议体现了相同的思想,但在通用服务系统和分布式应用程序的新领域中。在这种背景下,会出现一些新的挑战(例如,原子部署)。

Discussion

  1. Multiple Application Binaries
We argue that applications should be written and built assingle binaries, but we acknowledge that this may not always be feasible. For example, the size of an applicationmay exceed the capabilities of a single team, or differentapplication services may require distinct release cycles fororganizational reasons. In all such cases, it may be necessaryfor the application to consist of multiple binaries.While this paper doesn’t address the cases where the useof multiple binaries is required, we believe that our proposalallows developers to write fewer binaries (i.e. by groupingmultiple services into single binaries whenever possible),achieve better performance, and postpone hard decisionsrelated to how to partition the application. We are exploring how to accommodate applications written in multiplelanguages and compiled into separate binaries.

我们认为应用程序应该以单个二进制文件的形式编写和构建,但我们承认这并非总是可行的。例如,一个应用程序的规模可能超出了单个团队的能力范围,或者出于组织原因,不同的应用服务可能需要不同的发布周期。在所有这些情况下,应用程序可能需要由多个二进制文件组成。虽然本文未涉及需要使用多个二进制文件的情况,但我们相信我们的提议可以让开发人员编写更少的二进制文件(即,尽可能将多个服务组合到单个二进制文件中),实现更好的性能,并推迟与如何划分应用程序相关的困难决策。我们正在探索如何适应用多种语言编写并编译为单独的二进制文件的应用程序。

  1. Integration with External Services
Applications often need to interact with external services(e.g., a Postgres database [29]). Our programming modelallows applications to interact with these services as anyapplication would. Not anything and everything has to be acomponent. However, when an external service is extensivelyused within and across applications, defining a corresponding component might provide better code reuse.

应用程序通常需要与外部服务进行交互(例如,Postgres数据库 [29])。我们的编程模型允许应用程序像任何应用程序一样与这些服务进行交互。并不是所有东西都必须成为一个组件。然而,当一个外部服务在应用程序内部和跨应用程序中被广泛使用时,定义一个相应的组件可能会提供更好的代码重用。

  1. Distributed Systems Challenges
While our programming model allows developers to focuson their business logic and defer a lot of the complexity ofdeploying their applications to a runtime, our proposal doesnot solve fundamental challenges of distributed systems [53,68, 76]. Application developers still need to be aware thatcomponents may fail or experience high latency.

虽然我们的编程模型允许开发人员专注于他们的业务逻辑,并将部署应用程序的许多复杂性推迟到运行时,但我们的提议并未解决分布式系统的基本挑战 [53, 68, 76]。应用程序开发人员仍然需要意识到组件可能会失败或经历高延迟。

  1. Programming Guidance
There is no official guidance on how to write distributedapplications, hence it’s been a long and heated debate onwhether writing applications as monoliths or microservicesis a better choice. However, each approach comes with itsown pros and cons. We argue that developers should writetheir application as a single binary using our proposal and decide later whether they really need to move to a microservicesbased architecture. By postponing the decision of how exactly to split into different microservices, it allows them towrite fewer and better microservices.

关于如何编写分布式应用程序,目前还没有官方的指导,因此关于将应用程序编写为单体应用程序还是微服务的选择已经进行了长时间而激烈的辩论。然而,每种方法都有其优缺点。我们认为开发人员应该使用我们的提议将他们的应用程序编写为单个二进制文件,并在以后决定是否真的需要转向基于微服务的架构。通过推迟如何划分为不同微服务的决定,这使他们能够编写更少且更好的微服务。

  1. Conclusion
The status quo when writing distributed applications involves splitting applications into independently deployableservices. This architecture has a number of benefits but alsomany shortcomings. In this paper, we propose a differentprogramming paradigm that sidesteps these shortcomings.Our proposal encourages developers to (1) write monolithicapplications divided into logical components, (2) defer to aruntime the challenge of physically distributing and executing the modularized monoliths, and (3) deploy applicationsatomically. These three guiding principles unlock a numberof benefits and open the door to a bevy of future innovation.Our prototype implementation reduced application latencyby up to 15× and reduced cost by up to 9× compared to thestatus quo.

编写分布式应用程序的现状通常涉及将应用程序拆分为独立可部署的服务。这种架构具有许多优点,但也存在许多缺点。在本文中,我们提出了一种不同的编程范式,可以避开这些缺点。

我们的提议鼓励开发人员:

(1)编写被划分为逻辑组件的单体应用程序。

(2)将物理分发和执行模块化的单体应用程序的挑战推迟给运行时。

(3)原子性地部署应用程序。这三个指导原则带来了许多好处,并为未来的创新打开了大门。与现状相比,我们的原型实现将应用程序的延迟降低了多达 15 倍,并将成本降低了多达 9 倍。