ZooKeeper 原理总结

发表于 2019-07-29 更新于 2025-07-07 分类于 ZooKeeper

ZooKeeper 是分布式的、开源的分布式应用程序协调服务，原本是 Hadoop、HBase 的一个重要组件。它为分布式应用提供一致性服务的软件，包括：配置维护、域名服务、分布式同步、组服务等。

为什么使用 ZooKeeper

区别于集中式系统，分布式系统是一个硬件或软件组件分布在不同的网络计算机上，彼此之间仅仅通过消息传递进行通信和协调的系统。
因为分布式特性，多个主机注定会出现很多问题，包括单点、分区、脑裂等。
ZooKeeper 解决的就是如何协调这些主机，就像动物园管理员一样（这也是其名字的由来），作为分布式协调中间件，ZooKeeper 主要用来解决分布式环境当中多个进程之间的同步控制。

分布式应用面对的问题

分布式系统存在着不确定性和不可预测性，我们可能会面临诸多问题，包括：

节点部分失效；
节点短时间内不可用导致被误判为宕机，此时服务器可能正在执行需要长时间阻塞用户线程的任务，比如 GC；
由于网络不可靠而引起丢包或任意延迟；
不可靠的时钟导致节点间不同步；

这些问题可以归类到可用性和一致性两类问题上。

从 CAP 理论中我们可以了解到，网络分区无法避免，设计分布式系统时必须考虑网络的安全性，其次再权衡可用性和一致性，二者相当于鱼和熊掌不可得兼。

可用性

节点故障无法避免：服务器越多，个别节点发生宕机或僵死情况的可能性越高，特别是在一些有上百个服务的大型企业中，故障几乎每天都会发生。
通信故障无法避免：从集中式到分布式，必然引入了网络因素，而由于网络本身的不可靠性，因此就引入了额外的问题。分布式系统各节点之间的网络通信能够正常进行，其延时也会远大于单机操作，在消息的收发过程中，消息丢失和消息延迟变得十分普遍。
当网络发生异常情况时，导致分布式系统中部分节点之间的网络延时不断增大，最终导致组成分布式系统的所有节点中，只有部分节点之间能够正常通信，而另一些节点则不能，这种现象称之为网络分区，当网络分区出现时，分布式系统会出现局部小集群，在极端情况下，这些局部小集群会独立完成原本需要整个分布式系统才能完成的功能，包括对数据的事务处理，这就对分布式一致性提出了非常大的挑战。

由于网络可能会出现各种各样的问题，因此分布式系统的每一次请求与响应，存在特有的三态概念：成功、失败、超时。当网络在异常情况下，可能会出现超时现象，通常由以下两种情况：

由于网络原因，该请求并没有被成功地发送到接收方，而是在发送过程就发生了消息丢失现象。
该请求成功的被接收方接受后，并进行了处理，但是在将响应反馈给发送方时，发生了消息丢失现象。

一致性

为了灾备，主机一般会设有副本，主从、从从之间同步数据会有延迟，先写入的副本和后写入的副本之间必然产生不一致，甚至写入顺序不同也会导致数据副本间的不一致。
P2P 集群中这种问题更为突出，因为主从集群中数据完全以主节点为准，而 P2P 集群中每个节点都可以主动写数据。
* 缺乏全局时钟：典型的分布式系统由一系列在空间上随意分布的多个进程组成，具有明显的分布性，这些进程之间通过交换消息来进行互相通信，因此，在分布式系统中，很难定义两个时间究竟谁先谁后，原因就是因为分布式系统缺乏一个全局的时钟序列控制。
* 并发操作：同一分布式系统中的多个节点，可能会并发地操作一些共享资源，诸如数据库或分布式存储等，就算通过某些手段协调一致了时钟，但是由于网络延迟的存在，并发请求仍然难以决定请求到达、返回的先后顺序。

副本指的是分布式系统对数据和服务提供的一种冗余方式，为了对外提供高可用的服务，我们往往会对数据和服务进行副本处理。
数据副本是指在不同的节点上持久化同一份数据，当某一个节点上存储的数据丢失时，可以从副本上读取到该数据，这是解决分布式系统数据丢失问题最为有效的手段。
服务副本是指多个节点提供同样的服务，每个节点都有能力接受来自外部的请求并进行相应的处理。
这里最产生不一致的指的是数据副本，比如 MySQL 主从集群。
机器故障是无法避免的，我们在设计的时候一般也会考虑故障的情况、从而为数据引入副本，如果所有这些服务器（主、副）全部挂掉且损坏、数据全部丢失，那么谈一致性也就没意义了，不过一般不会出现这样的情况，一些稍大型的公司甚至会考虑双活、两地三中心等机制。
我认为一致性是分布式系统中最有难度、最有魅力的，一致性问题会引出几乎最难解的问题，同样也会催发出各种优美的解决方案，比如 ZooKeeper 中的 ZAB 协议。

一些一致性验证框架通过混沌理论来测试分布式系统的一致性，比如Jepsen。

本地事务和分布式事务

事务是由一系列对系统中数据进行访问与更新的操作所组成的一个程序执行单元，狭义上的食物特指数据库事务。一方面，当多个应用程序并发访问数据库时，食物可以在这些应用程序之间提供一个隔离方法，以防止彼此的操作相互干扰，另一方面，食物为数据库操作序列提供了一个从失败中恢复到正常状态的方法，同时提供了数据库即使在宜昌状态下仍能保持数据一致性的方法。事务具有原子性（Atomicity）、一致性（Consistency）、隔离性（Isolation）、持久性（Durability），简称 ACID。

原子性，指事务必须是一个原子的操作序列单元，事务中包含的各项操作在一次执行过程中，只允许出现以下两种状态之一，全部成功执行，全部不执行。任何一项操作失败都将导致整个事务失败，同时其他已经被执行的操作都将被撤销并回滚，只有所有操作全部成功，整个事务才算是成功完成。
一致性，指事务的执行不能破坏数据库数据的完整性和一致性，一个事务在执行之前和执行之后，数据库都必须处于一致性状态，即事务执行的结果必须是使数据库从一个一致性状态转变到另一个一致性状态，因此当数据库只包含成功事务提交的结果时，就能说数据库处于一致性状态，而如果数据库系统在运行过程中发生故障，有些事务尚未完成就被迫中断，这些未完成的事务对数据库所做的修改有一部分已写入物理数据库，这时数据库就处于一种不正确的状态，或者说是不一致的状态。
隔离性，指在并发环境中，并发的事务是相互隔离的，一个事务的执行不能被其他事务干扰，即不同的事务并发操作相同的数据时，每个事务都有各自完整的数据空间，即一个事务内部的操作及使用的数据对其他并发事务是隔离的，并发执行的各个事务之间不能相互干扰。
持久性，指一个事务一旦提交，他对数据库中对应数据的状态变更就应该是永久的，即一旦某个事务成功结束，那么它对数据库所做的更新就必须被永久的保存下来，即使发生系统崩溃或者宕机故障，只要数据库能够重新启动，那么一定能够将其恢复到事务成功结束时的状态。

分布式事务是指事务的参与者、支持事务的服务器、资源服务器以及事务管理器分别位于分布式系统的不同节点上，通常一个分布式事务中会涉及对多个数据源或业务系统的操作。一个分布式事务可以看做是由多个分布式的操作序列组成，通常可以把这一系列分布式的操作序列称为子事务。由于在分布式事务中，各个子事务的执行是分布式的，因此要实现一种能够保证 ACID 特性的分布式事务处理系统就显得格外复杂。

CAP 和 BASE

CAP理论告诉我们，一个分布式系统不可能同时满足一致性、可用性、分区容错性这三个基本需求，最多只能同时满足其中的两个。

Consistent（一致性），指数据在多个副本之间是否能够保持一致的特性，在一致性的需求下，当一个系统在数据一致的状态下执行更新操作后，应该保证系统的数据仍然处于一致状态。对于一个将数据副本分布在不同分布式节点上的系统来说，如果对第一个结点的数据进行了更新操作并且成功后，却没有使得第二个节点上的数据得到相应的更新，于是在对第二个结点的数据进行读取操作时，获取的仍然是老数据（脏数据），这就是典型的分布式数据不一致的情况，在分布式系统中，如果能够做到针对一个数据项的更新操作执行成功后，所有的用户都可以读取到期最新的值，那么这样的系统就被认为具有强一致性。
Available（可用性），指系统提供的服务必须一直处于可用的状态，对于用户的每一操作请求总是能够在有限的时间内返回结果。
Partition（分区容错性），分布式系统在遇到任何网络分区故障时，仍然需要能够保证对外提供满足一致性和可用性的服务，除非是整个网络环境都发生了故障。

BASE是基本可用（Basically Available）、Soft state（弱状态）、Eventually consistent（最终一致性）三个短语的简写。

基本可用，指分布式系统在出现不可预知故障时，允许损失部分可用性，如响应时间上的损失或功能上的损失。
弱状态，也称为软状态，指允许系统中的二数据存在中间状态，并认为该中间状态的存在不会影响系统的整体可用性，即允许系统在不同节点的数据副本之间进行数据同步的过程存在延时。
最终一致性，指系统中所有的数据副本，在经过一段时回见的同步后，最终能够达到一个一致的状态，因此最终一致性的本质是需要系统保证数据能够达到一致，而不需要实时保证系统数据的强一致性。
因为强一致性实现起来难度比较大，具有强一致性的系统一般简称为 CP（Consistent+Pertition），ZooKeeper 就是 CP 的一个例子，CP 系统的一个显著特征是缺乏可用性，比如主挂掉后集群会进入选举，在此期间服务都是不可用的，因此一般业务服务不会直接依赖 ZooKeeper。
分布式系统谈得最多的就是最终一致性，但仅靠系统本身是不够的，最终一致性包括人工环节，甚至客服的介入。大部分异常都可以通过扫描 + 重试（at least once 语义） + 幂等忽略，真正产生的异常数据比较少，人工可以处理得过来，最严重的情况下，就需要借助日志来排查 Bug 了。

ZooKeeper 的应用场景

在构建微服务应用的时候，ZooKeepe 到并 r 主要是作注册中心用，基于 Spring Cloud 或 Dubbo 框架开发的提供者、消费者都向 ZooKeeper 注册自己的 URL，消费者还能拿订阅提供者的注册 URL，以便在后续程序的执行中去调用提供者。而提供者发生了变动，也会通过 Zookeeper 向订阅的消费者发送通知。

ZooKeeper 原理

数据结构

路径（path）和树

ZooKeeper 通过路径来引用一个节点。路径必须是绝对的，因此他们必须由斜杠字符来开头。除此以外，他们必须是唯一的，也就是说每一个路径只有一个表示，因此这些路径不能改变。在 ZooKeeper 中，路径由 Unicode 字符串组成，并且有一些限制。字符串”/zookeeper”用以保存管理信息，比如关键配额信息。

Zookeeper 提供基于类似于文件系统的目录节点树方式的数据存储，但是 Zookeeper 并不是用来专门存储数据的，它的作用主要是用来维护和监控存储的数据的状态变化。通过监控这些数据状态的变化，从而可以达到基于数据的集群管理。
与 Linux 文件系统不同的是，Linux 文件系统有目录和文件的区别，而 Zookeeper 的数据节点称为ZNode，ZNode 是 Zookeeper 中数据的最小单元，每个 ZNode 都可以保存数据（ZooKeeper was designed to store coordination data: status information, configuration, location information, etc., so the data stored at each node is usually small, in the byte to kilobyte range.），同时还可以挂载子节点，因此构成了一个层次化的命名空间，称为树。
另外，client 在创建 znode 时还可以指定一个 sequential flag，创建的 znode 将会拥有一个递增的序号，该序号会加到 znode 的名字后面。
read_dominant

节点（znode）

zk 中 znode 类型有四种，持久化目录节点、持久化顺序编号目录节点(有顺序，能够在注册机器等许多场景用到)、临时目录节点、临时顺序编号节点。
节点类型在创建后就不能更改。

持久节点（PERSISTENT）：该节点的生命周期不依赖于会话，并且只有在客户端显示执行删除操作的时候，他们才能被删除。
持久顺序节点（PERSISTENT_SEQUENTIAL）：跟持久一样，就是父节点在创建下一级子节点（父节点必须是持久的）的时候，记录每个子节点创建的先后顺序，会给每个子节点名加上一个数字后缀。
临时节点（EPHEMERAL）：该节点的生命周期依赖于创建它们的会话。一旦会话(Session)结束，临时节点将被自动删除，当然可以也可以手动删除。虽然每个临时的 Znode 都会绑定到一个客户端会话，但他们对所有的客户端还是可见的。另外，ZooKeeper 的临时节点不允许拥有子节点。

注意是会话结束而不是连接被断开。
临时顺序节点（EPEMERAL_SEQUENTIAL）：类似持久顺序节点。

一个节点主要包含以下属性：
代码：Stat

stat：此为状态信息, 描述该 Znode 的版本, 权限等信息
data：与该 Znode 关联的数据
children：该 Znode 下的子节点

内存结构

Zookeeper 的数据模型是树结构，在内存数据库中，存储了整棵树的内容，包括所有的节点路径、节点数据、ACL 信息，Zookeeper 会定时将这个数据存储到磁盘上。

DataTree：DataTree 是内存数据存储的核心，是一个树结构，代表了内存中一份完整的数据。DataTree 不包含任何与网络、客户端连接及请求处理相关的业务逻辑，是一个独立的组件。
DataNode：DataNode 是数据存储的最小单元，其内部除了保存了结点的数据内容、ACL 列表、节点状态之外，还记录了父节点的引用和子节点列表两个属性，其也提供了对子节点列表进行操作的接口。
ZKDatabase：Zookeeper 的内存数据库，管理 Zookeeper 的所有会话、DataTree 存储和事务日志。ZKDatabase 会定时向磁盘 dump 快照数据，同时在 Zookeeper 启动时，会通过磁盘的事务日志和快照文件恢复成一个完整的内存数据库。

节点容量限制

管理调度数据，比如分布式应用中的配置文件信息、状态信息、汇集位置等等。这些数据的共同特性就是它们都是很小的数据，通常以 KB 为大小单位。ZooKeeper 的服务器和客户端都被设计为严格检查并限制每个 Znode 的数据大小至多 1M，但常规使用中应该远小于此值。

版本号

ZooKeeper 中的版本表示的是对数据节点的数据内容、子节点列表，或是节点 ACL 信息的修改次数。
代码：PrepRequestProcessor
在 ZooKeeper 中，版本号主要用于更新时进行并发校验。

原子读写

ZooKeeper 中的每个节点存储的数据要被原子性的操作。也就是说读操作将获取与节点相关的所有数据，写操作也将替换掉节点的所有数据。
The data stored at each znode in a namespace is read and written atomically. Reads get all the data bytes associated with a znode and a write replaces all the data. Each node has an Access Control List (ACL) that restricts who can do what.

ACL

Zookeeper 内部存储了分布式系统运行时状态的元数据，这些元数据会直接影响基于 Zookeeper 进行构造的分布式系统的运行状态，如何保障系统中数据的安全，从而避免因误操作而带来的数据随意变更而导致的数据库异常十分重要，Zookeeper 提供了一套完善的 ACL 权限控制机制来保障数据的安全。
我们可以从三个方面来理解 ACL 机制：权限模式 Scheme、授权对象 ID、权限 Permission，通常使用”scheme:id:permission”来标识一个有效的 ACL 信息。
每一个节点都拥有自己的 ACL(访问控制列表)，这个列表规定了用户的权限，即限定了特定用户对目标节点可以执行的操作。

序列

当创建 Znode 的时候，用户可以请求在 ZooKeeper 的路径结尾添加一个递增的计数。这个计数对于此节点的父节点来说是唯一的，它的格式为”%10d”(10 位数字，没有数值的数位用 0 补充，例如”0000000001”)。当计数值大于 232-1 时，计数器将溢出。

通知机制（Watcher）

ZooKeeper 实现这些分布式进程的状态（ZNode 的 Data、Children）共享时，基于性能的考虑采用了类似的异步非阻塞的主动通知模式即 Watch 机制，使得分布式进程之间的“共享状态通信”更加实时高效，其实这也是 ZooKeeper 的主要任务决定的—协调。
ZooKeeper-Watch
Zookeeper 的 Watcher 机制主要包括客户端线程、客户端 WatcherManager、Zookeeper 服务器三部分。

客户端在查询 znode 时可以向 ZooKeeper 服务器的WatchManager注册 watch 事件，表示想要监听该 znode 的节点状态，同时客户端本地会存储该监听器相关的信息在ZKWatchManager中；
当节点状态发生改变时（znode 的增、删、改）将会触发 watch 所对应的事件，当 watch 被触发时，ZooKeeper 将会给相应会话客户端发送且仅发送一条通知，因为 watch 只能被触发一次，这样可以减少网络流量；
代码：org.apache.zookeeper.server.watch.WatchManager#triggerWatch
客户端在本地响应式的回调相关 Watcher 的 Handler；

ZooKeeper 对节点的 watch 监听通知不是永久的，一个 watch 事件是一个一次性的触发器，当被设置了 watch 的数据发生了改变的时间，则服务器将这个改变发送给设置了 watch 的客户端，以便通知它们。
为什么不是永久的？举个例子，如果服务端变动频繁，而监听的客户端很多的情况下，每次变动都要通知到所有的客户端，这太消耗性能了。
一般是客户端执行 getData(“/节点 A”, true)，如果节点 A 发生了变更或删除，客户端会得到它的 watch 事件，但是在之后节点 A 又发生了变更，而客户端又没有设置 watch 事件，就不再给客户端发送了。
在实际应用中，很多情况下，我们的客户端不需要知道服务端的每一次变动，我只要最新的数据即可。

通信模型（Session）

当 client 连接 zookeeper 时，会初始化一个 session，session 有一个超时时间，如在超时时间内，zookeeper 没有从 client 收到任何信息（zookeeper 会发状态检测信息），则认为 client 故障了，此时 zookeeper 会关闭这个 session（session 也可由 client 显式关闭）。

读写数据

read_write

写数据，一个客户端进行写数据请求时，如果是 follower 接收到写请求，就会把请求转发给 Leader，Leader 通过内部的 Zab 协议进行原子广播，直到所有 Zookeeper 节点都成功写了数据后（内存同步以及磁盘更新），这次写请求算是完成，然后 Zookeeper Service 就会给 Client 发回响应。
读数据，因为集群中所有的 Zookeeper 节点都呈现一个同样的命名空间视图（就是结构数据），上面的写请求已经保证了写一次数据必须保证集群所有的 Zookeeper 节点都是同步命名空间的，所以读的时候可以在任意一台 Zookeeper 节点上。

客户端启动流程

实例化ZooKeeper
初始化过程中会创建一个客户端的 Watcher 管理器ClientWatchManager。
构造 ZooKeeper 服务器地址列表管理器HostProvider
构造 ZooKeeper 时传入的服务器地址由HostProvider管理。
创建并初始化客户端网络连接器ClientCnxn
创建ClientCnxn的同时，还会初始化客户端两个核心队列outgoingQueue和pendingQueue作为请求发送队列和服务端响应的等待队列。
初始化 SendThread 和 EventThread
前者管理客户端和服务端之间的所有网络 IO，后者用于进行客户端的事件处理。

服务端启动流程

ZooKeeper 集群原理

ZooKeeper 适合读多写少的场景

ZooKeeper 适合读频率远大于写的场景，下图是 ZooKeeper 官网给出的吞吐量/读请求比率的压测结果
read_dominant
The figure ZooKeeper Throughput as the Read-Write Ratio Varies is a throughput graph of ZooKeeper release 3.2 running on servers with dual 2Ghz Xeon and two SATA 15K RPM drives.
为什么这个比率这么悬殊呢？主要是由于 ZooKeeper 的数据同步机制，它的核心是ZAB协议。

分布式协调和 CAP

作为一个分布式系统，分区容错性是一个必须要考虑的关键点。一个分布式系统一旦丧失了分区容错性，也就表示放弃了扩展性。因为在分布式系统中，网络故障是经常出现的，一旦出现在这种问题就会导致整个系统不可用是绝对不能容忍的。所以，大部分分布式系统都会在保证分区容错性的前提下在一致性和可用性之间做权衡。
ZooKeeper 是个 CP（一致性+分区容错性）的分布式系统，即任何时刻对 ZooKeeper 的访问请求能得到一致的数据结果，同时系统对网络分割具备容错性；但是它不能保证每次服务请求的可用性。也就是在极端环境下，ZooKeeper 可能会丢弃一些请求，消费者程序需要重新请求才能获得结果。
ZooKeeper 是分布式协调服务，它的职责是保证数据在其管辖下的所有服务之间保持同步、一致；所以就不难理解为什么 ZooKeeper 被设计成 CP 而不是 AP 特性的了。而且，作为 ZooKeeper 的核心实现算法 Zab，就是解决了分布式系统下数据如何在多个服务之间保持同步问题的。
CAP

ZooKeeper 集群系统的特性

Sequential Consistency - Updates from a client will be applied in the order that they were sent.
Atomicity - Updates either succeed or fail. No partial results.
Single System Image - A client will see the same view of the service regardless of the server that it connects to.
Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.
①Failure and recovery of a follower
②Failure and recovery of a different follower
③Failure of the leader
④Failure and recovery of two followers
⑤Failure of another leader
Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.

ZooKeeper 消息系统提供的保证

ZooKeeper 能在服务器之间建立点对点的 FIFO 通道，满足：

Reliable delivery
If a message, m, is delivered by one server, it will be eventually delivered by all servers.
Total order
If a message is delivered before message b by one server, a will be delivered before b by all servers. If a and b are delivered messages, either a will be delivered before b or b will be delivered before a.
Causal order
If a message b is sent after a message a has been delivered by the sender of b, a must be ordered before b. If a sender sends c after sending b, c must be ordered after b.

ZAB 协议依赖的 TCP 协议特性

Ordered delivery
Data is delivered in the same order it is sent and a message m is delivered only after all messages sent before m have been delivered. (The corollary to this is that if message m is lost all messages after m will be lost.)
No message after close
Once a FIFO channel is closed, no messages will be received from it.

Service 网络结构（replicated、Leader-Follower）

Like the distributed processes it coordinates, ZooKeeper itself is intended to be replicated over a sets of hosts called an ensemble.

The servers that make up the ZooKeeper service must all know about each other. They maintain an in-memory image of state, along with a transaction logs and snapshots in a persistent store. As long as a majority of the servers are available, the ZooKeeper service will be available.
Clients connect to a single ZooKeeper server. The client maintains a TCP connection through which it sends requests, gets responses, gets watch events, and sends heart beats. If the TCP connection to the server breaks, the client will connect to a different server.
Zookeeper 的工作集群可以简单分成两类，一个是 Leader，唯一一个，其余的都是 follower，如何确定 Leader 是通过内部选举确定的。

Leader 和各个 follower 是互相通信的，对于 Zookeeper 系统的数据都是保存在内存里面的，同样也会备份一份在磁盘上。
如果 Leader 挂了，Zookeeper 集群会重新选举，在毫秒级别就会重新选举出一个 Leader。
集群中除非有一半以上的 Zookeeper 节点挂了，Zookeeper Service 才不可用。

ordered

ZooKeeper stamps each update with a number that reflects the order of all ZooKeeper transactions. Subsequent operations can use the order to implement higher-level abstractions, such as synchronization primitives.

Total order 原理

As stated above, ZooKeeper guarantees a total order of messages, and it also guarantees a total order of proposals. ZooKeeper exposes the total ordering using a ZooKeeper transaction id (zxid). All proposals will be stamped with a zxid when it is proposed and exactly reflects the total ordering. Proposals are sent to all ZooKeeper servers and committed when a quorum of them acknowledge the proposal. If a proposal contains a message, the message will be delivered when the proposal is committed. Acknowledgement means the server has recorded the proposal to persistent storage. Our quorums have the requirement that any pair of quorum must have at least one server in common. We ensure this by requiring that all quorums have size (n/2+1) where n is the number of servers that make up a ZooKeeper service.
The zxid has two parts: the epoch and a counter. In our implementation the zxid is a 64-bit number. We use the high order 32-bits for the epoch and the low order 32-bits for the counter. Because it has two parts represent the zxid both as a number and as a pair of integers, (epoch, count). The epoch number represents a change in leadership. Each time a new leader comes into power it will have its own epoch number. We have a simple algorithm to assign a unique zxid to a proposal: the leader simply increments the zxid to obtain a unique zxid for each proposal. Leadership activation will ensure that only one leader uses a given epoch, so our simple algorithm guarantees that every proposal will have a unique id.

多数派（Quorum）

一个由相同应用复制出来的组被称为一个多数派（Quorum），一个多数派中的所有服务器的配置文件都是一致的。
Atomic broadcast and leader election use the notion of quorum to guarantee a consistent view of the system. By default, ZooKeeper uses majority quorums, which means that every voting that happens in one of these protocols requires a majority to vote on. One example is acknowledging a leader proposal: the leader can only commit once it receives an acknowledgement from a quorum of servers.
If we extract the properties that we really need from our use of majorities, we have that we only need to guarantee that groups of processes used to validate an operation by voting (e.g., acknowledging a leader proposal) pairwise intersect in at least one server. Using majorities guarantees such a property. However, there are other ways of constructing quorums different from majorities. For example, we can assign weights to the votes of servers, and say that the votes of some servers are more important. To obtain a quorum, we get enough votes so that the sum of weights of all votes is larger than half of the total sum of all weights.
A different construction that uses weights and is useful in wide-area deployments (co-locations) is a hierarchical one. With this construction, we split the servers into disjoint groups and assign weights to processes. To form a quorum, we have to get a hold of enough servers from a majority of groups G, such that for each group g in G, the sum of votes from g is larger than half of the sum of weights in g. Interestingly, this construction enables smaller quorums. If we have, for example, 9 servers, we split them into 3 groups, and assign a weight of 1 to each server, then we are able to form quorums of size 4. Note that two subsets of processes composed each of a majority of servers from each of a majority of groups necessarily have a non-empty intersection. It is reasonable to expect that a majority of co-locations will have a majority of servers available with high probability.
With ZooKeeper, we provide a user with the ability of configuring servers to use majority quorums, weights, or a hierarchy of groups.

与通信相关的概念

Packet
a sequence of bytes sent through a FIFO channel
Proposal（怎么翻译啊？？？）
a unit of agreement. Proposals are agreed upon by exchanging packets with a quorum of ZooKeeper servers. Most proposals contain messages, however the NEW_LEADER proposal is an example of a proposal that does not correspond to a message.
Message
a sequence of bytes to be atomically broadcast to all ZooKeeper servers. A message put into a proposal and agreed upon before it is delivered.

使用超时来容错

FLP proved that consensus cannot be achieved in asynchronous distributed systems if failures are possible. To ensure we achieve consensus in the presence of failures we use timeouts. However, we rely on times for liveness not for correctness. So, if timeouts stop working (clocks malfunction for example) the messaging system may hang, but it will not violate its guarantees.

两阶段 Messaging

Leader activation
In this phase a leader establishes the correct state of the system and gets ready to start making proposals.
Active messaging
In this phase a leader accepts messages to propose and coordinates message delivery.

ZooKeeper is a holistic protocol. We do not focus on individual proposals, rather look at the stream of proposals as a whole. Our strict ordering allows us to do this efficiently and greatly simplifies our protocol. Leadership activation embodies this holistic concept. A leader becomes active only when a quorum of followers (The leader counts as a follower as well. You can always vote for yourself ) has synced up with the leader, they have the same state. This state consists of all of the proposals that the leader believes have been committed and the proposal to follow the leader, the NEW_LEADER proposal. (Hopefully you are thinking to yourself, Does the set of proposals that the leader believes has been committed included all the proposals that really have been committed? The answer is yes. Below, we make clear why.)

ZAB 协议

Zookeeper 的核心是广播，这个机制保证了各个 Server 之间的同步。实现这个机制的协议叫做 Zab 协议。
Zab（ZooKeeper Atomic Broadcast）原子消息广播协议作为数据一致性的核心算法，Zab 协议是专为 Zookeeper 设计的支持崩溃恢复原子消息广播算法。
Zab 协议核心如下：
所有的事务请求必须一个全局唯一的服务器（Leader）来协调处理，集群其余的服务器称为 follower 服务器。Leader 服务器负责将一个客户端请求转化为事务提议（Proposal），并将该 proposal 分发给集群所有的 follower 服务器。之后 Leader 服务器需要等待所有的 follower 服务器的反馈，一旦超过了半数的 follower 服务器进行了正确反馈后，那么 Leader 服务器就会再次向所有的 follower 服务器分发 commit 消息，要求其将前一个 proposal 进行提交。
以上对Zab协议的简单描述有点像两阶段提交协议，只是“Prepare”阶段现在只需要半数以上返回正确反馈就可以进入“Commit”阶段了。

ZAB 的两种模式

Zab 协议包括两种基本的模式：崩溃恢复和消息广播。

当整个服务框架启动过程中或 Leader 服务器出现网络中断、崩溃退出与重启等异常情况时，Zab 协议就会进入恢复模式并选举产生新的 Leader 服务器。
当选举产生了新的 Leader 服务器，同时集群中已经有过半的机器与该 Leader 服务器完成了状态同步之后，Zab 协议就会退出恢复模式，状态同步是指数据同步，用来保证集群在过半的机器能够和 Leader 服务器的数据状态保持一致。
当集群中已经有过半的 Follower 服务器完成了和 Leader 服务器的状态同步，那么整个服务框架就可以进入消息广播模式。
当一台同样遵守 Zab 协议的服务器启动后加入到集群中，如果此时集群中已经存在一个 Leader 服务器在负责进行消息广播，那么加入的服务器就会自觉地进入数据恢复模式：找到 Leader 所在的服务器，并与其进行数据同步，然后一起参与到消息广播流程中去。

Zookeeper 只允许唯一的一个 Leader 服务器来进行事务请求的处理，Leader 服务器在接收到客户端的事务请求后，会生成对应的事务提议并发起一轮广播协议，而如果集群中的其他机器收到客户端的事务请求后，那么这些非 Leader 服务器会首先将这个事务请求转发给 Leader 服务器。

选举算法概述

当 leader 崩溃或者 leader 失去大多数的 follower，这时 zk 进入恢复模式，恢复模式的主要目的是选出一个唯一 Leader。
用一个小组开发作为场景，假设这个小组有两个开发经理（Leader）A 和 B，这两名经理都会发出 proposal 指令并在得到超过半数下属的答复后再发出一个 accept 指令，每条指令都带上当时的时间，下属只会在接受到时间最近的一条 proposal 对应的 accept 后才会真正开始工作，现在考虑这样的情况：

A 发出 proposal 指令，并带上指令；
下属接受该指令；
B 发出 proposal 指令；
下属接受 B 的指令；
A 发出 accept 指令；
被下属忽略（因为 A 的 accept 指令的时间戳晚于 B 的 proposal 指令）；
A 于是又发出下一个 proposal；
下属接受新 proposal（覆盖掉 B 的指令）；
接下来轮到 B 的指令被忽略掉了，这样会陷入一个死循环。

因此，`Leader 必须是唯一的，不然会发生上述脑裂现象。

ZooKeeper 是如何保证事务的顺序一致性的

ZooKeeper 采用了递增的事务 Id 来标识，所有的 proposal 都在被提出的时候加上了 zxid，zxid 实际上是一个 64 位的数字，高 32 位是 epoch 用来标识 leader 是否发生改变，如果有新的 leader 产生出来，epoch 会自增，低 32 位用来递增计数。当新产生 proposal 的时候，会依据数据库的两阶段过程，首先会向其他的 server 发出事务执行请求，如果超过半数的机器都能执行并且能够成功，那么就会开始执行。

Leader Activation

Leader activation includes leader election. We currently have two leader election algorithms in ZooKeeper: LeaderElection and FastLeaderElection (AuthFastLeaderElection is a variant of FastLeaderElection that uses UDP and allows servers to perform a simple form of authentication to avoid IP spoofing). ZooKeeper messaging doesn’t care about the exact method of electing a leader has long as the following holds:

The leader has seen the highest zxid of all the followers.
A quorum of servers have committed to following the leader.

Of these two requirements only the first, the highest zxid amoung the followers needs to hold for correct operation. The second requirement, a quorum of followers, just needs to hold with high probability. We are going to recheck the second requirement, so if a failure happens during or after the leader election and quorum is lost, we will recover by abandoning leader activation and running another election.
After leader election a single server will be designated as a leader and start waiting for followers to connect. The rest of the servers will try to connect to the leader. The leader will sync up with followers by sending any proposals they are missing, or if a follower is missing too many proposals, it will send a full snapshot of the state to the follower.
There is a corner case in which a follower that has proposals, U, not seen by a leader arrives. Proposals are seen in order, so the proposals of U will have a zxids higher than zxids seen by the leader. The follower must have arrived after the leader election, otherwise the follower would have been elected leader given that it has seen a higher zxid. Since committed proposals must be seen by a quorum of servers, and a quorum of servers that elected the leader did not see U, the proposals of you have not been committed, so they can be discarded. When the follower connects to the leader, the leader will tell the follower to discard U.
A new leader establishes a zxid to start using for new proposals by getting the epoch, e, of the highest zxid it has seen and setting the next zxid to use to be (e+1, 0), fter the leader syncs with a follower, it will propose a NEW_LEADER proposal. Once the NEW_LEADER proposal has been committed, the leader will activate and start receiving and issuing proposals.
It all sounds complicated but here are the basic rules of operation during leader activation:

A follower will ACK the NEW_LEADER proposal after it has synced with the leader.
A follower will only ACK a NEW_LEADER proposal with a given zxid from a single server.
A new leader will COMMIT the NEW_LEADER proposal when a quorum of followers have ACKed it.
A follower will commit any state it received from the leader when the NEW_LEADER proposal is COMMIT.
A new leader will not accept new proposals until the NEW_LEADER proposal has been COMMITED.
If leader election terminates erroneously, we don’t have a problem since the NEW_LEADER proposal will not be committed since the leader will not have quorum. When this happens, the leader and any remaining followers will timeout and go back to leader election.

Leader 选举

Leader 选举是保证分布式数据一致性的关键所在。当 Zookeeper 集群中的一台服务器出现以下两种情况之一时，需要进入 Leader 选举。

服务器初始化启动。
服务器运行期间无法和 Leader 保持连接。

Zookeeper 在 3.4.0 版本后只保留了 TCP 版本的 FastLeaderElection 选举算法。当一台机器进入 Leader 选举时，当前集群可能会处于以下两种状态：

集群中已存在 Leader。
集群中不存在 Leader。

对于集群中已经存在 Leader 而言，此种情况一般都是某台机器启动得较晚，在其启动之前，集群已经在正常工作，对这种情况，该机器试图去选举 Leader 时，会被告知当前服务器的 Leader 信息，对于该机器而言，仅仅需要和 Leader 机器建立起连接，并进行状态同步即可。

而在集群中不存在 Leader 情况下则会相对复杂，其步骤如下：

第一次投票。无论哪种导致进行 Leader 选举，集群的所有机器都处于试图选举出一个 Leader 的状态，即 LOOKING 状态，LOOKING 机器会向所有其他机器发送消息，该消息称为投票。投票中包含了 SID（服务器的唯一标识）和 ZXID（事务 ID），(SID, ZXID)形式来标识一次投票信息。假定 Zookeeper 由 5 台机器组成，SID 分别为 1、2、3、4、5，ZXID 分别为 9、9、9、8、8，并且此时 SID 为 2 的机器是 Leader 机器，某一时刻，1、2 所在机器出现故障，因此集群开始进行 Leader 选举。在第一次投票时，每台机器都会将自己作为投票对象，于是 SID 为 3、4、5 的机器投票情况分别为(3, 9)，(4, 8)， (5, 8)。
变更投票。每台机器发出投票后，也会收到其他机器的投票，每台机器会根据一定规则来处理收到的其他机器的投票，并以此来决定是否需要变更自己的投票，这个规则也是整个 Leader 选举算法的核心所在，其中术语描述如下

vote_sid：接收到的投票中所推举 Leader 服务器的 SID。
vote_zxid：接收到的投票中所推举 Leader 服务器的 ZXID。
self_sid：当前服务器自己的 SID。
self_zxid：当前服务器自己的 ZXID。

每次对收到的投票的处理，都是对(vote_sid, vote_zxid)和(self_sid, self_zxid)对比的过程。

规则一：如果 vote_zxid 大于 self_zxid，就认可当前收到的投票，并再次将该投票发送出去。
规则二：如果 vote_zxid 小于 self_zxid，那么坚持自己的投票，不做任何变更。
规则三：如果 vote_zxid 等于 self_zxid，那么就对比两者的 SID，如果 vote_sid 大于 self_sid，那么就认可当前收到的投票，并再次将该投票发送出去。
规则四：如果 vote_zxid 等于 self_zxid，并且 vote_sid 小于 self_sid，那么坚持自己的投票，不做任何变更。

Leader选举
结合上面规则，给出下面的集群变更过程。
3. 确定 Leader。经过第二轮投票后，集群中的每台机器都会再次接收到其他机器的投票，然后开始统计投票，如果一台机器收到了超过半数的相同投票，那么这个投票对应的 SID 机器即为 Leader。此时 Server3 将成为 Leader。

由上面规则可知，通常那台服务器上的数据越新（ZXID 会越大），其成为 Leader 的可能性越大，也就越能够保证数据的恢复。如果 ZXID 相同，则 SID 越大机会越大。

Active Messaging

Leader Activation does all the heavy lifting. Once the leader is coronated he can start blasting out proposals. As long as he remains the leader no other leader can emerge since no other leader will be able to get a quorum of followers. If a new leader does emerge, it means that the leader has lost quorum, and the new leader will clean up any mess left over during her leadership activation.
ZooKeeper messaging operates similar to a classic two-phase commit.
两阶段Messaging
All communication channels are FIFO, so everything is done in order. Specifically the following operating constraints are observed:

The leader sends proposals to all followers using the same order. Moreover, this order follows the order in which requests have been received. Because we use FIFO channels this means that followers also receive proposals in order.
Followers process messages in the order they are received. This means that messages will be ACKed in order and the leader will receive ACKs from followers in order, due to the FIFO channels. It also means that if message $m$ has been written to non-volatile storage, all messages that were proposed before $m$ have been written to non-volatile storage.
The leader will issue a COMMIT to all followers as soon as a quorum of followers have ACKed a message. Since messages are ACKed in order, COMMITs will be sent by the leader as received by the followers in order.
COMMITs are processed in order. Followers deliver a proposals message when that proposal is committed.

Zab 协议的消息广播过程使用是一个原子广播协议，类似一个 2PC 提交过程。具体的：

ZooKeeper 使用单一主进程 Leader 用于处理客户端所有事务请求，并采用 Zab 的原子广播协议，将服务器数据状态变更以事务 Proposal 的形式广播 Follower 上，因此能很好的处理客户端的大量并发请求。
另一方面，由于事务间可能存在着依赖关系，Zab 协议保证 Leader 广播的变更序列被顺序的处理，有些状态的变更必须依赖于比它早生成的那些状态变更。
最后，考虑到主进程 Leader 在任何时候可能崩溃或者异常退出，因此 Zab 协议还要 Leader 进程崩溃的时候可以重新选出 Leader 并且保证数据的完整性；Follower 收到 Proposal 后，写到磁盘，返回 Ack。Leader 收到大多数 ACK 后，广播 Commit 消息，自己也提交该消息。Follower 收到 Commit 之后，提交该消息。

Zab 协议简化了 2PC 事务提交：

去除中断逻辑移除，follower 要么 ack，要么抛弃 Leader。
Leader 不需要所有的 Follower 都响应成功，只要一个多数派 Ack 即可。

Paxos & ZAB 区别

Paxos 是分布式选举算法，ZooKeeper 使用的 ZAB 协议（ZooKeeper Atomic Broadcast）。
二者有相同的地方。比如都有一个 Leader，用来协调 N 个 Follower 的运行；Leader 要等待超半数的 Follower 做出正确反馈之后才进行提案；二者都有一个值来代表 Leader 的周期（zxid 的前 32 位）。
二者不同的地方在于，ZAB 用来构建高可用的分布式数据主备系统（ZooKeeper），Paxos 是用来构建分布式一致性状态机系统的。

Summary

So there you go. Why does it work? Specifically, why does is set of proposals believed by a new leader always contain any proposal that has actually been committed? First, all proposals have a unique zxid, so unlike other protocols, we never have to worry about two different values being proposed for the same zxid; followers (a leader is also a follower) see and record proposals in order; proposals are committed in order; there is only one active leader at a time since followers only follow a single leader at a time; a new leader has seen all committed proposals from the previous epoch since it has seen the highest zxid from a quorum of servers; any uncommited proposals from a previous epoch seen by a new leader will be committed by that leader before it becomes active.

Comparisons

Isn’t this just Multi-Paxos? No, Multi-Paxos requires some way of assuring that there is only a single coordinator. We do not count on such assurances. Instead we use the leader activation to recover from leadership change or old leaders believing they are still active.
Isn’t this just Paxos? Your active messaging phase looks just like phase 2 of Paxos? Actually, to us active messaging looks just like 2 phase commit without the need to handle aborts. Active messaging is different from both in the sense that it has cross proposal ordering requirements. If we do not maintain strict FIFO ordering of all packets, it all falls apart. Also, our leader activation phase is different from both of them. In particular, our use of epochs allows us to skip blocks of uncommitted proposals and to not worry about duplicate proposals for a given zxid.