ES1_4文档

发表于 2019-08-21 更新于 2025-07-07 分类于 ElasticSearch

常用操作 - 对象（文档 Document）

文档

ES是面向文档的，文档是所有可搜索数据的最小单位。
文档会被序列化成JSON格式，保存在ES中；
JSON对象由字段组成，每个字段都有对应的字段类型。
每个文档都有一个Unique ID。
这个Unique ID可以自己指定或由ES自动生成。

文档和字段 - Document、Field

一个文档是一个可被索引的基础信息单元，文档以 JSON 格式来表示。
在一个 index/type 里面，可以存储任意多的文档，每个文档都有唯一 id。
每个文档包含多个字段(fields)，即 json 数据里的字段。

文档元数据

一个文档不仅仅包含它的数据，也包含元数据 —— 有关文档的信息。三个必须的元数据元素如下：

_index
一个索引应该是因共同的特性被分组到一起的文档集合。
索引名字必须小写，不能以下划线开头，不能包含逗号。
_type
Lucene 没有文档类型的概念，而是使用一个元数据字段_type 文档表示的对象类别，数据可能在索引中只是松散的组合在一起，但是通常明确定义一些数据中的子分区是很有用的，不同 types 的文档可能有不同的字段，但最好能够非常相似。
一个 _type 命名可以是大写或者小写，但是不能以下划线或者句号开头，不应该包含逗号，并且长度限制为 256 个字符。
当我们要检索某个类型的文档时, Elasticsearch 通过在 _type 字段上使用过滤器限制只返回这个类型的文档。
_id
文档唯一标识，和 _index 以及 _type 组合就可以唯一确定 Elasticsearch 中的一个文档。
id 也可以由 Elasticsearch 自动生成。
_version
在 Elasticsearch 中每个文档都有一个版本号。当每次对文档进行修改时（包括删除）， _version 的值会递增。这个字段用来确保这些改变在跨多节点时以正确的顺序执行。
版本号——不管是内部的还是引用外部的——都必须是在(0, 9.2E+18)范围内的一个 long 类型的正数。
_source
即索引数据时发送给 Elasticsearch 的原始 JSON 文档。
_score
相关性打分
~~_all~~
整合所有字段内容到该字段，已被废除。

文档属性

文档里有几个最重要的设置：

type
字段的数据类型，例如 string 或 date
index
字段是否应当被当成全文来搜索（analyzed），或被当成一个准确的值（not_analyzed），还是完全不可被搜索（ no ）
analyzer
确定在索引和搜索时全文字段使用的 analyzer
_source
存储代表文档体的 JSON 字符串，和所有被存储的字段一样， _source 字段在被写入磁盘之前先会被压缩。这个字段有以下作用：
1. 搜索结果包括了整个可用的文档——不需要额外的从另一个的数据仓库来取文档。
2. 如果没有 _source 字段，部分 update 请求不会生效。
3. 当你的映射改变时，你需要重新索引你的数据，有了_source 字段你可以直接从 Elasticsearch 这样做，而不必从另一个（通常是速度更慢的）数据仓库取回你的所有文档。
4. 当你不需要看到整个文档时，单个字段可以从 _source 字段提取和通过 get 或者 search 请求返回。
  1
  2
  3
  4
  5
  GET /_search
  {
  "query": { "match_all": {}},
  "_source": [ "title", "created" ]
  }
5. 调试查询语句更加简单，因为你可以直接看到每个文档包括什么，而不是从一列 id 猜测它们的内容。
  也可以调用下面的映射来禁用_source 字段：
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  PUT /my_index
  {
  "mappings": {
  "my_type": {
  "_source": {
  "enabled": false
  }
  }
  }
  }

对象和文档

通常情况下，我们使用的术语对象和文档是可以互相替换的。不过，有一个区别：
一个对象仅仅是类似于 hash 、 hashmap 、字典或者关联数组的 JSON 对象，对象中也可以嵌套其他的对象。对象可能包含了另外一些对象。
文档指最顶层或者根对象，这个根对象被序列化成 JSON 并存储到 Elasticsearch 中，指定了唯一 ID 及一些必须的文档元数据。

根对象

映射的最高一层被称为根对象，它可能包含下面几项：

一个 properties 节点，列出了文档中可能包含的每个字段的映射
各种元数据字段，它们都以一个下划线开头，例如 _type 、 _id 和 _source
设置项，控制如何动态处理新的字段，例如 analyzer 、 dynamic_date_formats 和 dynamic_templates
其他设置，可以同时应用在根对象和其他 object 类型的字段上，例如 enabled 、 dynamic 和 include_in_all

操作类型

文档的CRUD：

Index
Index操作——如果ID不存在——则创建新的文档，否则删除现有的再创建新的，版本号会增加
PUT my_index/_doc/1
Create
Create操作——如果ID已经存在——会失败
PUT my_index/_create/1
不指定ID，自动生成
POST my_index/_doc
Read
GET my_index/_doc/1
Update
文档必须已经存在，更新只会对相应字段做增量修改
POST my_index/_update/1
Delete
my_index/_doc/1

常见返回

问题	原因
无法连接	网络故障或集群挂了
连接无法关闭	网络故障或节点出错
429	集群过于繁忙
4xx	请求体格式有问题
500	集群内部错误

更新 - PUT

更新现有的对象需要自己指定对象的 id，如果不存在将自动创建一个，文档更新后_version 字段的值也会相应提高。在内部，Elasticsearch 已将旧文档标记为已删除，并增加一个全新的文档。尽管你不能再对旧版本的文档进行访问，但它并不会立即消失。当继续索引更多的数据，Elasticsearch 会在后台清理这些已删除文档。
检索和重建索引步骤的间隔越小，变更冲突的机会越小。但是它并不能完全消除冲突的可能性。还是有可能在 update 设法重新索引之前，来自另一进程的请求修改了文档。为了避免数据丢失， update API 在检索步骤时检索得到文档当前的 _version 号，并传递版本号到重建索引步骤的 index 请求。如果另一个进程修改了处于检索和重新索引步骤之间的文档，那么 _version 号将不匹配，更新请求将会失败。为了实现版本号控制只需要在请求参数中加入 version（如上所示）。

PUT /website/blog/123
{
  "title": "My first blog entry",
  "text": "Just trying this out...",
  "date": "2014/01/01"
}

如果已经有自己的 _id 、而又想执行创建，那么我们必须告诉 Elasticsearch ，只有在相同的 _index 、 _type 和 _id 不存在时才接受我们的索引请求——而不是覆盖掉，有两种方式：

# 指定ID的index操作，其实是个upsert操作
PUT /website/blog/123?op_type=create
{ ... }
# 创建一个文档
PUT /website/blog/123/_create
{ ... }

文档是不可变的：他们不能被修改，只能被替换。 update API 必须遵循同样的规则。从外部来看，我们在一个文档的某个位置进行部分更新。然而在内部， update API 简单使用与之前描述相同的 检索-修改-重建索引 的处理过程。区别在于这个过程发生在分片内部，这样就避免了多次请求的网络开销。通过减少检索和重建索引步骤之间的时间，我们也减少了其他进程的变更带来冲突的可能性。

创建 - POST

不需要指定对象 id，由 Elasticsearch 自动生成，自动生成的 ID 是 URL-safe、基于 Base64 编码且长度为 20 个字符的 GUID 字符串。这些 GUID 字符串由可修改的 FlakeID 模式生成，这种模式允许多个节点并行生成唯一 ID ，且互相之间的冲突概率几乎为零。

POST /website/blog/
{
  "title": "My second blog entry",
  "text": "Still trying this out...",
  "date": "2014/01/01"
}

部分更新 - POST

update 请求最简单的一种形式是接收文档的一部分作为 doc 的参数，它只是与现有的文档进行合并。对象被合并到一起，覆盖现有的字段，增加新的字段。

# 文档必须已经存在
POST /website/blog/1/_update
{
   "doc" : {
      "tags" : [ "testing" ],
      "views": 0
   }
}

使用脚本部分更新文档：脚本可以在 update API 中用来改变 _source 的字段内容，它在更新脚本中称为 ctx._source ，运行在一个沙盒内，默认使用 Painless 语言作为脚本语言。下面这个脚本在页面不存在时执行新增并初始化 views=1（第一次运行这个请求时， upsert 值作为新文档被索引，初始化 views 字段为 1 ；在后续的运行中，由于文档已经存在， script 更新操作将替代 upsert 进行应用，对 views 计数器进行累加）、页面被浏览 2 次后执行删除，其他情况浏览量+1 并添加一个新标签：

POST /website/blog/zVmOW2EBsZ0GEqF92yf6/_update
{
   "script" : {
      "source" : "if(ctx._source.views == params.count) { ctx.op = 'delete'} ctx._source.views+=1; ctx._source.tags.add(params.new_tag)",
      "params" : {
        "new_tag" : "search",
        "count": 2
      }
    },
    "upsert": {
        "views": 1
    }
}

重试：
正如之前所说，update 操作是检索-修改-重新索引的过程，检索和重建索引步骤的间隔越小，变更冲突的机会越小。但是它并不能完全消除冲突的可能性。还是有可能在 update 设法重新索引之前，来自另一进程的请求修改了文档。为了避免数据丢失， update API 在检索步骤时检索得到文档当前的 _version 号，并传递版本号到重建索引步骤的 index 请求。如果另一个进程修改了处于检索和重新索引步骤之间的文档，那么 _version 号将不匹配，更新请求将会失败。
对于部分更新的很多使用场景，文档已经被改变也没有关系。例如，如果两个进程都对页面访问量计数器进行递增操作，它们发生的先后顺序其实不太重要；如果冲突发生了，我们唯一需要做的就是尝试再次更新。这可以通过设置参数 retry_on_conflict 来自动完成，这个参数规定了失败之前 update 应该重试的次数，它的默认值为 0

POST /website/blog/zVmOW2EBsZ0GEqF92yf6/_update?retry_on_conflict=5 
{
   "script" : "ctx._source.views+=1",
   "upsert": {
       "views": 0
   }
}

GET（搜索）

在请求的查询串参数中加上 pretty 参数，这将会调用 Elasticsearch 的 pretty-print 功能，该功能使得 JSON 响应体更加可读，但其中的 _source 字段并不是被当成字符串打印出来，而是格式化成了 JSON 串：

1
2
3

GET /website/blog/123?pretty
GET /website/blog/123/_source
GET /website/blog/123?_source=title,text

将多个请求合并成一个，避免单独处理每个请求花费的网络延时和开销。如果你需要从 Elasticsearch 检索很多文档，那么使用 multi-get 或者 mget API 来将这些检索请求放在一个请求中，将比逐个文档请求更快地检索到全部文档。
mget API 要求有一个 docs 数组作为参数，每个元素包含需要检索文档的元数据，包括 _index 、 _type 和 _id 。如果你想检索一个或者多个特定的字段，那么你可以通过 _source 参数来指定这些字段的名字：

GET /_mget
{
   "docs" : [
      {
         "_index" : "website",
         "_type" : "blog",
         "_id" : "zVmOW2EBsZ0GEqF92yf6"
      },
      {
         "_index" : "website",
         "_type" : "blog",
         "_id" : 1,
         "_source": "views"
      }
   ]
}
GET /website/blog/_mget
{
   "ids" : [ "2", "1" ]
}

HEAD（ping）

如果只想检查一个文档是否存在——根本不想关心内容——那么用 HEAD 方法来代替 GET 方法。

1	HEAD /website/blog/124

DELETE（删除）

1	DELETE /website/blog/123

bulk（批量操作）

每一行——包括最后一行——都必须以换行符结尾，格式如下所示：

{ action: { metadata }}\n
{ request body }\n
{ action: { metadata }}\n
{ request body }\n

action/metadata 行指定哪一个文档做什么操作。action 必须是以下选项之一:
create：如果文档不存在，那么就创建它。类似POST或PUT /_create。
index：创建一个新文档或者替换一个现有的文档。类似POST或PUT。
update：部分更新一个文档。类似POST /_update。
delete：删除一个文档。类似DELETE。
metadata 应该指定被索引、创建、更新或者删除的文档的 _index 、 _type 和 _id ，每个请求的 metadata 都会覆盖请求 URL 中带上的默认元数据。
request body 行由文档的 _source 本身组成–文档包含的字段和值。它是 index、create、update 操作所必需的。
为什么不直接用一个 JSON 数组来保存？主要是考虑效率问题，解析为数组需要有更多的 RAM 空间，且 JVM 要花时间进行 gc。而直接使用原始数据只需要多注意每条数据之间的间隔（换行符）。
每个子请求都是独立执行，因此某个子请求的失败不会对其他子请求的成功与否造成影响。如果其中任何子请求失败，最顶层的 error 标志被设置为 true ，并且在相应的请求报告出错误明细。这也意味着 bulk 请求不是原子的：不能用它来实现事务控制。每个请求是单独处理的，因此一个请求的成功或失败不会影响其他的请求。

POST /_bulk
{ "delete": { "_index": "website", "_type": "blog", "_id": "123" }} 
{ "create": { "_index": "website", "_type": "blog", "_id": "123" }}
{ "title": "My first blog post" }
{ "index": { "_index": "website", "_type": "blog" }}
{ "title": "My second blog post" }
{ "update": { "_index": "webiite", "_type": "blog", "_id": "123", "_retry_on_conflict" : 3} }
{ "doc" : {"title" : "My updated blog post"} }

批量请求的大小有一个最佳值，大于这个值，性能将不再提升，甚至会下降。但是最佳值不是一个固定的值，它完全取决于硬件、文档的大小和复杂度、索引和搜索的负载的整体情况。
幸运的是，很容易找到这个最佳点：通过批量索引典型文档，并不断增加批量大小进行尝试。当性能开始下降，那么你的批量大小就太大了。一个好的办法是开始时将 1,000 到 5,000 个文档作为一个批次, 如果你的文档非常大，那么就减少批量的文档个数。并且请求的文档也最好不要太大，一个好的批量大小在开始处理后所占用的物理大小约为 5-15 MB。

ES2_2分布式文档存储

发表于 2019-08-21 更新于 2025-07-07 分类于 ElasticSearch

[x] ES如何保证断电时数据也不会丢失

阅读全文 »

ES2_3文档操作源码分析

发表于 2019-08-21 更新于 2025-07-07 分类于 ElasticSearch

写入（POST、PUT、DELETE）流程源码分析

refresh和flush - 实时性和可靠性之间的权衡

刷盘流程

近实时性

ES中数据写入后并不能被马上查到，而是必须先执行refresh，默认是1s，最快可到100ms。

可靠性

搜索系统对可靠性要求都不高，一般数据的可靠性通过将原始数据存储在另一个存储系统来保证，当搜索系统的数据发生丢失时，再从其他存储系统导一份数据过来重新rebuild就可以了。
ES采用多副本模型，可以避免单机发生故障时丢失数据，但是ES同时为了提升读写性能，一般是每隔一段时间才会把Lucene的Segment flush到磁盘实现持久化，这样减少了磁盘IO，但是数据未flush期间，如果发生了宕机就很容易导致数据的丢失。对于这个问题，ES中的解决方法类似数据库中的CommitLog，ES中引入了一个TransLog。
可以通过设置TransLog的Flush频率来控制写入缓存的数据什么时候刷到磁盘上，要么是按请求，每次请求都Flush；要么是按时间，每隔一段时间Flush一次。一般为了性能考虑，会设置为每隔5秒或者1分钟Flush一次，Flush间隔时间越长，可靠性就会越低。

ES的刷盘流程

之前我们已经讨论了数据如何定位到某个node、某个shard。

在每个shard上，数据会先写入Lucene，此时数据还在内存里；
写Lucene内存后还不可被搜索，需要先通过Refresh将内存对象转成完整Segment后，再次reopen后才可被搜索。
但是简单的Get操作是GetById的，这种查询可以直接从TransLog中查询，因此这种情况下是实时的。
接着去写TransLog，写完TransLog后会刷新TransLog数据到磁盘上；

和数据库不同，数据库是先写CommitLog再写内存，而ES是先写内存（Lucene）再写TransLog，原因是Lucene的内存写入有很复杂的逻辑，比如分词、字段长度超过限制等，很容易失败，为了避免TransLog中有大量无效记录，减少recover的复杂度和提高速度，所以把写Lucene放到了前面。
等到TransLog数据被刷新到磁盘上后，返回写成功给用户。
隔一段比较长的时间后，Lucene会把内存中新生成的Segment Flush到磁盘，之后就会把TransLog清空掉。

ES会丢失数据吗？

Lucene每隔1秒生成Segment文件，此时Segment还在缓存中，还未刷盘，如果这时挂掉，内存中的数据仍然可以从TransLog中恢复；
TransLog中的数据是每隔5秒刷新到磁盘，显然这还不能保证数据安全，最多会导致丢失TransLog中5秒内的数据，可以通过配置增加TransLog刷磁盘的频率来增加数据可靠性，但是会对性能有比较大的影响。
即使Master分片所在节点宕掉，导致TransLog丢失了，仍然可以从副本恢复。

文档更新（部分更新）

Lucene中不支持文档的部分更新，因此需要在Elasticsearch中实现该功能：

收到Update请求后，从Segment或TransLog中读取该id的完整文档，记录版本为V1；
将版本V1的文档和请求中的部分字段文档合并，同时更新内存中的versionMap，得到V2，之后Update请求就变成了对V2的Index请求；
加锁；
再次从versionMap中读取该id的最大版本号V2，如果没有再从Segment或TransLog中读取，但是versionMap中基本都可以获取到；
检查版本是否冲突（V1和V2），如果冲突则回退到开始的Update阶段重新执行，否则继续执行Index请求；
在Index阶段，首先版本+1得到V3，再将文档加入到Lucene中去，Lucene中会删除同id的旧文档，然后再新增文档。写入成功后，将V3更新到versionMap中；
释放锁。

文档操作类型

文档操作分为单个的（Index）和批量的（Bulk），它们最终都会被统一封装为批量操作请求（BulkRequest）。

请求入口

在ES中，所有action的入口都注册在ActionModule中，比如Bulk Request有两个注册入口：

actions.register(BulkAction.INSTANCE, TransportBulkAction.class,
        TransportShardBulkAction.class);

registerHandler.accept(new RestBulkAction(settings, restController));

对于Rest请求，会在RestBulkAction中解析请求，并最终转换成TransportAction处理。

比如对请求：localhost:9200/website/blog/123

{
  "title": "My first blog entry",
  "text": "Just trying this out...",
  "date": "2014/01/01"
}

会先被dispatch到RestIndexAction，然后转发给TransportBulkAction#doExecute，下面对文档写入流程的分析也将从这个入口开始。

文档写入流程

由上边对ES数据模型的讨论可知，ES文档的写入必须是先成功写入到主分片，然后才能复制到相关的副分片。
多节点多分片模型

第一个接收请求的节点是协调节点；
先根据_routing规则选择发给哪个shard（分片）；
优先使用IndexRequest中的设置，其次使用mapping中的配置，如果都没有则使用_id作为路由参数；
从集群的meta中找出该shard的节点，此时，请求会被转发到primary shard所在的节点；
请求接着发送给primary shard执行写操作；
primary shard执行成功后再发送给多个replica shard；
请求在多个replica shard上执行成功并返回给协调节点后，写入执行成功，协调节点返回结果给客户端。

从上述写入的概述可知，写入流程具体的，可以分为协调节点、主分片节点及副分片节点三种角色的写入过程。
ES写入流程

协调节点处理流程

1、自动创建索引
入口：TransportBulkAction#doExecute
找出请求中需要自动创建的索引

for (String index : indices) {
    boolean shouldAutoCreate;
    try {
        shouldAutoCreate = shouldAutoCreate(index, state);
    } catch (IndexNotFoundException e) {
        shouldAutoCreate = false;
        indicesThatCannotBeCreated.put(index, e);
    }
    if (shouldAutoCreate) {
        autoCreateIndices.add(index);
    }
}

执行创建索引的请求：

void createIndex(String index, TimeValue timeout, ActionListener<CreateIndexResponse> listener) {
    CreateIndexRequest createIndexRequest = new CreateIndexRequest();
    createIndexRequest.index(index);
    createIndexRequest.cause("auto(bulk api)");
    createIndexRequest.masterNodeTimeout(timeout);
    createIndexAction.execute(createIndexRequest, listener);
}

2、路由请求
入口：TransportBulkAction.BulkOperation#doRun
不同类型的请求路由逻辑也不同：

switch (docWriteRequest.opType()) {
    // 创建索引、mapping请求
    case CREATE:
    case INDEX:
        IndexRequest indexRequest = (IndexRequest) docWriteRequest;
        final IndexMetaData indexMetaData = metaData.index(concreteIndex);
        MappingMetaData mappingMd = indexMetaData.mappingOrDefault(indexRequest.type());
        Version indexCreated = indexMetaData.getCreationVersion();
        indexRequest.resolveRouting(metaData);
        indexRequest.process(indexCreated, mappingMd, concreteIndex.getName());
        break;
    // 更新文档请求
    case UPDATE:
        TransportUpdateAction.resolveAndValidateRouting(metaData, concreteIndex.getName(), (UpdateRequest) docWriteRequest);
        break;
    // 删除文档操作
    case DELETE:
        docWriteRequest.routing(metaData.resolveWriteIndexRouting(docWriteRequest.parent(), docWriteRequest.routing(), docWriteRequest.index()));
        // check if routing is required, if so, throw error if routing wasn't specified
        if (docWriteRequest.routing() == null && metaData.routingRequired(concreteIndex.getName(), docWriteRequest.type())) {
            throw new RoutingMissingException(concreteIndex.getName(), docWriteRequest.type(), docWriteRequest.id());
        }
        break;
    default: throw new AssertionError("request type not supported: [" + docWriteRequest.opType() + "]");
}

然后计算文档ID的hash值，将其分配给对应的shard：

// 根据文档ID分配给对应的shardId
// first, go over all the requests and create a ShardId -> Operations mapping
Map<ShardId, List<BulkItemRequest>> requestsByShard = new HashMap<>();
for (int i = 0; i < bulkRequest.requests.size(); i++) {
    DocWriteRequest request = bulkRequest.requests.get(i);
    if (request == null) {
        continue;
    }
    String concreteIndex = concreteIndices.getConcreteIndex(request.index()).getName();
    ShardId shardId = clusterService.operationRouting().indexShards(clusterState, concreteIndex, request.id(), request.routing()).shardId();
    List<BulkItemRequest> shardRequests = requestsByShard.computeIfAbsent(shardId, shard -> new ArrayList<>());
    shardRequests.add(new BulkItemRequest(i, request));
}

if (requestsByShard.isEmpty()) {
    listener.onResponse(new BulkResponse(responses.toArray(new BulkItemResponse[responses.length()]), buildTookInMillis(startTimeNanos)));
    return;
}

3、轮询分片，分发请求

final AtomicInteger counter = new AtomicInteger(requestsByShard.size());
// 当前节点ID
String nodeId = clusterService.localNode().getId();
for (Map.Entry<ShardId, List<BulkItemRequest>> entry : requestsByShard.entrySet()) {
    // 对每个分片
    final ShardId shardId = entry.getKey();
    final List<BulkItemRequest> requests = entry.getValue();
    // 创建该分片的批量操作请求
    BulkShardRequest bulkShardRequest = new BulkShardRequest(shardId, bulkRequest.getRefreshPolicy(),
            requests.toArray(new BulkItemRequest[requests.size()]));
    bulkShardRequest.waitForActiveShards(bulkRequest.waitForActiveShards());
    bulkShardRequest.timeout(bulkRequest.timeout());
    if (task != null) {
        bulkShardRequest.setParentTask(nodeId, task.getId());
    }
    // 执行该请求
    shardBulkAction.execute(bulkShardRequest, new ActionListener<BulkShardResponse>() {
        ...
}

4、将分片请求发往节点
代码入口：TransportReplicationAction.ReroutePhase#doRun
将请求路由到主分片所在的节点上，并重试失败的操作。

// 找到主分片
final ShardRouting primary = primary(state);
if (retryIfUnavailable(state, primary)) {
    return;
}
final DiscoveryNode node = state.nodes().get(primary.currentNodeId());
// 主分片在当前节点就直接本地执行，否则就调用该远程节点执行
if (primary.currentNodeId().equals(state.nodes().getLocalNodeId())) {
    performLocalAction(state, primary, node, indexMetaData);
} else {
    performRemoteAction(state, primary, node);
}

主分片节点处理流程

如上所述，协调节点会将请求发送给主分片所在节点，该节点接收请求，并执行该请求对应的处理器。
消息接收入口：TransportReplicationAction.PrimaryOperationTransportHandler#messageReceived
主节点执行逻辑：ReplicationOperation#execute
1、判断活跃的shard是否足够
代码入口：ReplicationOperation#checkActiveShardCount
活跃的分片越多，执行写入后同步的备份也越多，数据也越不容易丢失；默认为1，表示主分片可用就执行写入。
2、主分片执行
代码入口：ReplicationOperation.Primary#perform

public PrimaryResult perform(Request request) throws Exception {
    PrimaryResult result = shardOperationOnPrimary(request, indexShard);
    assert result.replicaRequest() == null || result.finalFailure == null : "a replica request [" + result.replicaRequest()
        + "] with a primary failure [" + result.finalFailure + "]";
    return result;
}

3、主分片执行索引操作
代码中需要对请求进行dispatch，TransportShardBulkAction#executeBulkItemRequest，以UPDATE操作为例：

private static BulkItemResultHolder executeUpdateRequest(UpdateRequest updateRequest, IndexShard primary,
                                                         IndexMetaData metaData, BulkShardRequest request,
                                                         int requestIndex, UpdateHelper updateHelper,
                                                         LongSupplier nowInMillis,
                                                         final MappingUpdatePerformer mappingUpdater) throws Exception {
    BulkItemRequest primaryItemRequest = request.items()[requestIndex];
    assert primaryItemRequest.request() == updateRequest
            : "expected bulk item request to contain the original update request, got: " +
            primaryItemRequest.request() + " and " + updateRequest;

    BulkItemResultHolder holder = null;
    // There must be at least one attempt
    // 保证至少执行一次，因此重试
    int maxAttempts = Math.max(1, updateRequest.retryOnConflict());
    for (int attemptCount = 0; attemptCount < maxAttempts; attemptCount++) {

        holder = executeUpdateRequestOnce(updateRequest, primary, metaData, request.index(), updateHelper,
                nowInMillis, primaryItemRequest, request.items()[requestIndex].id(), mappingUpdater);

        // It was either a successful request, or it was a non-conflict failure
        if (holder.isVersionConflict() == false) {
            return holder;
        }
    }
    // We ran out of tries and haven't returned a valid bulk item response, so return the last one generated
    return holder;
}

路径很长，最后调用了InternalEngine#index

public IndexResult index(Index index) throws IOException {
    assert Objects.equals(index.uid().field(), uidField) : index.uid().field();
    final boolean doThrottle = index.origin().isRecovery() == false;
    try (ReleasableLock releasableLock = readLock.acquire()) {
        ensureOpen();
        assert assertIncomingSequenceNumber(index.origin(), index.seqNo());
        assert assertVersionType(index);
        try (Releasable ignored = versionMap.acquireLock(index.uid().bytes());
            Releasable indexThrottle = doThrottle ? () -> {} : throttle.acquireThrottle()) {
            lastWriteNanos = index.startTime();
            /* A NOTE ABOUT APPEND ONLY OPTIMIZATIONS:
             * if we have an autoGeneratedID that comes into the engine we can potentially optimize
             * and just use addDocument instead of updateDocument and skip the entire version and index lookupVersion across the board.
             * Yet, we have to deal with multiple document delivery, for this we use a property of the document that is added
             * to detect if it has potentially been added before. We use the documents timestamp for this since it's something
             * that:
             *  - doesn't change per document
             *  - is preserved in the transaction log
             *  - and is assigned before we start to index / replicate
             * NOTE: it's not important for this timestamp to be consistent across nodes etc. it's just a number that is in the common
             * case increasing and can be used in the failure case when we retry and resent documents to establish a happens before relationship.
             * for instance:
             *  - doc A has autoGeneratedIdTimestamp = 10, isRetry = false
             *  - doc B has autoGeneratedIdTimestamp = 9, isRetry = false
             *
             *  while both docs are in in flight, we disconnect on one node, reconnect and send doc A again
             *  - now doc A' has autoGeneratedIdTimestamp = 10, isRetry = true
             *
             *  if A' arrives on the shard first we update maxUnsafeAutoIdTimestamp to 10 and use update document. All subsequent
             *  documents that arrive (A and B) will also use updateDocument since their timestamps are less than maxUnsafeAutoIdTimestamp.
             *  While this is not strictly needed for doc B it is just much simpler to implement since it will just de-optimize some doc in the worst case.
             *
             *  if A arrives on the shard first we use addDocument since maxUnsafeAutoIdTimestamp is < 10. A` will then just be skipped or calls
             *  updateDocument.
             */
            final IndexingStrategy plan;

            if (index.origin() == Operation.Origin.PRIMARY) {
                plan = planIndexingAsPrimary(index);
            } else {
                // non-primary mode (i.e., replica or recovery)
                plan = planIndexingAsNonPrimary(index);
            }

            final IndexResult indexResult;
            if (plan.earlyResultOnPreFlightError.isPresent()) {
                indexResult = plan.earlyResultOnPreFlightError.get();
                assert indexResult.getResultType() == Result.Type.FAILURE : indexResult.getResultType();
            } else if (plan.indexIntoLucene) {
                indexResult = indexIntoLucene(index, plan);
            } else {
                indexResult = new IndexResult(
                        plan.versionForIndexing, getPrimaryTerm(), plan.seqNoForIndexing, plan.currentNotFoundOrDeleted);
            }
            if (index.origin() != Operation.Origin.LOCAL_TRANSLOG_RECOVERY) {
                final Translog.Location location;
                if (indexResult.getResultType() == Result.Type.SUCCESS) {
                    location = translog.add(new Translog.Index(index, indexResult));
                } else if (indexResult.getSeqNo() != SequenceNumbers.UNASSIGNED_SEQ_NO) {
                    // if we have document failure, record it as a no-op in the translog with the generated seq_no
                    location = translog.add(new Translog.NoOp(indexResult.getSeqNo(), index.primaryTerm(), indexResult.getFailure().toString()));
                } else {
                    location = null;
                }
                indexResult.setTranslogLocation(location);
            }
            if (plan.indexIntoLucene && indexResult.getResultType() == Result.Type.SUCCESS) {
                final Translog.Location translogLocation = trackTranslogLocation.get() ? indexResult.getTranslogLocation() : null;
                versionMap.maybePutIndexUnderLock(index.uid().bytes(),
                    new IndexVersionValue(translogLocation, plan.versionForIndexing, plan.seqNoForIndexing, index.primaryTerm()));
            }
            if (indexResult.getSeqNo() != SequenceNumbers.UNASSIGNED_SEQ_NO) {
                localCheckpointTracker.markSeqNoAsCompleted(indexResult.getSeqNo());
            }
            indexResult.setTook(System.nanoTime() - index.startTime());
            indexResult.freeze();
            return indexResult;
        }
    } catch (RuntimeException | IOException e) {
        try {
            maybeFailEngine("index", e);
        } catch (Exception inner) {
            e.addSuppressed(inner);
        }
        throw e;
    }
}

副分片执行

同样是在主节点的ReplicationOperation#execute中，需要调用副分片的写入接口。
1、调用副分片
代码入口：ReplicationOperation#performOnReplicas

private void performOnReplica(final ShardRouting shard, final ReplicaRequest replicaRequest, final long globalCheckpoint) {
    if (logger.isTraceEnabled()) {
        logger.trace("[{}] sending op [{}] to replica {} for request [{}]", shard.shardId(), opType, shard, replicaRequest);
    }

    totalShards.incrementAndGet();
    pendingActions.incrementAndGet();
    // 发HTTP请求给副分片
    replicasProxy.performOn(shard, replicaRequest, globalCheckpoint, new ActionListener<ReplicaResponse>() {
        ...
}

2、副分片接收请求写入文档的流程与主分片基本一致。
消息接收入口是：TransportReplicationAction.ReplicaOperationTransportHandler#messageReceived

GET流程源码分析

实时性

Elasticsearch中的GET请求也能保证是实时的，因为GET请求会直接读内存中尚未Flush到磁盘的TransLog。
但是GET请求只支持通过doc_id进行查询，所以对于条件查询（Search）依然无法实现实时。

GET执行流程

GET指的是单个文档的查询请求，文档的唯一标识是ID，因此GET就是根据ID来找到一个文档。
ES集群中的节点分为协调节点和数据节点，GET请求会先打到协调节点，然后转发到数据节点上；如果一个节点执行失败，则转发到其他节点上进行读取。
ES的GET流程

协调节点执行流程

代码入口：TransportSingleShardAction.AsyncSingleAction#perform
注意AsyncSingleAction构造方法中会先准备集群状态、节点列表等信息。
1、计算文档所在的shardid，即它所在的分片；
在构造方法中，会先根据请求进行路由：`TransportGetAction#shards

private AsyncSingleAction(Request request, ActionListener<Response> listener) {
    this.listener = listener;

    ClusterState clusterState = clusterService.state();
    if (logger.isTraceEnabled()) {
        logger.trace("executing [{}] based on cluster state version [{}]", request, clusterState.version());
    }
    // 集群nodes列表
    nodes = clusterState.nodes();
    ClusterBlockException blockException = checkGlobalBlock(clusterState);
    if (blockException != null) {
        throw blockException;
    }

    String concreteSingleIndex;
    if (resolveIndex(request)) {
        concreteSingleIndex = indexNameExpressionResolver.concreteSingleIndex(clusterState, request).getName();
    } else {
        concreteSingleIndex = request.index();
    }
    this.internalRequest = new InternalRequest(request, concreteSingleIndex);
    // 解析请求，更新自定义routing
    resolveRequest(clusterState, internalRequest);

    blockException = checkRequestBlock(clusterState, internalRequest);
    if (blockException != null) {
        throw blockException;
    }
    // 根据路由算法计算得到目的shard迭代器，或者根据优先级选择目标节点
    this.shardIt = shards(clusterState, internalRequest);
}

2、发送请求
检查目标节点是不是本地节点，如果是则直接调本地的TransportService#sendLocalRequest；如果是远程节点则执行远程调用。

public <T extends TransportResponse> void sendRequest(final DiscoveryNode node, final String action,
                                                            final TransportRequest request,
                                                            final TransportResponseHandler<T> handler) {
    try {
        Transport.Connection connection = getConnection(node);
        sendRequest(connection, action, request, TransportRequestOptions.EMPTY, handler);
    } catch (NodeNotConnectedException ex) {
        // the caller might not handle this so we invoke the handler
        handler.handleException(ex);
    }
}

public Transport.Connection getConnection(DiscoveryNode node) {
    if (isLocalNode(node)) {
        return localNodeConnection;
    } else {
        return transport.getConnection(node);
    }
}

数据节点执行流程

代码入口：TransportSingleShardAction.ShardTransportHandler#messageReceived

public void messageReceived(final Request request, final TransportChannel channel) throws Exception {
    if (logger.isTraceEnabled()) {
        logger.trace("executing [{}] on shard [{}]", request, request.internalShardId);
    }
    Response response = shardOperation(request, request.internalShardId);
    channel.sendResponse(response);
}

protected GetResponse shardOperation(GetRequest request, ShardId shardId) {
    IndexService indexService = indicesService.indexServiceSafe(shardId.getIndex());
    IndexShard indexShard = indexService.getShard(shardId.id());

    if (request.refresh() && !request.realtime()) {
        indexShard.refresh("refresh_flag_get");
    }

    GetResult result = indexShard.getService().get(request.type(), request.id(), request.storedFields(),
            request.realtime(), request.version(), request.versionType(), request.fetchSourceContext());
    return new GetResponse(result);
}

1、读取数据

private GetResult innerGet(String type, String id, String[] gFields, boolean realtime, long version, VersionType versionType,
                           FetchSourceContext fetchSourceContext, boolean readFromTranslog) {
    fetchSourceContext = normalizeFetchSourceContent(fetchSourceContext, gFields);
    final Collection<String> types;
    // 处理_all选项
    if (type == null || type.equals("_all")) {
        types = mapperService.types();
    } else {
        types = Collections.singleton(type);
    }

    Engine.GetResult get = null;
    for (String typeX : types) {
        Term uidTerm = mapperService.createUidTerm(typeX, id);
        if (uidTerm != null) {
            // 调用Engine读取数据
            get = indexShard.get(new Engine.Get(realtime, readFromTranslog, typeX, id, uidTerm)
                    .version(version).versionType(versionType));
            if (get.exists()) {
                type = typeX;
                break;
            } else {
                get.release();
            }
        }
    }

    ...

        // 过滤返回结果
        // 根据type、id、DocumentMapper等信息从刚刚获取的信息中获取数据，对指定的field、source进行过滤
        // 把结果存于GetResult返回
        return innerGetLoadFromStoredFields(type, id, gFields, fetchSourceContext, get, mapperService);
}

2、从InternalEngine读取数据

public Engine.GetResult get(Engine.Get get) {
    readAllowed();
    return getEngine().get(get, this::acquireSearcher);
}

public GetResult get(Get get, BiFunction<String, SearcherScope, Searcher> searcherFactory) throws EngineException {
    assert Objects.equals(get.uid().field(), uidField) : get.uid().field();
    try (ReleasableLock ignored = readLock.acquire()) {
        ensureOpen();
        SearcherScope scope;
        // 处理realtime选项，判断是否需要刷盘
        if (get.realtime()) {
            VersionValue versionValue = null;
            // versionMap写入索引的时候添加的，不会写入磁盘
            try (Releasable ignore = versionMap.acquireLock(get.uid().bytes())) {
                // we need to lock here to access the version map to do this truly in RT
                versionValue = getVersionFromMap(get.uid().bytes());
            }
            if (versionValue != null) {
                if (versionValue.isDelete()) {
                    return GetResult.NOT_EXISTS;
                }
                // 版本是否冲突
                if (get.versionType().isVersionConflictForReads(versionValue.version, get.version())) {
                    throw new VersionConflictEngineException(shardId, get.type(), get.id(),
                        get.versionType().explainConflictForReads(versionValue.version, get.version()));
                }
                if (get.isReadFromTranslog()) {
                    // this is only used for updates - API _GET calls will always read form a reader for consistency
                    // the update call doesn't need the consistency since it's source only + _parent but parent can go away in 7.0
                    if (versionValue.getLocation() != null) {
                        try {
                            Translog.Operation operation = translog.readOperation(versionValue.getLocation());
                            if (operation != null) {
                                // in the case of a already pruned translog generation we might get null here - yet very unlikely
                                TranslogLeafReader reader = new TranslogLeafReader((Translog.Index) operation, engineConfig
                                    .getIndexSettings().getIndexVersionCreated());
                                return new GetResult(new Searcher("realtime_get", new IndexSearcher(reader)),
                                    new VersionsAndSeqNoResolver.DocIdAndVersion(0, ((Translog.Index) operation).version(), reader, 0));
                            }
                        } catch (IOException e) {
                            maybeFailEngine("realtime_get", e); // lets check if the translog has failed with a tragic event
                            throw new EngineException(shardId, "failed to read operation from translog", e);
                        }
                    } else {
                        trackTranslogLocation.set(true);
                    }
                }
                // 执行刷盘操作
                refresh("realtime_get", SearcherScope.INTERNAL);
            }
            scope = SearcherScope.INTERNAL;
        } else {
            // we expose what has been externally expose in a point in time snapshot via an explicit refresh
            scope = SearcherScope.EXTERNAL;
        }
        // 调用Searcher读取数据
        // no version, get the version from the index, we know that we refresh on flush
        return getFromSearcher(get, searcherFactory, scope);
    }
}