ES5_3IngestNode

发表于 2021-07-01 更新于 2025-07-07 分类于 ElasticSearch

Ingest Node

Ingest Node提供了一种类似Logstash的功能：

预处理能力，可拦截Index或Bulk API的请求
对数据进行转换，并重新返回给Index或Bulk API
比如为某个字段设置默认值、重命名某个字段的字段名、对字段值进行Split操作
支持设置Painless脚本，对数据进行更加复杂的加工。

相对Logstash来说：

-	Logstash	Ingest Node
数据输入与输出	支持从不同的数据源读取，并写入不同的数据源	支持从ES REST API获取数据，并且写入ES
数据缓冲	实现了简单的数据队列，支持重写	不支持缓冲
数据处理	支持大量插件、支持定制开发	内置插件，支持开发Plugin（但是添加Plugin需要重启）
配置和使用	增加了一定的架构复杂度	无需额外部署

构建Ingest Node - Pipeline & Processor

ES-IngestNode

Pipeline
管道会对通过的数据（文档），按照顺序进行加工
Processor
对加工的行为进行抽象封装

创建pipeline

为ES添加一个Pipeline：

PUT _ingest/pipeline/blog_pipeline
{
  "description": "a blog pipeline",
  "processors": [
      {
        "split": {
          "field": "tags",
          "separator": ","
        }
      },

      {
        "set":{
          "field": "views",
          "value": 0
        }
      }
    ]
}

查看Pipeline：

1	GET _ingest/pipeline/blog_pipeline

测试Pipeline：

POST _ingest/pipeline/blog_pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "title": "Introducing cloud computering",
        "tags": "openstack,k8s",
        "content": "You konw, for cloud"
      }
    }
  ]
}

可以看到tags被拆分成了数组
最终文档中新增了一个views字段

使用Pipeline更新文档：

PUT tech_blogs/_doc/2?pipeline=blog_pipeline
{
  "title": "Introducing cloud computering",
  "tags": "openstack,k8s",
  "content": "You konw, for cloud"
}

但是使用_update_by_query更新文档时可能会报错：

POST /tech_blogs/_update_by_query?pipeline=blog_pipeline
{
}

{
  ...
  "failures": [
    {
      "index": "tech_blogs",
      "type": "doc",
      "id": "1",
      "cause": {
        "type": "exception",
        "reason": "java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: field [tags] of type [java.util.ArrayList] cannot be cast to [java.lang.String]",
        "caused_by": {
          "type": "illegal_argument_exception",
          "reason": "java.lang.IllegalArgumentException: field [tags] of type [java.util.ArrayList] cannot be cast to [java.lang.String]",
          "caused_by": {
            "type": "illegal_argument_exception",
            "reason": "field [tags] of type [java.util.ArrayList] cannot be cast to [java.lang.String]"
          }
        },
        "header": {
          "processor_type": "split"
        }
      },
      "status": 500
    }
  ]
}

是因为对已经拆分过的字段再用split processor拆分，相当于要对数组类型的字段做字符串切分操作。
为了避免这种情况，可以通过加条件来忽略已经处理过的文档：

POST tech_blogs/_update_by_query?pipeline=blog_pipeline
{
    "query": {
        "bool": {
            "must_not": {
                "exists": {
                    "field": "views"
                }
            }
        }
    }
}

构建pipeline

processor的种类比较多，这里列出一部分。

字段拆分 - split

ES的_ingest命令可以分析pipeline：

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "to split blog tags",
    "processors": [
      {
        "split": {
          "field": "tags",
          "separator": ","
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "title": "Introducing big data......",
        "tags": "hadoop,elasticsearch,spark",
        "content": "You konw, for big data"
      }
    },
    {
      "_index": "index",
      "_id": "idxx",
      "_source": {
        "title": "Introducing cloud computering",
        "tags": "openstack,k8s",
        "content": "You konw, for cloud"
      }
    }
  ]
}

pipeline中只有一个processor，它将文档的tags字段按”,”拆分为数组
文档有一个tags字段，但是原始值中多个标签被拼成了一个字符串

字段值重置 - set

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "to split blog tags",
    "processors": [
      {
        "split": {
          "field": "tags",
          "separator": ","
        }
      },

      {
        "set":{
          "field": "views",
          "value": 0
        }
      }
    ]
  },
  "docs": [
    {
      "_index":"index",
      "_id":"id",
      "_source":{
        "title":"Introducing big data......",
        "tags":"hadoop,elasticsearch,spark",
        "content":"You konw, for big data"
      }
    },
    {
      "_index":"index",
      "_id":"idxx",
      "_source":{
        "title":"Introducing cloud computering",
        "tags":"openstack,k8s",
        "content":"You konw, for cloud"
      }
    }

    ]
}

添加文档时，使用processor set来增加一个新字段views

ES5_4Painless

发表于 2021-07-01 更新于 2025-07-07 分类于 ElasticSearch

内置脚本语言。

阅读全文 »

ES5_1数据建模

发表于 2021-06-12 更新于 2025-07-07 分类于 ElasticSearch

文档关联

范式化

关系数据库一般会考虑Normalize数据，而在Elasticsearch中，往往考虑Denormalize。

查询性能好
无需表连接
无需行锁

关联

ES不擅长处理关联关系，一般采用以下4种方法处理关联：

对象类型
嵌套对象（Nested Object）
父子关联关系（Parent / Child）
应用端关联

	Nested Object	Parent / Child
优点	文档存储在一起，读取性能高	父子文档可以独立更新
缺点	更新嵌套的子文档时，需要更新整个文档	需要额外的内存维护关系，读取性能相对较差
适用场景	子文档偶尔更新，以查询为主	子文档更新频繁

Nested

# 插入两条数据
PUT blog/doc/1
{
  "content": "i love you",
  "time": "2021-12-12T12:12:12",
  "user": {
    "userId": 1,
    "userName": "Jack",
    "city": "shanghai"
  }
}
PUT blog/doc/2
{
  "content": "i dont love you",
  "time": "2021-12-12T12:12:12",
  "user": [{
    "userId": 2,
    "userName": "Jack Mike",
    "city": "hebei"
  },
  {
    "userId": 3,
    "userName": "Joe",
    "city": "beijing"
  }]
}

// 搜索
POST blog/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {"content": "you"}},
        {"match": {"user.userName": "Jack"}},
        {"match": {"user.city": "beijing"}}
      ]
    }
  }
}

上面的搜索初看没什么问题，但是实际上查出了我们不需要的数据。

**Jack Mike来自hebei，Joe来自beijing，并不符合查询条件的“来自beijing的Jack”。

出现这个问题的原因是：

ES存储时，Nested对象的边界没有被考虑在内，JSON格式被处理成了扁平式的键值对结构

用Nested Data Type可以解决这个问题：

// 先指定user是nested域
PUT blog
{
  "mappings": {
    "doc": {
      "properties": {
        "content": {
          "type": "text"
        },
        "time": {
          "type": "date"
        },
        "user": {
          "type": "nested",
          "properties": {
            "userId": {
              "type": "long"
            },
            "userName": {
              "type": "text"
            }
          }
        }
      }
    }
  }
}
// 查询
POST blog/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {"content": "you"}},
        {
          "nested": { // 指定对嵌套对象的查询
            "path": "user", // 指定路径，就是嵌套对象是哪个域
            "query": {
              "bool": {
                "must": [
                    {"match": {"user.userName": "Jack"}},
                    {"match": {"user.city": "beijing"}}
                ]
              }
            }
          }
        }
      ]
    }
  }
}

像这么定义索引的话相同搜索条件就查不出来了，当然查询条件的city改成”hebei”的话就能重新查出来了。

其中，Nested数据结构，允许对象数组中的对象被独立索引
用nested和properties关键字将所有actors索引到了多个分隔的文档
在内部，Nexted文档会被存到两个Lucene对象中，

Parent / Child

Nested方式关联的局限性：

每次更新，需要重新索引整个对象（包括根对象和嵌套对象）

ES中Parent / Child的关系是类似关系数据库中的Join查询。

父文档和子文档是两个独立的文档；
更新父文档无需重新索引子文档，子文档被添加、更新或删除也不会影响到父文档和其他的子文档。

创建索引，设置mapping：

PUT my_blogs
{
  "mappings": {
    "doc": {
      "properties": {
        "blog_comments_relation": {
          "type": "join", // 声明join类型
            "relations": { // 声明parent和child关系
              "blog": "comment" // parent名称和child名称
            }
        }
        "content": {
          "type": "text"
        },
        "title": {
          "type": "keyword"
        }
      }
    }
  }
}

索引父子文档并查询：

#索引父文档
PUT my_blogs/_doc/blog1
{
  "title":"Learning Elasticsearch",
  "content":"learning ELK @ geektime",
  "blog_comments_relation":{
    "name":"blog"
  }
}

#索引父文档
PUT my_blogs/_doc/blog2
{
  "title":"Learning Hadoop",
  "content":"learning Hadoop",
  "blog_comments_relation":{
    "name":"blog"
  }
}

#索引子文档
PUT my_blogs/_doc/comment1?routing=blog1
{
  "comment":"I am learning ELK",
  "username":"Jack",
  "blog_comments_relation":{
    "name":"comment",
    "parent":"blog1"
  }
}

#索引子文档
PUT my_blogs/_doc/comment2?routing=blog2
{
  "comment":"I like Hadoop!!!!!",
  "username":"Jack",
  "blog_comments_relation":{
    "name":"comment",
    "parent":"blog2"
  }
}

#索引子文档
PUT my_blogs/_doc/comment3?routing=blog2
{
  "comment":"Hello Hadoop",
  "username":"Bob",
  "blog_comments_relation":{
    "name":"comment",
    "parent":"blog2"
  }
}

索引子文档时设置routing=父文档，保证父文档和子文档在一个分片上
同时在parent中设置其父文档的

根据需要查询：

#根据父文档ID查看
GET my_blogs/_doc/blog2

# Parent Id 查询
POST my_blogs/_search
{
  "query": {
    "parent_id": {
      "type": "comment",
      "id": "blog2"
    }
  }
}

# Has Child 查询,返回父文档
POST my_blogs/_search
{
  "query": {
    "has_child": {
      "type": "comment",
      "query" : {
                "match": {
                    "username" : "Jack"
                }
            }
    }
  }
}

# Has Parent 查询，返回相关的子文档
POST my_blogs/_search
{
  "query": {
    "has_parent": {
      "parent_type": "blog",
      "query" : {
                "match": {
                    "title" : "Learning Hadoop"
                }
            }
    }
  }
}

#通过ID ，访问子文档，不会返回_source
GET my_blogs/_doc/comment3
#通过ID和routing ，访问子文档，能返回_source
GET my_blogs/_doc/comment3?routing=blog2

#更新子文档
PUT my_blogs/_doc/comment3?routing=blog2
{
    "comment": "Hello Hadoop??",
    "blog_comments_relation": {
      "name": "comment",
      "parent": "blog2"
    }
}

查询子文档：

has_parent

根据子文档查父文档：

parent_id
has_child

数据建模

过程：

概念模型
逻辑模型
实体属性
实体之间的关系
搜索相关的配置
数据模型
索引分片数
mapping字段配置、关系处理

字段建模

字段类型

text:

用于全文本字段，文本会被Analyzer分词
默认不支持聚合分析及排序，需要设置fielddata为true

keyword:

用于id、枚举及不需要分词的文本
适用于Filter（精确匹配）、sorting、aggregation

设置多字段类型：

默认会为文本类型设置成text，并且设置一个keyword的子字段
在处理人类语言时，通过增加“英文”、”拼音”、”标准”分词器，提高搜索体验

结构化数据：

精确数据类型，可以用byte的情况下就不要用long
枚举类型设置为keyword，即便是数字也应该设置成keyword，获取更好的性能

搜索和分词

如果不需要搜索、排序和聚合分析，则设置enable为false
不需要检索的话，将index设置为false
对需要搜索的字段，设置存储粒度
index_options / norms
不需要归一化数据时，也可以关闭，节约磁盘存储

聚合及排序

如果不需要聚合或排序
设置Doc_values / fielddata为false
对更新频繁、聚合查询频繁的keyword类型的字段
设置eager_global_ordinals为true，利用缓存提高聚合性能

额外存储

_source设置enabled为false可以节约磁盘空间
但是一般不会把_source关掉，而是优先考虑增加压缩比，因为关掉后无法再看到_source字段，且无法做Reindex和Update

数据建模最佳实践

关联关系

优先考虑Denormalization
当数据包含多数值对象，同时有查询需求时，使用Nested Object
关联文档更新非常频繁时，使用Parent / Child

避免过多字段

过多字段带来的问题：

不容易维护
Mapping信息保存在Cluster State中，数据量过大的话会对集群性能造成影响（Cluster State信息需要和所有节点同步）
删除或修改数据需要reindex

默认最大字段数是1000，可以设置index.mapping.total_fields.limit来修改

避免正则查询

正则查询存在的问题：

正则、通配符查询、前缀查询属于Term查询，但是性能不够好
特别是将通配符放在开头的话，性能极差

避免空值引起的聚合不准

比如下面插入两条文档，一条文档的rating值为null：

PUT ratings/doc/1
{
 "rating":5
}
PUT ratings/doc/2
{
 "rating":null
}

聚合分析结果中可以看到，total虽然是2，但是avg结果却是5：

POST ratings/_search
{
  "size": 0,
  "aggs": {
    "avg": {
      "avg": {
        "field": "rating"
      }
    }
  }
}

解决办法是给null取默认值：

DELETE ratings
PUT ratings
{
  "mappings": {
    "doc": {
      "properties": {
        "rating": {
          "type": "long",
          "null_value": 1.0
        }
      }
    }
  }
}

为索引的Mapping加入Meta信息

PUT softwares
{
  "mappings": {
    "_meta": {
      "software_version_mapping": "1.0"
    }
  }
}

Mapping的设置是一个迭代的过程：

加入新的字段很容易（必要时需要update_by_query）
更新删除字段不允许（需要Reindex重建数据）
最好能对Mapping加入Meta信息，更好地进行版本管理
可以考虑将Mapping文件上传git进行管理

ES5_2创建Index原则

发表于 2021-06-12 更新于 2025-07-07 分类于 ElasticSearch

定义索引的最佳实践。

阅读全文 »

Hadoop入门

发表于 2021-04-11 更新于 2025-07-07 分类于 Hadoop

Hadoop的基本使用。

阅读全文 »

Kafka原理总结

发表于 2021-03-28 更新于 2025-07-07 分类于消息队列

Kafka

高吞吐
Kafka常用于大数据场景，因为它对批量传输的场景做了优化，producer发出的数据需要成批一次性发给Consumer。
分布式
Broker多个副本保存多个partition，partition也有多个副本。
Pub/Sub
意味着和大部分MQ中间件一样会有Producer、Consumer、Broker这几个组件。
Scala
和Java有关系，编译器不同但是一般都使用hotspot虚拟机运行字节码。

阅读全文 »

关于DDD

发表于 2021-01-31 更新于 2025-07-07 分类于分布式系统

领域驱动设计（Domain Driven Design）是一种架构思想，目的是正确地划定一个业务的边界，令架构整体上更清晰。
这里记录下DDD的落地方案——从Bean的定义到架构如何构建。

阅读全文 »

热点数据发现

发表于 2021-01-03 更新于 2025-07-07 分类于设计

实验性质的项目，因为之前被问过两次，当时没有太好的思路，这里试想一种解决方案。

阅读全文 »

设计秒杀系统

发表于 2020-12-31 更新于 2025-07-07 分类于设计

介绍

纯粹为了熟悉系统设计而搭建的项目。

场景没什么可说的就是减库存下单，流程很简单；
秒杀会在短时间内产生大量流量，会有很多用户“疯狂”下单，但是又只有少部分用户能够真正秒杀成功；
主要关注如何设计系统的高可用和高性能特性。

阅读全文 »

建立网页索引

发表于 2020-12-25 更新于 2025-07-07 分类于设计

经常想搜某个文件夹、站点内的所有内容，而网页会套网页、文件夹会套文件夹，没想到有什么特别好的搜索工具能递归地找出所有内容，于是想到要不把数据都爬到es再提供全文查询，形成自己的网页知识库。

阅读全文 »