Elasticsearch Mapping 参数

2023-11-20 Lang elastic 评论字数统计: 5.1k(字) 阅读时长: 25(分)

本文基于 Elasticsearch 6.6.0

analyzer

指定分词器(分析器更合理)，对索引和查询都有效。如下，指定ik分词的配置：

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "content": {
          "type": "text",
          "analyzer": "ik_max_word",
          "search_analyzer": "ik_max_word"
        }
      }
    }
  }
}

normalizer

normalizer用于解析前的标准化配置，比如把所有的字符转化为小写等。例子：

PUT index
{
  "settings": {
    "analysis": {
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "char_filter": [],
          "filter": ["lowercase", "asciifolding"]
        }
      }
    }
  },
  "mappings": {
    "type": {
      "properties": {
        "foo": {
          "type": "keyword",
          "normalizer": "my_normalizer"
        }
      }
    }
  }
}

PUT index/type/1
{
  "foo": "BÀR"
}

PUT index/type/2
{
  "foo": "bar"
}

PUT index/type/3
{
  "foo": "baz"
}

POST index/_refresh

GET index/_search
{
  "query": {
    "match": {
      "foo": "BAR"
    }
  }
}

BÀR经过normalizer过滤以后转换为bar，文档1和文档2会被搜索到。

boost

boost字段用于设置字段的权重，比如，关键字出现在title字段的权重是出现在content字段中权重的2倍，设置mapping如下，其中content字段的默认权重是1.

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "title": {
          "type": "text",
          "boost": 2 
        },
        "content": {
          "type": "text"
        }
      }
    }
  }
}

同样，在查询时指定权重也是一样的：

POST _search
{
    "query": {
        "match" : {
            "title": {
                "query": "quick brown fox",
                "boost": 2
            }
        }
    }
}

推荐在查询时指定boost，第一中在mapping中写死，如果不重新索引文档，权重无法修改，使用查询可以实现同样的效果。

coerce

coerce属性用于清除脏数据，coerce的默认值是true。整型数字5有可能会被写成字符串“5”或者浮点数5.0.coerce属性可以用来清除脏数据：

字符串会被强制转换为整数
浮点数被强制转换为整数

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "number_one": {
          "type": "integer"
        },
        "number_two": {
          "type": "integer",
          "coerce": false
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "number_one": "10" 
}

PUT my_index/my_type/2
{
  "number_two": "10" 
}

mapping中指定number_one字段是integer类型，虽然插入的数据类型是String，但依然可以插入成功。number_two字段关闭了coerce，因此插入失败。

copy_to

copy_to属性用于配置自定义的_all字段。换言之，就是多个字段可以合并成一个超级字段。比如，first_name和last_name可以合并为full_name字段。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "first_name": {
          "type": "text",
          "copy_to": "full_name" 
        },
        "last_name": {
          "type": "text",
          "copy_to": "full_name" 
        },
        "full_name": {
          "type": "text"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "first_name": "John",
  "last_name": "Smith"
}

GET my_index/_search
{
  "query": {
    "match": {
      "full_name": { 
        "query": "John Smith",
        "operator": "and"
      }
    }
  }
}

doc_values

doc_values是为了加快排序、聚合操作，在建立倒排索引的时候，额外增加一个列式存储映射，是一个空间换时间的做法。默认是开启的，对于确定不需要聚合或者排序的字段可以关闭。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "status_code": { 
          "type":       "keyword"
        },
        "session_id": { 
          "type":       "keyword",
          "doc_values": false
        }
      }
    }
  }
}

注:text类型不支持doc_values。

dynamic

dynamic属性用于检测新发现的字段，有三个取值：

true:新发现的字段添加到映射中。（默认）
flase:新检测的字段被忽略。必须显式添加新字段。
strict:如果检测到新字段，就会引发异常并拒绝文档。

PUT my_index
{
  "mappings": {
    "my_type": {
      "dynamic": false, 
      "properties": {
        "user": { 
          "properties": {
            "name": {
              "type": "text"
            },
            "social_networks": { 
              "dynamic": true,
              "properties": {}
            }
          }
        }
      }
    }
  }
}

注：取值如果为strict (非布尔值)要加引号。

文档中有一个之前没有出现过的字段被添加到ELasticsearch之后，文档的type mapping中会自动添加一个新的字段。这个可以通过dynamic属性去控制，dynamic属性为false会忽略新增的字段、dynamic属性为strict会抛出异常。如果dynamic为true的话，ELasticsearch会自动根据字段的值推测出来类型进而确定mapping：

JSON格式的数据	自动推测的字段类型
null	没有字段被添加
true or false	boolean类型
floating类型数字	floating类型
integer	long类型
JSON对象	object类型
数组	由数组中第一个非空值决定
string	有可能是date类型（开启日期检测)、double或long类型、text类型、keyword类型

日期检测默认是检测符合以下日期格式的字符串：

1	[ "strict_date_optional_time","yyyy/MM/dd HH:mm:ss Z\|\|yyyy/MM/dd Z"]

例子:

PUT my_index/my_type/1
{
  "create_date": "2015/09/02"
}

GET my_index/_mapping

mapping 如下，可以看到create_date为date类型：

{
  "my_index": {
    "mappings": {
      "my_type": {
        "properties": {
          "create_date": { "type": "date", "format": "yyyy/MM/dd HH:mm:ss||yyyy/MM/dd||epoch_millis" } }
      }
    }
  }
}

关闭日期检测：

PUT my_index
{
  "mappings": {
    "my_type": {
      "date_detection": false
    }
  }
}

PUT my_index/my_type/1 
{
  "create": "2015/09/02"
}

再次查看mapping，create字段已不再是date类型：

GET my_index/_mapping
返回结果：
{
  "my_index": {
    "mappings": {
      "my_type": {
        "date_detection": false,
        "properties": {
          "create": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      }
    }
  }
}

自定义日期检测的格式：

PUT my_index
{
  "mappings": {
    "my_type": {
      "dynamic_date_formats": ["MM/dd/yyyy"]
    }
  }
}

PUT my_index/my_type/1
{
  "create_date": "09/25/2015"
}

开启数字类型自动检测：

PUT my_index
{
  "mappings": {
    "my_type": {
      "numeric_detection": true
    }
  }
}

PUT my_index/my_type/1
{
  "my_float":   "1.0", 
  "my_integer": "1" 
}

enabled

ELasticseaech默认会索引所有的字段，enabled设为false的字段，es会跳过字段内容，该字段只能从_source中获取，但是不可搜。而且字段可以是任意类型。

PUT my_index
{
  "mappings": {
    "session": {
      "properties": {
        "user_id": {
          "type":  "keyword"
        },
        "last_updated": {
          "type": "date"
        },
        "session_data": { 
          "enabled": false
        }
      }
    }
  }
}

PUT my_index/session/session_1
{
  "user_id": "kimchy",
  "session_data": { 
    "arbitrary_object": {
      "some_array": [ "foo", "bar", { "baz": 2 } ]
    }
  },
  "last_updated": "2015-12-06T18:20:22"
}

PUT my_index/session/session_2
{
  "user_id": "jpountz",
  "session_data": "none", 
  "last_updated": "2015-12-06T18:22:13"
}

fielddata

搜索要解决的问题是“包含查询关键词的文档有哪些？”，聚合恰恰相反，聚合要解决的问题是“文档包含哪些词项”，大多数字段再索引时生成doc_values，但是text字段不支持doc_values。

取而代之，text字段在查询时会生成一个fielddata的数据结构，fielddata在字段首次被聚合、排序、或者使用脚本的时候生成。ELasticsearch通过读取磁盘上的倒排记录表重新生成文档词项关系，最后在Java堆内存中排序。

text字段的fielddata属性默认是关闭的，开启fielddata非常消耗内存。在你开启text字段以前，想清楚为什么要在text类型的字段上做聚合、排序操作。大多数情况下这么做是没有意义的。

“New York”会被分析成“new”和“york”，在text类型上聚合会分成“new”和“york”2个桶，也许你需要的是一个“New York”。这是可以加一个不分词的keyword字段：

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "my_field": { 
          "type": "text",
          "fields": {
            "keyword": { 
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

上面的mapping中实现了通过my_field字段做全文搜索，my_field.keyword做聚合、排序和使用脚本。

format

format属性主要用于格式化日期：

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "date": {
          "type":   "date",
          "format": "yyyy-MM-dd"
        }
      }
    }
  }
}

ignore_above

ignore_above用于指定字段索引和存储的长度最大值，超过最大值的会被忽略：

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "message": {
          "type": "keyword",
          "ignore_above": 15
        }
      }
    }
  }
}

PUT my_index/my_type/1 
{
  "message": "Syntax error"
}

PUT my_index/my_type/2 
{
  "message": "Syntax error with some long stacktrace"
}

GET my_index/_search 
{
  "size": 0, 
  "aggs": {
    "messages": {
      "terms": {
        "field": "message"
      }
    }
  }
}

mapping中指定了ignore_above字段的最大长度为15，第一个文档的字段长小于15，因此索引成功，第二个超过15，因此不索引，返回结果只有”Syntax error”,结果如下：

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "messages": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": []
    }
  }
}

ignore_malformed

ignore_malformed可以忽略不规则数据，对于login字段，有人可能填写的是date类型，也有人填写的是邮件格式。给一个字段索引不合适的数据类型发生异常，导致整个文档索引失败。如果ignore_malformed参数设为true，异常会被忽略，出异常的字段不会被索引，其它字段正常索引。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "number_one": {
          "type": "integer",
          "ignore_malformed": true
        },
        "number_two": {
          "type": "integer"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "text":       "Some text value",
  "number_one": "foo" 
}

PUT my_index/my_type/2
{
  "text":       "Some text value",
  "number_two": "foo" 
}

上面的例子中number_one接受integer类型，ignore_malformed属性设为true，因此文档一种number_one字段虽然是字符串但依然能写入成功；number_two接受integer类型，默认ignore_malformed属性为false，因此写入失败。

include_in_all

include_in_all属性用于指定字段是否包含在_all字段里面，默认开启，除索引时index属性为no。
例子如下，title和content字段包含在_all字段里，date不包含。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "title": { 
          "type": "text"
        },
        "content": { 
          "type": "text"
        },
        "date": { 
          "type": "date",
          "include_in_all": false
        }
      }
    }
  }
}

include_in_all也可用于字段级别，如下my_type下的所有字段都排除在_all字段之外，author.first_name 和author.last_name 包含在in _all中：

PUT my_index
{
  "mappings": {
    "my_type": {
      "include_in_all": false, 
      "properties": {
        "title":          { "type": "text" },
        "author": {
          "include_in_all": true, 
          "properties": {
            "first_name": { "type": "text" },
            "last_name":  { "type": "text" }
          }
        },
        "editor": {
          "properties": {
            "first_name": { "type": "text" }, 
            "last_name":  { "type": "text", "include_in_all": true } 
          }
        }
      }
    }
  }
}

index

index属性指定字段是否索引，不索引也就不可搜索，取值可以为true或者false。

index_options

index_options控制索引时存储哪些信息到倒排索引中，接受以下配置：

参数	作用
docs	只存储文档编号
freqs	存储文档编号和词项频率
positions	文档编号、词项频率、词项的位置被存储，偏移位置可用于临近搜索和短语查询
offsets	文档编号、词项频率、词项的位置、词项开始和结束的字符位置都被存储，offsets设为true会使用Postings highlighter

fields

fields可以让同一文本有多种不同的索引方式，比如一个String类型的字段，可以使用text类型做全文检索，使用keyword类型做聚合和排序。

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "city": {
          "type": "text",
          "fields": {
            "raw": { 
              "type":  "keyword"
            }
          }
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "city": "New York"
}

PUT my_index/my_type/2
{
  "city": "York"
}

GET my_index/_search
{
  "query": {
    "match": {
      "city": "york" 
    }
  },
  "sort": {
    "city.raw": "asc" 
  },
  "aggs": {
    "Cities": {
      "terms": {
        "field": "city.raw" 
      }
    }
  }
}

norms

norms参数用于标准化文档，以便查询时计算文档的相关性。norms虽然对评分有用，但是会消耗较多的磁盘空间，如果不需要对某个字段进行评分，最好不要开启norms。

null_value

值为null的字段不索引也不可以搜索，null_value参数可以让值为null的字段显式的可索引、可搜索。例子：

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "status_code": {
          "type":       "keyword",
          "null_value": "NULL" 
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "status_code": null
}

PUT my_index/my_type/2
{
  "status_code": [] 
}

GET my_index/_search
{
  "query": {
    "term": {
      "status_code": "NULL" 
    }
  }
}

文档1可以被搜索到，因为status_code的值为null，文档2不可以被搜索到，因为status_code为空数组，但不是null。

position_increment_gap

为了支持近似或者短语查询，text字段被解析的时候会考虑此项的位置信息。举例，一个字段的值为数组类型：

1	"names": [ "John Abraham", "Lincoln Smith"]

为了区别第一个字段和第二个字段，Abraham和Lincoln在索引中有一个间距，默认是100。例子如下，这是查询”Abraham Lincoln”是查不到的：

PUT my_index/groups/1
{
    "names": [ "John Abraham", "Lincoln Smith"]
}

GET my_index/groups/_search
{
    "query": {
        "match_phrase": {
            "names": {
                "query": "Abraham Lincoln" 
            }
        }
    }
}

指定间距大于100可以查询到：

GET my_index/groups/_search
{
    "query": {
        "match_phrase": {
            "names": {
                "query": "Abraham Lincoln",
                "slop": 101 
            }
        }
    }
}

在mapping中通过position_increment_gap参数指定间距：

PUT my_index
{
  "mappings": {
    "groups": {
      "properties": {
        "names": {
          "type": "text",
          "position_increment_gap": 0 
        }
      }
    }
  }
}

properties

PUT my_index
{
  "mappings": {
    "my_type": { 
      "properties": {
        "manager": { 
          "properties": {
            "age":  { "type": "integer" },
            "name": { "type": "text"  }
          }
        },
        "employees": { 
          "type": "nested",
          "properties": {
            "age":  { "type": "integer" },
            "name": { "type": "text"  }
          }
        }
      }
    }
  }
}

对应的文档结构：

PUT my_index/my_type/1 
{
  "region": "US",
  "manager": {
    "name": "Alice White",
    "age": 30
  },
  "employees": [
    {
      "name": "John Smith",
      "age": 34
    },
    {
      "name": "Peter Brown",
      "age": 26
    }
  ]
}

可以对manager.name、manager.age做搜索、聚合等操作。

GET my_index/_search
{
  "query": {
    "match": {
      "manager.name": "Alice White" 
    }
  },
  "aggs": {
    "Employees": {
      "nested": {
        "path": "employees"
      },
      "aggs": {
        "Employee Ages": {
          "histogram": {
            "field": "employees.age", 
            "interval": 5
          }
        }
      }
    }
  }
}

search_analyzer

大多数情况下索引和搜索的时候应该指定相同的分析器，确保query解析以后和索引中的词项一致。但是有时候也需要指定不同的分析器，例如使用edge_ngram过滤器实现自动补全。

默认情况下查询会使用analyzer属性指定的分析器，但也可以被search_analyzer覆盖。例子：

PUT my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "autocomplete_filter": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20
        }
      },
      "analyzer": {
        "autocomplete": { 
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "autocomplete_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "text": {
          "type": "text",
          "analyzer": "autocomplete", 
          "search_analyzer": "standard" 
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "text": "Quick Brown Fox" 
}

GET my_index/_search
{
  "query": {
    "match": {
      "text": {
        "query": "Quick Br", 
        "operator": "and"
      }
    }
  }
}

similarity

BM25 ：ES和Lucene默认的评分模型
classic ：TF/IDF评分
boolean：布尔模型评分
例子

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "default_field": { 
          "type": "text"
        },
        "classic_field": {
          "type": "text",
          "similarity": "classic" 
        },
        "boolean_sim_field": {
          "type": "text",
          "similarity": "boolean" 
        }
      }
    }
  }
}

default_field自动使用BM25评分模型，classic_field使用TF/IDF经典评分模型，boolean_sim_field使用布尔评分模型。

store

默认情况下，自动是被索引的也可以搜索，但是不存储，这也没关系，因为_source字段里面保存了一份原始文档。在某些情况下，store参数有意义，比如一个文档里面有title、date和超大的content字段，如果只想获取title和date，可以这样：

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "title": {
          "type": "text",
          "store": true 
        },
        "date": {
          "type": "date",
          "store": true 
        },
        "content": {
          "type": "text"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "title":   "Some short title",
  "date":    "2015-01-01",
  "content": "A very long content field..."
}

GET my_index/_search
{
  "stored_fields": [ "title", "date" ] 
}

查询结果：

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "my_index",
        "_type": "my_type",
        "_id": "1",
        "_score": 1,
        "fields": {
          "date": [
            "2015-01-01T00:00:00.000Z"
          ],
          "title": [
            "Some short title"
          ]
        }
      }
    ]
  }
}

Stored fields返回的总是数组，如果想返回原始字段，还是要从_source中取。

term_vector

词向量包含了文本被解析以后的以下信息：

词项集合
词项位置
词项的起始字符映射到原始文档中的位置。

term_vector参数有以下取值：

参数取值	含义
no	默认值，不存储词向量
yes	只存储词项集合
with_positions	存储词项和词项位置
with_offsets	词项和字符偏移位置
with_positions_offsets	存储词项、词项位置、字符偏移位置

例子：

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "text": {
          "type":        "text",
          "term_vector": "with_positions_offsets"
        }
      }
    }
  }
}

PUT my_index/my_type/1
{
  "text": "Quick brown fox"
}

GET my_index/_search
{
  "query": {
    "match": {
      "text": "brown fox"
    }
  },
  "highlight": {
    "fields": {
      "text": {} 
    }
  }
}

动态Mapping `_default_`

在mapping中使用default字段，那么其它字段会自动继承default中的设置。

PUT my_index
{
  "mappings": {
    "_default_": { 
      "_all": {
        "enabled": false
      }
    },
    "user": {}, 
    "blogpost": { 
      "_all": {
        "enabled": true
      }
    }
  }
}

上面的 mapping 中，_default_ 中关闭了 _all 字段，user会继承 _default_ 中的配置，因此 user 中的 _all 字段也是关闭的，blogpost 中开启 _all，覆盖了 _default 的默认配置。

当default被更新以后，只会对后面新加的文档产生作用。

dynamic_templates

动态模板可以根据字段名称设置mapping，如下对于string类型的字段，设置mapping为：

1	"mapping": { "type": "long"}

但是匹配字段名称为long_格式的，不匹配_text格式的：

PUT my_index
{
  "mappings": {
    "my_type": {
      "dynamic_templates": [
        {
          "longs_as_strings": {
            "match_mapping_type": "string",
            "match":   "long_*",
            "unmatch": "*_text",
            "mapping": {
              "type": "long"
            }
          }
        }
      ]
    }
  }
}

PUT my_index/my_type/1
{
  "long_num": "5", 
  "long_text": "foo" 
}12345678910111213141516171819202122232425

写入文档以后，long_num字段为long类型，long_text 仍为string类型。

本文链接： https://lxb.wiki/7a9a148e/

版权声明： 本博客所有文章除特别声明外，均采用 CC BY-NC 4.0 许可协议。转载请注明出处！