有人愿意用数据库来解释“Tokenized Field”吗？(Anybody care to explain “Tokenized Field” in terms of Databases?)

编程入门行业动态更新时间:2024-10-28 20:28:31

我正在阅读有关SOLR并将MySQL数据库索引到SOLR中的内容。

“tokenize”和“un-tokenize”是什么意思？

当字段“正常化”时它意味着什么？

我知道规范化数据库的方式和意义，但是一个字段？一个简单的领域如何被标准化？

谢谢

I am reading about SOLR and indexing a MySQL database into SOLR.

What do they mean by "tokenize" and "un-tokenize"?

And what does it mean when fields are "normalized"?

I know how and what it means to normalize a database, but a field? How can a simple field be normalized?

Thanks

最满意答案

“tokenize”和“un-tokenize”是什么意思？

对字段进行标记可启用全文搜索，即查找字段中任何地方出现的任何字词。只有当你有一个完整和精确的匹配时，才会发现一个未经过处理的字段，例如，如果该字段的内容是“蓝色月亮”，那么只有当你搜索“蓝色月亮”时才能找到，而不是当你只搜索“蓝色” 。

当字段“正常化”时它意味着什么？

这很可能是指Unicode规范化 - Unicode具有分隔符号的单独代码点，例如U + 0060是`（重音符号），所以重音字母可以是一个Unicode字符（U + 00E8），也可以是两个（U + 0060和U + 0065）。但是，当你搜索è时，你当然希望找到它们。

What do they mean by "tokenize" and "un-tokenize"?

Tokenizing a field enables full text search, i.e. finding any word that occurs anywhere in the field. An Untokenized field will be found only when you have a complete and exact match, e.g. if the field's content is "blue moon" then it will only be found when you search for "blue moon", not when you search only for "blue".

And what does it mean when fields are "normalized"?

This most likely refers to Unicode normalization - Unicode has separate code points for diacritics, e.g. U+0060 is ` (grave accent), so the accented letter è could either be one Unicode character (U+00E8) or composed of two (U+0060 and U+0065). But of course you want both to be found when you search for è.

更多推荐

本文发布于:2023-08-02 18:40:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1379436.html