比较相似度算法

编程入门行业动态更新时间:2024-10-14 22:14:36

本文介绍了比较相似度算法的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我想使用字符串相似性函数在数据库中查找损坏的数据.

I want to use string similarity functions to find corrupted data in my database.

我遇到了其中几个:

Jaro，
Jaro-Winkler，
Levenshtein，
欧几里得和
Q-gram

我想知道它们之间有什么区别，以及它们在什么情况下最有效?

I wanted to know what is the difference between them and in what situations they work best?

推荐答案

在勘误表和注意到一些适用于类似问题空间的算法的可比性，让我们先探讨这些算法的适用性，然后再确定它们是否适用.在数值上是可比的.

Expanding on my wiki-walk comment in the errata and noting some of the ground-floor literature on the comparability of algorithms that apply to similar problem spaces, let's explore the applicability of these algorithms before we determine if they're numerically comparable.

从Wikipedia， Jaro-Winkler :

From Wikipedia, Jaro-Winkler:

在计算机科学和统计学中，Jaro–Winkler距离 (Winkler，1990)是衡量两个字符串之间相似性的标准.它是 Jaro距离度量的一种变体(Jaro，1989，1995)，以及主要[需要引用]用于记录链接领域(重复检测).两根琴弦的Jaro–Winkler距离越高，字符串越相似. Jaro–Winkler距离度量为设计并最适合诸如人名之类的短字符串.这分数被标准化，以使0等于无相似性，而1为完全匹配.

In computer science and statistics, the Jaro–Winkler distance (Winkler, 1990) is a measure of similarity between two strings. It is a variant of the Jaro distance metric (Jaro, 1989, 1995) and mainly[citation needed] used in the area of record linkage (duplicate detection). The higher the Jaro–Winkler distance for two strings is, the more similar the strings are. The Jaro–Winkler distance metric is designed and best suited for short strings such as person names. The score is normalized such that 0 equates to no similarity and 1 is an exact match.

Levenshtein距离:

在信息论和计算机科学中，Levenshtein距离是用于衡量两个之间的差异量的字符串量度序列.术语编辑距离"通常用于专门指代到Levenshtein的距离.

In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences. The term edit distance is often used to refer specifically to Levenshtein distance.

两个琴弦之间的Levenshtein距离定义为最小将一个字符串转换为另一个字符串所需的编辑次数，其中允许的编辑操作是插入，删除或替换单个字符.它以弗拉基米尔·弗拉基米尔(Vladimir)命名莱文施泰因(Levenshtein)，他在1965年考虑了这一距离.

The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character. It is named after Vladimir Levenshtein, who considered this distance in 1965.

欧几里德距离:

在数学中，欧几里德距离或欧几里德度量是两点之间的普通"距离，用一个点来测量尺，并由毕达哥拉斯公式给出.通过使用这个公式随着距离的增加，欧几里德空间(甚至任何内积空间)变为公制空间.关联的规范称为欧几里得规范. 较早的文献将度量标准称为毕达哥拉斯度量标准.

In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula. By using this formula as distance, Euclidean space (or even any inner product space) becomes a metric space. The associated norm is called the Euclidean norm. Older literature refers to the metric as Pythagorean metric.

Q或n-gram编码:

在计算语言学和概率领域，一个n-gram 是给定文本序列中n个项目的连续序列，或者演讲.有问题的项目可以是音素，音节，字母，单词或碱基对根据应用程序. n-克是从文本或语音语料库中收集.

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items in question can be phonemes, syllables, letters, words or base pairs according to the application. n-grams are collected from a text or speech corpus.

两个核心 n元语法模型的优点(以及使用它们)是相对简单的，并且具有扩展的能力-只需增加一个模型可以用来存储更多的上下文易于理解的时空权衡，可以进行小型实验非常有效地扩展.

The two core advantages of n-gram models (and algorithms that use them) are relative simplicity and the ability to scale up – by simply increasing n a model can be used to store more context with a well-understood space–time tradeoff, enabling small experiments to scale up very efficiently.

问题在于这些算法解决了所有可能解决算法数据中或嫁接可用的度量标准.实际上，并非所有这些指标甚至都是 metrics ，因为其中一些指标不能满足三角形不等式.

The trouble is these algorithms solve different problems that have different applicability within the space of all possible algorithms to solve the longest common subsequence problem, in your data or in grafting a usable metric thereof. In fact, not all of these are even metrics, as some of them don't satisfy the triangle inequality.

正确地做到这一点:使用校验和和奇偶校验位作为数据.当更简单的解决方案可以解决时，不要尝试解决更困难的问题.

Instead of going out of your way to define a dubious scheme to detect data corruption, do this properly: by using checksums and parity bits for your data. Don't try to solve a much harder problem when a simpler solution will do.

更多推荐

比较相似度算法

本文发布于:2023-11-30 02:19:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1648395.html