R:使用 jarowinkler 进行字符串模糊匹配

编程入门行业动态更新时间:2024-10-16 20:25:55

本文介绍了R:使用 jarowinkler 进行字符串模糊匹配的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我在 R 中有两个字符类型的向量.

I have two vector of type character in R.

我希望能够使用 jarowinkler 将参考列表与原始字符列表进行比较，并分配一个百分比相似度分数.因此，例如，如果我有 10 个参考项和 20 个原始数据项，我希望能够获得比较的最佳分数以及算法与之匹配的分数(因此 2 个向量为 10).如果我有大小为 8 和 10 个参考项的原始数据，我应该只得到 8 个项的 2 个向量结果，每个项的匹配和得分最好

I want to be able to compare the reference list to the raw character list using jarowinkler and assign a % similarity score. So for example if i have 10 reference items and twenty raw data items, i want to be able to get the best score for the comparison and what the algorithm matched it to (so 2 vectors of 10). If i have raw data of size 8 and 10 reference items, i should only end up with a 2 vector result of 8 items with the best match and score per item

item、匹配、matched_to冰，78，冰淇淋

下面是我的代码，没什么可看的.

Below is my code which isn't much to look at.

NumItems.Raw = length(words) NumItems.Ref = length(Ref.Desc) for (item in words) { for (refitem in Ref.Desc) { jarowinkler(refitem,item) # Find Best match Score # Find Best Item in reference table # Add both items to vectors # decrement NumItems.Raw # Loop } }

推荐答案

使用玩具示例:

library(RecordLinkage) library(dplyr) ref <- c('cat', 'dog', 'turtle', 'cow', 'horse', 'pig', 'sheep', 'koala','bear','fish') words <- c('dog', 'kiwi', 'emu', 'pig', 'sheep', 'cow','cat','horse') wordlist <- expand.grid(words = words, ref = ref, stringsAsFactors = FALSE) wordlist %>% group_by(words) %>% mutate(match_score = jarowinkler(words, ref)) %>% summarise(match = match_score[which.max(match_score)], matched_to = ref[which.max(match_score)])

给予

words match matched_to 1 cat 1.0000000 cat 2 cow 1.0000000 cow 3 dog 1.0000000 dog 4 emu 0.5277778 bear 5 horse 1.0000000 horse 6 kiwi 0.5350000 koala 7 pig 1.0000000 pig 8 sheep 1.0000000 sheep

作为对 OP 评论的回应，最后一个命令使用来自 dplyr 的管道方法，并按原始词对原始单词和引用的每个组合进行分组词，添加一列 match_score 与 jarowinkler 分数，并仅返回最高匹配分数的摘要(由 which.max(match_score) 索引)，以及由最大 match_score 索引的引用.

As a response to the OP's comment, the last command uses the pipeline approach from dplyr, and groups every combination of the raw words and references by the raw words, adds a column match_score with the jarowinkler score, and returns only a summary of the highest match score (indexed by which.max(match_score)), as well as the reference which also is indexed by the maximum match_score.

更多推荐

R:使用 jarowinkler 进行字符串模糊匹配

本文发布于:2023-10-23 05:30:21，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1519930.html