我正在为我正在处理的项目执行CSV导入工具。 客户端需要能够输入excel中的数据,将它们导出为CSV并将其上传到数据库。 例如,我有这个CSV记录:
I am doing a CSV Import tool for the project I'm working on. The client needs to be able to enter the data in excel, export them as CSV and upload them to the database. For example I have this CSV record:
1, John Doe, ACME Comapny (the typo is on purpose)当然,这些公司保存在一个单独的表中,并与外键关联,在插入之前发现正确的公司ID。 我计划通过比较数据库中的公司名称和CSV中的公司名称来做到这一点。 如果字符串完全相同,比较应该返回0,并返回一些随着字符串变得更加不同而变大的值,但是strcmp不会在这里剪切,因为:
Of course, the companies are kept in a separate table and linked with a foreign key, so I need to discover the correct company ID before inserting. I plan to do this by comparing the company names in the database with the company names in the CSV. the comparison should return 0 if the strings are exactly the same, and return some value that gets bigger as the strings get more different, but strcmp doesn't cut it here because:
Acme Company和Acme Comapny应该有非常小的差异指数,但Acme Company和Cmea Mpnyaco应该有非常大的差异指数或 Acme Company和Acme Comp。。也应该具有小的差异指数,即使字符计数不同。 此外,Acme Company和Company Acme应该返回0。
"Acme Company" and "Acme Comapny" should have a very small difference index, but "Acme Company" and "Cmea Mpnyaco" should have a very big difference index Or "Acme Company" and "Acme Comp." should also have a small difference index, even though the character count is different. Also, "Acme Company" and "Company Acme" should return 0.
因此,如果客户输入数据时输入类型,选择他最想插入的名称。
So if the client makes a type while entering data, i could prompt him to choose the name he most probably wanted to insert.
有一个已知的算法来做,或者我们可以发明一个:) ?
Is there a known algorithm to do this, or maybe we can invent one :) ?
推荐答案您可能想查看 Levenshtein Distance 算法作为起点。它会评估两个字之间的距离。
You might want to check out the Levenshtein Distance algorithm as a starting point. It will rate the "distance" between two words.
这个SO线程实现一个谷歌风格的你的意思是...?系统也可以提供一些想法。
This SO thread on implementing a Google-style "Do you mean...?" system may provide some ideas as well.
更多推荐
字比较算法
发布评论