我正在尝试找到一种可靠的方法来匹配数据库中重复的人员记录.数据存在一些严重的数据质量问题,我也正在尝试解决这些问题,但是在我不得不批准之前,我一直牢牢掌握着已有的数据.
I am trying to find a reliable method for matching duplicate person records within the database. The data has some serious data quality issues which I am also trying to fix but until I have the go-ahead to do so I am stuck with the data I have got.
我可以使用的表格列是:
The table columns available to me are:
SURNAME VARCHAR2(43) FORENAME VARCHAR2(38) BIRTH_DATE DATE ADDRESS_LINE1 VARCHAR2(60) ADDRESS_LINE2 VARCHAR2(60) ADDRESS_LINE3 VARCHAR2(60) ADDRESS_LINE4 VARCHAR2(60) ADDRESS_LINE5 VARCHAR2(60) POSTCODE VARCHAR2(15)SOUNDEX函数对此用途相对有限,但是UTL_MATCH包似乎可以使用Jaro Winker算法提供更好的匹配级别.
The SOUNDEX function is relatively limited for this use but the UTL_MATCH package seems to offer a better level of matching using the Jaro Winker algorithm.
没有人重新发明轮子,没有人实施一种可靠的方法来匹配这种类型的数据吗?
Rather than re-inventing the wheel, has anyone implemented a reliable method for matching this type of data?
要解决的数据质量问题:
Data Quality issues to contend with:
例如,我正在考虑:
连接所有地址字段,并将Jaro Winkler算法应用于完整地址,再结合对串联在一起的全名进行类似的测试.
Concatenating all address fields and applying the Jaro Winkler algorithm to the full address combined with a similar test of the full name concatenated together.
可以直接比较匹配的出生日期,但是由于数据量巨大,仅匹配此数据是不够的.
The birth date can be compared directly for a match but due to the large volume of data just matching upon this isn't enough.
Oracle 10g R2企业版.
Oracle 10g R2 Enterprise Edition.
欢迎任何有用的建议.
Any helpful suggestions welcome.
推荐答案
我正在尝试找到一种可靠的方法来匹配重复的人 数据库中的记录."
"I am trying to find a reliable method for matching duplicate person records within the database."
A,没有这样的事情.您最希望得到的是一个带有合理怀疑元素的系统.
Alas there is no such thing. The most you can hope for is a system with a reasonable element of doubt.
SQL> select n1 , n2 , soundex(n1) as sdx_n1 , soundex(n2) as sdx_n2 , utl_match.edit_distance_similarity(n1, n2) as ed , utl_match.jaro_winkler_similarity(n1, n2) as jw from t94 order by n1, n2 / 2 3 4 5 6 7 8 9 N1 N2 SDX_ SDX_ ED JW -------------------- -------------------- ---- ---- ---------- ---------- MARK MARKIE M620 M620 67 93 MARK MARKS M620 M620 80 96 MARK MARKUS M620 M622 67 93 MARKY MARKIE M620 M620 67 89 MARSK MARKS M620 M620 60 95 MARX AMRX M620 A562 50 91 MARX M4RX M620 M620 75 85 MARX MARKS M620 M620 60 84 MARX MARSK M620 M620 60 84 MARX MAX M620 M200 75 93 MARX MRX M620 M620 75 92 11 rows selected. SQL> SQL> SQL>SOUNDEX的最大优点是它可以对字符串进行标记化.这意味着它为您提供了可以被索引的索引:当涉及大量数据时,这是非常有价值的.另一方面,它又旧又粗糙.周围有更新的算法,例如Metaphone和Double Metaphone.您应该可以通过Google找到它们的PL/SQL实现.
The big advantage of SOUNDEX is that it tokenizes the string. This means it gives you something which can be indexed: this is incredibly valuable when it comes to large amounts of data. On the other hand it is old and crude. There are newer algorithms around, such as Metaphone and Double Metaphone. You should be able to find PL/SQL implemenations of them via Google.
评分的优点是它们允许一定程度的模糊性;因此您可以找到所有行where name_score >= 90%.压倒性的缺点是分数是相对的,因此您无法为它们编制索引.这种比较会杀死您.
The advantage of scoring is that they allow for a degree of fuzziness; so you can find all rows where name_score >= 90%. The crushing disadvantage is that the scores are relative and so you cannot index them. This sort of comparison kills you with large volumes.
这是什么意思:
以我的经验,将令牌(名字,姓氏)串联起来是一种喜忧参半的祝福.它解决了某些问题(例如道路名称是出现在地址行1还是地址行2中),但会引起其他问题:考虑对GRAHAM OLIVER和OLIVER GRAHAM评分,而不是OLIVER对OLIVER,GRAHAM对GRAHAM,OLIVER对GRAHAM和GRAHAM对OLIVER评分.
In my experience concatenating the tokens (first name, last name) is a mixed blessing. It solves certain problems (such as whether the road name appears in address line 1 or address line 2) but causes other problems: consider scoring GRAHAM OLIVER vs OLIVER GRAHAM against scoring OLIVER vs OLIVER, GRAHAM vs GRAHAM, OLIVER vs GRAHAM and GRAHAM vs OLIVER.
无论您做什么,最终还是会得到误报和错失命中率.没有任何算法可以防止打字错误(尽管Jaro Winkler在MARX vs AMRX方面做得很好).
Whatever you do you will still end up with false positives and missed hits. No algorithm is proof against typos (although Jaro Winkler did pretty good with MARX vs AMRX).
更多推荐
使用Soundex,Jaro Winkler和Edit Distance(UTL
发布评论