我正试图从我们的数据库中删除某个客户。我注意到一种趋势,那就是人们填写他们的名字,其名字与他们填写公司名称的方式不同。因此,一个示例如下所示:
business_name first_name ------------- ---------- 锁匠taylorsville锁匠 锁匠roy locksmi 锁匠克林顿锁 锁匠farmington锁匠这些是我不希望被拉进查询的人。他们是坏蛋。我试图用一个WHERE语句来组合一个查询(大概),这个查询隔离了任何名字中至少包含部分匹配到他们公司名字的人,但是我很困惑并且可以使用一些帮助。
解决方案您可以使用基于相似性的方法 试试代码底部的答案 它会生成结果如下
business_name partial_business_name名字相似性锁匠taylorsville锁匠锁匠1.0 locksmith farmington锁匠锁匠1.0 locksmith roy locksmith locksmi 0.7777777777777778 locksmith clinton locksmith locks 0.5555555555555556所以,你会能够根据相似性值控制要过滤的内容 **代码**
SELECT business_name, partial_business_name,first_name,similarity FROM JS(//输入表()SELECT business_name,REGEXP_EXTRACT(business_name,r'^(\ w +)')AS partial_business_name,first_name AS first_name FROM (SELECT'locksmith taylorsville'AS business_name,'locksmith'AS first_name),(SELECT'locksmith roy'AS business_name,'locksmi'AS first_name),(SELECT'locksmith clinton'AS (SELECT'locksmith farmington'AS business_name,'locksmith'AS first_name),), //输入列 business_name,partial_business_name ,first_name, //输出模式[{name:'business_name',type:'string'}, {name:'partial_business_name',type:'string'}, {name:'first_name',type:'string'}, {name:'similarity',type:'float'}] , // function 函数(r,emit){ var _extend = function(dst ){ var sources = Array.prototype.slice.call(arguments,1); for(var i = 0; i< sources.length; ++ i){ var src = sources [i]; for(var p in src){ if(src.hasOwnProperty(p))dst [p] = src [p]; } } return dst; }; var Levenshtein = { / ** *计算两个琴弦的levenshtein距离。 * * @param str1字符串,第一个字符串。 * @param str2第二个字符串的字符串。 * @return整数levenshtein距离(0和以上)。 * / get:function(str1,str2){ // base cases if(str1 === str2)return 0; if(str1.length === 0)return str2.length; if(str2.length === 0)return str1.length; //两行 var prevRow = new Array(str2.length + 1), curCol,nextCol,i,j,tmp; //初始化上一行 for(i = 0; i< prevRow.length; ++ i){ prevRow [i] = i; } //计算当前行距前一行(i = 0; iI am working on trying to weed out a certain customer from our database. I've noticed a trend where people fill out their first name with the same name that is partial to how they fill out their company name. So an example would look like:
business_name first_name ------------- ---------- locksmith taylorsville locksmith locksmith roy locksmi locksmith clinton locks locksmith farmington locksmithThese are people I do not want being pulled in a query. They are bad eggs. I'm trying to put together a query with a WHERE statement (presumably) that isolates anyone who has a first name that contains at least a partial match to their business name, but I'm stumped and could use some help.
解决方案You can employ similarity based approach Try code at bottom of answer It produces result like below
business_name partial_business_name first_name similarity locksmith taylorsville locksmith locksmith 1.0 locksmith farmington locksmith locksmith 1.0 locksmith roy locksmith locksmi 0.7777777777777778 locksmith clinton locksmith locks 0.5555555555555556So, you will be able to control what to filter out based on similarity value
** Code **
SELECT business_name, partial_business_name, first_name, similarity FROM JS( // input table ( SELECT business_name, REGEXP_EXTRACT(business_name, r'^(\w+)') AS partial_business_name, first_name AS first_name FROM (SELECT 'locksmith taylorsville' AS business_name, 'locksmith' AS first_name), (SELECT 'locksmith roy' AS business_name, 'locksmi' AS first_name), (SELECT 'locksmith clinton' AS business_name, 'locks' AS first_name), (SELECT 'locksmith farmington' AS business_name, 'locksmith' AS first_name), ) , // input columns business_name, partial_business_name, first_name, // output schema "[{name: 'business_name', type:'string'}, {name: 'partial_business_name', type:'string'}, {name: 'first_name', type:'string'}, {name: 'similarity', type:'float'}] ", // function "function(r, emit) { var _extend = function(dst) { var sources = Array.prototype.slice.call(arguments, 1); for (var i=0; i<sources.length; ++i) { var src = sources[i]; for (var p in src) { if (src.hasOwnProperty(p)) dst[p] = src[p]; } } return dst; }; var Levenshtein = { /** * Calculate levenshtein distance of the two strings. * * @param str1 String the first string. * @param str2 String the second string. * @return Integer the levenshtein distance (0 and above). */ get: function(str1, str2) { // base cases if (str1 === str2) return 0; if (str1.length === 0) return str2.length; if (str2.length === 0) return str1.length; // two rows var prevRow = new Array(str2.length + 1), curCol, nextCol, i, j, tmp; // initialise previous row for (i=0; i<prevRow.length; ++i) { prevRow[i] = i; } // calculate current row distance from previous row for (i=0; i<str1.length; ++i) { nextCol = i + 1; for (j=0; j<str2.length; ++j) { curCol = nextCol; // substution nextCol = prevRow[j] + ( (str1.charAt(i) === str2.charAt(j)) ? 0 : 1 ); // insertion tmp = curCol + 1; if (nextCol > tmp) { nextCol = tmp; } // deletion tmp = prevRow[j + 1] + 1; if (nextCol > tmp) { nextCol = tmp; } // copy current col value into previous (in preparation for next iteration) prevRow[j] = curCol; } // copy last col value into previous (in preparation for next iteration) prevRow[j] = nextCol; } return nextCol; } }; var the_partial_business_name; try { the_partial_business_name = decodeURI(r.partial_business_name).toLowerCase(); } catch (ex) { the_partial_business_name = r.partial_business_name.toLowerCase(); } try { the_first_name = decodeURI(r.first_name).toLowerCase(); } catch (ex) { the_first_name = r.first_name.toLowerCase(); } emit({business_name: r.business_name, partial_business_name: the_partial_business_name, first_name: the_first_name, similarity: 1 - Levenshtein.get(the_partial_business_name, the_first_name) / the_partial_business_name.length}); }" ) ORDER BY similarity DESCWas used in How to perform trigram operations in Google BigQuery? and based on storage.googleapis/thomaspark-sandbox/udf-examples/pataky.js by @thomaspark where Levenshtein's distance is used to measure similarity
更多推荐
匹配两个不同列中的部分单词
发布评论