我有一个有趣的问题,需要帮助.我目前正在开发程序的功能,却偶然发现了这个问题
I have an interesting problem that I need help with. I am currently working on a feature of my program and stumbled into this issues
我在数据库中存储了印度尼西亚的大量街道名称列表(> 10万行), 每个街道名称都可以包含1个以上的单词.例如:"Sudirman","Gatot Subroto"或"Jalan Asia Afrika"都是合法的街道名称
I have a huge list of street names in Indonesia ( > 100k rows ) stored in database, Each street name may have more than 1 word. For example : "Sudirman", "Gatot Subroto", or "Jalan Asia Afrika" are all legit street names
在数据库中有一堆文本(> 1百万行),我将其拆分为多个句子.现在,我需要做的功能(准确地说是功能)是测试句子中是否有街道名称,所以只对/错测试
have a bunch of texts ( > 1 Million rows ) in databases, that I split into sentences. Now, the features ( function to be exact ) that I need to do , is to test whether there are street names inside the sentences or no, so just a true / false test
我尝试通过执行以下步骤来解决它:
I have tried to solve it by doing these steps:
a.将街道名称放入键值散列"中
a. Putting the street names into a Key,Value Hash
b.将每个句子分成单词
b. Split each sentences into words
c.测试单词是否在哈希中
c. Test whether words are in the hash
这是快速的方法,但不能同时使用多个单词
This is fast, but will not work with multiple words
我想到的另一种替代方法是执行以下步骤:
Another alternatives that I thought of is to do these steps:
a.将每个句子拆分成单词
a. Split each sentences into words
b.用LIKE语句查询数据库(即SELECT #### FROM street_table WHERE名称,例如'%word%')
b. Query the database with LIKE statement ( i,e. SELECT #### FROM street_table WHERE name like '%word%' )
c.如果查询返回一行,则表示该句子包含街道名称
c. If query returned a row, it means that the sentence contains street names
现在,此解决方案将需要大量的IO.
Now, this solution is going to be a very IO intensive.
所以我的问题是进行此测试的最有效方法是什么?"?不管编程语言如何.我主要是在python中进行此操作,但是只要我能掌握这些概念,任何语言都可以做到
So my question is "What is the most efficient way to do this test" ? regardless of the programming language. I do this in python mainly, but any language will do as long as I can grasp the concepts
============编辑1 ================
============EDIT 1 =================
这将是期刊吗?
是的,我将以1分钟的间隔调用此功能.每次通话至少要获取100行文字,并根据街道名称数据库对其进行测试
Yes, I will call this feature / function with an interval of 1 minute. Each call will take 100 row of texts at least and test them against the street name database
推荐答案一个简单的解决方案是使用第一个单词的街道名称=>完整的街道名称创建字典/多图.当您遍历句子中的每个单词时,您将查找潜在的街道名称,并检查您是否有匹配项(通过查看下一个单词).
A simple solution would be to create a dictionary/multimap with first-word-of-street-name=>full-street-name(s). When you iterate each word in your sentence you'll look up potential street names, and check if you have a match (by looking at the next words).
该算法应该很容易实现,并且应该也表现不错.
This algorithm should be fairly easy to implement and should perform pretty good too.
更多推荐
给定大量街道名称,测试文本是否包含该街道名称之一的最有效方法是什么?
发布评论