URL路径相似度/字符串相似度算法

编程入门 行业动态 更新时间:2024-10-14 14:16:21
本文介绍了URL路径相似度/字符串相似度算法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我的问题是,我需要比较URL路径并推断出它们是否相似。下面我提供了要处理的示例数据:

My problem is that I need to compare URL paths and deduce if they are similar. Below I provide example data to process:

# GROUP 1 /robots.txt # GROUP 2 /bot.html # GROUP 3 /phpMyAdmin-2.5.6-rc1/scripts/setup.php /phpMyAdmin-2.5.6-rc2/scripts/setup.php /phpMyAdmin-2.5.6/scripts/setup.php /phpMyAdmin-2.5.7-pl1/scripts/setup.php /phpMyAdmin-2.5.7/scripts/setup.php /phpMyAdmin-2.6.0-alpha/scripts/setup.php /phpMyAdmin-2.6.0-alpha2/scripts/setup.php # GROUP 4 //phpMyAdmin/

我尝试用Levenshtein距离进行比较,但是对我来说还不够准确。我不需要100%准确的算法,但是我认为必须达到90%以上。

I tried Levenshtein distance to compare, but for me is not enough accurate. I do not need 100% accurate algorithm, but I think 90% and above is a must.

我认为我需要某种分类器,但问题是新数据的每个部分都可以包含应该分类为新的未知类的路径。

I think that I need some sort of classifier, but the problem is that each portion of new data can containt path that should be classified to the new unknown class.

您能否将我定向到正确的位置?

Could you please direct me to the right thoutht?

谢谢

推荐答案

在检查@ jakub.gieryluk建议时,我偶然发现了令我满意的解决方案- 霍博姆聚类算法,最初旨在减少生物序列数据集的冗余。

When checking @jakub.gieryluk suggestion I accidentally have found solution that satisfy me - "Hobohm clustering algorithm, originally devised to reduce redundancy of biological sequence data sets."

由 Bruno Vecchi 给了我很好的结果。唯一的问题是我需要Python实现,但是我相信我既可以在Internet上找到一个代码,也可以自己重新实现代码。

Tests of PERL library implemented by Bruno Vecchi gave me really good results. The only problem is that I need Python implementation, but I belive that I can either find one on the Internet or reimplement code by myself.

接下来的事情是我还没有检查了该算法的主动学习能力;)

Next thing is that I have not checked active learning ability of this algorithm yet ;)

更多推荐

URL路径相似度/字符串相似度算法

本文发布于:2023-11-30 02:16:43,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1648391.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:字符串   算法   路径   URL

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!