Beautifulsoup findall被卡住而不进行处理

编程入门 行业动态 更新时间:2024-10-22 12:34:11
本文介绍了Beautifulsoup findall被卡住而不进行处理的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我试图了解BeautifulSoup,并试图在facebook中找到所有链接,并对其中的每个链接进行迭代...

I'm trying to understand BeautifulSoup and tried want to find all the links within facebook and iterate each and every link within it...

这是我的代码...可以正常工作,但是一旦找到Linkedin并对其进行迭代,它就会卡在此URL后的某个位置- www.linkedin/redir/redirect?url = http%3A%2F%2Fbusiness%2Elinkedin%2Ecom%2Ftalent-solutions%3Fsrc%3Dli-footer& urlhash = f9Nj

Here is my code...it works fine but once it finds Linkedin and iterates over it, it get stuck at a point after this URL - www.linkedin/redir/redirect?url=http%3A%2F%2Fbusiness%2Elinkedin%2Ecom%2Ftalent-solutions%3Fsrc%3Dli-footer&urlhash=f9Nj

当我分别运行Linkedin时,我没有任何问题...

When I run Linkedin separately, I don't have any problem...

这可能是我操作系统中的限制吗?我正在使用Ubuntu Linux ...

Could this be a limitation within my operating system..Im using Ubuntu Linux...

import urllib2 import BeautifulSoup import re def main_process(response): print "Main process started" soup = BeautifulSoup.BeautifulSoup(response) limit = '5' count = 0 main_link = valid_link = re.search("^(https?://(?:\w+.)+\)(?:/.*)?$","www.facebook") if main_link: main_link = main_link.group(1) print 'main_link = ', main_link result = {} result[main_link] = {'incoming':[],'outgoing':[]} print 'result = ', result for link in soup.findAll('a',href=True): if count < 10: valid_link = re.search("^(https?://(?:\w+.)+\)(?:/.*)?$",link.get('href')) if valid_link: #print 'Main link = ', link.get('href') print 'Links object = ', valid_link.group(1) connecting_link = valid_link.group(1) connecting_link = connecting_link.encode('ascii') if main_link <> connecting_link: print 'outgoing link = ', connecting_link result = add_new_link(connecting_link, result) #Check if the outgoing is already added, if its then don't add it populate_result(result,main_link,connecting_link) print 'result = ', result print 'connecting' request = urllib2.Request(connecting_link) response = urllib2.urlopen(request) soup = BeautifulSoup.BeautifulSoup(response) for sublink in soup.findAll('a',href=True): print 'sublink = ', sublink.get('href') valid_link = re.search("^(https?://(?:\w+.)+\)(?:/.*)?$",sublink.get('href')) if valid_link: print 'valid_link = ', valid_link.group(1) valid_link = valid_link.group(1) if valid_link <> connecting_link: populate_result(result,connecting_link,valid_link) count += 1 print 'final result = ', result # print 'found a url with national-park in the link' def add_new_link(connecting_link, result): result[connecting_link] = {'incoming':[],'outgoing':[]} return result def populate_result(result,link,dest_link): if len(result[link]['outgoing']) == 0: result[link]['outgoing'].append(dest_link) else: found_in_list = 'Y' try: result[link]['outgoing'].index(dest_link) found_in_list = 'Y' except ValueError: found_in_list = 'N' if found_in_list == 'N': result[link]['outgoing'].append(dest_link) return result if __name__ == "__main__": request = urllib2.Request("facebook") print 'process start' try: response = urllib2.urlopen(request) main_process(response) except urllib2.URLError, e: print "URLERROR" print "program ended"

推荐答案

问题出在以下行的某些URL上挂起re.search():

The problem is in hanging re.search() on certain URLs on this line:

valid_link = re.search("^(https?://(?:\w+.)+\)(?:/.*)?$", sublink.get('href'))

例如,它挂在www.facebook/campaign/landing.php?placement=pflo&campaign_id=402047449186&extra_1=auto网址上:

>>> import re >>> s = "www.facebook/campaign/landing.php?placement=pflo&campaign_id=402047449186&extra_1=auto" >>> re.search("^(https?://(?:\w+.)+\)(?:/.*)?$", s) hanging "forever"...

看起来,它引入了灾难性回溯案例,导致正则表达式搜索挂起

Looks like, it introduces a Catastrophic Backtracking case that causes regex search to hang.

一种解决方案是使用其他正则表达式来验证URL,请在此处查看许多选项:

One solution would be to use a different regex for validating the URL, see plenty of options here:

  • 如何验证在Python中带有正则表达式的URL?

希望有帮助.

更多推荐

Beautifulsoup findall被卡住而不进行处理

本文发布于:2023-11-23 18:37:36,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1622483.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:而不   Beautifulsoup   findall

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!