Beautifulsoup findall被卡住而不进行处理

编程入门行业动态更新时间:2024-10-22 12:34:11

本文介绍了Beautifulsoup findall被卡住而不进行处理的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我试图了解BeautifulSoup，并试图在facebook中找到所有链接，并对其中的每个链接进行迭代...

I'm trying to understand BeautifulSoup and tried want to find all the links within facebook and iterate each and every link within it...

这是我的代码...可以正常工作，但是一旦找到Linkedin并对其进行迭代，它就会卡在此URL后的某个位置- www.linkedin/redir/redirect?url = http％3A％2F％2Fbusiness％2Elinkedin％2Ecom％2Ftalent-solutions％3Fsrc％3Dli-footer& urlhash = f9Nj

Here is my code...it works fine but once it finds Linkedin and iterates over it, it get stuck at a point after this URL - www.linkedin/redir/redirect?url=http%3A%2F%2Fbusiness%2Elinkedin%2Ecom%2Ftalent-solutions%3Fsrc%3Dli-footer&urlhash=f9Nj

当我分别运行Linkedin时，我没有任何问题...

When I run Linkedin separately, I don't have any problem...

这可能是我操作系统中的限制吗?我正在使用Ubuntu Linux ...

Could this be a limitation within my operating system..Im using Ubuntu Linux...

import urllib2 import BeautifulSoup import re def main_process(response): print "Main process started" soup = BeautifulSoup.BeautifulSoup(response) limit = '5' count = 0 main_link = valid_link = re.search("^(https?://(?:\w+.)+\)(?:/.*)?$","www.facebook") if main_link: main_link = main_link.group(1) print 'main_link = ', main_link result = {} result[main_link] = {'incoming':[],'outgoing':[]} print 'result = ', result for link in soup.findAll('a',href=True): if count < 10: valid_link = re.search("^(https?://(?:\w+.)+\)(?:/.*)?$",link.get('href')) if valid_link: #print 'Main link = ', link.get('href') print 'Links object = ', valid_link.group(1) connecting_link = valid_link.group(1) connecting_link = connecting_link.encode('ascii') if main_link <> connecting_link: print 'outgoing link = ', connecting_link result = add_new_link(connecting_link, result) #Check if the outgoing is already added, if its then don't add it populate_result(result,main_link,connecting_link) print 'result = ', result print 'connecting' request = urllib2.Request(connecting_link) response = urllib2.urlopen(request) soup = BeautifulSoup.BeautifulSoup(response) for sublink in soup.findAll('a',href=True): print 'sublink = ', sublink.get('href') valid_link = re.search("^(https?://(?:\w+.)+\)(?:/.*)?$",sublink.get('href')) if valid_link: print 'valid_link = ', valid_link.group(1) valid_link = valid_link.group(1) if valid_link <> connecting_link: populate_result(result,connecting_link,valid_link) count += 1 print 'final result = ', result # print 'found a url with national-park in the link' def add_new_link(connecting_link, result): result[connecting_link] = {'incoming':[],'outgoing':[]} return result def populate_result(result,link,dest_link): if len(result[link]['outgoing']) == 0: result[link]['outgoing'].append(dest_link) else: found_in_list = 'Y' try: result[link]['outgoing'].index(dest_link) found_in_list = 'Y' except ValueError: found_in_list = 'N' if found_in_list == 'N': result[link]['outgoing'].append(dest_link) return result if __name__ == "__main__": request = urllib2.Request("facebook") print 'process start' try: response = urllib2.urlopen(request) main_process(response) except urllib2.URLError, e: print "URLERROR" print "program ended"

推荐答案

问题出在以下行的某些URL上挂起re.search():

The problem is in hanging re.search() on certain URLs on this line:

valid_link = re.search("^(https?://(?:\w+.)+\)(?:/.*)?$", sublink.get('href'))

例如，它挂在www.facebook/campaign/landing.php?placement=pflo&campaign_id=402047449186&extra_1=auto网址上:

>>> import re >>> s = "www.facebook/campaign/landing.php?placement=pflo&campaign_id=402047449186&extra_1=auto" >>> re.search("^(https?://(?:\w+.)+\)(?:/.*)?$", s) hanging "forever"...

看起来，它引入了灾难性回溯案例，导致正则表达式搜索挂起

Looks like, it introduces a Catastrophic Backtracking case that causes regex search to hang.

一种解决方案是使用其他正则表达式来验证URL，请在此处查看许多选项:

One solution would be to use a different regex for validating the URL, see plenty of options here: