我试图了解BeautifulSoup,并试图在facebook中找到所有链接,并对其中的每个链接进行迭代...
I'm trying to understand BeautifulSoup and tried want to find all the links within facebook and iterate each and every link within it...
这是我的代码...可以正常工作,但是一旦找到Linkedin并对其进行迭代,它就会卡在此URL后的某个位置- www.linkedin/redir/redirect?url = http%3A%2F%2Fbusiness%2Elinkedin%2Ecom%2Ftalent-solutions%3Fsrc%3Dli-footer& urlhash = f9Nj
Here is my code...it works fine but once it finds Linkedin and iterates over it, it get stuck at a point after this URL - www.linkedin/redir/redirect?url=http%3A%2F%2Fbusiness%2Elinkedin%2Ecom%2Ftalent-solutions%3Fsrc%3Dli-footer&urlhash=f9Nj
当我分别运行Linkedin时,我没有任何问题...
When I run Linkedin separately, I don't have any problem...
这可能是我操作系统中的限制吗?我正在使用Ubuntu Linux ...
Could this be a limitation within my operating system..Im using Ubuntu Linux...
import urllib2 import BeautifulSoup import re def main_process(response): print "Main process started" soup = BeautifulSoup.BeautifulSoup(response) limit = '5' count = 0 main_link = valid_link = re.search("^(https?://(?:\w+.)+\)(?:/.*)?$","www.facebook") if main_link: main_link = main_link.group(1) print 'main_link = ', main_link result = {} result[main_link] = {'incoming':[],'outgoing':[]} print 'result = ', result for link in soup.findAll('a',href=True): if count < 10: valid_link = re.search("^(https?://(?:\w+.)+\)(?:/.*)?$",link.get('href')) if valid_link: #print 'Main link = ', link.get('href') print 'Links object = ', valid_link.group(1) connecting_link = valid_link.group(1) connecting_link = connecting_link.encode('ascii') if main_link <> connecting_link: print 'outgoing link = ', connecting_link result = add_new_link(connecting_link, result) #Check if the outgoing is already added, if its then don't add it populate_result(result,main_link,connecting_link) print 'result = ', result print 'connecting' request = urllib2.Request(connecting_link) response = urllib2.urlopen(request) soup = BeautifulSoup.BeautifulSoup(response) for sublink in soup.findAll('a',href=True): print 'sublink = ', sublink.get('href') valid_link = re.search("^(https?://(?:\w+.)+\)(?:/.*)?$",sublink.get('href')) if valid_link: print 'valid_link = ', valid_link.group(1) valid_link = valid_link.group(1) if valid_link <> connecting_link: populate_result(result,connecting_link,valid_link) count += 1 print 'final result = ', result # print 'found a url with national-park in the link' def add_new_link(connecting_link, result): result[connecting_link] = {'incoming':[],'outgoing':[]} return result def populate_result(result,link,dest_link): if len(result[link]['outgoing']) == 0: result[link]['outgoing'].append(dest_link) else: found_in_list = 'Y' try: result[link]['outgoing'].index(dest_link) found_in_list = 'Y' except ValueError: found_in_list = 'N' if found_in_list == 'N': result[link]['outgoing'].append(dest_link) return result if __name__ == "__main__": request = urllib2.Request("facebook") print 'process start' try: response = urllib2.urlopen(request) main_process(response) except urllib2.URLError, e: print "URLERROR" print "program ended"推荐答案
问题出在以下行的某些URL上挂起re.search():
The problem is in hanging re.search() on certain URLs on this line:
valid_link = re.search("^(https?://(?:\w+.)+\)(?:/.*)?$", sublink.get('href'))例如,它挂在www.facebook/campaign/landing.php?placement=pflo&campaign_id=402047449186&extra_1=auto网址上:
>>> import re >>> s = "www.facebook/campaign/landing.php?placement=pflo&campaign_id=402047449186&extra_1=auto" >>> re.search("^(https?://(?:\w+.)+\)(?:/.*)?$", s) hanging "forever"...看起来,它引入了灾难性回溯案例,导致正则表达式搜索挂起
Looks like, it introduces a Catastrophic Backtracking case that causes regex search to hang.
一种解决方案是使用其他正则表达式来验证URL,请在此处查看许多选项:
One solution would be to use a different regex for validating the URL, see plenty of options here:
- 如何验证在Python中带有正则表达式的URL?
希望有帮助.
更多推荐
Beautifulsoup findall被卡住而不进行处理
发布评论