asyncio 网页抓取 101:使用 aiohttp 获取多个 url

编程入门行业动态更新时间:2024-10-27 00:25:00

本文介绍了asyncio 网页抓取 101:使用 aiohttp 获取多个 url的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

在前面的问题中，aiohttp 的一位作者亲切地建议了使用 aiohttp 来获取多个网址的方法使用来自 Python 3.5 的新 async with 语法:

In earlier question, one of authors of aiohttp kindly suggested way to fetch multiple urls with aiohttp using the new async with syntax from Python 3.5:

import aiohttp import asyncio async def fetch(session, url): with aiohttp.Timeout(10): async with session.get(url) as response: return await response.text() async def fetch_all(session, urls, loop): results = await asyncio.wait([loop.create_task(fetch(session, url)) for url in urls]) return results if __name__ == '__main__': loop = asyncio.get_event_loop() # breaks because of the first url urls = ['SDFKHSKHGKLHSKLJHGSDFKSJH', 'google', 'twitter'] with aiohttp.ClientSession(loop=loop) as session: the_results = loop.run_until_complete( fetch_all(session, urls, loop)) # do something with the the_results

但是，当 session.get(url) 请求之一中断时(如上所述，因为 SDFKHSKHGKLHSKLJHGSDFKSJH)，错误不会被处理，并且整个事情都坏了.

However when one of the session.get(url) requests breaks (as above because of SDFKHSKHGKLHSKLJHGSDFKSJH) the error is not handled and the whole thing breaks.

我寻找方法来插入关于 session.get(url) 结果的测试，例如寻找 try ... except ... 的位置，或者对于 if response.status != 200: 但我只是不明白如何使用 async with、await 和各种对象.

I looked for ways to insert tests about the result of session.get(url), for instance looking for places for a try ... except ..., or for a if response.status != 200: but I am just not understanding how to work with async with, await and the various objects.

由于 async with 还很新，所以例子并不多.如果 asyncio 向导可以展示如何做到这一点，这对很多人都会非常有帮助.毕竟大多数人想要用 asyncio 测试的第一件事就是同时获取多个资源.

Since async with is still very new there are not many examples. It would be very helpful to many people if an asyncio wizard could show how to do this. After all one of the first things most people will want to test with asyncio is getting multiple resources concurrently.

目标

目标是我们可以检查 the_results 并快速查看:

The goal is that we can inspect the_results and quickly see either:

此网址失败(以及原因:状态代码，可能是异常名称)，或
这个网址有效，这是一个有用的响应对象

推荐答案

我会使用 gather 而不是 wait，后者可以将异常作为对象返回，而不会引发它们.然后你可以检查每个结果，如果它是某个异常的实例.

I would use gather instead of wait, which can return exceptions as objects, without raising them. Then you can check each result, if it is instance of some exception.

import aiohttp import asyncio async def fetch(session, url): with aiohttp.Timeout(10): async with session.get(url) as response: return await response.text() async def fetch_all(session, urls, loop): results = await asyncio.gather( *[fetch(session, url) for url in urls], return_exceptions=True # default is false, that would raise ) # for testing purposes only # gather returns results in the order of coros for idx, url in enumerate(urls): print('{}: {}'.format(url, 'ERR' if isinstance(results[idx], Exception) else 'OK')) return results if __name__ == '__main__': loop = asyncio.get_event_loop() # breaks because of the first url urls = [ 'SDFKHSKHGKLHSKLJHGSDFKSJH', 'google', 'twitter'] with aiohttp.ClientSession(loop=loop) as session: the_results = loop.run_until_complete( fetch_all(session, urls, loop))

测试:

$python test.py SDFKHSKHGKLHSKLJHGSDFKSJH: ERR google: OK twitter: OK

更多推荐

asyncio 网页抓取 101:使用 aiohttp 获取多个 url

本文发布于:2023-11-23 09:36:31，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1620962.html