我正在抓取一个 XML 站点地图,其中包含像 é 这样的特殊字符,结果是
ERROR: Spider 错误处理 <GET [URL with '%C3%A9' 而不是 'é']>如何让 Scrapy 保持原始 URL 不变,即其中包含特殊字符?
Scrapy==1.3.3
Python==3.5.2(我需要坚持这些版本)
更新:根据 stackoverflow/a/17082272/6170115 我能够获得带有正确字符的 URL使用 unquote:
示例用法:
>>>从 urllib.parse 导入取消引用>>>unquote('ros%C3%A9')'玫瑰'我也尝试了没有 safe_url_string 的我自己的 Request 子类,但我最终得到:
UnicodeEncodeError: 'ascii' codec can't encode character '\xf9' in position 25: ordinal not in range(128)完整回溯:
[scrapy.core.scraper] 错误:下载错误 <GET [URL with characters likeù]>回溯(最近一次调用最后一次):文件/usr/share/anaconda3/lib/python3.5/site-packages/twisted/internet/defer.py",第 1384 行,在 _inlineCallbacks结果 = result.throwExceptionIntoGenerator(g)文件/usr/share/anaconda3/lib/python3.5/site-packages/twisted/python/failure.py",第 393 行,在 throwExceptionIntoGenerator 中返回 g.throw(self.type, self.value, self.tb)文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py",第43行,在process_request中defer.returnValue((yield download_func(request=request,spider=spider)))文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/utils/defer.py",第45行,在mustbe_deferred结果 = f(*args, **kw)文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/__init__.py",第65行,在download_request返回 handler.download_request(请求,蜘蛛)文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/http11.py",第61行,在download_request返回 agent.download_request(request)文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/http11.py",第260行,在download_requestagent = self._get_agent(请求,超时)文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/http11.py",第241行,在_get_agent方案 = _parse(request.url)[0]文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/webclient.py",第 37 行,在 _parse 中返回_parsed_url_args(解析)文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/webclient.py",第 19 行,在 _parsed_url_args路径 = b(路径)文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/webclient.py",第17行,在<lambda>b = lambda s: to_bytes(s, encoding='ascii')文件/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/utils/python.py",第 120 行,在 to_bytes 中返回 text.encode(编码,错误)UnicodeEncodeError: 'ascii' 编解码器无法对位置 25 中的字符 '\xf9' 进行编码:序号不在范围内 (128)有什么建议吗?
解决方案我不认为你可以像 Scrapy 使用 safe_url_string 来自 w3lib 库,然后再存储 Request 网址.你将不得不以某种方式扭转这一点.
I'm scraping an XML sitemap which contains special characters like é, which results in
ERROR: Spider error processing <GET [URL with '%C3%A9' instead of 'é']>How do I get Scrapy to keep the original URL as is, i.e. with the special character in it?
Scrapy==1.3.3
Python==3.5.2 (I need to stick to these versions)
Update: As per stackoverflow/a/17082272/6170115 I was able to get the URL with the correct character using unquote:
Example usage:
>>> from urllib.parse import unquote >>> unquote('ros%C3%A9') 'rosé'I also tried my own Request subclass without safe_url_string but I end up with:
UnicodeEncodeError: 'ascii' codec can't encode character '\xf9' in position 25: ordinal not in range(128)Full traceback:
[scrapy.core.scraper] ERROR: Error downloading <GET [URL with characters like ù]> Traceback (most recent call last): File "/usr/share/anaconda3/lib/python3.5/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks result = result.throwExceptionIntoGenerator(g) File "/usr/share/anaconda3/lib/python3.5/site-packages/twisted/python/failure.py", line 393, in throwExceptionIntoGenerator return g.throw(self.type, self.value, self.tb) File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request defer.returnValue((yield download_func(request=request,spider=spider))) File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/utils/defer.py", line 45, in mustbe_deferred result = f(*args, **kw) File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/__init__.py", line 65, in download_request return handler.download_request(request, spider) File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/http11.py", line 61, in download_request return agent.download_request(request) File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/http11.py", line 260, in download_request agent = self._get_agent(request, timeout) File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/http11.py", line 241, in _get_agent scheme = _parse(request.url)[0] File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/webclient.py", line 37, in _parse return _parsed_url_args(parsed) File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/webclient.py", line 19, in _parsed_url_args path = b(path) File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/webclient.py", line 17, in <lambda> b = lambda s: to_bytes(s, encoding='ascii') File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/utils/python.py", line 120, in to_bytes return text.encode(encoding, errors) UnicodeEncodeError: 'ascii' codec can't encode character '\xf9' in position 25: ordinal not in range(128)Any tips?
解决方案I don't think you can do that as Scrapy uses safe_url_string from w3lib library before storing Requests URL. You would somehow have to reverse that.
更多推荐
scrapy:处理 url 中的特殊字符
发布评论