我正在编写一个小的 Python 脚本来通过谷歌图片抓取图片.我已经设法使事情达到了在一个方便的列表中我想要的图像的网址的程度.现在,我只需要抓住它们...
I'm writing a small Python script to grab images via google images. I've managed to get things up to the point where I have the urls of the images I want in a handy list. Now, I just need to grab them...
对于每个图片网址,我都这样做:
for each image url i do this:
print("Retrieving:{0}".format(sFinalImageURL)) sExt = sFinalImageURL.split('.')[-1] #u = urllib.request.urlopen(sFinalImageURL) try: u = urllib.request.urlopen(sFinalImageURL) except: print("error: cannot retrieve image") continue raw_data = u.read() print("read {0} bytes".format(len(raw_data))) u.close() global sImagesFolder try: f = open("{0}/{1}_{2}.{3}".format(sImagesFolder,sImage,i,sExt),'wb') f.write(raw_data) f.close() except: print("couldn't write to {0}/{1}_{2}.{3}".format(sImagesFolder,sImage,i,sExt)) print()以下是我遇到的问题:
即使我可以直接在浏览器中打开 URL,尝试打开一些 URL 也会给我 403.所以 HTTP 请求头中有一些图片服务器不喜欢的东西......有什么想法吗?
trying to open some off the URLs gives me 403 even though I can open the URLs straight in my browser. So there's something in the HTTP request header that the image server doesn't like... any ideas?
以下是一些输出:
Retrieving:upload.wikimedia/wikipedia/commons/thumb/4/43/Timba%2B1.jpg/220px-Timba%2B1.jpg error: cannot retrieve image Retrieving:upload.wikimedia/wikipedia/commons/thumb/2/26/YellowLabradorLooking_new.jpg/260px-YellowLabradorLooking_new.jpg error: cannot retrieve image Retrieving:1.bp.blogspot/-7SsJ1n3RdoA/Tf07NOgD5nI/AAAAAAAAABo/tl8qLLIU01Y/s1600/english-shepherd-dog-0003.jpg read 11123 bytes Retrieving:completedogfood/wp-content/uploads/2010/07/complete-dog-food.bmp read 419630 bytes 推荐答案似乎维基百科只允许访问真实浏览器.这个问题可以通过指定一个真实浏览器的User-Agent字符串来解决,因为Python的urllib发送了类似Python-urllib/3.2的东西默认.
It seems like Wikipedia only allows access to real browsers. The problem can be solved by specifying a User-Agent string of a real browser, because Python's urllib sends something like Python-urllib/3.2 by default.
这是一个有效的示例(使用我使用的浏览器的 User-Agent 字符串):
Here's an example that works (with User-Agent string of the browser that I use):
url = 'upload.wikimedia/wikipedia/commons/thumb/4/43/Timba%2B1.jpg/220px-Timba%2B1.jpg' user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.19 (KHTML, like Gecko) Ubuntu/12.04 Chromium/18.0.1025.168 Chrome/18.0.1025.168 Safari/535.19' u = urllib.request.urlopen(urllib.request.Request(url, headers={'User-Agent': user_agent}))更多推荐
Python3 urllib 图像检索
发布评论