豆瓣"/>
怎么用python爬豆瓣
学python没多久,主要想用它来做爬虫,写api建议用node.js,做全站页面渲染用php搞定,做爬虫还得看python:
这里没有用python的一些爬虫框架,先采用python内置模块urllib直接处理页面抓取,然后解析内容然后直接下载图片:
直接抓取豆瓣图片api,解析并下载图片:
# -*- coding: utf-8 -*-
import json
import urllib
import re
def getHtml(url):
request =url
response = urllib.urlopen(request)
return response.read()
def downloadPic(url,start):
source =getHtml(url)
s = json.loads(source)
imgArr = s['subjects']
index=0
for i in imgArr:
#print i['title'],i['url']
ext=re.findall(r'.*\.(\w+)$',i['cover'])
if len(ext)>0:
ext =ext[0]
else:
ext='jpg'
path='./img/douban_%s_%s.%s' % (start,index,ext)
print path
f=open(path,'w')
f.write(getHtml(i['cover']))
f.close()
index=index+1
def downMore(num=0):
for i in range(num):
p=i*20
url='=movie&tag=%E7%83%AD%E9%97%A8&sort=recommend&page_limit=20&page_start='+str(p)
print 'url is %s' % url
downloadPic(url,i)
downMore(13)
直接抓取豆瓣电影页面源代码解析并下载图片:
# -*- coding: utf-8 -*-
import urllib
import re
def getHtml(url):
request =url
response = urllib.urlopen(request)
return response.read()
def trimempty(n):
return len(n)>12
url=''
data= getHtml(url)
#print data
#imgs= re.findall(r'.*',data)
imgs= re.findall(r'',data)
if len(imgs)>50:
imgs = imgs[0:50]
else:
imgs=imgs
imgs = filter(trimempty,imgs)
print imgs
index=0
for i in imgs:
extArr=re.findall(r'.*\.(\w+)$',i)
if(len(extArr)>0):
ext =extArr[0]
else:
ext='jpg'
path ='./img/%s.%s' % (index,ext)
print path
f=open(path,'w')
f.write(getHtml(i))
f.close()
index=index+1
运行代码前,先到当前目录下新建img文件夹,运行代码大功告成!
小问题是windows下图片显示不正常,linux下完美下载,最后建议小伙伴们不要再windows上跑python了
更多推荐
怎么用python爬豆瓣
发布评论