爬虫回顾"/>
爬虫回顾
爬虫类型:通用爬虫、聚焦爬虫、增量式爬虫
在使用fiddler工具抓包时,需要注意下:因为它需要安装证书,在项目请求HTTPS页面是会ssl要求提供安全证书,可能会被拒绝请求
可以在发送requests请求时,关闭安全认证,或者暂时关闭fiddler代理。末尾也会提到,这个坑……
使用 BeautifulSoup对HTML标签进行解析数据:
import requests from bs4 import BeautifulSoup url='/' ua={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"}page_text=requests.get(url=url,headers=ua).text #获取所有章节列表HTML soup=BeautifulSoup(page_text,'lxml') a_list=soup.select('.listmain > dl > dd > a') #解析出所有章节的URL with open("秦吏.txt",'w',encoding='utf-8')as f:for a in a_list:title=a.string #获取a标签中的文本 作为章节名detail_url=''+a['href'] #拼接章节详情urldetail_page=requests.get(url=detail_url,headers=ua).text dsp=BeautifulSoup(detail_page,'lxml') #章节详情页面content=dsp.find('div',id='content').text #章节内容详情f.write(title+'\n'+content) #数据持久化存储print(title+":下载完成") print('The end') f.close()
关于xpath的使用:
div[@class="song"] div中class为song的标签元素
div[@class="song"]/li/a/@href 取出其中的url地址
div[@class="song"]/li/a/text() 取出其中的文本
div[contains(@class,'ng')] 是指在div中查找class属性名含有ng的标签元素
div[starts-with(@class,'ta')] 是指div中查找class属性以ta开头的标签元素
xpath小案例:
import requests from lxml import etree url="/?PGTID=0d100000-0000-335c-5dda-1cebcdf9ae5f&ClickID=2" user_agent={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"}page_text=requests.get(url,headers=user_agent).text tree=etree.HTML(page_text) #格式化处理后的全部页面数据 li_list=tree.xpath('//ul[@class="house-list-wrap"]/li') #记录以列表返回 fp=open("58.scv",'w',encoding='utf-8') for li in li_list:title=li.xpath("./div[@class='list-info']/h2/a/text()")[0]price=li.xpath("./div[@class='price']//text()")sum_price=''.join(price)fp.write("home:"+title+"price:"+sum_price+'\n')fp.close() print("数据获取完成!")
碰到网站文本乱码问题的解决:
import requests,os from lxml import etreeurl='/' user_agent={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"}page_text=requests.get(url,headers=user_agent).text tree=etree.HTML(page_text)li_list=tree.xpath('//div[@class="slist"]/ul/li')def getpic(title,photo):if not os.path.exists('./photo'): #没有文件夹则直接创建空文件夹os.mkdir('./photo')fp = open('photo/'+title, 'wb')fp.write(photo)fp.close()return "当前资源下载完成"for li in li_list:title=li.xpath('./a/b/text()')[0]+".jpg"title=title.encode('iso-8859-1').decode('gbk') #乱码 转标准格式在解码print(title)p_url=li.xpath('./a/img/@src')[0]picture_url=''+p_urlphoto=requests.get(url=picture_url,headers=user_agent).contentret=getpic(title,photo)print(ret)
批量获取简历模板:字符乱码问题的处理
import requests,random,os,time from lxml import etreeurl='.html' user_agent={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"}response=requests.get(url,headers=user_agent) #简历列表页面 response.encoding='utf-8' #对乱码进行处理 page_text=response.texttree=etree.HTML(page_text)div_list=tree.xpath('//div[@id="container"]/div')if not os.path.exists('./jl'):os.mkdir('./jl') for div in div_list:title=div.xpath('./a/img/@alt')[0] #简历名称link=div.xpath('./a/@href')[0] #简历详情地址 fp=open('./jl/'+title+'.zip','wb')detail_page=requests.get(url=link,headers=user_agent).text #简历详情页面dpage=etree.HTML(detail_page)down_list=dpage.xpath('//div[@class="clearfix mt20 downlist"]/ul/li/a/@href')down_url=random.choice(down_list) #随机选择下载地址word=requests.get(url=down_url,headers=user_agent).contentprint("准备开始下载>>"+title)fp.write(word)time.sleep(1)
对代理ip进行测试:写入数据则代理ip可用
import requestsurl='=ip' user_agent={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"} proxy={"https":'112.85.170.79:9999'}page=requests.get(url,headers=user_agent,proxies=proxy).textwith open('./ip.html','w',encoding='utf-8')as f:f.write(page)
这一天天的mmp,当指定headers的User-Agent时,服务器会重定向到https的网址.因此报出SSL验证失败的错误,为了避免重定向造成认证失败,直接关闭认证page_text=requests.get(url=url,headers=user_agent,verify=False).text
转载于:.html
更多推荐
爬虫回顾
发布评论