python爬虫学习第二十五天

编程入门行业动态更新时间:2024-10-27 00:30:43

python<a href=https://www.elefans.com/category/jswz/34/1770264.html style= 爬虫学习第二十五天"/>

python爬虫学习第二十五天

首先，前一章还有一节内容是讲用python发送邮件的，但是我的电脑配置不好SMTP(simple mail transfer protocol)客户端，所以这部分的练习做不了。

下面进入下一章，读取文档。
本章重点介绍文档处理的相关内容，包括把文件下载到文件夹里，以及读取文档并提取数据。还会介绍文档的不同编码类型，让程序可以读取非英文的 HTML 页面。

不同的网站有可能会用不同的编码方式组织01信息，这时候解码的方式尤为重要，错误的解码方式会使得字符串的意思变得迥然不同。

练习1 英语网站和非英语网站使用默认读取方式的差别
运行后就能清楚的看出，第一段代码的结果是乱码的，因为原网页的是俄语，而我们却使用了系统默认的给英语的编码方式

# from urllib.request import urlopen# html = urlopen(".txt")
# print(html.read())# from urllib.request import urlopen# html = urlopen(".txt")
# print(html.read())

换成utf8方式decode后，正确的内容得以显现

# from urllib.request import urlopen# html = urlopen(".txt")
# print(str(html.read(),'utf-8'))

from urllib.request import urlopen
from bs4 import BeautifulSouphtml = urlopen("(programming_language)")
bsObj = BeautifulSoup(html)
content = bsObj.find('div',{'id':'mw-content-text'}).get_text()
content = bytes(content,'utf-8')
print(content.decode('utf-8'))

如果你要做很多网络数据采集工作，尤其是面对国际网站时，建议你先看看 meta 标签的内容，用网站推荐的编码方式读取页面内容。

练习3 从网上获取一个 CSV 文件然后把每一行都打印到命令行

# from urllib.request import urlopen
# from io import StringIO
# import csv# data = urlopen(".csv").read().decode("ascii","ignore")
# dataFile = StringIO(data)
# csvFile = csv.reader(dataFile)
# for row in csvFile:
#   print(row)

输出的格式是下面的样子：

[‘Name’, ‘Year’]
[“Monty Python’s Flying Circus”, ‘1970’]
[‘Another Monty Python Record’, ‘1971’]
[“Monty Python’s Previous Record”, ‘1972’]
[‘The Monty Python Matching Tie and Handkerchief’, ‘1973’]
[‘Monty Python Live at Drury Lane’, ‘1974’]
[‘An Album of the Soundtrack of the Trailer of the Film of Monty Python and the Holy Grail’, ‘1975’]
[‘Monty Python Live at City Center’, ‘1977’]
[‘The Monty Python Instant Record Collection’, ‘1977’]
[“Monty Python’s Life of Brian”, ‘1979’]
[“Monty Python’s Cotractual Obligation Album”, ‘1980’]
[“Monty Python’s The Meaning of Life”, ‘1983’]
[‘The Final Rip Off’, ‘1987’]
[‘Monty Python Sings’, ‘1989’]
[‘The Ultimate Monty Python Rip Off’, ‘1994’]
[‘Monty Python Sings Again’, ‘2014’]

从输出格式可以看出，csv.reader 返回的 csvReader 对象是可迭代的，而且由 Python 的列表对象构成

还用另一种对象叫DictReader对象，csv.DictReader 会返回把 CSV 文件每一行转换成 Python 的字典对象返回

from urllib.request import urlopen
from io import StringIO
import csvdata = urlopen(".csv").read().decode("ascii","ignore")
dataFile = StringIO(data)
dictReader = csv.DictReader(dataFile)
print(dictReader.fieldnames)
for row in dictReader:print(row)

看完了文本文档，接下来看一下PDF文档的读取，python3.x内置库不支持PDF处理，所以需要下第三方库，例如本人用的PDFminer这个库，他是通过python源码安装的

练习4 把任意 PDF 读成字符串，然后用 StringIO 转换成文件对象

from urllib.request import urlopen
from pdfminer.pdfinterp import PDFResourceManager,process_pdf
from pdfminer.converter import TextConverter 
from pdfminer.layout import LAParams 
from io import StringIO
from io import opendef readPDF(PDFfile):rsrcmgr = PDFResourceManager()retstr = StringIO()laparams = LAParams()device = TextConverter(rsrcmgr, retstr, laparams=laparams)process_pdf(rsrcmgr,device,PDFfile)device.close()content = retstr.getvalue()retstr.close()return contentpass
pdfFile = urlopen(".pdf")
outPutString = readPDF(pdfFile)
print(outPutString)
pdfFile.close()

这个例子用到了许多pdfminer里的对象，关于pdfminer，如果想掌握的比较好需要看文档，时间关系先往后面看了。

今天的内容到这里啦，打卡~

更多推荐

python爬虫学习第二十五天

本文发布于:2024-03-07 04:18:59，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1716827.html