python爬虫自学宝典——将爬取的数据写入MySQL数据库

编程入门行业动态更新时间:2024-10-26 12:31:17

python<a href=https://www.elefans.com/category/jswz/34/1770264.html style= 爬虫自学宝典——将爬取的数据写入MySQL数据库"/>

python爬虫自学宝典——将爬取的数据写入MySQL数据库

前文回顾
上一节介绍了怎么将信息写入json中，这一节讲怎么将爬取的信息写入MySQL数据库中。写入数据库中，其实只需要修改pipeline.py文件即可，凡是输出，都只需要修改pipeline文件即可。
打开pipeline文件，咱们上一节写入的内容如下：

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: .htmlimport json
class DemoPipeline(object):
# def process_item(self, item, spider):# print("Blogs's name:"+item['name'])# print("The number of blogs' redding:"+item['red_number'])# print("The date of blogs' publish:",item['publish_date'])def __init__(self):self.json_file = open("./demo.json","wb+")def close_spider(self,spider):print('————————————————————关闭文件——————————————————————')self.json_file.close()def process_item(self, item, spider):text = json.dumps(dict(item),ensure_ascii=False)+"\n"self.json_file.write(text.encode("utf-8"))

那么这次呢？咱们将数据导入到MySQL数据库中去，以方便数据的存储和随后的信息挖掘。如果不会MySQL链接python的朋友，请看这篇博客：python链接MySQL
这次咱们在类的初始化中（构造函数）中，连接数据库，创建数据库表，并且判断数据库中的表是否存在。
为什么要判断数据库表是否存在呢？因为没有的时候，我们没有地方导入数据信息，所以必须在pipeline中添加一个数据库表检查操作，以防止notFoundTheDatabase错误发生。
回顾前面的讲解中，我们了解到scrapy的运行机制是：虫子创建请求，scrapy引擎将请求发送给下载器，下载器将请求发送到互联网上，互联网给出响应，将数据反馈给下载器，下载器再将数据以scrapy引擎驱动，以item的形式发送给pipeline；pipeline将数据处理，保存到文件系统，保存到数据库都行。
我推荐用MySQL自带的workbench软件，以方便有更直观的操作。先打开一个新的数据库系统，创建一个名为demo的数据库，结果如下：

数据库内空无一表，我们用程序进行创建表格，并添加数据。下面是我们修改后的pipeline.py文件程序：

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: .htmlimport mysql.connectordef table_exists(cursor, table_name):str_sql = "show tables from demo"cursor.execute(str_sql)tables = cursor.fetchall()for i in tables:for j in i:if j == table_name:return Truereturn Falseclass DemoPipeline(object):def __init__(self):self.con = mysql.connector.connect(host='localhost', port='3306', user='root', password='11131432', \database='demo', use_unicode=True)# 链接数据库self.con.autocommit = True# 使事务自动提交，若是没有此句，则必须在process_spider(self,item,spider)中手动提交事务。self.cu = self.con.cursor()# 建立游标if not table_exists(self.cu, 'blogs'):self.cu.execute("create table blogs(id integer primary key auto_increment,name varchar(200),red_number long,""publish_date ""varchar(200))")# To judge this table whether in the database-demo.# if yes,pass;or not, create a new table and named with blogs.def close_spider(self):print('————————————————————关闭文件——————————————————————')self.cu.close()self.con.close()def process_item(self, item, spider):self.cu.execute("INSERT INTO `demo`.`blogs`(`name`,`red_number`,`publish_date`)VALUES(%s,%s,%s)",(item['name'], item['red_number'], item['publish_date']))# self.cu.commit()#手动提交事务，不提交，执行不成功

运行程序，到workbench中查看我们导入的数据如下：

我们成功了，啊哈哈哈！！！

## 总结爬出的数据进入数据库其实没有什么难的，最主要的是掌握如何链接数据库，了解scrapy的运行机制。下一节我们将如何防范反爬虫，越过防火线，进入网站爬取数据。欲知后事如何，且听下回分解。

更多推荐

python爬虫自学宝典——将爬取的数据写入MySQL数据库

本文发布于:2023-07-27 22:08:42，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1225580.html