爬虫练习-爬取起点中文网小说信息

编程知识更新时间:2023-04-04 08:50:15

前言：

爬取起点中文网全部小说基本信息，小说名、作者、类别、连载\完结情况、简介，并将爬取的数据存储与EXCEL表中

本文为整理代码，梳理思路，验证代码有效性——2019.12.15

环境：
Python3（Anaconda3）
PyCharm
Chrome浏览器

主要模块：
xlwt
lxml
requests
time

1.

爬取起点中文网全部小说首页及所需信息如下

2.

分析请求的网页

http://a.qidian.com/? page=1  # 第一页
http://a.qidian.com/? page=2  # 第二页
http://a.qidian.com/? page=3  # 第三页
...

通过观察发现，全部小说总共有五页，后面的无法正常访问到，那么我们构造列表解析式
PS:很奇怪的是，有近一百万本小说，最后仅只能爬取五页共计八十多本，当然这是后话了

urls = ['http://a.qidian/? page={}'.format(str(i)) for i in range(1, 5)]

3.

解析并获取数据，打开开发者工具查看可知每本小说的数据均在标签class为“all-img-list cf”的ul下的 li 中，我们可以先行将其提取出来方便后续的数据解析

 # 定位大标签，以此循环
infos = selector.xpath('//ul[@class="all-img-list cf"]/li')

for info in infos:
    title = info.xpath('div[2]/h4/a/text()')[0]
    author = info.xpath('div[2]/p[1]/a[1]/text()')[0]
    style_1 = info.xpath('div[2]/p[1]/a[2]/text()')[0]
    style_2 = info.xpath('div[2]/p[1]/a[3]/text()')[0]
    style = style_1+'·'+style_2
    complete = info.xpath('div[2]/p[1]/span/text()')[0]
    introduce = info.xpath('div[2]/p[2]/text()')[0].strip()
    word = info.xpath('div[2]/p[3]/span/text()')[0].strip('万字')
    info_list = [title, author, style, complete, introduce, word]

我们将解析出来的数据通通放入一个静态公有的列表中

# 把数据存入列表
all_info_list.append(info_list)

4.

将列表中的数据转储与Excel表中
与text或word文本格式不同，这里我们定义一个表头并写入excel表
在写入之前要先后分别创建工作簿（即Excel表）、工作表（Sheet表）

 # 定义表头
header = ['title', 'author', 'style', 'complete', 'introduce', 'word']
# 创建工作簿
book = xlwt.Workbook(encoding='utf-8')
# 创建工作表
sheet = book.add_sheet('Sheet1')
for h in range(len(header)):
    # 写入表头
    sheet.write(0, h, header[h])

将文件按行列方式写入Excel表

i = 1  # 行数
for list in all_info_list:
    j = 0  # 列数
    # 写入爬虫数据
    for data in list:
        sheet.write(i, j, data)
        j += 1  # 列数加一，和写字一样，从左往右写入数据
    i += 1  # 这里就是换行的意思

最后保存excel文件

# 保存文件
book.save('xiaoshuo.xls')

至此爬取起点中文网小说信息结束

A.完整代码

#!/usr/bin/env python
# -*- coding: utf-8 -*-

# 导入相应的库文件
import xlwt
import requests
from lxml import etree
import time


# 初始化列表，存入爬虫数据
all_info_list = []


# 定义获取爬虫信息的函数
def get_info(url):

    html = requests.get(url)
    selector = etree.HTML(html.text)

    # 定位大标签，以此循环
    infos = selector.xpath('//ul[@class="all-img-list cf"]/li')

    for info in infos:
        title = info.xpath('div[2]/h4/a/text()')[0]
        author = info.xpath('div[2]/p[1]/a[1]/text()')[0]
        style_1 = info.xpath('div[2]/p[1]/a[2]/text()')[0]
        style_2 = info.xpath('div[2]/p[1]/a[3]/text()')[0]
        style = style_1+'·'+style_2
        complete = info.xpath('div[2]/p[1]/span/text()')[0]
        introduce = info.xpath('div[2]/p[2]/text()')[0].strip()
        word = info.xpath('div[2]/p[3]/span/text()')[0].strip('万字')
        info_list = [title, author, style, complete, introduce, word]
        # 把数据存入列表
        all_info_list.append(info_list)
        # 睡眠1秒
        time.sleep(1)


# 程序主入口
if __name__ == '__main__':

    urls = ['http://a.qidian/? page={}'.format(str(i)) for i in range(1, 5)]
    # 获取所有数据
    for url in urls:
        get_info(url)

    # 定义表头
    header = ['title', 'author', 'style', 'complete', 'introduce', 'word']
    # 创建工作簿
    book = xlwt.Workbook(encoding='utf-8')
    # 创建工作表
    sheet = book.add_sheet('Sheet1')
    for h in range(len(header)):
        # 写入表头
        sheet.write(0, h, header[h])

    i = 1  # 行数
    for list in all_info_list:
        j = 0  # 列数
        # 写入爬虫数据
        for data in list:
            sheet.write(i, j, data)
            j += 1
        i += 1
    # 保存文件
    book.save('xiaoshuo.xls')