利用Python爬取51CTO博客

编程入门 行业动态 更新时间:2024-10-15 06:13:14

利用Python爬取51CTO<a href=https://www.elefans.com/category/jswz/34/1771176.html style=博客"/>

利用Python爬取51CTO博客

一、背景

近期学习python request模块,想要实现一个输入关键字和页面数就可以查找到页面下的所有博客的功能,然后把查询结果写入excel的功能,利用request模块获取页面,BeautifulSoup获取指定数据(博客名称和博客url),xlsxwriter用来绘制Excel模板,并将指定内容写入Excel。后续利用这种思维抓取其他类型的数据,把抓取到的数据存入文件或数据库中。

二、代码

2.1 结构

  • getexcel模块主要是创建Excel文件,sheet工作表,绘制Excel模板,写入指定内容
  • geturl模块主要是根据关键字拼接成页面url,获取页面url内容,通过BeautifulSoup获取博客名称和url

2.2 代码

github地址

  • getexcel3.py
#!/bin/env python
# -*- coding:utf-8 -*-
# @Author  : kellyimport xlsxwriterclass create_excle:def __init__(self):self.tag_list = ["blog_name", "blog_url"]def create_workbook(self,search=" "):excle_name = search + '.xlsx'#定义excle名称workbook = xlsxwriter.Workbook(excle_name)worksheet_M = workbook.add_worksheet(search)print('create %s....' % excle_name)return workbook,worksheet_Mdef col_row(self,worksheet):worksheet.set_column('A:A', 30)worksheet.set_row(0, 17)worksheet.set_column('A:A',58)worksheet.set_column('B:B', 58)def shell_format(self,workbook):#表头格式merge_format = workbook.add_format({'bold': 1,'border': 1,'align': 'center','valign': 'vcenter','fg_color': '#FAEBD7'})#标题格式name_format = workbook.add_format({'bold': 1,'border': 1,'align': 'center','valign': 'vcenter','fg_color': '#E0FFFF'})#正文格式normal_format = workbook.add_format({'align': 'center',})return merge_format,name_format,normal_format#写入title和列名def write_title(self,worksheet,search,merge_format):title = search + "搜索结果"worksheet.merge_range('A1:B1', title, merge_format)print('write title success')def write_tag(self,worksheet,name_format):tag_row = 1tag_col = 0for num in self.tag_list:worksheet.write(tag_row,tag_col,num,name_format)tag_col += 1print('write tag success')#写入内容def write_context(self,worksheet,con_dic,normal_format):row = 2for k,v in con_dic.items():if row > len(con_dic):breakcol = 0worksheet.write(row,col,k,normal_format)col+=1worksheet.write(row,col,v,normal_format)row+=1print('write context success')#关闭exceldef workbook_close(self,workbook):workbook.close()if __name__ == '__main__':print('This is create excel mode')
  • geturl3.py
#!/bin/env python
# -*- coding:utf-8 -*-import requests
from bs4 import BeautifulSoupclass get_urldic:#获取搜索关键字def get_url(self):urlList = []first_url = '='after_url = '&type=&page='try:search = input("Please input search name:")page = int(input("Please input page:"))except Exception as e:print('Input error:',e)exit()for num in range(1,page+1):url = first_url + search + after_url + str(num)urlList.append(url)print("Please wait....")return urlList,search#获取网页文件def get_html(self,urlList):response_list = []for r_num in urlList:request = requests.get(r_num)response = request.contentresponse_list.append(response)return response_list#获取blog_name和blog_urldef get_soup(self,html_doc):result = {}for g_num in html_doc:soup = BeautifulSoup(g_num,'html.parser')context = soup.find_all('a',class_='m-1-4 fl')for i in context:title=i.get_text()result[title.strip()]=i['href']return resultif __name__ == '__main__':blog = get_urldic()urllist, search = blog.get_url()html_doc = blog.get_html(urllist)result = blog.get_soup(html_doc)for k,v in result.items():print('search blog_name is:%s,blog_url is:%s' % (k,v))
  • main.py
#!/bin/env python
# -*- coding:utf-8 -*-import geturl3
import getexcel3#获取url字典
def get_dic():blog = geturl3.get_urldic()urllist, search = blog.get_url()html_doc = blog.get_html(urllist)result = blog.get_soup(html_doc)return result,search#写入excle
def write_excle(urldic,search):excle = getexcel3.create_excle()workbook, worksheet = excle.create_workbook(search)excle.col_row(worksheet)merge_format, name_format, normal_format = excle.shell_format(workbook)excle.write_title(worksheet,search,merge_format)excle.write_tag(worksheet,name_format)excle.write_context(worksheet,urldic,normal_format)excle.workbook_close(workbook)def main():url_dic ,search_name = get_dic()write_excle(url_dic,search_name)if __name__ == '__main__':main()

三、测试结果

3.1 运行程序

运行main.py,填写搜索的关键字和查询页数

3.2 运行结果

根据kafka关键字和页数,可以看到已经生成了一个kafka.xlsx的文件

打开kafka.xlsx查看结果

利用request,xlsxwriter,BeautifulSoup爬取51CTO指定关键字和页数的流程就全部结束了,这里面通过request获取到了html页面所有内容形成列表,解析html,获取所有<a>标签,最后获取 标签正文和url会有点绕,可以debug看下变量是怎么变化的。

更多推荐

利用Python爬取51CTO博客

本文发布于:2024-02-11 19:37:24,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1682971.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:博客   Python   CTO

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!