Python3网络爬虫

编程入门 行业动态 更新时间:2024-10-08 02:25:09

Python3网络<a href=https://www.elefans.com/category/jswz/34/1770264.html style=爬虫"/>

Python3网络爬虫

Python3网络爬虫-基本库使用

文章目录

  • Python3网络爬虫-基本库使用
    • 1、 HTTP基本原理
      • 1、URL&URN
      • 2、 HTTP&HTTPS
      • 3、请求
      • 4、 响应
    • 2、基本库使用
      • 1、urllib
      • 3、requests
      • 4、正则表达式
      • 5、XPath
      • 6、Beautiful Soup
      • 7、 pyquery

1、 HTTP基本原理

1、URL&URN

  • URL Universa Resource Locator ,即统 资源定位符 。例如:https://github /favicon.ico

  • URN 它的全称为 Universal Resource Name ,即统一资源名称 ,例如:um:isbn:0451450523 指定了一本书的 ISBN

  • URI 全称为 ifo rm Resource Identifier ,即 统一资源标志符

    URI=URL+URN , 目前URN使用较少,几乎所有的URL都是URI

2、 HTTP&HTTPS

URL 的开头会有 http https,也许还会看到坤、 smb 开头的URL

  • HTTP Hyper Text Transfer Protocol ,超文本传输协议
  • HTTPS HTTP+SSL/STL ,安全的超文本传输协议

3、请求

  • 请求方法

    • GET 请求中的参数包含在 URL 里面,数据可以在 URL 中看到,而 POST 请求的 URL不 会包
      含这些数据,数据都是通过表单形式传输的,会包含在请求体中
    • GET 请求提交的数据最多只有 1024 字节,而 POST 方式没有限制
    • 表单、敏感信息、文件使用POST提交
  • 请求网址

    即统 资惊定位符 URL ,它可以唯一确定我们想请求的资源

  • 请求头

    服务器重要附加信息

    • Accept 请求报头域,用于指定客户端可接受哪些类型的信息
    • Accept-Language :指定客户端可接受的语言类型
    • Accept-Encoding :指定客户端可接受的内容编码
    • Host :用于指定请求资源的主机 IP 和端口号,其内容为请求 URL 的原始服务器或网关的位置。HTTP 1. 版本开始,请求必须包含此内容
    • Cookie :也常用复数形式 Cookies ,这是网站为了辨别用户进行会话跟踪而存储在用户本地
      的数据 它的主要功能是维持当前访问会话
    • Referer : 此内容用来标识这个请求是从哪个页面发过来的,服务器可以拿到这 信息并做相
      应的处理,如做来源统计、防盗链处理等
    • User-Agent:简称 UA ,它是一个特殊的字符串头,可以使服务器识别客户使用的操作系统
      及版本 浏览器及版本等信息 在做爬虫时加上此信息,可以伪装为浏览器;如果不加,很
      可能会被识别州为爬虫
    • Content-Type :也叫互联网媒体类型( Internet Media Type )或者 MIME 类型,在 HTT 协议
      消息头中,它用来表示具体请求中的媒体类型信息 例如, text/html 代表 HTML 格式,image/gif 代表 GIF 图片, app lication/json 代表JSON 类型,更多对应关系可以查看此对照表
  • 请求体

    请求体 般承载的内容是 POST 请求中的表单数据,而对于 GET 请求,请求体则为空

    Content-Type提交数据的方式
    application/x-www-forrn-urlencoded表单数据
    multi part/form-data表单文件上传
    application/json序列化 JSON 数据
    text/xmlXML 数据

4、 响应

响应,由服务端返回给客户端,可以分为 部分:响应状态码( Response Status Code )、响应头( Response Headers )和响应体( Response Body )

  • 项目码

    常见的错误代码及错误原因

    状态码说明详情
    100继续请求者应当继续提出请求 服务器已收到请求的一部分,正在等待其余部分
    101切换协议请求者已要求服务器切换协议,服务器已确认并准备切换
    200成功服务然已成功处理了请求
    201已创建请求成功并且服务器创建了新的资源
    202已接收服务器已经接受请求,但尚未处理
    203非授权信息服务器已成功处理了请求,但返回的信息可能来自另 个源
    204无内容服务器成功处理了请求 但没有返回任何内容
    205重置内容服务器成功处理了请求,内容被重宜
    206部分内容服务器成功处理了部分请求
    300多种选择针对请求,服务器可执行多种操作
    301永久移动请求的网页已永久移动到新位置,即永久重定向
    302l临时移动请求的网页暂时跳转到其他页面,即暂时重定向
    303查看其他位置如果原来的请求是POST , 重定向目标文档应该通过GET 提取
    304未修改此次请求返回的网页未修改, 继续使用上次的资源
    305使用代理请求者应该使用代理访问该网页
    307临时重定向请求的资源临时从其他位置l响应
    400错误请求服务器无法解析该请求
    401没授权请求没有进行身份验证或验证未通过
    403禁止访问服务将拒绝此请求
  • 响应头

    响应头包含了服务器对请求的应答信息

    • Date : 标识响应产生的时间。
    • Last-Modified : 指定资源的最后修改时间。
    • Content-Encoding : 指定响应内容的编码。
    • Server : 包含服务器的信息,比如名称、版本号等。
    • Content-Type : 文档类型,指定返回的数据类型是什么,如tex t/htm l 代表返回HTML 文档,app li cation/x-javascript !J!U 代表返回JavaScript 文件, image/jpeg 则代表返回图片。
    • Set-Cookie : 设置Cookies 。响应头中的Set- Cooki e 告诉浏览器需要将此内容放在Cookies中, 下次请求携带Cookies 请求
    • Expires : 指定响应的过期时间, 可以使代理服务器或浏览器将加载的内容更新到缓存中。如
      果再次访问时,就可以直接从缓存中加载, 降低服务器负载,缩短加载时间。

2、基本库使用

1、urllib

判断超时情况

import urllib.request
import urllib.error
import socket
try:reponse=urllib.request.urlopen( '' , timeout=0.1)
except urllib.error.URLError as e:if isinstance( e.reason , socket.timeout):print('TIME OUT')

构造HTTP请求头

from urllib import request , parse
url=''
headers={'User-Agent':'Mozilla/4.0 (compatible; MSIE S. S; Windows NT )','Host':'httpbin'
}
dict={'name':'Germey'
}
data=bytes( parse.urlencode(dict), encoding='utf-8')
req= request.Request(url=url,headers=headers,data=data,method='POST')
response= request.urlopen(req)
print(response.read().decode('utf-8'))

3、requests

简单爬取知乎问答内容

import requests
import re
import sys
#设置正确的浏览器信息否则返回400,GOOGLE F12 现抓个User-Agent最好
headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36' 
}
r=requests.get('',headers=headers)
#如果有不确定的地方可以先发到 看看请求是否正确
#r=requests.get('',headers=headers)
if r.status_code != 200 :print( "return status_code : % " %r.status_code );sys.exit()
pattern=repile('explore-feed.*?question_link.*?>(.*?)</a>', re.S)
titles=re.findall(pattern,r.text)
print(titles)

抓取图片、视频、音频

import requests
headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36' 
}
r=requests.get('.ico' , headers=headers )
#print( t.text ) --打印文本
#print( t.content )--打印bytes
with open( 'favicon.ico' , 'wb' ) as f:f.write( r.content )

状态码查询

import requests
import sys
r = requests.get('')
sys.exit if r.status_code != requests.codes.ok else print('Request Successfully')
#class 'requests.structures.CaseInsensitiveDict'
print( type(r.headers) , r.headers)
#class 'requests.cookies.RequestsCookieJar'
print( type(r.cookies) , r.cookies)
print( type(r.url) , r.url )

requests.codes

import requests
dict=requests.codes.__dict__  
#键值对颠倒
si=[(status_code,info) for info,status_code in dict.items()]
dist_si={}
#字典去重合并
for code_dict in si:code_dict_key=code_dict[0]code_dict_val=code_dict[1]print( code_dict_key , code_dict_val )if  dist_si.get(code_dict_key):dist_si[code_dict_key].append(code_dict_val)else:dist_si[code_dict_key]=[code_dict_val]
for (status_code,info) in dist_si.items():print( status_code,info ) 
=====+++++++++++++++++++++++++++++=输出==================================
status_codes ['name']
#信息状态码
100 ['continue', 'CONTINUE']
101 ['switching_protocols', 'SWITCHING_PROTOCOLS']
102 ['processing', 'PROCESSING']
103 ['checkpoint', 'CHECKPOINT']
122 ['uri_too_long', 'URI_TOO_LONG', 'request_uri_too_long', 'REQUEST_URI_TOO_LONG']
#成功状态码
200 ['ok', 'OK', 'okay', 'OKAY', 'all_ok', 'ALL_OK', 'all_okay', 'ALL_OKAY', 'all_good', 'ALL_GOOD', '\\o/', '✓']
201 ['created', 'CREATED']
202 ['accepted', 'ACCEPTED']
203 ['non_authoritative_info', 'NON_AUTHORITATIVE_INFO', 'non_authoritative_information', 'NON_AUTHORITATIVE_INFORMATION']
204 ['no_content', 'NO_CONTENT']
205 ['reset_content', 'RESET_CONTENT', 'reset', 'RESET']
206 ['partial_content', 'PARTIAL_CONTENT', 'partial', 'PARTIAL']
207 ['multi_status', 'MULTI_STATUS', 'multiple_status', 'MULTIPLE_STATUS', 'multi_stati', 'MULTI_STATI', 'multiple_stati', 'MULTIPLE_STATI']
208 ['already_reported', 'ALREADY_REPORTED']
226 ['im_used', 'IM_USED']
#重定向状态码
300 ['multiple_choices', 'MULTIPLE_CHOICES']
301 ['moved_permanently', 'MOVED_PERMANENTLY', 'moved', 'MOVED', '\\o-']
302 ['found', 'FOUND']
303 ['see_other', 'SEE_OTHER', 'other', 'OTHER']
304 ['not_modified', 'NOT_MODIFIED']
305 ['use_proxy', 'USE_PROXY']
306 ['switch_proxy', 'SWITCH_PROXY']
307 ['temporary_redirect', 'TEMPORARY_REDIRECT', 'temporary_moved', 'TEMPORARY_MOVED', 'temporary', 'TEMPORARY']
308 ['permanent_redirect', 'PERMANENT_REDIRECT', 'resume_incomplete', 'RESUME_INCOMPLETE', 'resume', 'RESUME']
#客户端错误状态码
400 ['bad_request', 'BAD_REQUEST', 'bad', 'BAD']
401 ['unauthorized', 'UNAUTHORIZED']
402 ['payment_required', 'PAYMENT_REQUIRED', 'payment', 'PAYMENT']
403 ['forbidden', 'FORBIDDEN']
404 ['not_found', 'NOT_FOUND', '-o-', '-O-']
405 ['method_not_allowed', 'METHOD_NOT_ALLOWED', 'not_allowed', 'NOT_ALLOWED']
406 ['not_acceptable', 'NOT_ACCEPTABLE']
407 ['proxy_authentication_required', 'PROXY_AUTHENTICATION_REQUIRED', 'proxy_auth', 'PROXY_AUTH', 'proxy_authentication', 'PROXY_AUTHENTICATION']
408 ['request_timeout', 'REQUEST_TIMEOUT', 'timeout', 'TIMEOUT']
409 ['conflict', 'CONFLICT']
410 ['gone', 'GONE']
411 ['length_required', 'LENGTH_REQUIRED']
412 ['precondition_failed', 'PRECONDITION_FAILED']
428 ['precondition', 'PRECONDITION', 'precondition_required', 'PRECONDITION_REQUIRED']
413 ['request_entity_too_large', 'REQUEST_ENTITY_TOO_LARGE']
414 ['request_uri_too_large', 'REQUEST_URI_TOO_LARGE']
415 ['unsupported_media_type', 'UNSUPPORTED_MEDIA_TYPE', 'unsupported_media', 'UNSUPPORTED_MEDIA', 'media_type', 'MEDIA_TYPE']
416 ['requested_range_not_satisfiable', 'REQUESTED_RANGE_NOT_SATISFIABLE', 'requested_range', 'REQUESTED_RANGE', 'range_not_satisfiable', 'RANGE_NOT_SATISFIABLE']
417 ['expectation_failed', 'EXPECTATION_FAILED']
418 ['im_a_teapot', 'IM_A_TEAPOT', 'teapot', 'TEAPOT', 'i_am_a_teapot', 'I_AM_A_TEAPOT']
421 ['misdirected_request', 'MISDIRECTED_REQUEST']
422 ['unprocessable_entity', 'UNPROCESSABLE_ENTITY', 'unprocessable', 'UNPROCESSABLE']
423 ['locked', 'LOCKED']
424 ['failed_dependency', 'FAILED_DEPENDENCY', 'dependency', 'DEPENDENCY']
425 ['unordered_collection', 'UNORDERED_COLLECTION', 'unordered', 'UNORDERED']
426 ['upgrade_required', 'UPGRADE_REQUIRED', 'upgrade', 'UPGRADE']
429 ['too_many_requests', 'TOO_MANY_REQUESTS', 'too_many', 'TOO_MANY']
431 ['header_fields_too_large', 'HEADER_FIELDS_TOO_LARGE', 'fields_too_large', 'FIELDS_TOO_LARGE']
444 ['no_response', 'NO_RESPONSE', 'none', 'NONE']
449 ['retry_with', 'RETRY_WITH', 'retry', 'RETRY']
450 ['blocked_by_windows_parental_controls', 'BLOCKED_BY_WINDOWS_PARENTAL_CONTROLS', 'parental_controls', 'PARENTAL_CONTROLS']
451 ['unavailable_for_legal_reasons', 'UNAVAILABLE_FOR_LEGAL_REASONS', 'legal_reasons', 'LEGAL_REASONS']
499 ['client_closed_request', 'CLIENT_CLOSED_REQUEST']
#服务端错误状态码
500 ['internal_server_error', 'INTERNAL_SERVER_ERROR', 'server_error', 'SERVER_ERROR', '/o\\', '✗']
501 ['not_implemented', 'NOT_IMPLEMENTED']
502 ['bad_gateway', 'BAD_GATEWAY']
503 ['service_unavailable', 'SERVICE_UNAVAILABLE', 'unavailable', 'UNAVAILABLE']
504 ['gateway_timeout', 'GATEWAY_TIMEOUT']
505 ['http_version_not_supported', 'HTTP_VERSION_NOT_SUPPORTED', 'http_version', 'HTTP_VERSION']
506 ['variant_also_negotiates', 'VARIANT_ALSO_NEGOTIATES']
507 ['insufficient_storage', 'INSUFFICIENT_STORAGE']
509 ['bandwidth_limit_exceeded', 'BANDWIDTH_LIMIT_EXCEEDED', 'bandwidth', 'BANDWIDTH']
510 ['not_extended', 'NOT_EXTENDED']
511 ['network_authentication_required', 'NETWORK_AUTHENTICATION_REQUIRED', 'network_auth', 'NETWORK_AUTH', 'network_authentication', 'NETWORK_AUTHENTICATION']

文件上传(#“Content-Type”: “multipart/form-data”)

import requests
files={'file':open('1.pem','rb')
}
r=requests.post( '' , files=files )
print(r.text)

Cookies

  • #获取Cookies
    import requests
    r=requests.get('')
    for key , val in r.cookies.items() :print( "%s=%s" % (key , val) )
    
  • #手工设置Cookies ---Google浏览器F12 获取
    #####################方法1#####################################
    import requests
    headers={'Cookie':'_zap=cc672834-3e63-4a4e-9246-93b54dc74a76; __DAYU_PP=yuUeiiVeaVZEayUab2rFffffffffd3f1f0f5bc9c; d_c0="AMCkrWxHuw2PTh4QnK1aQBQcA2l7rd2aSjY=|1528686380"; l_n_c=1; q_c1=35d4a692ec7d4c3c88351f8b8959668b|1553738732000|1516775913000; _xsrf=d632891773e10dc462a07feb2f829368; n_c=1; _xsrf=aDKGdn6TfOkYfk43vsekRV75FfebYNba; SL_GWPT_Show_Hide_tmp=1; SL_wptGlobTipTmp=1; __utmc=51854390; __utmz=51854390.1553738668.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); BL_D_PROV=; BL_T_PROV=; tgw_l7_route=66cb16bc7f45da64562a077714739c11; l_cap_id="YzJjYzEyY2ExZGMxNGJkMmFjNmNkNTM3MDg1ZWRiM2E=|1553762062|9d1547776eebfb3b42ca92369b2d3a9df4245339"; r_cap_id="Yjg3NTg0YjRhNmZjNDEyMDk2MmFkMjI4NzgyODgzYzU=|1553762062|efff30851f845765634ec9bae5bde07dce11315e"; cap_id="M2M0MjNjMzUyNzdlNGQxMThlNTRhOGVhOTY5ZDkwMjM=|1553762062|48aac3689381c89f5ecccbdc02c001de923e6fe2"; __utma=51854390.1821104099.1553738668.1553738668.1553761992.2; __utmb=51854390.0.10.1553761992; capsion_ticket="2|1:0|10:1553762071|14:capsion_ticket|44:ODBmZjRiMWMzN2MxNDM1OTlkMDUzNTA5NTNjM2ZlMDI=|6a6ccc9cf7d944da04671d627a7be433a0911b39d8918dc4ae65184d1d7fff89"; z_c0="2|1:0|10:1553762113|4:z_c0|92:Mi4xVHg3NkRnQUFBQUFBd0tTdGJFZTdEU1lBQUFCZ0FsVk5RZFdKWFFBU2RTWmpnTUIwSXF3ODZ1TEFNTlJraFJsbjh3|fb442f693e4ef8cc9837064a6e4e1bdd766d26db24f0bb4b0b765f36e7672ac8"; tst=r; __utmv=51854390.100--|2=registration_date=20190328=1^3=entry_date=20180124=1' ,'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
    }
    r=requests.get('' ,headers=headers)
    print(r.text)
    #####################方法2#####################################
    import requests
    headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
    }
    cookies='_zap=cc672834-3e63-4a4e-9246-93b54dc74a76; __DAYU_PP=yuUeiiVeaVZEayUab2rFffffffffd3f1f0f5bc9c; d_c0="AMCkrWxHuw2PTh4QnK1aQBQcA2l7rd2aSjY=|1528686380"; l_n_c=1; q_c1=35d4a692ec7d4c3c88351f8b8959668b|1553738732000|1516775913000; _xsrf=d632891773e10dc462a07feb2f829368; n_c=1; _xsrf=aDKGdn6TfOkYfk43vsekRV75FfebYNba; SL_GWPT_Show_Hide_tmp=1; SL_wptGlobTipTmp=1; __utmc=51854390; __utmz=51854390.1553738668.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); BL_D_PROV=; BL_T_PROV=; tgw_l7_route=66cb16bc7f45da64562a077714739c11; l_cap_id="YzJjYzEyY2ExZGMxNGJkMmFjNmNkNTM3MDg1ZWRiM2E=|1553762062|9d1547776eebfb3b42ca92369b2d3a9df4245339"; r_cap_id="Yjg3NTg0YjRhNmZjNDEyMDk2MmFkMjI4NzgyODgzYzU=|1553762062|efff30851f845765634ec9bae5bde07dce11315e"; cap_id="M2M0MjNjMzUyNzdlNGQxMThlNTRhOGVhOTY5ZDkwMjM=|1553762062|48aac3689381c89f5ecccbdc02c001de923e6fe2"; __utma=51854390.1821104099.1553738668.1553738668.1553761992.2; __utmb=51854390.0.10.1553761992; capsion_ticket="2|1:0|10:1553762071|14:capsion_ticket|44:ODBmZjRiMWMzN2MxNDM1OTlkMDUzNTA5NTNjM2ZlMDI=|6a6ccc9cf7d944da04671d627a7be433a0911b39d8918dc4ae65184d1d7fff89"; z_c0="2|1:0|10:1553762113|4:z_c0|92:Mi4xVHg3NkRnQUFBQUFBd0tTdGJFZTdEU1lBQUFCZ0FsVk5RZFdKWFFBU2RTWmpnTUIwSXF3ODZ1TEFNTlJraFJsbjh3|fb442f693e4ef8cc9837064a6e4e1bdd766d26db24f0bb4b0b765f36e7672ac8"; tst=r; __utmv=51854390.100--|2=registration_date=20190328=1^3=entry_date=20180124=1'
    jar=requests.cookies.RequestsCookieJar()
    for cookie in cookies.split(';'):key,val = cookie.split('=',1)jar.set(key,val)
    r=requests.get('' ,cookies=jar,headers=headers)
    print(r.text)
  • #维持回话
    import requests
    s=requests.Session()
    r=s.get('')
    print(r.text)
    r=s.get('')
    print(r.text)
    

SSL证书验证

相关资料

[理解服务器证书 CA&SSL][]

[SSL/TLS原理详解][]

python有自己的CA列表(不是跟IE,GOOGLE一样用操作系统的) ,由certifi模块提供。我测试环境的CA文件:

(site_test) wujun@wujun-VirtualBox:~$ sudo find ./ -name cacert.pem 
./env_site_test/lib/python3.6/site-packages/pip/_vendor/certifi/cacert.pem
(site_test) wujun@wujun-VirtualBox:~$ python
Python 3.6.7 (default, Oct 22 2018, 11:32:17) 
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import certifi
>>> 
#忽略警告信息
import logging
import requests
#捕获警告到日志
logging.captureWarnings(True)
#我实验的时候12306已经不是自签名证书了,没法实验
response=requests.get('')
#双向认证的时候需要制定客户端证书和私钥。私钥用于证书签名。requests.get要求这个私钥不能是明文
#response=requests.get('',cert=('/path/ser.crt','/path/key'))
print(response.status_code)

proxy代理

  • 如果需要SOCK协议 需要单独安装 pip install ‘requests[socks]’
import requests
proxies={'http':'http://211.149.172.228:9999','https':'https://182.150.35.173:80',#HTTP Basic Auth'https':'sock5://user:password@10.10.110:3128/'
}
#响应时间timeout最好大于3秒(因为TCP数据包重传窗口的默认大小是3),timeout可以细化:例如timeout=(connect,read,total) 。默认timeout=None阻塞等待...
requests.get('' , proxies = proxies ,timeout=(4,5,10)

tcpdump抓包可以看到tcp/ip协议头的目的地址已经变成代理的地址(211.149.172.228)

[免费代理&购买代理点击][/]

身份认证

  • basic auth

    import requests
    from requests.auth import HTTPBasicAuth 
    #测试用户名称test_name 密码:123456 , URL中的basic-auth标识是基本认证
    r=requests.get( '' ,auth = HTTPBasicAuth('test_name','123456'))
    r.text
    '''
    输出测试1,输入正确的密码(200-ok): 
    >>> r=requests.get( '' ,auth = HTTPBasicAuth('test_name','123456'))
    >>> print(r.headers)
    {'Access-Control-Allow-Credentials': 'true', 'Access-Control-Allow-Origin': '*', 'Content-Encoding': 'gzip', 'Content-Type': 'application/json', 'Date': 'Sun, 31 Mar 2019 12:27:00 GMT', 'Server': 'nginx', 'Content-Length': '68', 'Connection': 'keep-alive'}
    >>> print(r.status_code)
    200输出测试2,输入错误的密码(401-unauthorized):
    >>> r=requests.get( '' ,auth = HTTPBasicAuth('test_name','1234567'))
    >>> print(r.status_code)
    401
    >>> print(r.headers)    
    {'Access-Control-Allow-Credentials': 'true', 'Access-Control-Allow-Origin': '*', 'Date': 'Sun, 31 Mar 2019 12:30:17 GMT', 'Server': 'nginx', 'WWW-Authenticate': 'Basic realm="Fake Realm"', 'Content-Length': '0', 'Connection': 'keep-alive'}
    >>> 请求测试,BASIC AUTH请求报文什么样子
    >>> r=requests.get( '' ,auth = HTTPBasicAuth('test_name','1234567')) 
    >>> print(r.text)
    {"args": {}, "headers": {"Accept": "*/*", "Accept-Encoding": "gzip, deflate", "Authorization": "Basic dGVzdF9uYW1lOjEyMzQ1Njc=", "Host": "httpbin", "User-Agent": "python-requests/2.18.4"}, "origin": "218.88.16.199, 218.88.16.199", "url": ""
    }
    '''
    

    1、从上面可以看出服务器需要BASIC AUTH是响应401 ,报文头WWW-Authenticate 提示Fake Realm域需要验证

    2、客户端对添加"Authorization": “Basic dGVzdF9uYW1lOjEyMzQ1Njc=”, user:password 用base64转换够放在Basic后发给服务器

    3、如果用户名和密码不匹配在重新响应401提示需要基本验证

  • 摘要认证

    import requests
    from requests.auth import HTTPDigestAuth
    url = ''
    r=requests.get(url, auth=HTTPDigestAuth('user', 'pass'))
    r.status_code
    print(r.headers)
    '''
    #输出测试1
    >>> r.status_code
    200
    >>> print(r.headers)
    {'Access-Control-Allow-Credentials': 'true', 'Access-Control-Allow-Origin': '*', 'Content-Encoding': 'gzip', 'Content-Type': 'application/json', 'Date': 'Mon, 01 Apr 2019 03:14:28 GMT', 'Server': 'nginx', 'Set-Cookie': 'fake=fake_value; Path=/, stale_after=never; Path=/', 'Content-Length': '59', 'Connection': 'keep-alive'}
    #输出测试2,服务器返回401
    import requests
    from requests.auth import HTTPDigestAuth
    text=requests.get('', auth=HTTPDigestAuth('user', 'pass')).headers
    for head,response_msg in text.items():print(head,response_msg) Access-Control-Allow-Credentials true
    Access-Control-Allow-Origin *
    Content-Type text/html; charset=utf-8
    Date Mon, 01 Apr 2019 04:26:46 GMT
    Server nginx
    Set-Cookie stale_after=never; Path=/, last_nonce=d0d5882d37dcf4b76dee54e9c0d2bb5a; Path=/, fake=fake_value; Path=/
    WWW-Authenticate Digest realm="me@kennethreitz", nonce="3969731c4f2ce3545a8266fe7d41a67c", qop="auth", opaque="3f15a8256cb961c0e0add04854f1f15d", algorithm=MD5, stale=FALSE
    Content-Length 0
    Connection keep-alive
    >>> 
    输入测试1,请求报文是什么样子
    看下图
    '''

    1、重TCPDUMP截图可以看出 request进行 了两次请求, 第一次请求为了获取服务随机数、摘要算法等信息。第二次请求才带上用户名和密码

    2、第二次请求中Authorization 中的response是计算结果。 [OAuth 2.0: Bearer Token Usage][.html]

prepared request

  • 为方便进程调度方便引入

    from requests import Request, Session
    url=''
    data={'name':'wujun'
    }
    headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
    }
    s=Session()
    req=Request('POST',url,data=data,headers=headers)
    prepped=s.prepare_request(req)
    r=s.send(prepped)
    print(r.text)
    

4、正则表达式

在线测试网站

模式描述
\w匹配字母、数字及下划线
\W匹配非字母、数字及下划线
\s匹配任意空白字符,等价于[\t\n\r\f]
\S匹配任意非空字符
\d匹配任意数字,等价于[0- 9]
\D匹配任意非数字的字符
\A匹配字符串开头
\Z匹配字符串结尾,如果存在换行,只匹配到换行前的结束字符串
\z匹配字符串结尾,如果存在换行,同时还会匹配换行符
\G匹配最后匹配完成的位宣
\n匹配 个换行符
\t匹配一个制表衔’
^匹配 行字符串的开头
$匹配 行字符串的结尾
.匹配任意字符,除了换行符,当 re.DOTALL 标记被指定时,则可以匹配包括换行符的任意字符
[…]用来表示 组字符,单独列出
[^…]不在[]中的字符
*匹配 0个或多个表达式
+匹配 1个或多个表达式
?匹配 0个或1 个前面的正则表达式定义的片段,非贪婪方式
{n}精确匹配 n个前面的表达式
{n,m}匹配 n到m次由前面正则表达式定义的片段,贪婪方式
a|b匹配a或b
( )匹配括号内的表达式,也表示 1个组
  • match()

    需要匹配的内容用()扩起来,用group按顺序输出

    import re
    content = 'Hello 1234567 World_tHIS is Regex Demo'
    result= re.match('^Hello\s(\d+)\s',content)
    print(result)
    print(result.group(1))
    print(result.span())
    #非贪婪模式1 输出 1234567
    result=re.match('^Hello.*?(\d+).*Demo$',content)
    >>> print(result.group(1))
    1234567
    #非贪婪模式2 输出 '' 意料之外(因为最少匹配字符)
    result=re.match('^Hello.*Regex (.*?)',content)
    >>> print(result.group(1))                        #贪婪模式  输出 7
    result=re.match('^Hello.*(\d+).*Demo$',content)
    >>> print(result.group(1))                         
    7
    #换行 需要增加修饰符re.S 这个修饰符的作用是使.匹配包括换行符在内的所有字符
    content = '''Hello 1234567 World_tHIS 
    is Regex Demo'''
    result= re.match('^Hello\s(\d+)\s',content,re.S)
    >>> print(result.group(1))  
    12345
    #转义 使用"\"
  • search()

    它在匹配时会扫描整个字符串,然后返回第一个成功匹配的结结果。

    import re
    content = 'extra Hello 1234567 World_tHIS is Regex Demo'
    result= re.search('Hello\s(\d+)\s',content)
    >>> print(result.group(1))
    1234567
    
  • findall()

    提取多个内容,注意贪婪和非贪婪模式

    import re
    html='''
    <li data-view="5"><a href="/4.mp3" singer="beyond">尤辉岁月</a></li>
    <li data-view="5"><a href="/4.mp3" 
    singer="beyond">尤辉岁月</a></li>
    '''
    result= re.findall('<li.*?href="(.*?)".*?singer="(.*?)">(.*?)</a></li>',html,re.S)
    for r in result:print(r[0],r[1],r[2])
    
  • sub()

    字符替换

    import re
    content='123wujun456'
    result=re.sub('\d+' , '' , content)
    >>> print(result)
    wujun
    
  • compile()

    这个方法可以将正则字符串编译成正则表达式对象,以便在后面的匹配中复用

    import re
    content1 = '2019-12-15 12:00'
    content2 = '2019-12-16 12:00'
    content3 = '2019-12-17 12:00'
    pattern = repile('\d{2}:\d{2}',re.S)
    result1=re.sub(pattern ,'' , content1 )
    result2=re.sub(pattern ,'' , content2 )
    result3=re.sub(pattern ,'' , content3 )
    >>> print( result1 , result2 , result3)
    2019-12-15  2019-12-16  2019-12-17 
  • 抓取猫眼电影TOP100

    import requests
    import re
    import json
    def write_to_file(content):with open('result.txt' , 'a' , encoding='utf-8') as f :f.write( json.dumps(content , ensure_ascii = False) + '\n' ) def get_one_page(url):headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}response = requests.get( url , headers = headers )if response.status_code != 200 :print(esponse.status_code )return Nonereturn response.text
    def parse_one_page(html):pattern=repile('<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?<a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?score.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>' , re.S)items = re.findall( pattern , html )'''print( "items=",items)for item in items:print( item[0] , item[1] , item[2].strip() , item[3].strip()[3:]   if len(item[3].strip()) > 3 else ''  , item[4][5:]   if len(item[4]) > 5 else ''  , item[5]+item[6] ) print( "="*50 )'''for item in items:yield {'index' : item[0],'image' : item[1],'title' : item[2].strip(),'actor' : item[3].strip()[3:]   if len(item[3].strip()) > 3 else '','time'  : item[4][5:]   if len(item[4]) > 5 else '','score' : item[5]+item[6]}
    if __name__ == "__main__":for pages in range( 10 ):url='=' + str(pages*10)html=get_one_page(url)for content in parse_one_page(html) :print(content)write_to_file(content)

5、XPath

  • 第一个XPath程序
from lxml import etree
text='''
<div>
<ul>
<li class ="item-0"><a href="link1.html">first item</a></li>
<li class ="item-1"><a href="link2.html">second item</a></li>
<li class ="item-inactive"><a href="link3.html">third item</a></li>
<li class ="item-1"><a href="link4.html">fourth item</a></li>
<li class ="item-0"><a href="link5.html">程序</a>
<li class ="item-3 item-4" name = "name" ><a href="link5.html">程序</a>
</ul>
</div>
'''
html=etree.HTML(text)
#自动修正HTML报文
result=etree.tostring(html)
#bytes转换成str
print(result.decode('utf-8'))###或者直接解析程序
html=etree.parse(text , etree.HTMLParse())
result=etree.tostring(html)
print(result.decode('utf-8'))
###属性匹配
  • 按顺选择

    from lxml import etree
    text='''
    <div>
    <ul>
    <li class ="item-0"><a href="link1.html">first item</a></li>
    <li class ="item-1"><a href="link2.html">second item</a></li>
    <li class ="item-inactive"><a href="link3.html">third item</a></li>
    <li class ="item-1"><a href="link4.html">fourth item</a></li>
    <li class ="item-0"><a href="link5.html">程序</a>
    <li class ="item-3 item-4" name = "name" ><a href="link5.html">程序</a>
    </ul>
    </div>
    '''
    html=etree.HTML(text)
    #第一个li节点
    html.xpath('//li[1]')
    #最好一个
    html.xpath('//li[last()]')
    #位置小于3的节点
    html.xpath('//li[position()<3]')
    #倒数第二个
    html.xpath('//li[last()-2]')
  • 节点轴选择

    from lxml import etree
    text='''
    <div>
    <ul>
    <li class ="item-0"><a href="link1.html">first item</a></li>
    <li class ="item-1"><a href="link2.html">second item</a></li>
    <li class ="item-inactive"><a href="link3.html">third item</a></li>
    <li class ="item-1"><a href="link4.html">fourth item</a></li>
    <li class ="item-0"><a href="link5.html">程序</a>
    <li class ="item-3 item-4" name = "name" ><a href="link5.html">程序</a>
    </ul>
    <ul>
    <li class ="a-item-0"><a href="link1.html">first item</a></li>
    </ul>
    <ul>
    <li class ="b-item-0"><a href="link1.html">first item</a></li>
    </ul>
    </div>
    '''
    html=etree.HTML(text)
    #所有的父节点
    html.xpath('//li[1]/ancestor::*')
    #父body节点
    html.xpath('//li[1]/ancestor::body')
    #选中节点的所有属性
    html.xpath('//li[1]/attribute::*')
    #获取直接子节点
    html.xpath('//li[1]/child::a[contains(@href , "link1.html")]')
    #获取子孙节点
    html.xpath('//li[1]/descendant::*')
    #获取所有同级节点
    html.xpath('//li[1]/following-sibling::*')
    

6、Beautiful Soup

  • 基本用法

    text='''
    <html><head><title>The Dormouse's story </title></head>
    <body>	
    <p class = "title 1 2 3" name = "dromouse"> <b>The Dormouse's story</b></p>
    <div>
    <ul>
    <li class ="item-0"><a href="link1.html">first item</a></li>
    <li class ="item-1"><a href="link2.html">second item</a></li>
    <li class ="item-inactive"><a href="link3.html">third item</a></li>
    <li class ="item-1"><a href="link4.html">fourth item</a></li>
    <li class ="item-0"><a href="link5.html">程序</a>
    <li class ="item-3 item-4" name = "name" ><a href="link5.html">程序</a>
    </ul>
    <ul>
    <li class ="a-item-0"><a href="link1.html">first item</a></li>
    <ul>
    <li class ="b-item-0"><a href="link1.html">first item</a></li>
    </div>
    '''
    from bs4 import BeautifulSoup
    #使用lxml解释器
    soup= BeautifulSoup(text,'lxml')
    #输出美化后的HTML报文
    print(soup.prettify())
    #Tag类型 ,string是属性
    print(type(soup.title))
    <class 'bs4.element.Tag'>
    #输出li标签的文本(仅仅选取第一个)
    print(soup.li.string)
    #不带属性,选取一大段
    print(soup.head)
    
  • 提取信息

    #节点名 name
    >>> print(soup.head.name)
    head
    #属性1 attrs
    >>> print(soup.p.attrs['name'])
    dromouse
    >>> print(soup.p['name'])      
    dromouse
    >>> print(soup.p['class'])
    ['title', '1', '2', '3']
    >>> 
    #获取内容
    >>> print(soup.title.string)
    The Dormouse's story
    #嵌套选择
    >>> print(soup.p.b.string)  
    The Dormouse's story
    >>> 
    #获取子节点
    >>> soup.div.contents
    ['\n', <ul>
    <li class="item-0"><a href="link1.html">first item</a></li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-inactive"><a href="link3.html">third item</a></li>
    <li class="item-1"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">程序</a>
    </li><li class="item-3 item-4" name="name"><a href="link5.html">程序</a>
    </li></ul>, '\n', <ul>
    <li class="a-item-0"><a href="link1.html">first item</a></li>
    </ul>, '\n', <ul>
    <li class="b-item-0"><a href="link1.html">first item</a></li>
    </ul>, '\n']
    >>> soup.div.children
    <list_iterator object at 0x7f1fbcea9908>>>> for i , child  in enumerate(soup.div.children): 
    ...     print(i, child)
    ... 
    0 1 <ul>
    <li class="item-0"><a href="link1.html">first item</a></li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-inactive"><a href="link3.html">third item</a></li>
    <li class="item-1"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">程序</a>
    </li><li class="item-3 item-4" name="name"><a href="link5.html">程序</a>
    </li></ul>
    2 3 <ul>
    <li class="a-item-0"><a href="link1.html">first item</a></li>
    </ul>
    4 5 <ul>
    <li class="b-item-0"><a href="link1.html">first item</a></li>
    </ul>
    6 #获取子节点
    >>> for i , child  in enumerate(soup.div.descendants):
    ...     print(i,child)#获取父节点,第一个li的父亲
    >>> soup.li.parent
    <ul>
    <li class="item-0"><a href="link1.html">first item</a></li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-inactive"><a href="link3.html">third item</a></li>
    <li class="item-1"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">程序</a>
    </li><li class="item-3 item-4" name="name"><a href="link5.html">程序</a>
    </li></ul>
    #获取所有祖先节点
    >>> list(enumerate(soup.div.parents))
    #获取兄弟节点
    text='''
    <p>a<a>a</a>c<a></a>d</p>
    '''
    soup= BeautifulSoup(text,'lxml')
    soup.a.previous_sibling
    soup.a.next_sibling
    list(enumerate(soup.a.previous_siblings))
    list(enumerate(soup.a.next_siblings))
    'a'
    >>> soup.a.next_sibling
    'c'
    >>> list(enumerate(soup.a.previous_siblings))
    [(0, 'a')]
    >>> list(enumerate(soup.a.next_siblings))
    [(0, 'c'), (1, <a></a>), (2, 'd')]
    >>> 
    #提取信息
    text='''
    <p class="1234">a
    <a>a1</a>
    <a>a2</a>
    d</p>
    '''
    soup= BeautifulSoup(text,'lxml')
    soup.a.previous_sibling
    soup.a.next_sibling.string
    list(soup.a.parents)
    list(soup.a.parents)[0]
    list(soup.a.parents)[0].attrs['class']
  • find_all()

    #按节点查询
    text='''
    <p class="1234">a
    <a>a1</a>
    <a>a2</a>
    d</p>
    '''
    soup= BeautifulSoup(text,'lxml')
    print(soup.find_all(name='a'))
    print(type(soup.find_all(name='a')[0]))
    for a in soup.find_all(name='a'):print(a.string)
    #按属性查询
    text='''
    <p class="1234">a
    <a>a1</a>
    <a>a2</a>
    d</p>
    <p id = "1" class="12345">a
    <a>a1</a>
    <a>a2</a>
    d</p>
    '''
    soup= BeautifulSoup(text,'lxml')
    #print(soup.find_all(attrs={'class':'12345'})) 或者 print( soup.find_all(class_="12345") )
    print( soup.find_all(id="1") )
    print(type(soup.find_all(attrs={'class':'1234'})[0]))
    for a in soup.find_all(attrs={'class':'1234'}):print(a.string)
    #text 正则匹配节点的!文本!
    text='''
    <p>
    Hello,this is link
    </p>
    <p>
    Hello,this is link,too
    </p>
    '''
    soup= BeautifulSoup(text,'lxml')
    print(soup.find_all(text=repile('link')))
    
  • find()

    与find_all比较,它返回单个Tag

  • 其他函数

    函数功能
    find_parents返回所有父节点
    find_parent直接父节点
    find_next_siblings返回后面所有兄弟节点
    find_next_sibling返回后面第一个兄弟节点
    find_previous_siblings返回前面所有兄弟节点
    find_previous_sibling返回前面第一个兄弟节点
    find_all_next返回后面所有复合条件的节点
    find_next返回后面第一个复合条件的节点
    find_all_previous返回前面所有复合条件的节点
    find_previous返回前面第一个复合条件的节点
  • css

[w3c-css选择器][.asp]

#按节点查询
text='''
<div class ='panle'>
<div class = 'panle-heading' >
<p class="1234">a
<a>a1</a>
<a>a2</a>
d</p>
</div>
<div>
<ul class='ul-1'>
<li id = "item-1">test1</li>
<li id = "item-3">test2</li>
</ul>
<ul class='ul-2'>
<li id = "item-1">test1</li>
<li id = "item-3">test2</li>
</ul>
</div>
'''
from bs4 import BeautifulSoup
soup= BeautifulSoup(text,'lxml')
print(soup.select('.panle .panle-heading'))
print(soup.select('ul li'))
print(soup.select('.ul-1 #item-1'))
print(type(soup.select('ul')[0]))
print(soup.select('ul')[0])
>>> for ul in soup.select('ul'):
...     print( ul.select('li'))
... 
[<li id="item-1">test1</li>, <li id="item-3">test2</li>]
[<li id="item-1">test1</li>, <li id="item-3">test2</li>]
>>> print(soup.select('ul li')[0].get_text())
test1
>>> print(soup.select('ul li')[0].string)
test1
>>> 

7、 pyquery

  • 字符串初始化

    text='''
    <div class ='panle'>
    <div class = 'panle-heading' >
    <p class="1234">a
    <a>a1</a>
    <a>a2</a>
    d</p>
    </div>
    <div>
    <ul class='ul-1'>
    <li id = "item-1">test1</li>
    <li id = "item-3">test2</li>
    </ul>
    <ul class='ul-2'>
    <li id = "item-1">test1</li>
    <li id = "item-3">test2</li>
    </ul>
    </div>
    '''
    from pyquery import PyQuery as pq
    doc=pq(text)
    >>>print(doc('li'))
    <li id="item-1">test1</li>
    <li id="item-3">test2</li>
    <li id="item-1">test1</li>
    <li id="item-3">test2</li>
    
  • URL初始化

    from pyquery import PyQuery as pq
    >>> html=pq(url='',encoding='utf-8')     
    >>> print(html('title'))                                   
    <title>新浪首页</title>
    
  • 文件初始化

    from pyquery import PyQuery as pq
    html=pq(filename='demo.html',encoding='utf-8') 
    
  • CSS

    text='''
    <div id='AAA' class ='panle'>
    <div class = 'panle-heading' >
    <p class="1234">a
    <a>a1</a>
    <a>a2</a>
    d</p>
    </div>
    <div>
    <ul class='ul-1'>
    <li id = "item-1">test1</li>
    <li id = "item-3">test2</li>
    </ul>
    <ul class='ul-2'>
    <li id = "item-1">test1</li>
    <li id = "item-3">test2</li>
    </ul>
    </div>
    >/div>
    '''
    from pyquery import PyQuery as pq
    doc=pq(text)
    >>> print(doc('.panle .panle-heading a')) 
    <a>a1</a>
    <a>a2</a>
    d
    >>> print(type(doc('.panle .panle-heading a')) )
    <class 'pyquery.pyquery.PyQuery'>
  • 查找节点

    1. 子节点 find-子孙节点 children-子节点

      #使用上面的HTML文本
      from pyquery import PyQuery as pq
      doc=pq(text)
      items=doc('.ul-1')
      >>> print(type(items))
      <class 'pyquery.pyquery.PyQuery'>
      >>> print(items)
      <ul class="ul-1">
      <li id="item-1">test1</li>
      <li id="item-3">test2</li>
      </ul>>>> lis=items.find('li')
      >>> print(type(lis))
      <class 'pyquery.pyquery.PyQuery'>
      >>> print(lis)
      <li id="item-1">test1</li>
      <li id="item-3">test2</li>>>> lis=items.children()
      >>> print(lis)
      <li id="item-1">test1</li>
      <li id="item-3">test2</li>
      #按id筛选
      >>> lis=items.children('#item-1')
      >>> print(lis)                   
      <li id="item-1">test1</li>
    2. 父节点 parent -直接父节点 parents-祖先

      #使用上面的HTML文本
      from pyquery import PyQuery as pq
      doc=pq(text)
      items=doc('.ul-1')
      container=items.parent()
      print(type(container))
      print(container)
      >>> items=doc('.ul-1')
      >>> container=items.parent()
      >>> print(type(container))
      <class 'pyquery.pyquery.PyQuery'>
      >>> print(container)
      <div>
      <ul class="ul-1">
      <li id="item-1">test1</li>
      <li id="item-3">test2</li>
      </ul>
      <ul class="ul-2">
      <li id="item-1">test1</li>
      <li id="item-3">test2</li>
      </ul>
      </div>>>> container=items.parents('.panle')        
      >>> print(container)                 
      <div id="AAA" class="panle">
      <div class="panle-heading">
      <p class="1234">a
      <a>a1</a>
      <a>a2</a>
      d</p>
      </div>
      <div>
      <ul class="ul-1">
      <li id="item-1">test1</li>
      <li id="item-3">test2</li>
      </ul>
      <ul class="ul-2">
      <li id="item-1">test1</li>
      <li id="item-3">test2</li>
      </ul>
      </div>
      </div>
    3. 兄弟节点

      from pyquery import PyQuery as pq
      doc=pq(text)
      >>> li=doc( '#item-1')     
      >>> print(li)
      <li id="item-1">test1</li>
      <li id="item-1">test1</li>>>> print(li.siblings())
      <li id="item-3">test2</li>
      <li id="item-3">test2</li>
    4. 遍历

      text='''
      <div class= "div0 div1">
      <li id="1" >li-1</li>
      <li>li-2</li>
      <li>li-3</li>
      <li>li-3</li>
      </div>
      '''
      from pyquery import PyQuery as pq
      doc=pq(text)
      >>> lis=doc('li').items()
      >>> print(type(lis))     
      <class 'generator'>
      >>> for li in lis:
      ...     print(li,type(li))
      ... 
      <li id="1">li-1</li><class 'pyquery.pyquery.PyQuery'>
      <li>li-2</li><class 'pyquery.pyquery.PyQuery'>
      <li>li-3</li><class 'pyquery.pyquery.PyQuery'>
      <li>li-3</li><class 'pyquery.pyquery.PyQuery'>
    5. 获取信息

      text='''
      <div class= "div0 div1">
      <li id="1" ><span class='bold1'>li-1</span></li>
      <li id="2" ><span class='bold2'>li-2</span></li>
      <li id="3" ><span class='bold3'>li-3</span></li>
      <li id="4" ><span class='bold4'>li-4</span></li>
      </div>
      '''
      from pyquery import PyQuery as pq
      doc=pq(text)
      >>> a=doc('li')
      >>> print(a , type(a))
      <li id="1"><span class="bold1">li-1</span></li>
      <li id="2"><span class="bold2">li-2</span></li>
      <li id="3"><span class="bold3">li-3</span></li>
      <li id="4"><span class="bold4">li-4</span></li><class 'pyquery.pyquery.PyQuery'>
      >>> print(a.attr('id'))
      1
      >>> print(a.attr.id)
      1
      #遍历
      >>> a=doc('li').items()
      >>> for li in a:
      ...     print(li.attr.id)
      ... 
      1
      2
      3
      4
    6. 取 文本

      text='''
      <div class= "div0 div1">
      <li id="1" ><span class='bold1'>li-1</span></li>
      <li id="2" ><span class='bold2'>li-2</span></li>
      <li id="3" ><span class='bold3'>li-3</span></li>
      <li id="4" ><span class='bold4'>li-4</span></li>
      </div>
      '''
      from pyquery import PyQuery as pq
      doc=pq(text)
      li_text=doc('li')
      >>> print(a,li_text.text())
      <li id="1"><span class="bold1">li-1</span></li>
      <li id="2"><span class="bold2">li-2</span></li>
      <li id="3"><span class="bold3">li-3</span></li>
      <li id="4"><span class="bold4">li-4</span></li>li-1 li-2 li-3 li-4
      >>> print(a,li_text.html())
      <li id="1"><span class="bold1">li-1</span></li>
      <li id="2"><span class="bold2">li-2</span></li>
      <li id="3"><span class="bold3">li-3</span></li>
      <li id="4"><span class="bold4">li-4</span></li>    <span class="bold1">li-1</span>
      >>> text=li_text.items()       
      >>> for html in text:
      ...     print(html.html())
      ... 
      <span class="bold1">li-1</span>
      <span class="bold2">li-2</span>
      <span class="bold3">li-3</span>
      <span class="bold4">li-4</span>
    7. 节点操作

      text='''
      <div class= "div0 div1">
      <li id="1" ><span class='bold1'>li-1</span></li>
      <li id="2" ><span class='bold2'>li-2</span></li>
      <li id="3" ><span class='bold3'>li-3</span></li>
      <li id="4" ><span class='bold4'>li-4</span></li>
      </div>
      '''
      from pyquery import PyQuery as pq
      doc=pq(text)
      >>> li_text=doc('div')      
      >>> print(li_text)          
      <div class="div0 div1">
      <li id="1"><span class="bold1">li-1</span></li>
      <li id="2"><span class="bold2">li-2</span></li>
      <li id="3"><span class="bold3">li-3</span></li>
      <li id="4"><span class="bold4">li-4</span></li>
      </div>
      >>> li_text.removeClass('div0')
      [<div.div1>]
      >>> print(li_text)             
      <div class="div1">
      <li id="1"><span class="bold1">li-1</span></li>
      <li id="2"><span class="bold2">li-2</span></li>
      <li id="3"><span class="bold3">li-3</span></li>
      <li id="4"><span class="bold4">li-4</span></li>
      </div>
      >>> li_text.addClass('div2')   
      [<div.div1.div2>]
      >>> print(li_text)          
      <div class="div1 div2">
      <li id="1"><span class="bold1">li-1</span></li>
      <li id="2"><span class="bold2">li-2</span></li>
      <li id="3"><span class="bold3">li-3</span></li>
      <li id="4"><span class="bold4">li-4</span></li>
      </div>>>> li_text=doc('#1')    
      >>> print(li_text)
      <li id="1"><span class="bold1">li-1</span></li>>>> print(li_text.attr('name','modify'))
      <li id="1" name="modify"><span class="bold1">li-1</span></li>>>> print(li_text.text('test modify'))  
      <li id="1" name="modify">test modify</li>>>> print(li_text.html('<b>AAA</b>'))     
      <li id="1" name="modify"><b>AAA</b></li>
      >>>

更多推荐

Python3网络爬虫

本文发布于:2024-03-09 07:14:11,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1724254.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:爬虫   网络

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!