实验1:域名信息收集工具
1. 实验过程
1.1 访问前准备
def baidu_search(url):
global page # 全局变量记录当前页数
Subdomain = [] # 定义一个空列表用于存储收集到的子域名
# 定义请求头,绕过反爬机制
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:104.0) Gecko/20100101 Firefox/104.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'referer': 'https://www.baidu/',
'cookie': 'BAIDUID=46000A409626DC9216D4FC620835FE6D:FG=1; BIDUPSID=891607BCB0E6B201975FE9BF1A7BCEE8; PSTM=1654070845; COOKIE_SESSION=91460_0_5_5_1_22_1_0_5_5_1_2_0_0_0_0_1661996381_0_1662087822%7C6%230_0_1662087822%7C1; ZFY=CtmtdbRHWfJVCg1pX5WHvhMPzw5eY6eOTdNdsOuDU1M:C; BDUSS=9BV0VpTWJBdDBRZ3gwTmE3OW5KQ1l0Q1d3b01udVBBejkwSnQ0Ry1FU0Ytc1JpSVFBQUFBJCQAAAAAAAAAAAEAAABOVu43AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIVtnWKFbZ1iWm; baikeVisitId=738da5b5-d3dc-4558-a511-37b5c8f77d15; Hm_lvt_aec699bb6442ba076c8981c6dc490771=1661996364; BDRCVFR[gltLrB7qNCt]=mk3SLVN4HKm; delPer=0; BD_CK_SAM=1; PSINO=2; H_PS_PSSID=36544_37114_37353_37299_36885_34812_37397_37258_26350_37384_22160; BD_UPN=13314752; H_PS_645EC=0ca8xpSR3xDvSnc9On3ivRZZP4rpC8Xw1fFCjFb9MDPnLuUFEVsn19Blj9oBNkv%2FRcwy; BA_HECTOR=810k000l01010l0085a43pi61higjcj19; BDORZ=FFFB88E999055A3F8A630C64834BD6D0',
'Host': 'www.baidu',
'Connection': 'close',
'Accept-Language': 'zh-CN,zh;q=0.9'
}
请求头这里遇到了很大的问题,刚开始使用谷歌/edge浏览器都出现了RemoteConnectionError的情况,搜索得出:
该错误原因如下:
- http的连接数超过最大限制,默认的情况下连接是Keep-alive的,所以这就导致了服务器保持了太多连接而不能再新建连接。
- ip被封
- 程序请求速度过快
解决方案:
- 在header中不使用持久连接
'Connection': 'close'
或者
requests.adapters.DEFAULT_RETRIES = 5
- 随机切换User-Agent:
"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36", "Mozilla/5.0 (Windows NT 10.0; …) Gecko/20100101 Firefox/61.0", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36", "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.62 Safari/537.36", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)", "Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.5; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15", ] headers['User-Agent'] = random.choice(user_agent_list)
- 设置访问频率
time.sleep(5)
- 代理IP
self.proxies = { "http": ip, "https":ip, }
试了一圈都不行,最后跟同学交流换成Firefox就可以了,不知道什么原因:(
1.2 爬取子域名
resp = requests.get(url, headers=headers) # 访问url,获取网页源码
print('-----现在是第%d页:------' % page)
soup = BeautifulSoup(resp.content,
'html.parser') # 创建一个BeautifulSoup对象,第一个参数是网页源码,第二个参数是Beautiful Soup 使用的 HTML 解析器,
job_bt = soup.find_all('h3') # find_all()查找源码中所有<h3>标签的内容
for i in job_bt:
link = i.a.get('href') # 循环获取‘href’的内容
try:
response = requests.get(link, headers=headers)
link = response.url
except:
continue
# urlparse是一个解析url的工具,scheme获取url的协议名,netloc获取url的网络位置
domain = str(urlparse(link).scheme + "://" + urlparse(link).netloc)
if domain in Subdomain: # 如果解析后的domain存在于Subdomain中则跳过,否则将domain存入子域名表中
pass
else:
Subdomain.append(domain)
print(domain)
审查元素可以看到在百度里各标题的渲染标签为<h2>
1.3 翻页
# 只爬取前十页
if page < 10:
try:
page += 1 # 页数加一
nextPage = soup.find("a", {"class": "n"}) # 查找class="n"的元素
nextUrl = "https://www.baidu" + nextPage.get('href')
print('翻到下一页:%s' % nextUrl)
baidu_search(nextUrl) # 再次调用baidu_search()
except:
pass
可以看到“下一页”的class=“n”,其中的href蕴含了下一页的url。
可以看到爬了十页。
更多推荐
实验1:域名信息收集工具
发布评论