加载第一页时检测到硒刮板

编程入门 行业动态 更新时间:2024-10-27 22:33:50
本文介绍了加载第一页时检测到硒刮板的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我正在尝试抓取这个网站:

网站怎么能检测到机器人那么快?

解决方案

正如发布的图片所暗示的,该站点受到 Imperva WAF(Web 应用程序防火墙)或相关产品的保护.

如果您 p​​ing 站点,您会看到所有请求都通过与 Imperva 相关的地址.

ping www.zocdoc使用 32 字节数据 Ping ux639.x.incapdns [45.60.62.232]:来自 45.60.62.232 的回复:bytes=32 time=46ms TTL=59来自 45.60.62.232 的回复:bytes=32 time=47ms TTL=59来自 45.60.62.232 的回复:bytes=32 time=46ms TTL=59来自 45.60.62.232 的回复:bytes=32 time=46ms TTL=59

如您所见,ping www.zocdoc 会将您重定向到 incapdns 命名空间,根据 WHOIS,归 Imperva Inc. 所有.

至于检测是如何工作的.我相信该问题已在以下帖子中讨论:网站能否检测到您何时将 selenium 与 chromedriver 一起使用? .

I'm trying yo scrape this site: www.zocdoc/

Frist I tried using request library and got this response from the site:

b'<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"></head><body style="margin:0px;height:100%"><iframe src="/_Incapsula_Resource?CWUDNSAI=20&xinfo=13-8874904-0%200NNN%20RT%281557792003687%20128%29%20q%280%20-1%20-1%20-1%29%20r%280%20-1%29%20B15%284%2c200%2c0%29%20U5&incident_id=787000970007113277-35368596172637725&edet=15&cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 787000970007113277-35368596172637725</iframe></body></html>'

Hence I switched to selenium which usually works. I using this simple code to test it:

from selenium import webdriver from selenium.webdrivermon.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWait driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver") url = "www.zocdoc/" driver.get(url)

But this is not working either, I'm getting this result:

How can be the site detecting that fast the robot?

解决方案

As the posted image suggests, the site is protected behind an Imperva WAF (Web Application Firewall) or a related product.

If you ping the site you'll see that all requests go through addresses related to Imperva.

ping www.zocdoc Pinging ux639.x.incapdns [45.60.62.232] with 32 bytes of data: Reply from 45.60.62.232: bytes=32 time=46ms TTL=59 Reply from 45.60.62.232: bytes=32 time=47ms TTL=59 Reply from 45.60.62.232: bytes=32 time=46ms TTL=59 Reply from 45.60.62.232: bytes=32 time=46ms TTL=59

As you can see, pinging www.zocdoc redirects you through an incapdns namespace, which according to WHOIS, is owned by Imperva Inc.

As for how the detection works. I believe that issue has been covered in the following post: Can a website detect when you are using selenium with chromedriver? .

更多推荐

加载第一页时检测到硒刮板

本文发布于:2023-06-13 05:21:21,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/674846.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:第一页   检测到   加载   硒刮板

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!