所以我试图刮一个网站( https://shop.advanceautoparts.com/ ),我可以通过CasperJS在过去几周内正常访问它。 当我现在尝试这样做时(就像2天前一样)我收到一条奇怪的消息,说该网站处于离线状态:
当我试用普通浏览器或PhantomJS时,我得到了正常的网站。 我尝试在不同的计算机上进行,更改我的IP,更改用户代理但没有任何作用。
编辑
在PhantomJS上尝试相同的事情后,运行代码大约5次后我得到了相同的消息。 这是网站正在采取哪些措施来防止刮擦?
So I'm trying to scrape a site (https://shop.advanceautoparts.com/) and I could access it normally for the past couple of weeks through CasperJS. When I try to do it now (as of like 2 days ago) I get an odd message saying that the website is offline:
When I try it off a normal browser or PhantomJS, I get the normal site. I've tried doing it off different computers, changing my IP, changing the User agent but nothing works.
EDIT
After trying the same thing on PhantomJS, after running the code about 5 times I got the same message. Is this something the site is doing to prevent scraping?
最满意答案
我怀疑该网站知道你正在根据你的用户代理进行抓取,因为你多次点击它
也许尝试随意使用你的用户,看看会发生什么。 ( 见这里的清单 )
var casper = require('casper').create({ pageSettings: { userAgent: "USE SOME OTHER USER AGENT HERE" } });但是,在多个同时请求之后,该站点也可能被IP地址阻止。 因此,也尝试a)减慢脚本速度或b)导航到不同的页面
编辑
我把一个测试脚本拼凑起来,一切都适合我。 重要的是:
casper.waitUntilVisible("#header-top", function() {
HTH
I suspect the site knows you are scraping based on your user agent as you are hitting it mutltiple times
Maybe try randomising your useragent and seeing what happens. (see list here)
var casper = require('casper').create({ pageSettings: { userAgent: "USE SOME OTHER USER AGENT HERE" } });
However the site might also be blocking by IP address after a number of simultaneous requests. Therefore also try a) slowing down your script or b) navigating to different pages
EDIT
I have knocked together a testing script and all works for me. The important bit is:
casper.waitUntilVisible("#header-top", function() {
HTH
更多推荐
发布评论