如何保护网站免遭批量下载/下载？(How to protect website from bulk scraping /downloading? [duplicate])

编程入门行业动态更新时间:2024-10-24 21:25:54

这个问题在这里已经有了答案：

避免从网站数据库中“抓取数据”的顶级技术 14的答案

我有LAMP服务器，我在那里运行一个网站，我想要防止散装刮擦/下载。我知道没有完美的解决方案，攻击者总是会找到办法。但是，我希望至少有一些“保护”能够加强数据窃取的方式，而不是什么都没有。

这个网站有cca。在每页上有5000个有价值的文本数据和几幅图片的子页面。我希望能够在线分析传入的HTTP请求，并且如果存在可疑活动（例如，来自一个IP的一分钟内的数十个请求），它会自动将该特定IP地址列入黑名单，以免进一步访问该站点。

我完全意识到，我所要求的有很多缺陷，但我并不是真的在寻找防弹解决方案，而只是一种如何限制脚本小子使用容易被刮取的数据“玩”的方法。

感谢您提供题外答案和可能的解决方案。

This question already has an answer here:

Top techniques to avoid 'data scraping' from a website database 14 answers

I have LAMP server where I run a website, which I want to protect against bulk scraping / downloading. I know that there is no perfect solution for this, that the attacker will always find a way. But I would like to have at least some "protection" which hardenes the way of stealing data than just having nothing at all.

This website has cca. 5000 of subpages with valuable text data and couple of pictures on each page. I would like to be able online analyze incoming HTTP requests and if there is suspicious activity (e.g. tens of requests in one minute from one IP) it would automatically blacklist this certain IP address from further access to the site.

I fully realize that what I am asking for has many flaws, but I am not really looking for bullet-proof solution, but just a way how to limit script-kiddies from "playing" with easily scraped data.

Thank you for your on-topic answers and possible solution ideas.

最满意答案

对不起 - 但我不知道任何可以提供现成可用的反窜改代码。

如何在不给合法用户带来负担的情况下限制访问权限，并且不提供访问您网站的机制？与垃圾邮件预防一样，最佳解决方案是使用多种方法并保持恶劣分数。

你已经提到了查看请求的速度 - 但要记住，越来越多的用户将从NAT网络连接 - 例如IPV6弹出。更好的方法是检查每个会话 - 你不需要你的用户注册和登录（尽管openId使得这更简单），但是只要他们在没有当前会话的情况下发出请求，你就可以将它们重定向到一个定义的起始点并使用用户名/密码登录。检查引荐者（并且引荐者确实指向当前的内容项目）也是一个好主意。跟踪404价格。路障（当分数超过阈值重定向到capcha或需要登录时）。检查用户代理可以指示攻击 - 但应该用作评分机制的一部分，而不是阻止的是/否标准。

另一种方法，不是中断流程，而是触发阈值时开始代替内容。或者当您的referer头文件中出现重复的外部主机时，也可以这样做。

除非你有大量的资源服务器端，否则不要关注连接。

Sorry - but I'm not aware of any anti-leeching code available off-the-shelf which does a good job.

How do you limit access without placing burdens on legitimate users / withuot providing a mechanism for DOSing your site? Like spam prevention, the best solution is to use several approaches and maintain scores of badness.

You've already mentioned looking at the rate of requests - but bear in mind that increasingly users will be connecting from NAT networks - e.g. IPV6 pops. A better approach is to check per session - you don't need to require your users to register and login (although openId makes this a lot simpler) but you could redirect them to a defined starting point whenever they make a request without a current session and log them in with no username/password. Checking the referer (and that the referer really does point to the current content item) is a good idea too. Tracking 404 rates. Road blocks (when score exceeds threshold redirect to a capcha or require a login). Checking the user agent can be indicative of attacks - but should be used as part of the scoring mechanism, not as a yes/no criteria for blocking.

Another approach, rather than interrupting the flow, is when the thresholds are triggered start substituting content. Or do the same when you get repeated external hosts appearing in your referer headers.

Do not tar pit connections unless you've got a lot of resource serverside!

更多推荐

本文发布于:2023-08-06 07:57:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1446023.html