我有点想自己写,但我现在真的没有足够的时间.我看过开源爬虫的维基百科列表,但我更喜欢用 Python 写的东西.我意识到我可能只使用维基百科页面上的工具之一并将其包装在 Python 中.我最终可能会这样做 - 如果有人对这些工具中的任何一个有任何建议,我愿意听取他们的意见.我通过其 Web 界面使用了 Heritrix,但我发现它非常麻烦.我绝对不会在即将开展的项目中使用浏览器 API.
I'm half-tempted to write my own, but I don't really have enough time right now. I've seen the Wikipedia list of open source crawlers but I'd prefer something written in Python. I realize that I could probably just use one of the tools on the Wikipedia page and wrap it in Python. I might end up doing that - if anyone has any advice about any of those tools, I'm open to hearing about them. I've used Heritrix via its web interface and I found it to be quite cumbersome. I definitely won't be using a browser API for my upcoming project.
提前致谢.另外,这是我的第一个 SO 问题!
Thanks in advance. Also, this is my first SO question!
推荐答案- 机械化是我的最爱;强大的高级浏览功能(超级简单的表单填写和提交).
- Twill 是一种建立在 Mechanize 之上的简单脚本语言
- BeautifulSoup + urllib2 也很好用.
- Scrapy 看起来是一个非常有前途的项目;这是新的.
- Mechanize is my favorite; great high-level browsing capabilities (super-simple form filling and submission).
- Twill is a simple scripting language built on top of Mechanize
- BeautifulSoup + urllib2 also works quite nicely.
- Scrapy looks like an extremely promising project; it's new.
更多推荐
有人知道我可以使用的基于 Python 的网络爬虫吗?
发布评论