使用HtmlAgilityPack从html页面获取节点(Getting nodes from html page using HtmlAgilityPack)

编程入门 行业动态 更新时间:2024-10-17 02:54:00
使用HtmlAgilityPack从html页面获取节点(Getting nodes from html page using HtmlAgilityPack)

我的程序收集有关Steam用户个人资料的信息(例如游戏,徽章等)。 我使用HtmlAgilityPack从html页面收集数据,到目前为止它对我来说很好。

问题是,在某些页面上它运行良好,但在某些页面上返回空节点或引发异常

你调用的对象是空的

这是一个例子。

这部分运作良好(当我拿到徽章时):

WebClient client = new WebClient(); string html = client.DownloadString("http://steamcommunity.com/profiles/*id*/badges/"); var doc = new HtmlDocument(); doc.LoadHtml(html); HtmlNodeCollection div = doc.DocumentNode.SelectNodes("//div[@class=\"badge_row is_link\"]");

这将返回确切的徽章数量,然后我可以随心所欲地做任何事情。

但在这一个中,我做了完全相同的事情(但是获得游戏),并且不知何故它一直在抛出我和上面提到的错误:

WebClient client = new WebClient(); string html = client.DownloadString("http://steamcommunity.com/profiles/*id*/games/?tab=all"); var doc = new HtmlDocument(); doc.LoadHtml(html); HtmlNodeCollection div = doc.DocumentNode.SelectNodes("//*[@id='game_33120']");

我知道页面上有节点(通过谷歌浏览器代码查看),我不知道为什么在第一种情况下它可以工作,但在第二种情况下它不工作。

My program collects info about Steam users' profiles (such as games, badges and etc.). I use HtmlAgilityPack to collect data from html page and so far it worked for me just good.

The problem is that on some pages it works well, but on some - returns null nodes or throws an exception

object reference not set to an instance of an object

Here's an example.

This part works well (when I'm getting badges):

WebClient client = new WebClient(); string html = client.DownloadString("http://steamcommunity.com/profiles/*id*/badges/"); var doc = new HtmlDocument(); doc.LoadHtml(html); HtmlNodeCollection div = doc.DocumentNode.SelectNodes("//div[@class=\"badge_row is_link\"]");

This returns the exact amout of badges and then I can do whatever I want with them.

But in this one I do the exact same thing (but getting games), and somehow it keeps throwing me and error I mentioned above:

WebClient client = new WebClient(); string html = client.DownloadString("http://steamcommunity.com/profiles/*id*/games/?tab=all"); var doc = new HtmlDocument(); doc.LoadHtml(html); HtmlNodeCollection div = doc.DocumentNode.SelectNodes("//*[@id='game_33120']");

I know that there is the node on the page (checked via google chrome code view) and I don't know why in 1st case it works, but in the 2nd it doesn't.

最满意答案

当你在页面上右键点击并选择查看源文件时,你是否仍然可以看到id ='game_33120'的元素? 我的猜测是你不会。 我的猜测是该页面正在动态构建,即客户端。 因此,请求中的HTML不包含您正在查找的元素。 一旦Javascript代码在浏览器中运行,该元素就会出现。

看起来原始请求将包含一段Javascript,其中包含一个名为rgGames的变量,该变量是将在屏幕上呈现的游戏的Javascript数组。 你应该能够从中提取信息。

When you right-click on the page and choose View Source do you still see an element with id='game_33120'? My guess is you won't. My guess is that the page is being built dynamically, client-side. Therefore, the HTML that comes down in the request doesn't contain the element you're looking for. Instead that element appears once the Javascript code has run in the browser.

It appears that the original request will have a section of Javascript that contains a variable called rgGames which is a Javascript array of the games that will be rendered on the screen. You should be able to extract the information from that.

更多推荐

本文发布于:2023-07-18 04:37:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1154842.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:节点   页面   html   HtmlAgilityPack   nodes

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!