提取Nokogiri每个区块内的值[关闭](Extracting values scoped inside a Nokogiri each block [closed])

编程入门 行业动态 更新时间:2024-10-28 08:21:16
提取Nokogiri每个区块内的值[关闭](Extracting values scoped inside a Nokogiri each block [closed])

我正在尝试创建一个功能,将从维基百科页面中删除演员的电影。 这是代码的一个例子

doca = Nokogiri::HTML(open("http://en.wikipedia.org/wiki/Kevin_Bacon")) grandparent = doca.xpath('//div[@id="mw-content-text"]').children() child = [] grandparent.each {|node| node.children.each{|x| if x['id'] == "Films" child = node.next_element.children break end } }

子数组的每个元素现在包含一行影片表。 我真正想要的是将每部电影的href链接保存到一个数组中,但是因为它们很好地嵌套在每个电影中而无法访问它们。 任何帮助非常感谢

I'm trying to create a function that will scrape the filmography of actors from wikipedia pages. This is an example of the code

doca = Nokogiri::HTML(open("http://en.wikipedia.org/wiki/Kevin_Bacon")) grandparent = doca.xpath('//div[@id="mw-content-text"]').children() child = [] grandparent.each {|node| node.children.each{|x| if x['id'] == "Films" child = node.next_element.children break end } }

Each element of the child array now contains one row of the filmography table. What i really want is to save the href link for each film into an array but am having trouble accessing them as they are well nested within each. Any help greatly appreciated

最满意答案

怎么样:

doca.xpath('//div[@id="mw-content-text"]/table//td[2]//i/a').map { |a| a['href'] }

它在具有id mw-content-text的div中直接在表中的列( td )内的任意深度处选择斜体链接,然后将它们映射到它们的href属性(即它们的链接值)。 您可以更具体,具体取决于您要包含/排除的内容。

如果您希望链接是绝对的而不是相对的,则可以将页面URL合并到链接值:

url = "http://en.wikipedia.org/wiki/Kevin_Bacon" doca.xpath('//div[@id="mw-content-text"]/table//td[2]//a').map { |a| URI(url).merge(a['href']) }

更新:

或者,如果您想按照描述的方式搜索链接,则可以执行以下操作:

doca.xpath('//div[@id="mw-content-text"]//table[preceding-sibling::*[1][span[@id="Films"]]]//a').map { |a| a['href'] }

这样说:在id为mw-content-text的div中查找作为表的子节点的所有链接,其直接前面的兄弟具有id “Films”的直接子span标记。 有点复杂。

How about:

doca.xpath('//div[@id="mw-content-text"]/table//td[2]//i/a').map { |a| a['href'] }

That selects links in italics at any depth within a column (td) in a table directly inside a div with id mw-content-text, then maps them to their href attribute (i.e. their link value). You could be more specific, depending on what you want to include/exclude.

If you want the links to be absolute and not relative, you can merge the page URL to the link value:

url = "http://en.wikipedia.org/wiki/Kevin_Bacon" doca.xpath('//div[@id="mw-content-text"]/table//td[2]//a').map { |a| URI(url).merge(a['href']) }

UPDATE:

Alternatively, if you want to do search for the links the way you described, you could do this:

doca.xpath('//div[@id="mw-content-text"]//table[preceding-sibling::*[1][span[@id="Films"]]]//a').map { |a| a['href'] }

This says: find all links that are children of a table inside a div with id mw-content-text whose direct preceding sibling has a direct child span tag with id "Films". Somewhat more complicated.

更多推荐

本文发布于:2023-07-22 00:57:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1215488.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:区块   Extracting   Nokogiri   values   closed

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!