R数据抓取/抓取动态/多个URL(R data scraping / crawling with dynamic/multiple URLs)

编程入门 行业动态 更新时间:2024-10-26 01:27:41
R数据抓取/抓取动态/多个URL(R data scraping / crawling with dynamic/multiple URLs)

我试图获得瑞士联邦最高法院的所有法令: https : //www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&type=simple_query&query_words= &lang = de&top_subcollection_aza = all&from_date =&to_date =&x = 12&y = 12不幸的是,没有提供API。 我想要检索的数据的CSS选择器是.para

我知道http://relevancy.bger.ch/robots.txt 。

User-agent: * Disallow: /javascript Disallow: /css Disallow: /hashtables Disallow: /stylesheets Disallow: /img Disallow: /php/jurivoc Disallow: /php/taf Disallow: /php/azabvger Sitemap: http://relevancy.bger.ch/sitemaps/sitemapindex.xml Crawl-delay: 2

对我来说,看起来我正在查看的URL被允许抓取,这是正确的吗? 无论如何,联邦科特迪瓦解释说,这些规则是针对大型搜索引擎,个人爬行是容忍的。

我可以检索单个法令的数据(使用https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on - 知识/ )

url <- 'https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&type=highlight_simple_query&page=1&from_date=&to_date=&sort=relevance&insertion_date=&top_subcollection_aza=all&query_words=&rank=1&azaclir=aza&highlight_docid=aza%3A%2F%2F18-12-2017-6B_790-2017&number_of_ranks=113971' webpage <- read_html(url) decree_html <- html_nodes(webpage,'.para') rank_data <- html_text(decree_html) decree1_data <- html_text(decree_html)

但是,由于rvest仅从一个特定页面提取数据而我的数据在多个页面上,我尝试使用Rcrawler( https://github.com/salimk/Rcrawler ),但我不知道如何抓取给定的网站在www.bger.ch上使用structur结构来获取所有包含法令的URL。

我查看了以下帖子,但仍未找到解决方案:

R网络跨越多个页面

Rvest:刮取多个网址

I try to get all decrees of the Federal Supreme Court of Switzerland available at: https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&type=simple_query&query_words=&lang=de&top_subcollection_aza=all&from_date=&to_date=&x=12&y=12 Unfortunately, no API is provided. The CSS selectors of the data I want to retrieve is .para

I am aware of http://relevancy.bger.ch/robots.txt.

User-agent: * Disallow: /javascript Disallow: /css Disallow: /hashtables Disallow: /stylesheets Disallow: /img Disallow: /php/jurivoc Disallow: /php/taf Disallow: /php/azabvger Sitemap: http://relevancy.bger.ch/sitemaps/sitemapindex.xml Crawl-delay: 2

To me it looks like the URL i am looking at is allowed to crawl, is that correct? Whatever, the federal cort explains that these rules are targeted to big search engines and individual crawling is tolerated.

I can retrieve the data for a single decree (using https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/)

url <- 'https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&type=highlight_simple_query&page=1&from_date=&to_date=&sort=relevance&insertion_date=&top_subcollection_aza=all&query_words=&rank=1&azaclir=aza&highlight_docid=aza%3A%2F%2F18-12-2017-6B_790-2017&number_of_ranks=113971' webpage <- read_html(url) decree_html <- html_nodes(webpage,'.para') rank_data <- html_text(decree_html) decree1_data <- html_text(decree_html)

However, since rvest extracts data from only one specific page and my data is on multiple pages, I tried Rcrawler to do so (https://github.com/salimk/Rcrawler), but I do not know how to crawl the given site structur on www.bger.ch to get all URLs with the decrees.

I checked out following posts, but could still not find a solution:

R web scraping across multiple pages

Rvest: Scrape multiple URLs

最满意答案

我不在下面进行错误处理,因为这超出了这个问题的范围。

让我们从通常的嫌疑人开始:

library(rvest) library(httr) library(tidyverse)

我们将定义一个函数,通过页码为我们提供一页搜索结果。 自提供网址以来,我对搜索参数进行了硬编码。

在这个功能中,我们:

获取页面HTML 获取我们想要抓取的文档的链接 得到文件metdata 制作一个数据框 为抓取的页码添加属性到数据框,以及是否有更多页面要抓取

这非常简单:

get_page <- function(page_num=1) { GET( url = "https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php", query = list( type="simple_query", lang="de", top_subcollection_aza="all", query_words="", from_date="", to_date="", x="12", y="12", page=page_num ) ) -> res warn_for_status(res) # shld be "stop" and you should do error handling pg <- content(res) links <- html_nodes(pg, "div.ranklist_content ol li") data_frame( link = html_attr(html_nodes(links, "a"), "href"), title = html_text(html_nodes(links, "a"), trim=TRUE), court = html_text(html_nodes(links, xpath=".//a/../../div/div[contains(@class, 'court')]"), trim=TRUE), # these are "dangerous" if they aren't there but you can wrap error handling around this subject = html_text(html_nodes(links, xpath=".//a/../../div/div[contains(@class, 'subject')]"), trim=TRUE), object = html_text(html_nodes(links, xpath=".//a/../../div/div[contains(@class, 'object')]"), trim=TRUE) ) -> xdf # this looks for the text at the bottom paginator. if there's no link then we're done attr(xdf, "page") <- page_num attr(xdf, "has_next") <- html_node(pg, xpath="boolean(.//a[contains(., 'Vorwärts')])") xdf }

做一个帮手函数,因为我无法忍受输入attr(...) ,它在使用中读得更好:

has_next <- function(x) { attr(x, "has_next") }

现在,做一个循环。 我只是在6点停下来。 你应该删除所有的逻辑。 考虑到这样做的批处理,因为互联网连接是不稳定的事情:

pg_num <- 0 all_links <- list() repeat { cat(".") # poor dude's progress ber pg_num <- pg_num + 1 pg_df <- get_page(pg_num) if (!has_next(pg_df)) break all_links <- append(all_links, list(pg_df)) if (pg_num == 6) break # this is here for me since I don't need ~11,000 documents Sys.sleep(2) # robots.txt crawl delay } cat("\n")

将数据框列表转换为一个大的框架。 注意:您应该在此之前进行有效性测试,因为网页抓取充满了危险。 您还应该将此数据帧保存到RDS文件中,以便您不必再次执行此操作。

lots_of_links <- bind_rows(all_links) glimpse(lots_of_links) ## Observations: 60 ## Variables: 5 ## $ link <chr> "https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&type=highlight_simple_query&... ## $ title <chr> "18.12.2017 6B 790/2017", "14.12.2017 6G 2/2017", "13.12.2017 5A 975/2017", "13.12.2017 5D 257/2017", "... ## $ court <chr> "Strafrechtliche Abteilung", "Cour de droit pénal", "II. zivilrechtliche Abteilung", "II. zivilrechtlic... ## $ subject <chr> "Straf- und Massnahmenvollzug", "Procédure pénale", "Familienrecht", "Schuldbetreibungs- und Konkursrec... ## $ object <chr> "Bedingte Entlassung aus der Verwahrung, Beschleunigungsgebot", "Demande d'interprétation et de rectifi...

手上的所有链接,我们会得到的文件。

定义辅助函数。 注意,我们不在这里解析。 单独做。 我们将存储内部内容<div> HTML文本,以便您以后解析它。

get_documents <- function(urls) { map_chr(urls, ~{ cat(".") # poor dude's progress ber Sys.sleep(2) # robots.txt crawl delay read_html(.x) %>% xml_node("div.content") %>% as.character() # we do this b/c we aren't parsing it yet but xml2 objects don't serialize at all }) }

以下是如何使用它。 再次,删除head()但也考虑批量做。

head(lots_of_links) %>% # I'm not waiting for 60 documents mutate(content = get_documents(link)) -> links_and_docs cat("\n") glimpse(links_and_docs) ## Observations: 6 ## Variables: 6 ## $ link <chr> "https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&type=highlight_simple_query&... ## $ title <chr> "18.12.2017 6B 790/2017", "14.12.2017 6G 2/2017", "13.12.2017 5A 975/2017", "13.12.2017 5D 257/2017", "... ## $ court <chr> "Strafrechtliche Abteilung", "Cour de droit pénal", "II. zivilrechtliche Abteilung", "II. zivilrechtlic... ## $ subject <chr> "Straf- und Massnahmenvollzug", "Procédure pénale", "Familienrecht", "Schuldbetreibungs- und Konkursrec... ## $ object <chr> "Bedingte Entlassung aus der Verwahrung, Beschleunigungsgebot", "Demande d'interprétation et de rectifi... ## $ content <chr> "<div class=\"content\">\n \n<div class=\"para\"> </div>\n<div class=\"para\">Bundesgericht </div>...

您仍需要在各个地方进行错误和有效性检查,如果存在服务器错误或解析问题,可能需要重新抓取页面。 但这是如何构建这种特定于站点的爬虫。

I don't do error handling below since that's beyond the scope of this question.

Let's start with the usual suspects:

library(rvest) library(httr) library(tidyverse)

We'll define a function that will get us a page of search results by page number. I've hard-coded the search parameters since you provided the URL.

In this function, we:

get the page HTML get the links to the documents we want to scrape get document metdata make a data frame add attributes to the data frame for page number grabbed and whether there are more pages to grab

It's pretty straightforward:

get_page <- function(page_num=1) { GET( url = "https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php", query = list( type="simple_query", lang="de", top_subcollection_aza="all", query_words="", from_date="", to_date="", x="12", y="12", page=page_num ) ) -> res warn_for_status(res) # shld be "stop" and you should do error handling pg <- content(res) links <- html_nodes(pg, "div.ranklist_content ol li") data_frame( link = html_attr(html_nodes(links, "a"), "href"), title = html_text(html_nodes(links, "a"), trim=TRUE), court = html_text(html_nodes(links, xpath=".//a/../../div/div[contains(@class, 'court')]"), trim=TRUE), # these are "dangerous" if they aren't there but you can wrap error handling around this subject = html_text(html_nodes(links, xpath=".//a/../../div/div[contains(@class, 'subject')]"), trim=TRUE), object = html_text(html_nodes(links, xpath=".//a/../../div/div[contains(@class, 'object')]"), trim=TRUE) ) -> xdf # this looks for the text at the bottom paginator. if there's no link then we're done attr(xdf, "page") <- page_num attr(xdf, "has_next") <- html_node(pg, xpath="boolean(.//a[contains(., 'Vorwärts')])") xdf }

Make a helper function since I can't stand typing attr(...) and it reads better in use:

has_next <- function(x) { attr(x, "has_next") }

Now, make a scraping loop. I stop at 6 just b/c. You should remove that logic for scraping everything. Consider doing this in batches since internet connections are unstable things:

pg_num <- 0 all_links <- list() repeat { cat(".") # poor dude's progress ber pg_num <- pg_num + 1 pg_df <- get_page(pg_num) if (!has_next(pg_df)) break all_links <- append(all_links, list(pg_df)) if (pg_num == 6) break # this is here for me since I don't need ~11,000 documents Sys.sleep(2) # robots.txt crawl delay } cat("\n")

Turn the list of data frames into one big one. NOTE: You should do validity tests before this since web scraping is fraught with peril. You should also save off this data frame to an RDS file so you don't have to do it again.

lots_of_links <- bind_rows(all_links) glimpse(lots_of_links) ## Observations: 60 ## Variables: 5 ## $ link <chr> "https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&type=highlight_simple_query&... ## $ title <chr> "18.12.2017 6B 790/2017", "14.12.2017 6G 2/2017", "13.12.2017 5A 975/2017", "13.12.2017 5D 257/2017", "... ## $ court <chr> "Strafrechtliche Abteilung", "Cour de droit pénal", "II. zivilrechtliche Abteilung", "II. zivilrechtlic... ## $ subject <chr> "Straf- und Massnahmenvollzug", "Procédure pénale", "Familienrecht", "Schuldbetreibungs- und Konkursrec... ## $ object <chr> "Bedingte Entlassung aus der Verwahrung, Beschleunigungsgebot", "Demande d'interprétation et de rectifi...

With all the link in hand, we'll get the documents.

Define a helper function. NOTE we aren't parsing here. Do that separately. We'll store the inner content <div> HTML text so you can parse it later.

get_documents <- function(urls) { map_chr(urls, ~{ cat(".") # poor dude's progress ber Sys.sleep(2) # robots.txt crawl delay read_html(.x) %>% xml_node("div.content") %>% as.character() # we do this b/c we aren't parsing it yet but xml2 objects don't serialize at all }) }

Here's how to use it. Again, remove head() but also consider doing it in batches.

head(lots_of_links) %>% # I'm not waiting for 60 documents mutate(content = get_documents(link)) -> links_and_docs cat("\n") glimpse(links_and_docs) ## Observations: 6 ## Variables: 6 ## $ link <chr> "https://www.bger.ch/ext/eurospider/live/de/php/aza/http/index.php?lang=de&type=highlight_simple_query&... ## $ title <chr> "18.12.2017 6B 790/2017", "14.12.2017 6G 2/2017", "13.12.2017 5A 975/2017", "13.12.2017 5D 257/2017", "... ## $ court <chr> "Strafrechtliche Abteilung", "Cour de droit pénal", "II. zivilrechtliche Abteilung", "II. zivilrechtlic... ## $ subject <chr> "Straf- und Massnahmenvollzug", "Procédure pénale", "Familienrecht", "Schuldbetreibungs- und Konkursrec... ## $ object <chr> "Bedingte Entlassung aus der Verwahrung, Beschleunigungsgebot", "Demande d'interprétation et de rectifi... ## $ content <chr> "<div class=\"content\">\n \n<div class=\"para\"> </div>\n<div class=\"para\">Bundesgericht </div>...

You still need error & validity checking in various places and may need to re-scrape pages if there are server errors or parsing issues. But this is how to build a site-specific crawler of this nature.

更多推荐

本文发布于:2023-07-31 20:50:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1347290.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:多个   动态   数据   URL   data

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!