简单的问题:这段代码 x <- read_html(url) 挂起并无限秒地读取页面.我不知道如何处理这个问题,例如,通过设置一些最大响应时间.我可以使用 try、catch 或任何方法重试.但它只是挂起,什么也没有发生.有人知道怎么处理吗?
页面没有问题,有时会出现,当我手动重试时它可以工作.
解决方案您可以将 read_html 包装在 httr 包中的 GET 函数中
例如如果您的原始代码是
库(rvest)图书馆(dplyr)my_url <- "stackoverflow/questions/48722076/how-to-set-timeout-in-rvest"x <- my_url %>% read_html(.)然后你可以用
替换它库(httr)# 允许 10 秒my_url %>% GET(., timeout(10)) %>% read_html# 允许 30 秒my_url %>% GET(., timeout(30)) %>% read_html示例
要进行测试,请尝试设置极短的超时时间(例如百分之一秒)
# 允许不合理的短时间,以便请求错误而不是无限期挂起my_url %>% GET(., timeout(0.01)) %>% read_html# curl::curl_fetch_memory(url, handle = handle) 中的错误:# 已达到超时:解决 10 毫秒后超时您可以在此处
找到更多示例在循环中使用它(例如,'如果超时,则跳到下一个)尝试运行此代码.它假设您有多个(在本例中为 3 个)要访问的 url(下面的第二个 url 将在提供 html 之前延迟 3 秒 - 一种测试您正在寻找的功能的好方法).我们将超时设置为 2 秒,因此我们知道它会失败.tryCatch() 函数将简单地执行您作为第二个参数放入的任何代码;在这种情况下,它将简单地分配超时!"到列表元素
my_urls <- c("stackoverflow/questions/48722076/how-to-set-timeout-in-rvest","httpbin/delay/3", #这个url会延迟3秒httpbin/delay/1")x <- 列表()# 将超时设置为 2 秒(因此第二个 url 将失败)for (i in 1:length(my_urls)) {打印(粘贴0(抓取网址号",我))tryCatch(x[[i]] <- my_urls[i] %>% GET(., timeout(2)) %>% read_html,error = function(e) { x[[i]] <<-超时!";})}现在我们检查输出 - 第一个和第三个站点返回内容,第二个超时
# >X# [[1]]# {xml_document}# # [1] \n\r\n
# [1] <body><p>{\n "args": {}, \n "data": "", \n "files": {}, \n "form";: {}, \n "headers": {\n "Accept": ...显然,您可以将超时值设置为您想要的任何值.30 - 60 秒可能是合理的,具体取决于使用情况.
Simple question: this code x <- read_html(url) hangs and reads page infinite amount of seconds. I don't know how to handle this, for example, by setting some maximum time for response. I could use try, catch, whatever to retry. But it just hangs and nothing happens. Anyone know how to deal with it?
There's no problem with page, it occurs sometimes, and while I retry manually it works.
解决方案You can wrap read_html in the GET function from httr package
e.g. if your original code was
library(rvest) library(dplyr) my_url <- "stackoverflow/questions/48722076/how-to-set-timeout-in-rvest" x <- my_url %>% read_html(.)then you could replace it with
library(httr) # Allow 10 seconds my_url %>% GET(., timeout(10)) %>% read_html # Allow 30 seconds my_url %>% GET(., timeout(30)) %>% read_htmlExample
To put it to the test, try setting an extremely short timeout period (e.g. a hundredth of a second)
# Allow an unreasonably short amount of time so the request errors rather than hangs indefinitely my_url %>% GET(., timeout(0.01)) %>% read_html # Error in curl::curl_fetch_memory(url, handle = handle) : # Timeout was reached: Resolving timed out after 10 millisecondsYou can find some more examples here
Using it in a loop (e.g. 'skip to the next if timed out)Try running this code. It supposes you have a number (3 in this case) of urls to visit (one the second url below will delay 3 seconds before providing the html - a great way to test the functionality you're looking for). We set the timeout for 2 seconds so we know it will fail. The tryCatch() function will simply execute whatever code you put in as its second argument; in this case it will simply assign 'Timed out!' to the list element
my_urls <- c("stackoverflow/questions/48722076/how-to-set-timeout-in-rvest", "httpbin/delay/3", # This url will delay 3 seconds "httpbin/delay/1") x <- list() # Set timeout for 2 seconds (so second url will fail) for (i in 1:length(my_urls)) { print(paste0("Scraping url number ", i)) tryCatch(x[[i]] <- my_urls[i] %>% GET(., timeout(2)) %>% read_html, error = function(e) { x[[i]] <<- "Timed out!" } ) }Now we inspect the output - the first and third sites returned content, the second timed out
# > x # [[1]] # {xml_document} # <html itemscope="" itemtype="schema/QAPage" class="html__responsive"> # [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<title>r - how to set timeout ... # [2] <body class="question-page unified-theme">\r\n <div id="notify-container"></div>\r\n <div id="custom ... # # [[2]] # [1] "Timed out!" # # [[3]] # {xml_document} # <html> # [1] <body><p>{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {}, \n "headers": {\n "Accept": ...Obviously you can set the timeout value to whatever you want. 30 - 60 seconds could be sensible depending on the use.
更多推荐
如何在rvest中设置超时
发布评论