从维基百科加载一个表到R中(Load a table from wikipedia into R)

系统教程行业动态更新时间:2024-06-14 16:57:40

我正试图从以下URL中将最高法院法官的表格加载到R中。 https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States

我使用下面的代码：

scotusURL <- "https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States" scotusData <- getURL(scotusURL, ssl.verifypeer = FALSE) scotusDoc <- htmlParse(scotusData) scotusData <- scotusDoc['//table[@class="wikitable"]'] scotusTable <- readHTMLTable(scotusData[[1]], stringsAsFactors = FALSE)

R将scotusTable作为NULL返回。这里的目标是在R中获得一个data.frame，我可以用它来构建一个在法庭上享有SCOTUS正义任期的ggplot。我以前有过这样的脚本来制作一个很棒的情节，但是最近的决定在页面上发生了一些变化，现在脚本无法运行。我通过维基百科上的HTML尝试查找任何更改，但是我不是webdev，因此任何会破坏我的脚本的内容都不会立即显现。

另外，R中有没有一种方法可以缓存这个页面的数据，所以我并不是经常引用这个URL？这似乎是今后避免这个问题的理想方式。感谢帮助。

顺便说一句，SCOTUS在我的正在进行的业余爱好/侧面项目中，所以如果有其他的数据源比维基百科更好的话，我会全神贯注。

编辑：对不起，我应该列出我的依赖。我正在使用XML，plyr，RCurl，data.table和ggplot2库。

I'm trying to load the table of Supreme Court Justices into R from the following URL. https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States

I'm using the following code:

R returns scotusTable as NULL. The goal here is to get a data.frame in R that I can use to make a ggplot of SCOTUS justice tenure on the Court. I previously had the script working to make an awesome plot, however after the recent decisions something changed on the page and now the script will not function. I went through the HTML on wikipedia to try to find any changes, however I'm not a webdev so anything that would break my script isn't immediately apparent.

Additionally, is there a method in R that would allow me to cache the data from this page so I'm not constantly referencing the URL? That would seem to be the ideal way to avoid this issue in the future. Appreciate the help.

As an aside, SCOTUS in an on-going hobby/side-project of mine so if there's some other data source out there that's better than wikipedia, I'm all ears.

Edit: Sorry I should have listed my dependencies. I'm using the XML, plyr, RCurl, data.table, and ggplot2 libraries.

最满意答案

如果您不介意使用不同的软件包，则可以尝试“rvest”软件包。

library(rvest) scotusURL <- "https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States"

选项1：从页面中获取表格并使用html_table函数来提取您感兴趣的表格。

temp <- scotusURL %>% html %>% html_nodes("table") html_table(temp[1]) ## Just the "legend" table html_table(temp[2]) ## The table you're interested in

选项2：检查表格元素并复制XPath以直接读取该表格（右键单击，检查元素，滚动到相关的“表格”标记，右键单击该表格并选择“复制XPath”）。

scotusURL %>% html %>% html_nodes(xpath = '//*[@id="mw-content-text"]/table[2]') %>% html_table

我喜欢的另一个选择是将数据加载到Google电子表格中，并使用“googlesheets”包读取它。

在Google云端硬盘中，创建一个名为“最高法院”的新电子表格。在第一个工作表中输入：

=importhtml("https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States", "table", 2)

这会自动将此表格粘贴到您的Google电子表格中。

从那里，在R你可以做到：

library(googlesheets) SC <- gs_title("Supreme Court") gs_read(SC)

If you don't mind using a different package, you can try the "rvest" package.

library(rvest) scotusURL <- "https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States"

Option 1: Grab the tables from the page and use the html_table function to extract the tables you're interested in.

temp <- scotusURL %>% html %>% html_nodes("table") html_table(temp[1]) ## Just the "legend" table html_table(temp[2]) ## The table you're interested in

Option 2: Inspect the table element and copy the XPath to read that table directly (right-click, inspect element, scroll to the relevant "table" tag, right click on that, and select "Copy XPath").

scotusURL %>% html %>% html_nodes(xpath = '//*[@id="mw-content-text"]/table[2]') %>% html_table

Another option I like is loading the data in a Google spreadsheet and reading it using the "googlesheets" package.

In Google Drive, create a new spreadsheet named, for instance "Supreme Court". In the first worksheet, enter:

=importhtml("https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States", "table", 2)

This will automatically scrape this table into your Google spreadsheet.

From there, in R you can do:

library(googlesheets) SC <- gs_title("Supreme Court") gs_read(SC)

更多推荐

本文发布于:2023-04-13 12:50:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/dzcp/26b847ae68af3f7962bd52f3595aa30f.html