我有一个xml文件,我想使用XML包中的xmlToDataFrame从R中提取特定节点。 我可以获得从各个节点提取数据的功能。 例如:
xml <- xmlParse("file.xml") df <- xmlToDataFrame(getNodeSet(xml, "//lat"))但是我想知道它是否可以同时提取多个节点? 具体来说,我希望制作一个五列数据帧,从节点中提取数据: //nucleotides , //lat , //lon , //bin_uri , //record_id来自xml。
xml文件的结构如下(只有一个record_id但我需要提取的文件中有很多):
<record> <record_id>634750</record_id> <processid>CCSMA054-07</processid> <bin_uri>AAG2098</bin_uri> <collection_event> <collectors>Arctic Ecology</collectors> <coordinates> <lat>58.805</lat> <lon>-94.214</lon> </coordinates> <country>Canada</country> <province>Manitoba</province> </collection_event> <sequences> <sequence> <sequenceID>3336699</sequenceID> <markercode>COI-5P</markercode> <genbank_accession>HQ938393</genbank_accession> <nucleotides>CTCAGAGTTCTCACCTGGC</nucleotides> </sequence> </sequences> </record>I have an xml file that I want to extract specific nodes from in R using xmlToDataFrame from the XML package. I can get the function to extract data from individual nodes. ex:
xml <- xmlParse("file.xml") df <- xmlToDataFrame(getNodeSet(xml, "//lat"))However I was wondering if its possible to extract multiple nodes at the same time? Specifically I am looking to make a five column dataframe extracting data from nodes: //nucleotides,//lat,//lon,//bin_uri,//record_id from the xml.
The structure of the xml file is as follows (just one record_id but there are many in the file that I need to extract):
<record> <record_id>634750</record_id> <processid>CCSMA054-07</processid> <bin_uri>AAG2098</bin_uri> <collection_event> <collectors>Arctic Ecology</collectors> <coordinates> <lat>58.805</lat> <lon>-94.214</lon> </coordinates> <country>Canada</country> <province>Manitoba</province> </collection_event> <sequences> <sequence> <sequenceID>3336699</sequenceID> <markercode>COI-5P</markercode> <genbank_accession>HQ938393</genbank_accession> <nucleotides>CTCAGAGTTCTCACCTGGC</nucleotides> </sequence> </sequences> </record>最满意答案
考虑使用xpathSApply()简单地运行各种xpath表达式,然后将它们一起绑定到数据框中:
library(XML) doc<-xmlParse("D:/Freelance Work/Scripts/BoldXML.xml") record_id <- xpathSApply(doc, "//record/record_id", xmlValue) bin_uri <- xpathSApply(doc, "//record/bin_uri", xmlValue) lat <- xpathSApply(doc, "//record/collection_event/coordinates/lat", xmlValue) lon <- xpathSApply(doc, "//record/collection_event/coordinates/lon", xmlValue) nucleotides <- xpathSApply(doc, "//record/sequences/sequence/nucleotides", xmlValue) df <- data.frame(record_id = unlist(record_id), bin_uri = unlist(bin_uri), lat = unlist(lat), lng = unlist(lon), nucleotides = unlist(nucleotides))或者,您可以使用XSLT简化原始XML, XSLT是重构/重新设计XML文件的专用语言。 虽然R没有通用的XSLT软件包,但实际上所有通用语言(C#,Java,PHP,Perl,Python,VB)都维护着XSLT库,您甚至可以使用system()从R调用脚本。 更重要的是,Windows的PowerShell和Linux的Bash等命令行程序可以运行XSLT。
XSLT脚本(另存为.xsl或.xslt)
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output version="1.0" encoding="UTF-8" indent="yes" /> <xsl:strip-space elements="*"/> <xsl:template match="/"> <root> <xsl:apply-templates select="*"/> </root> </xsl:template> <xsl:template match="record"> <xsl:copy> <xsl:copy-of select="record_id"/> <xsl:copy-of select="bin_uri"/> <xsl:copy-of select="collection_event/coordinates/lat"/> <xsl:copy-of select="collection_event/coordinates/lon"/> <xsl:copy-of select="sequences/sequence/nucleotides"/> </xsl:copy> </xsl:template> </xsl:transform>XML (转换后)
<?xml version="1.0" encoding="utf-8"?> <root> <record> <record_id>634750</record_id> <bin_uri>AAG2098</bin_uri> <lat>58.805</lat> <lon>-94.214</lon> <nucleotides>CTCAGAGTTCTCACCTGGC</nucleotides> </record> </root>R脚本:
result <- system('..some command line call to an external script that parses original xml and above xslt script and transforms former with the latter..', intern = TRUE) doc <- xmlParse("C:/Path/To/Transformed/XML.xml") df <- xmlToDataFrame(getNodeSet(doc, "//record"))Consider simply running various xpath expressions using xpathSApply() and then bind all together into a data frame:
library(XML) doc<-xmlParse("D:/Freelance Work/Scripts/BoldXML.xml") record_id <- xpathSApply(doc, "//record/record_id", xmlValue) bin_uri <- xpathSApply(doc, "//record/bin_uri", xmlValue) lat <- xpathSApply(doc, "//record/collection_event/coordinates/lat", xmlValue) lon <- xpathSApply(doc, "//record/collection_event/coordinates/lon", xmlValue) nucleotides <- xpathSApply(doc, "//record/sequences/sequence/nucleotides", xmlValue) df <- data.frame(record_id = unlist(record_id), bin_uri = unlist(bin_uri), lat = unlist(lat), lng = unlist(lon), nucleotides = unlist(nucleotides))Alternatively, you can simplify your raw XML using XSLT, the special-purpose language that restructures/re-designs XML files. While R does not have a universal XSLT package, practically all general purpose languages (C#, Java, PHP, Perl, Python, VB) maintain XSLT libraries which you can even call scripts from R with system(). Even more, command line programs such as Windows' PowerShell and Linux's Bash can run XSLT.
XSLT Script (save as .xsl or .xslt)
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output version="1.0" encoding="UTF-8" indent="yes" /> <xsl:strip-space elements="*"/> <xsl:template match="/"> <root> <xsl:apply-templates select="*"/> </root> </xsl:template> <xsl:template match="record"> <xsl:copy> <xsl:copy-of select="record_id"/> <xsl:copy-of select="bin_uri"/> <xsl:copy-of select="collection_event/coordinates/lat"/> <xsl:copy-of select="collection_event/coordinates/lon"/> <xsl:copy-of select="sequences/sequence/nucleotides"/> </xsl:copy> </xsl:template> </xsl:transform>XML (after transformation)
<?xml version="1.0" encoding="utf-8"?> <root> <record> <record_id>634750</record_id> <bin_uri>AAG2098</bin_uri> <lat>58.805</lat> <lon>-94.214</lon> <nucleotides>CTCAGAGTTCTCACCTGGC</nucleotides> </record> </root>R Script:
result <- system('..some command line call to an external script that parses original xml and above xslt script and transforms former with the latter..', intern = TRUE) doc <- xmlParse("C:/Path/To/Transformed/XML.xml") df <- xmlToDataFrame(getNodeSet(doc, "//record"))更多推荐
发布评论