从xml文件中提取特定内部节点并在r中构造数据帧(Extracting specific internal nodes from an xml file and construct a datafra

编程入门 行业动态 更新时间:2024-10-26 10:29:27
从xml文件中提取特定内部节点并在r中构造数据帧(Extracting specific internal nodes from an xml file and construct a dataframe in r)

我有一个xml文件,我想使用XML包中的xmlToDataFrame从R中提取特定节点。 我可以获得从各个节点提取数据的功能。 例如:

xml <- xmlParse("file.xml") df <- xmlToDataFrame(getNodeSet(xml, "//lat"))

但是我想知道它是否可以同时提取多个节点? 具体来说,我希望制作一个五列数据帧,从节点中提取数据: //nucleotides , //lat , //lon , //bin_uri , //record_id来自xml。

xml文件的结构如下(只有一个record_id但我需要提取的文件中有很多):

<record> <record_id>634750</record_id> <processid>CCSMA054-07</processid> <bin_uri>AAG2098</bin_uri> <collection_event> <collectors>Arctic Ecology</collectors> <coordinates> <lat>58.805</lat> <lon>-94.214</lon> </coordinates> <country>Canada</country> <province>Manitoba</province> </collection_event> <sequences> <sequence> <sequenceID>3336699</sequenceID> <markercode>COI-5P</markercode> <genbank_accession>HQ938393</genbank_accession> <nucleotides>CTCAGAGTTCTCACCTGGC</nucleotides> </sequence> </sequences> </record>

I have an xml file that I want to extract specific nodes from in R using xmlToDataFrame from the XML package. I can get the function to extract data from individual nodes. ex:

xml <- xmlParse("file.xml") df <- xmlToDataFrame(getNodeSet(xml, "//lat"))

However I was wondering if its possible to extract multiple nodes at the same time? Specifically I am looking to make a five column dataframe extracting data from nodes: //nucleotides,//lat,//lon,//bin_uri,//record_id from the xml.

The structure of the xml file is as follows (just one record_id but there are many in the file that I need to extract):

<record> <record_id>634750</record_id> <processid>CCSMA054-07</processid> <bin_uri>AAG2098</bin_uri> <collection_event> <collectors>Arctic Ecology</collectors> <coordinates> <lat>58.805</lat> <lon>-94.214</lon> </coordinates> <country>Canada</country> <province>Manitoba</province> </collection_event> <sequences> <sequence> <sequenceID>3336699</sequenceID> <markercode>COI-5P</markercode> <genbank_accession>HQ938393</genbank_accession> <nucleotides>CTCAGAGTTCTCACCTGGC</nucleotides> </sequence> </sequences> </record>

最满意答案

考虑使用xpathSApply()简单地运行各种xpath表达式,然后将它们一起绑定到数据框中:

library(XML) doc<-xmlParse("D:/Freelance Work/Scripts/BoldXML.xml") record_id <- xpathSApply(doc, "//record/record_id", xmlValue) bin_uri <- xpathSApply(doc, "//record/bin_uri", xmlValue) lat <- xpathSApply(doc, "//record/collection_event/coordinates/lat", xmlValue) lon <- xpathSApply(doc, "//record/collection_event/coordinates/lon", xmlValue) nucleotides <- xpathSApply(doc, "//record/sequences/sequence/nucleotides", xmlValue) df <- data.frame(record_id = unlist(record_id), bin_uri = unlist(bin_uri), lat = unlist(lat), lng = unlist(lon), nucleotides = unlist(nucleotides))

或者,您可以使用XSLT简化原始XML, XSLT是重构/重新设计XML文件的专用语言。 虽然R没有通用的XSLT软件包,但实际上所有通用语言(C#,Java,PHP,Perl,Python,VB)都维护着XSLT库,您甚至可以使用system()从R调用脚本。 更重要的是,Windows的PowerShell和Linux的Bash等命令行程序可以运行XSLT。

XSLT脚本(另存为.xsl或.xslt)

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output version="1.0" encoding="UTF-8" indent="yes" /> <xsl:strip-space elements="*"/> <xsl:template match="/"> <root> <xsl:apply-templates select="*"/> </root> </xsl:template> <xsl:template match="record"> <xsl:copy> <xsl:copy-of select="record_id"/> <xsl:copy-of select="bin_uri"/> <xsl:copy-of select="collection_event/coordinates/lat"/> <xsl:copy-of select="collection_event/coordinates/lon"/> <xsl:copy-of select="sequences/sequence/nucleotides"/> </xsl:copy> </xsl:template> </xsl:transform>

XML (转换后)

<?xml version="1.0" encoding="utf-8"?> <root> <record> <record_id>634750</record_id> <bin_uri>AAG2098</bin_uri> <lat>58.805</lat> <lon>-94.214</lon> <nucleotides>CTCAGAGTTCTCACCTGGC</nucleotides> </record> </root>

R脚本:

result <- system('..some command line call to an external script that parses original xml and above xslt script and transforms former with the latter..', intern = TRUE) doc <- xmlParse("C:/Path/To/Transformed/XML.xml") df <- xmlToDataFrame(getNodeSet(doc, "//record"))

Consider simply running various xpath expressions using xpathSApply() and then bind all together into a data frame:

library(XML) doc<-xmlParse("D:/Freelance Work/Scripts/BoldXML.xml") record_id <- xpathSApply(doc, "//record/record_id", xmlValue) bin_uri <- xpathSApply(doc, "//record/bin_uri", xmlValue) lat <- xpathSApply(doc, "//record/collection_event/coordinates/lat", xmlValue) lon <- xpathSApply(doc, "//record/collection_event/coordinates/lon", xmlValue) nucleotides <- xpathSApply(doc, "//record/sequences/sequence/nucleotides", xmlValue) df <- data.frame(record_id = unlist(record_id), bin_uri = unlist(bin_uri), lat = unlist(lat), lng = unlist(lon), nucleotides = unlist(nucleotides))

Alternatively, you can simplify your raw XML using XSLT, the special-purpose language that restructures/re-designs XML files. While R does not have a universal XSLT package, practically all general purpose languages (C#, Java, PHP, Perl, Python, VB) maintain XSLT libraries which you can even call scripts from R with system(). Even more, command line programs such as Windows' PowerShell and Linux's Bash can run XSLT.

XSLT Script (save as .xsl or .xslt)

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output version="1.0" encoding="UTF-8" indent="yes" /> <xsl:strip-space elements="*"/> <xsl:template match="/"> <root> <xsl:apply-templates select="*"/> </root> </xsl:template> <xsl:template match="record"> <xsl:copy> <xsl:copy-of select="record_id"/> <xsl:copy-of select="bin_uri"/> <xsl:copy-of select="collection_event/coordinates/lat"/> <xsl:copy-of select="collection_event/coordinates/lon"/> <xsl:copy-of select="sequences/sequence/nucleotides"/> </xsl:copy> </xsl:template> </xsl:transform>

XML (after transformation)

<?xml version="1.0" encoding="utf-8"?> <root> <record> <record_id>634750</record_id> <bin_uri>AAG2098</bin_uri> <lat>58.805</lat> <lon>-94.214</lon> <nucleotides>CTCAGAGTTCTCACCTGGC</nucleotides> </record> </root>

R Script:

result <- system('..some command line call to an external script that parses original xml and above xslt script and transforms former with the latter..', intern = TRUE) doc <- xmlParse("C:/Path/To/Transformed/XML.xml") df <- xmlToDataFrame(getNodeSet(doc, "//record"))

更多推荐

本文发布于:2023-08-07 19:56:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1465934.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:并在   节点   文件   数据   Extracting

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!