我有示例xml
<?xml version="1.0" encoding="UTF-8"?> <tag_1> <tag_2>A</tag_2> <tag_3>B</tag_3> <tag_4>C</tag_4> <tag_5>D</tag_5> </tag_1> </xml>现在我有兴趣只提取特定的数据。
例如
tag_1/tag_5 -> Dtag_1/tag_5是我的数据定义(我想要的唯一数据),它本质上是动态的,意味着明天tag_1 / tag_4将成为我的数据定义。
所以实际上我的XML是一个大型的数据集。 而这些XML有效载荷就像50,000 /小时到80,000 /小时。
我想知道是否已经有高性能的XML读取工具或一些特殊的逻辑我可以实现哪些提取数据取决于数据定义。
目前我有使用Stax解析器的实现,但它需要将近一天的时间来解析80,000个xml。
public class VTDParser { private final Logger LOG = LoggerFactory.getLogger(VTDParser.class); private final VTDGen vg; public VTDParser() { vg = new VTDGen(); } public String parse(final String data, final String xpath) { vg.setDoc(data.getBytes()); try { vg.parse(true); } catch (final ParseException e) { LOG.error(e.toString()); } final VTDNav vn = vg.getNav(); final AutoPilot ap = new AutoPilot(vn); try { ap.selectXPath(xpath); } catch (final XPathParseException e) { LOG.error(e.toString()); } try { while (ap.evalXPath() != -1) { final int val = vn.getText(); if (val != -1) { return vn.toNormalizedString(val); } } } catch (XPathEvalException | NavException e) { LOG.error(e.toString()); } return null; } }I have sample xml
<?xml version="1.0" encoding="UTF-8"?> <tag_1> <tag_2>A</tag_2> <tag_3>B</tag_3> <tag_4>C</tag_4> <tag_5>D</tag_5> </tag_1> </xml>Now i am interested to extract only specific data.
For example
tag_1/tag_5 -> Dtag_1/tag_5 is my data definition (the only data which i want) which is dynamic in nature that means tomorrow tag_1/tag_4 will be my data definition.
So in reality my xml is a large data set. And these xml payloads comes like 50,000/hour to 80,000/hour.
I would like to know if there already high performance xml reader tool or some special logic i can implement which extracts data depending upon data definition.
Currently i have implementation using Stax parser but its taking nearly a day to parse 80,000 xml's.
public class VTDParser { private final Logger LOG = LoggerFactory.getLogger(VTDParser.class); private final VTDGen vg; public VTDParser() { vg = new VTDGen(); } public String parse(final String data, final String xpath) { vg.setDoc(data.getBytes()); try { vg.parse(true); } catch (final ParseException e) { LOG.error(e.toString()); } final VTDNav vn = vg.getNav(); final AutoPilot ap = new AutoPilot(vn); try { ap.selectXPath(xpath); } catch (final XPathParseException e) { LOG.error(e.toString()); } try { while (ap.evalXPath() != -1) { final int val = vn.getText(); if (val != -1) { return vn.toNormalizedString(val); } } } catch (XPathEvalException | NavException e) { LOG.error(e.toString()); } return null; } }最满意答案
这是我的代码,它可以编译xpath一次并重复使用很多次。 它编译xpath而不绑定到VTDNav实例。 它也在退出解析方法之前调用resetXPath ..但是,我没有告诉你如何用VTD预编译xml文档...以避免重复解析....并且我怀疑它可能是您的差异制造商项目...这是关于vtd-xml功能的论文引用。
http://recipp.ipp.pt/bitstream/10400.22/1847/1/ART_BrunoOliveira_2013.pdf
import com.ximpleware.*; public class VTDParser { // private final Logger LOG = LoggerFactory.getLogger(VTDParser.class); private final VTDGen vg; private final AutoPilot ap; public VTDParser() throws VTDException{ vg = new VTDGen(); ap = new AutoPilot(); ap.selectXPath("/a/b/c");// this is how you compile xpath w/o binding to an XML doc } public String parse(final String data, final AutoPilot ap1) { vg.setDoc(data.getBytes()); try { vg.parse(true); } catch (final ParseException e) { LOG.error(e.toString()); } final VTDNav vn = vg.getNav(); ap1.bind(vn); try { while (ap.evalXPath() != -1) { final int val = vn.getText(); if (val != -1) { return vn.toNormalizedString(val); } } } catch (XPathEvalException | NavException e) { LOG.error(e.toString()); } ap.resetXPath();// reset your xpath here return null; } }This is my mod to your code which compiles xpath once and reuse many times. It compiles the xpath without binding to a VTDNav instance. It also calls resetXPath before exiting the parse method.. I, however, didn't show you how to preindex the xml docs with VTD... to avoid repetitive parsing.... and I suspect it might be the difference maker for your project... Here is a paper reference regarding the capabilities of vtd-xml..
http://recipp.ipp.pt/bitstream/10400.22/1847/1/ART_BrunoOliveira_2013.pdf
import com.ximpleware.*; public class VTDParser { // private final Logger LOG = LoggerFactory.getLogger(VTDParser.class); private final VTDGen vg; private final AutoPilot ap; public VTDParser() throws VTDException{ vg = new VTDGen(); ap = new AutoPilot(); ap.selectXPath("/a/b/c");// this is how you compile xpath w/o binding to an XML doc } public String parse(final String data, final AutoPilot ap1) { vg.setDoc(data.getBytes()); try { vg.parse(true); } catch (final ParseException e) { LOG.error(e.toString()); } final VTDNav vn = vg.getNav(); ap1.bind(vn); try { while (ap.evalXPath() != -1) { final int val = vn.getText(); if (val != -1) { return vn.toNormalizedString(val); } } } catch (XPathEvalException | NavException e) { LOG.error(e.toString()); } ap.resetXPath();// reset your xpath here return null; } }更多推荐
发布评论