ANTLR4 java解析器可以处理非常大的文件还是可以流式传输文件

编程入门行业动态更新时间:2024-10-27 19:25:54

本文介绍了ANTLR4 java解析器可以处理非常大的文件还是可以流式传输文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

限时送ChatGPT账号..

ANTLR 生成的 java 解析器是否能够流式传输任意大的文件?

Is the java parser generated by ANTLR capable of streaming arbitrarily large files?

我尝试使用 UnbufferedCharStream 构建一个词法分析器并将其传递给解析器.由于在 UnbufferedCharStream 上调用了 size，我收到了 UnsupportedOperationException，并且该异常包含一个解释，说明您不能在 UnbufferedCharStream 上调用 size.

I tried constructing a Lexer with a UnbufferedCharStream and passed that to the parser. I got an UnsupportedOperationException because of a call to size on the UnbufferedCharStream and the exception contained an explained that you can't call size on an UnbufferedCharStream.

    new Lexer(new UnbufferedCharStream( new CharArrayReader("".toCharArray())));
    CommonTokenStream stream = new CommonTokenStream(lexer);
    Parser parser = new Parser(stream);

我基本上有一个使用 pig 从 hadoop 导出的文件.它有大量由 '\n' 分隔的行.每列都由一个 '\t' 分割.这很容易在 java 中解析，因为我使用缓冲阅读器来读取每一行.然后我用 '\t' 分割以获得每一列.但我也想进行某种模式验证.第一列应该是格式正确的日期，然后是一些价格列，然后是一些十六进制列.

I basically have a file I exported from hadoop using pig. It has a large number of rows separated by '\n'. Each column is split by a '\t'. This is easy to parse in java as I use a buffered reader to read each line. Then I split by '\t' to get each column. But I also want to have some sort of schema validation. The first column should be a properly formatted date, followed some price columns, followed by some hex columns.

当我查看生成的解析器代码时，我可以这样称呼它

When I look at the generated parser code I could call it like so

    parser.lines().line()

这会给我一个列表，我可以在概念上迭代它.但是当我得到它时，这个列表似乎有一个固定的大小.这意味着解析器可能已经解析了整个文件.

This would give me a List which conceptually I could iterate over. But it seems that the list would have a fixed size by the time I get it. Which means the parser probably already parsed the entire file.

API 的另一部分是否允许您流式传输非常大的文件?就像使用访问者或监听器在读取文件时被调用的某种方式一样?但它不能将整个文件保存在内存中.它不适合.

Is there another part of the API that would allow you to stream really large files? Like some way of using the Visitor or Listener to get called as it is reading the file? But it can't keep the entire file in memory. It will not fit.

推荐答案

你可以这样做:

InputStream is = new FileInputStream(inputFile);//input file is the path to your input file
ANTLRInputStream input = new ANTLRInputStream(is);
GeneratedLexer lex = new GeneratedLexer(input);
lex.setTokenFactory(new CommonTokenFactory(true));
TokenStream tokens = new UnbufferedTokenStream<CommonToken>(lex);
GeneratedParser parser = new GeneratedParser(tokens);
parser.setBuildParseTree(false);//!!
parser.top_level_rule();

如果文件很大，忘记监听器或访问者——我会直接在语法中创建对象.只需将它们全部放入某种结构中(即 HashMap、Vector...)并根据需要检索.这样就可以避免创建解析树(这才是真正需要大量内存的地方).

And if the file is quite big, forget about listener or visitor - I would be creating object directly in the grammar. Just put them all in some structure (i.e. HashMap, Vector...) and retrieve as needed. This way creating the parse tree (and this is what really takes a lot of memory) is avoided.

这篇关于ANTLR4 java解析器可以处理非常大的文件还是可以流式传输文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

更多推荐

[db:关键词]

本文发布于:2023-04-20 07:26:19，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/979072.html