在空行之间解析数字数组(Parse array of numbers between emptylines)

编程入门行业动态更新时间:2024-10-24 14:25:35

我试图使解析器扫描文本文件中由空行分隔的数字数组。

1 235 623 684 2 871 699 557 3 918 686 49 4 53 564 906 1 154 2 321 3 519 1 235 623 684 2 871 699 557 3 918 686 49

这是完整的文本文件

我用parsec编写了以下解析器：

import Text.ParserCombinators.Parsec emptyLine = do spaces newline emptyLines = many1 emptyLine data1 = do dat <- many1 digit return (dat) datan = do many1 (oneOf " \t") dat <- many1 digit return (dat) dataline = do dat1 <- data1 dat2 <- many datan many (oneOf " \t") newline return (dat1:dat2) parseSeries = do dat <- many1 dataline return dat parseParag = try parseSeries parseListing = do --cont <- parseSeries `sepBy` emptyLines cont <- between emptyLines emptyLines parseSeries eof return cont main = do fichier <- readFile ("test_listtst.txt") case parse parseListing "(test)" fichier of Left error -> do putStrLn "!!! Error !!!" print error Right serie -> do mapM_ print serie

但它失败并出现以下错误：

!!! Error !!! "(test)" (line 6, column 1): unexpected "1" expecting space or new-line

我不明白为什么。

你知道我的解析器有什么问题吗？

您是否有一个关于如何解析由空行分隔的结构化数据的示例？

I'm trying to make a parser to scan arrays of numbers separated by empty lines in a text file.

1 235 623 684 2 871 699 557 3 918 686 49 4 53 564 906 1 154 2 321 3 519 1 235 623 684 2 871 699 557 3 918 686 49

Here is the full text file

I wrote the following parser with parsec :

but it fails with the following error :

!!! Error !!! "(test)" (line 6, column 1): unexpected "1" expecting space or new-line

and I don't understand why.

Do you have any idea of what's wrong with my parser ?

Do you have an example on how to parse a structured bunch of data separated by empty lines ?

最满意答案

你知道我的解析器有什么问题吗？

一些东西：

正如其他答复者已经指出的那样， spaces解析器被设计为消耗满足Data.Char.isSpace的一系列字符; 换行符（ '\n' ）就是这样的一个字符。因此，你的emptyLine解析器总是失败，因为newline需要一个已经被使用的换行符。

您可能不应该在“行”解析器中使用newline解析器，因为如果后者不以换行符结束，那么解析器将在文件的最后一行失败。

为什么不使用parsec 3（ Text.Parsec.* ）而不是parsec 2（ Text.ParserCombinators.* ）？

为什么不把数字解析为Integer或Int ，而不是将它们保留为String ？

个人喜好，但你太过于依赖于我的口味，不利于可读性。例如，

data1 = do dat <- many1 digit return (dat)

可以简化为

data1 = many1 digit

您最好在所有顶级绑定中添加一个类型签名。

在解析器命名方式上保持一致：为什么“parseListing”而不是简单的“列表”？

您是否考虑过使用不同类型的输入流（例如Text ）以获得更好的性能？

您是否有一个关于如何解析由空行分隔的结构化数据的示例？

下面是你想要的那种解析器的简化版本。请注意，输入不应该以空行开头（但可能以空白行结束），并且“数据行”不应包含前导空格，但可能包含尾随空格（在spaces分析器的意义上）。

module Main where

import Data.Char ( isSpace )
import Text.Parsec
import Text.Parsec.String ( Parser )

eolChar :: Char
eolChar = '\n'

eol :: Parser Char
eol = char eolChar

whitespace :: Parser String
whitespace = many $ satisfy $ \c -> isSpace c && c /= eolChar

emptyLine :: Parser String
emptyLine = whitespace

emptyLines :: Parser [String]
emptyLines = sepEndBy1 emptyLine eol

cell :: Parser Integer
cell = read <$> many1 digit

dataLine :: Parser [Integer]
dataLine = sepEndBy1 cell whitespace
--             ^
-- replace by endBy1 if no trailing whitespace is allowed in a "data line"

dataLines :: Parser [[Integer]]
dataLines = sepEndBy1 dataLine eol

listing :: Parser [[[Integer]]]
listing = sepEndBy dataLines emptyLines

main :: IO ()
main = do
    fichier <- readFile ("test_listtst.txt")
    case parse listing "(test)" fichier of
        Left error  -> putStrLn "!!! Error !!!"
        Right serie -> mapM_ print serie
 
 测试：  
λ> main
[[1,235,623,684],[2,871,699,557],[3,918,686,49],[4,53,564,906]]
[[1,154],[2,321],[3,519]]
[[1,235,623,684],[2,871,699,557],[3,918,686,49]]
 
 Do you have any idea of what's wrong with my parser ? 
 
A few things: 
 
 As other answerers have already pointed out, the spaces parser is designed to consume a sequence of characters that satisfy Data.Char.isSpace; the newline ('\n') is such a character. Therefore, your emptyLine parser always fails, because newline expects a newline character that has already been consumed. 
 You probably shouldn't use the newline parser in your "line" parsers anyway, because those parsers will fail on the last line of the file if the latter doesn't end with a newline. 
 Why not use parsec 3 (Text.Parsec.*) rather than parsec 2 (Text.ParserCombinators.*)? 
 Why not parse the numbers as Integers or Ints as you go, rather than keep them as Strings? 
 Personal preference, but you rely too much on the do notation for my taste, to the detriment of readability. For instance, data1 = do
  dat <- many1 digit
  return (dat)
 can be simplified to data1 = many1 digit
 
 You would do well to add a type signature to all your top-level bindings. 
 Be consistent in how you name your parsers: why "parseListing" instead of simply "listing"? 
 Have you considered using a different type of input stream (e.g. Text) for better performance?  
 
 
 Do you have an example on how to parse a structured bunch of data separated by empty lines ? 
 
Below is a much simplified version of the kind of parser you want. Note that the input is not supposed to begin with (but may end with) empty lines, and "data lines" are not supposed to contain leading spaces, but may contain trailing spaces (in the sense of the spaces parser). 
module Main where

import Data.Char ( isSpace )
import Text.Parsec
import Text.Parsec.String ( Parser )

eolChar :: Char
eolChar = '\n'

eol :: Parser Char
eol = char eolChar

whitespace :: Parser String
whitespace = many $ satisfy $ \c -> isSpace c && c /= eolChar

emptyLine :: Parser String
emptyLine = whitespace

emptyLines :: Parser [String]
emptyLines = sepEndBy1 emptyLine eol

cell :: Parser Integer
cell = read <$> many1 digit

dataLine :: Parser [Integer]
dataLine = sepEndBy1 cell whitespace
--             ^
-- replace by endBy1 if no trailing whitespace is allowed in a "data line"

dataLines :: Parser [[Integer]]
dataLines = sepEndBy1 dataLine eol

listing :: Parser [[[Integer]]]
listing = sepEndBy dataLines emptyLines

main :: IO ()
main = do
    fichier <- readFile ("test_listtst.txt")
    case parse listing "(test)" fichier of
        Left error  -> putStrLn "!!! Error !!!"
        Right serie -> mapM_ print serie
 
Test: 
λ> main
[[1,235,623,684],[2,871,699,557],[3,918,686,49],[4,53,564,906]]
[[1,154],[2,321],[3,519]]
[[1,235,623,684],[2,871,699,557],[3,918,686,49]]