简体   繁体   English

在Haskell中解析字符(-)时出错

[英]Error parsing a char (――) in Haskell

I'm writing a parser to parse huge chunks of English text using attoparsec. 我正在编写一个解析器,以使用attoparsec解析大量英语文本。 Everything has been great so far, except for parsing this char "――" . 到目前为止,除了解析此char "――"之外,一切都很好。 I know it is just 2 dashes together "--" . 我知道这只是两个破折号"--" The weird thing is, the parser catches it in this code: 奇怪的是,解析器在以下代码中捕获了它:

wordSeparator :: Parser ()
wordSeparator = many1 (space <|> satisfy (inClass "――?!,:")) >> pure () 

but not in this case: 但在这种情况下不是:

specialChars = ['――', '?', '!', ',', ':']
wordSeparator :: Parser ()
wordSeparator = many1 (space <|> satisfy (inClass specialChars)) >> pure ()

The reason I'm using the list specialChars is because I have a lot of characters to consider and I apply it multiple cases. 我使用specialChars列表的specialChars是因为我要考虑很多字符,因此我将其应用于多种情况。 And for the input consider: "I am ――Walt Whitman._" and the output is supposed to be {"I", "am", "Walt", "Whiteman."} I believe it's mostly because "――" is not a Char? 对于输入,请考虑: "I am ――Walt Whitman._" ,而输出应该是{"I", "am", "Walt", "Whiteman."}我认为这主要是因为"――"是不是字符? How do I fix this? 我该如何解决?

A Char is one character, full stop. Char是一个字符,句号。 ―― is two characters, so it is two Char s. ――是两个字符,所以是两个Char You can fit as many Char s as you want into a String , but you certainly cannot fit two Char s into one Char . 您可以将任意多个Char放入一个String ,但您肯定不能将两个Char放入一个Char

Since satisfy considers individual characters at a time, it probably isn't what you want if you need to parse a sequence of two characters as a single unit. 因为satisfy考虑单个字符,所以如果您需要将两个字符的序列解析为一个单元,则可能不是您想要的。 The inClass function just produces a predicate on characters ( inClass partially applied to one argument produces a function of type Char -> Bool ), so inClass "――" is the same as inClass ['―', '―'] , which is just the same as inClass ['―'] since duplicates are irrelevant. inClass函数只是针对字符生成谓词(部分应用于一个参数的inClass会生成inClass "――" Char -> Bool类型的函数),因此inClass "――"inClass ['―', '―'] ,即与inClass ['―']相同,因为重复项无关紧要。 That won't help you much. 那对你没有多大帮助。

Consider using string instead of or in combination with inClass , since it is designed to handle sequences of characters. 考虑使用string代替inClass或与inClass结合使用,因为它旨在处理字符序列 For example, something like this might better suit your needs: 例如,类似这样的东西可能更适合您的需求:

wordSeparator :: Parser ()
wordSeparator = many1 (space <|> string "――" <|> satisfy (inClass "?!,:")) >> pure ()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM