简体   繁体   English

使用 pyparsing 只解析一些行

[英]Parsing only some lines with pyparsing

I'm trying to parse a file, actually some portions of the file.我正在尝试解析一个文件,实际上是文件的某些部分。 The file contains information about hardwares in a server and each line starts with a keyword denoting the type of hardware.该文件包含有关服务器中硬件的信息,每一行都以表示硬件类型的关键字开头。 For example:例如:

pci24 u2480-L0
fcs1 g4045-L1
pci25 h6045-L0
en192 v7024-L3
pci26 h6045-L1

Above example doesnt show a real file but it's simple and quite enough to demonstrate the need.上面的示例没有显示真实文件,但它很简单并且足以证明需要。 I want only to parse the lines starting with "pci" and skip others.我只想解析以“pci”开头的行并跳过其他行。 I wrote a grammer for lines starting with "pci":我为以“pci”开头的行写了一个语法:

grammar_pci = Group ( Word( "pci" + nums ) + Word( alphanums + "-" ) )

I've also wrote a grammar for lines not starting with "pci":我还为不以“pci”开头的行编写了语法:

grammar_non_pci = Suppress( Regex( r"(?!pci)" ) )

And then build a grammar that sum up above two:然后构建一个总结以上两个的语法:

grammar = ( grammar_pci | grammar_non_pci )

Then i read the file and send it to parseString:然后我读取文件并将其发送到 parseString:

with open("foo.txt","r") as f:
  data = grammar.parseString(f.read())
print(data)

But no data is written as output.但是没有数据作为输出写入。 What am i missing?我错过了什么? How to parse data skipping the lines not starts with a specific keyword?如何解析跳过不以特定关键字开头的行的数据?

Thanks.谢谢。

Read each line at a time, and if starts with pci , add it to the list data ;一次读取每一行,如果以pci开头,则将其添加到列表data中; otherwise, discard it:否则,丢弃它:

data = []

with open("foo.txt", "r") as f:
    for line in f:
        if line.startswith('pci'):
            data.append(line)

print(data)

If you still need to do further parsing with your Grammar, you can now parse the list data , knowing that each item does indeed start with pci .如果您仍然需要使用语法进行进一步的解析,您现在可以解析列表data ,知道每个项目确实以pci开头。

You are off to a good start, but you are missing a few steps, mostly having to do with filling in gaps and repetition.你有了一个良好的开端,但你错过了一些步骤,主要是与填补空白和重复有关。

First, look at your expression for grammar_non_pci:首先,查看 grammar_non_pci 的表达式:

grammar_non_pci = Suppress( Regex( r"(?!pci)" ) )

This correctly detects a line that does not start with "pci", but it doesn't actually parse the line's content.这正确地检测到不是以“pci”开头的行,但实际上并没有解析该行的内容。

The easiest way to add this is to add a ".*" to the regex, so that it will parse not only the "not starting with pci" lookahead, but also the rest of the line.添加它的最简单方法是向正则表达式添加一个“.*”,这样它不仅会解析“不是以 pci 开头”的前瞻,还会解析该行的其余部分。

grammar_non_pci = Suppress( Regex( r"(?!pci).*" ) )

Second, your grammar just processes a single instance of an input line.其次,您的语法只处理输入行的单个实例。

grammar = ( grammar_pci | grammar_non_pci )

grammar needs to be repetitive语法需要重复

grammar = OneOrMore( grammar_pci | grammar_non_pci, stopOn=StringEnd())

[EDIT: since you are up to pyparsing 3.0.9, this can also be written as follows]
grammar = (grammar_pci | grammar_non_pci)[1, ...: StringEnd()]

Since grammar_non_pci could actually match on an empty string, it could repeat forever at the end of the file - that's why the stopOn argument is needed.由于 grammar_non_pci 实际上可以匹配一个空字符串,它可以在文件末尾永远重复——这就是为什么需要 stopOn 参数。

With these changes, your sample text should parse correctly.通过这些更改,您的示例文本应该可以正确解析。

But there is one issue that you'll need to clean up, and that is the definition of the "pci"-prefixed word in grammar_pci.但是有一个问题需要解决,那就是 grammar_pci 中“pci”前缀词的定义。

grammar_pci = Group ( Word( "pci" + nums ) + Word( alphanums + "-" ) )

Pyparsing's Word class takes 1 or 2 strings of characters, and uses them as a set of the valid characters for the initial word character and the body word characters. Pyparsing 的 Word 类采用 1 或 2 个字符串,并将它们用作初始单词字符和正文单词字符的一组有效字符。 "pci" + nums gives the string "pci0123456789", and will match any word group using any of those characters. "pci" + nums 给出字符串 "pci0123456789",并将匹配使用任何这些字符的任何词组。 So it will match not only "pci00" but also "cip123", "cci123", "p0c0i", or "12345".所以它不仅会匹配“pci00”,还会匹配“cip123”、“cci123”、“p0c0i”或“12345”。

To resolve this, use "pci" + Word(nums) wrapped in Combine to represent only word groups that start with "pci":要解决此问题,请使用包含在 Combine 中的"pci" + Word(nums)来表示仅以 "pci" 开头的词组:

grammar_pci = Group ( Combine("pci" + Word( nums )) + Word( alphanums + "-" ) )

Since you seem comfortable using Regex items, you could also write this as由于您似乎很喜欢使用 Regex 项,因此您也可以将其写为

grammar_pci = Group ( Regex(r"pci\d+") + Word( alphanums + "-" ) )

These changes should get you moving forward on your parser.这些更改应该会让您在解析器上向前迈进。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM