简体   繁体   English

在 Python 中解析半结构化文本字符串

[英]Parsing semi structured text strings in Python

I am trying to parse pseudo-English scripts, and want to convert it into another machine readable language.我正在尝试解析伪英语脚本,并希望将其转换为另一种机器可读的语言。 However the script have been written by many people in the past, and each had their own style of writing.不过这个剧本过去曾被很多人写过,每个人都有自己的写作风格。

some Examples would be:一些例子是:

  1. On Device 1 Set word 45 and 46 to hex 331在设备 1 上将字 45 和 46 设置为十六进制 331
  2. On Device 1 set words 45 and 46 bits 3..7 to 280在设备 1 上设置字 45 和 46 位 3..7 到 280
  3. on Device 1 set word 45 to oct 332在设备 1 上将字 45 设置为 oct 332
  4. on device 1 set speed to 60kts Words 3-4 to hex 34 (there are many more different ways used in the source text)在设备 1 上将速度设置为 60kts Words 3-4 到 hex 34(源文本中使用了更多不同的方法)

The issue is its not always logical nor consistent问题是它并不总是合乎逻辑或一致的

I have looked at Regexp, and matching certain words.我查看了 Regexp,并匹配了某些单词。 This works out ok, but when I need to know the next word (eg in 'Word 24' I would match for 'Word' then try to figure out if the next token is a number or not).这样做没问题,但是当我需要知道下一个单词时(例如,在“Word 24”中,我会匹配“Word”,然后尝试确定下一个标记是否是数字)。 In the case of 'Words' i need to look for the words to set, as well as their values.在“单词”的情况下,我需要查找要设置的单词及其值。

in example 1, it should produce to Set word 45 to hex 331 and Set word 46 to hex 331 or if possible Set word 45 to hex 331 and word 46 to hex 331在示例 1 中,它应该生成Set word 45 to hex 331 and Set word 46 to hex 331 or if possible Set word 45 to hex 331 and word 46 to hex 331

i tried using the findall method on re - that would only give me the matched words, and then i have to try to find out the next word (ie value) manually我尝试在重新使用 findall 方法 - 这只会给我匹配的单词,然后我必须尝试手动找出下一个单词(即值)

alternatively, i could split the string using a space and process each word manually, then be able to do something like或者,我可以使用空格分割字符串并手动处理每个单词,然后可以执行类似的操作

assuming list is假设列表是

['On', 'device1:', 'set', 'Word', '1', '', 'to', '88', 'and', 'word', '2', 'to', '2151']

for i in range (0,sp.__len__()):
    rew = re.search("[Ww]ord", sp[i])
    if rew:
        print ("Found word, next val is ", sp[i+1])

is there a better way to do what i want?有没有更好的方法来做我想做的事? i looked a little bit into tokenizing, but not sure that would work as the language is not structured in the first place.我有点研究标记化,但不确定这会起作用,因为该语言首先不是结构化的。

I suggest you develop a program that gradually explores the syntax that people have used to write the scripts.我建议您开发一个程序,逐步探索人们用来编写脚本的语法。

Eg, each instruction in your examples seems to break down into a device-part and a settings-part.例如,您示例中的每条指令似乎都分解为设备部分和设置部分。 So you could try matching each line against the regex ^(.+) set (.+) .因此,您可以尝试将每一行与正则表达式^(.+) set (.+)匹配。 If you find lines that don't match that pattern, print them out.如果您发现与该模式不匹配的行,请将其打印出来。 Examine the output, find a general pattern that matches some of them, add a corresponding regex to your program (or modify an existing regex), and repeat.检查输出,找到与其中一些匹配的一般模式,将相应的正则表达式添加到您的程序(或修改现有的正则表达式),然后重复。 Proceed until you've recognized (in a very general way) every line in your input.继续,直到您(以非常一般的方式)识别出输入中的每一行。

(Since capitalization appears to be inconsistent, you can either do case-insensitive matches, or convert each line to lowercase before you start processing it. More generally, you may find other 'normalizations' that simplify subsequent processing. Eg, if people were inconsistent about spaces, you can convert every run of whitespace characters into a single space.) (由于大写似乎不一致,您可以进行不区分大小写的匹配,或者在开始处理之前将每一行转换为小写。更一般地说,您可能会发现其他简化后续处理的“规范化”。例如,如果人们不一致关于空格,您可以将每次运行的空白字符转换为一个空格。)

(If your input has typographical errors, eg someone wrote "ste" for "set", then you can either change the regex to allow for that ( ... (set|ste) ... ), or go to (a copy of) the input file and just fix the typo.) (如果您的输入有印刷错误,例如有人为“set”写了“ste”,那么您可以更改正则表达式以允许( ... (set|ste) ... ),或转到 (a copy of) 输入文件并修复错字。)

Then go back to the lines that matched ^(.+) set (.+) , print out just the first group for each, and repeat the above process for just those substrings.然后返回匹配^(.+) set (.+) ,只打印出每个的第一组,并仅对这些子字符串重复上述过程。 Then repeat the process for the second group in each "set" instruction.然后在每个“设置”指令中对第二组重复该过程。 And so on, recursively.以此类推,递归。

Eventually, your program will be, in effect, a parser for the script language.最终,您的程序实际上将成为脚本语言的解析器。 At that point, you can start to add code to convert each recognized construct into the output language.此时,您可以开始添加代码以将每个识别的构造转换为输出语言。

Depending on your experience with Python, you can find ways to make the code concise.根据您使用 Python 的经验,您可以找到使代码简洁的方法。

Depending on what you actually want from these strings, you could use a parser, eg parsimonious :根据您从这些字符串中实际想要的内容,您可以使用解析器,例如parsimonious

from parsimonious.nodes import NodeVisitor
from parsimonious.grammar import Grammar

grammar = Grammar(
    r"""
    command     = set operand to? number (operator number)* middle? to? numsys? number
    operand     = (~r"words?" / "speed") ws
    middle      = (~r"[Ww]ords" / "bits")+ ws number
    to          = ws "to" ws
    number      = ws ~r"[-\d.]+" "kts"? ws
    numsys      = ws ("oct" / "hex") ws
    operator    = ws "and" ws
    set         = ~"[Ss]et" ws
    ws          = ~r"\s*"
    """
)

class HorribleStuff(NodeVisitor):
    def __init__(self):
        self.cmds = []

    def generic_visit(self, node, visited_children):
        pass

    def visit_operand(self, node, visited_children):
        self.cmds.append(('operand', node.text))

    def visit_number(self, node, visited_children):
        self.cmds.append(('number', node.text))


examples = ['Set word 45 and 46 to hex 331',
            'set words 45 and 46 bits 3..7 to 280',
            'set word 45 to oct 332',
            'set speed to 60kts Words 3-4 to hex 34']


for example in examples:
    tree = grammar.parse(example)
    hs = HorribleStuff()
    hs.visit(tree)
    print(hs.cmds)

This would yield这将产生

[('operand', 'word '), ('number', '45 '), ('number', '46 '), ('number', '331')]
[('operand', 'words '), ('number', '45 '), ('number', '46 '), ('number', '3..7 '), ('number', '280')]
[('operand', 'word '), ('number', '45 '), ('number', '332')]
[('operand', 'speed '), ('number', '60kts '), ('number', '3-4 '), ('number', '34')]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM