简体   繁体   English

Python中字符串的复杂解析

[英]Complex parsing of a string in Python

I want to parse a string with a format like this: 我想解析这样的格式的字符串:

[{text1}]{quantity}[{text2}]

This rule means that at the beginning there is some text that can optionally be there or not, followed by a {quantity} whose syntax I describe just below, followed by more optional text. 此规则意味着,一开始会有一些文本可以有选择地存在或不存在,其后跟一个{quantity},其语法在下面进行介绍,后面是更多可选的文本。

The {quantity} can take a variety of forms, with {n} being any positive integer {quantity}可以采用多种形式,其中{n}是任何正整数

{n}
{n}PCS
{n}PC
{n}PCS.
{n}PC.
Lot of {n}

Also, it should accept this additional rule: 另外,它应接受以下附加规则:

{n} {text2} 

In this rule, {n} is followed by a space then {text2} 在此规则中,{n}后跟一个空格,然后是{text2}

In the cases where PC or PCS appears 在出现PC或PCS的情况下

  • it may or may not be followed by a dot 它可能会或可能不会跟一个点
  • case insensitive 不区分大小写
  • a space can optionally appear between {n} and PCS {n}和PCS之间可以有一个空格
  • The following are all stripped: PC or PCS, the optional dot, and the optional space 除去以下内容:PC或PCS,可选点和可选空格

The desired output is normalized to two variables: 期望的输出被归一化为两个变量:

  • {n} as an integer {n}作为整数
  • [{text1}] [{text2}], that is, first {text1} (if present), then a space, then {text2} (if present), concatenated to one string. [{text1}] [{text2}],即首先连接到一个字符串,然后是一个{text1}(如果存在),然后是一个空格,然后是{text2}(如果存在)。 A space to separate the text pieces is only used if there are two of them. 仅当其中有两个时,才使用分隔文本的空间。

If the {quantity} includes anything besides a positive integer, {n} consists only of the the integer, and the rest of {quantity} (eg " PCS.") is stripped from both {n} and the resultant text string. 如果{quantity}除正整数之外还包含其他任何内容,则{n}仅由整数组成,并且{n}的其余部分(例如“ PCS。”)将从{n}和结果文本字符串中剥离。

In the text parts, more integers could appear. 在文本部分中,可能会出现更多的整数。 Any other than the {quantity} found should be regarded as just part of the text, not interpreted as another quantity. 除找到的{quantity}外,任何其他内容均应视为文本的一部分,而不应解释为其他数量。

I am a former C/C++ programmer. 我是前C / C ++程序员。 If I had to solve this with those languages, I would probably use rules in lex and yacc, or else I would have to write a lot of nasty code to hand-parse it. 如果必须使用这些语言解决此问题,则可能会在lex和yacc中使用规则,否则我将不得不编写很多讨厌的代码来手动解析它。

I would like to learn a clean approach for coding this efficiently in Python, probably using rules in some form to easily support more cases. 我想学习一种干净的方法来用Python有效地对此进行编码,可能使用某种形式的规则轻松支持更多情况。 I think I could use lex and yacc with Python, but I wonder if there is an easier way. 我想我可以在Python中使用lex和yacc,但是我想知道是否有更简单的方法。 I'm a Python newbie; 我是Python新手; I don't even know where to start with this. 我什至不知道从哪里开始。

I am not asking anyone to write code for a complete solution, rather, I need an approach or two, and perhaps some sample code showing part of how to do it. 我并没有要求任何人为完整的解决方案编写代码,而是需要一种或两种方法,也许还需要一些示例代码来说明如何实现。

Pyparsing let's you build up a parser by stitching together smaller parsers using '+' and '|' 通过Pyparsing,您可以使用'+'和'|'将较小的解析器拼接在一起,从而构建一个解析器 operators (among others). 运算符(以及其他)。 You can also attach names to the individual elements in the parser, to make it easier to get at the values afterward. 您还可以将名称附加到解析器中的各个元素,以使以后更容易获得这些值。

from pyparsing import (pyparsing_common, CaselessKeyword, Optional, ungroup, restOfLine, 
    oneOf, SkipTo)

int_qty = pyparsing_common.integer

# compose an expression for the quantity, in its various forms
"""
{n}
{n}PCS
{n}PC
{n}PCS.
{n}PC.
Lot of {n}
"""
LOT = CaselessKeyword("lot")
OF = CaselessKeyword("of")
pieces = oneOf("PC PCS PC. PCS.", caseless=True)
qty_expr = Optional(LOT + OF).suppress() + int_qty("qty") + Optional(pieces).suppress()

# compose expression for entire line
line_expr = SkipTo(qty_expr)("text1") + qty_expr + restOfLine("text2")

tests = """
    Send me 1000 widgets pronto!
    Deliver a Lot of 50 barrels of maple syrup by Monday, June 10.
    My shipment was short by 25 pcs.
    """

line_expr.runTests(tests)

Prints: 打印:

Send me 1000 widgets pronto!
['Send me', 1000, ' widgets pronto!']
- qty: 1000
- text1: ['Send me']
- text2:  widgets pronto!


Deliver a Lot of 50 barrels of maple syrup by Monday, June 10.
['Deliver a ', 50, ' barrels of maple syrup by Monday, June 10.']
- qty: 50
- text1: ['Deliver a ']
- text2:  barrels of maple syrup by Monday, June 10.


My shipment was short by 25 pcs.
['My shipment was short by', 25, '']
- qty: 25
- text1: ['My shipment was short by']
- text2: 

EDIT: Pyparsing supports two forms of alternatives for matching: MatchFirst, which stops on the first matched alternative (which is defined using the '|' operator), and Or, which evaluates all alternatives and selects the longest match (defined using '^' operator). 编辑:Pyparsing支持两种形式的替代匹配:MatchFirst,它停在第一个匹配替代上(使用'|'运算符定义),或者Or,评估所有替代并选择最长的匹配(使用'^'定义)运营商)。 So if you need a priority of the quantity expression, then you define it explicitly: 因此,如果需要数量表达式的优先级,则可以明确定义它:

qty_pcs_expr = int_qty("qty") + White().suppress() + pieces.suppress()
qty_expr = Optional(LOT + OF).suppress() + int_qty("qty") + FollowedBy(White())

# compose expression for entire line
line_expr = (SkipTo(qty_pcs_expr)("text1") + qty_pcs_expr + restOfLine("text2") |
             SkipTo(qty_expr)("text1") + qty_expr + restOfLine("text2"))

Here are the new tests: 这是新的测试:

tests = """
    Send me 1000 widgets pronto!
    Deliver a Lot of 50 barrels of maple syrup by Monday, June 10.
    My shipment was short by 25 pcs.
    2. I expect 22 pcs delivered in the morning
    On May 15 please deliver 1000 PCS.
    """

Giving: 赠送:

2. I expect 22 pcs delivered in the morning
['2. I expect ', 22, ' delivered in the morning']
- qty: 22
- text1: ['2. I expect ']
- text2:  delivered in the morning


On May 15 please deliver 1000 PCS.
['On May 15 please deliver ', 1000, '']
- qty: 1000
- text1: ['On May 15 please deliver ']
- text2: 

I don't know if you want to use re , but here's a regular expression which I think works. 我不知道您是否要使用re ,但是这是我认为可行的正则表达式。 You can change the str value to test it. 您可以更改str值进行测试。 The match returns a tuple which has the three values [{text1}]{quantity}[{text2}]. 匹配返回一个具有三个值[{text1}] {quantity} [{text2}]的元组。 The first and last items in the tuple will be empty if text1 and text2 are missing. 如果缺少text1和text2,则元组中的第一和最后一项将为空。

import re

str = "aSOETIHSIBSROG1PCS.ecsrGIR"

matchObj = re.search(r'([a-zA-Z]+|)(\dPCS?\.?|Lot of \d)([a-zA-Z]+|)',str).groups()
print matchObj.groups()

#Output
('aSOETIHSIBSROG', '1PCS.', 'ecsrGIR')

Here's a rules processor using regex to match your two cases. 这是一个使用正则表达式匹配两种情况的规则处理器。 I create a custom match result class to hold relevant extracted values from the input string. 我创建了一个自定义匹配结果类,以保存从输入字符串中提取的相关值。 The rules processor tries the following rules in succession: 规则处理器连续尝试以下规则:

  • rule1 - tries to match {n} followed by one of pc, pc., pcs, or pcs. rule1-尝试匹配{n},然后匹配pc,pc。,pcs或pcs之一。
  • rule2 - tries to match {n} prefaced by "lot of" Rule2-尝试匹配以“很多”开头的{n}
  • rule3 - matches {n} followed by {text2} rule3-匹配{n},后跟{text2}

when run, results in 运行时,导致

abc 23 PCS. def
amount=23 qtype=PCS. text1="abc" text2="def" rule=1
abc 23pc def
amount=23 qtype=pc text1="abc" text2="def" rule=1
abc 24pc.def
amount=24 qtype=pc. text1="abc" text2="def" rule=1
abc 24 pcs def
amount=24 qtype=pcs text1="abc" text2="def" rule=1
abc lot of 24 def
amount=24 qtype=lot of text1="abc" text2="def" rule=2
3 abcs
amount=3 qtype=None text1="" text2="abcs" rule=3
import re

class Match:
    def __init__(self, amount, qtype, text1, text2, rule):
        self.amount = int(amount)
        self.qtype = qtype
        self.text1 = text1
        self.text2 = text2
        self.rule = rule

    def __str__(self):
        return 'amount={} qtype={} text1="{}" text2="{}" rule={}'.format(
            self.amount, self.qtype, self.text1, self.text2, self.rule)

#{n} pc pc. pcs pcs.
def rule1(s):
    m = re.search("\s*(?P\d+)\s*(?PPCS?\.?)\s*", s, re.IGNORECASE)
    if m:
        return Match(m.group('amount'), m.group('qtype'),
                     text1=s[:m.start()], text2=s[m.end():], rule=1)
    return None

#lot of {n}
def rule2(s):
    m = re.search("\s*lot of\s*(?P\d+)\s*", s, re.IGNORECASE)
    if m:
        return Match(m.group('amount'), 'lot of',
                     text1=s[:m.start()], text2=s[m.end():], rule=2)
    return None

#{n} {text2}
def rule3(s):
    m = re.search("\s*(?P\d+)\s*",s)
    if m:
        return Match(m.group('amount'), None,
                     text1=s[:m.start()], text2=s[m.end():], rule=3)
    return None

RULES = [rule1, rule2, rule3]

def process(s):
    for rule in RULES:
        m = rule(s)
        if m: return m
    return None


tests = [
"abc 23 PCS. def",
"abc 23pc def",
"abc 24pc.def",
"abc 24 pcs def",
"abc lot of 24 def",
"3 abcs"
]


for t in tests:
    m = process(t)
    print(t)
    print(m)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM