简体   繁体   中英

Complex parsing of a string in Python

I want to parse a string with a format like this:

[{text1}]{quantity}[{text2}]

This rule means that at the beginning there is some text that can optionally be there or not, followed by a {quantity} whose syntax I describe just below, followed by more optional text.

The {quantity} can take a variety of forms, with {n} being any positive integer

{n}
{n}PCS
{n}PC
{n}PCS.
{n}PC.
Lot of {n}

Also, it should accept this additional rule:

{n} {text2} 

In this rule, {n} is followed by a space then {text2}

In the cases where PC or PCS appears

  • it may or may not be followed by a dot
  • case insensitive
  • a space can optionally appear between {n} and PCS
  • The following are all stripped: PC or PCS, the optional dot, and the optional space

The desired output is normalized to two variables:

  • {n} as an integer
  • [{text1}] [{text2}], that is, first {text1} (if present), then a space, then {text2} (if present), concatenated to one string. A space to separate the text pieces is only used if there are two of them.

If the {quantity} includes anything besides a positive integer, {n} consists only of the the integer, and the rest of {quantity} (eg " PCS.") is stripped from both {n} and the resultant text string.

In the text parts, more integers could appear. Any other than the {quantity} found should be regarded as just part of the text, not interpreted as another quantity.

I am a former C/C++ programmer. If I had to solve this with those languages, I would probably use rules in lex and yacc, or else I would have to write a lot of nasty code to hand-parse it.

I would like to learn a clean approach for coding this efficiently in Python, probably using rules in some form to easily support more cases. I think I could use lex and yacc with Python, but I wonder if there is an easier way. I'm a Python newbie; I don't even know where to start with this.

I am not asking anyone to write code for a complete solution, rather, I need an approach or two, and perhaps some sample code showing part of how to do it.

Pyparsing let's you build up a parser by stitching together smaller parsers using '+' and '|' operators (among others). You can also attach names to the individual elements in the parser, to make it easier to get at the values afterward.

from pyparsing import (pyparsing_common, CaselessKeyword, Optional, ungroup, restOfLine, 
    oneOf, SkipTo)

int_qty = pyparsing_common.integer

# compose an expression for the quantity, in its various forms
"""
{n}
{n}PCS
{n}PC
{n}PCS.
{n}PC.
Lot of {n}
"""
LOT = CaselessKeyword("lot")
OF = CaselessKeyword("of")
pieces = oneOf("PC PCS PC. PCS.", caseless=True)
qty_expr = Optional(LOT + OF).suppress() + int_qty("qty") + Optional(pieces).suppress()

# compose expression for entire line
line_expr = SkipTo(qty_expr)("text1") + qty_expr + restOfLine("text2")

tests = """
    Send me 1000 widgets pronto!
    Deliver a Lot of 50 barrels of maple syrup by Monday, June 10.
    My shipment was short by 25 pcs.
    """

line_expr.runTests(tests)

Prints:

Send me 1000 widgets pronto!
['Send me', 1000, ' widgets pronto!']
- qty: 1000
- text1: ['Send me']
- text2:  widgets pronto!


Deliver a Lot of 50 barrels of maple syrup by Monday, June 10.
['Deliver a ', 50, ' barrels of maple syrup by Monday, June 10.']
- qty: 50
- text1: ['Deliver a ']
- text2:  barrels of maple syrup by Monday, June 10.


My shipment was short by 25 pcs.
['My shipment was short by', 25, '']
- qty: 25
- text1: ['My shipment was short by']
- text2: 

EDIT: Pyparsing supports two forms of alternatives for matching: MatchFirst, which stops on the first matched alternative (which is defined using the '|' operator), and Or, which evaluates all alternatives and selects the longest match (defined using '^' operator). So if you need a priority of the quantity expression, then you define it explicitly:

qty_pcs_expr = int_qty("qty") + White().suppress() + pieces.suppress()
qty_expr = Optional(LOT + OF).suppress() + int_qty("qty") + FollowedBy(White())

# compose expression for entire line
line_expr = (SkipTo(qty_pcs_expr)("text1") + qty_pcs_expr + restOfLine("text2") |
             SkipTo(qty_expr)("text1") + qty_expr + restOfLine("text2"))

Here are the new tests:

tests = """
    Send me 1000 widgets pronto!
    Deliver a Lot of 50 barrels of maple syrup by Monday, June 10.
    My shipment was short by 25 pcs.
    2. I expect 22 pcs delivered in the morning
    On May 15 please deliver 1000 PCS.
    """

Giving:

2. I expect 22 pcs delivered in the morning
['2. I expect ', 22, ' delivered in the morning']
- qty: 22
- text1: ['2. I expect ']
- text2:  delivered in the morning


On May 15 please deliver 1000 PCS.
['On May 15 please deliver ', 1000, '']
- qty: 1000
- text1: ['On May 15 please deliver ']
- text2: 

I don't know if you want to use re , but here's a regular expression which I think works. You can change the str value to test it. The match returns a tuple which has the three values [{text1}]{quantity}[{text2}]. The first and last items in the tuple will be empty if text1 and text2 are missing.

import re

str = "aSOETIHSIBSROG1PCS.ecsrGIR"

matchObj = re.search(r'([a-zA-Z]+|)(\dPCS?\.?|Lot of \d)([a-zA-Z]+|)',str).groups()
print matchObj.groups()

#Output
('aSOETIHSIBSROG', '1PCS.', 'ecsrGIR')

Here's a rules processor using regex to match your two cases. I create a custom match result class to hold relevant extracted values from the input string. The rules processor tries the following rules in succession:

  • rule1 - tries to match {n} followed by one of pc, pc., pcs, or pcs.
  • rule2 - tries to match {n} prefaced by "lot of"
  • rule3 - matches {n} followed by {text2}

when run, results in

abc 23 PCS. def
amount=23 qtype=PCS. text1="abc" text2="def" rule=1
abc 23pc def
amount=23 qtype=pc text1="abc" text2="def" rule=1
abc 24pc.def
amount=24 qtype=pc. text1="abc" text2="def" rule=1
abc 24 pcs def
amount=24 qtype=pcs text1="abc" text2="def" rule=1
abc lot of 24 def
amount=24 qtype=lot of text1="abc" text2="def" rule=2
3 abcs
amount=3 qtype=None text1="" text2="abcs" rule=3
import re

class Match:
    def __init__(self, amount, qtype, text1, text2, rule):
        self.amount = int(amount)
        self.qtype = qtype
        self.text1 = text1
        self.text2 = text2
        self.rule = rule

    def __str__(self):
        return 'amount={} qtype={} text1="{}" text2="{}" rule={}'.format(
            self.amount, self.qtype, self.text1, self.text2, self.rule)

#{n} pc pc. pcs pcs.
def rule1(s):
    m = re.search("\s*(?P\d+)\s*(?PPCS?\.?)\s*", s, re.IGNORECASE)
    if m:
        return Match(m.group('amount'), m.group('qtype'),
                     text1=s[:m.start()], text2=s[m.end():], rule=1)
    return None

#lot of {n}
def rule2(s):
    m = re.search("\s*lot of\s*(?P\d+)\s*", s, re.IGNORECASE)
    if m:
        return Match(m.group('amount'), 'lot of',
                     text1=s[:m.start()], text2=s[m.end():], rule=2)
    return None

#{n} {text2}
def rule3(s):
    m = re.search("\s*(?P\d+)\s*",s)
    if m:
        return Match(m.group('amount'), None,
                     text1=s[:m.start()], text2=s[m.end():], rule=3)
    return None

RULES = [rule1, rule2, rule3]

def process(s):
    for rule in RULES:
        m = rule(s)
        if m: return m
    return None


tests = [
"abc 23 PCS. def",
"abc 23pc def",
"abc 24pc.def",
"abc 24 pcs def",
"abc lot of 24 def",
"3 abcs"
]


for t in tests:
    m = process(t)
    print(t)
    print(m)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM