Python中字符串的復雜解析

Question

我想解析這樣的格式的字符串：

[{text1}]{quantity}[{text2}]

此規則意味着，一開始會有一些文本可以有選擇地存在或不存在，其后跟一個{quantity}，其語法在下面進行介紹，后面是更多可選的文本。

{quantity}可以采用多種形式，其中{n}是任何正整數

{n}
{n}PCS
{n}PC
{n}PCS.
{n}PC.
Lot of {n}

另外，它應接受以下附加規則：

{n} {text2}

在此規則中，{n}后跟一個空格，然后是{text2}

在出現PC或PCS的情況下

它可能會或可能不會跟一個點
不區分大小寫
{n}和PCS之間可以有一個空格
除去以下內容：PC或PCS，可選點和可選空格

期望的輸出被歸一化為兩個變量：

{n}作為整數
[{text1}] [{text2}]，即首先連接到一個字符串，然后是一個{text1}（如果存在），然后是一個空格，然后是{text2}（如果存在）。 僅當其中有兩個時，才使用分隔文本的空間。

如果{quantity}除正整數之外還包含其他任何內容，則{n}僅由整數組成，並且{n}的其余部分（例如“ PCS。”）將從{n}和結果文本字符串中剝離。

在文本部分中，可能會出現更多的整數。 除找到的{quantity}外，任何其他內容均應視為文本的一部分，而不應解釋為其他數量。

我是前C / C ++程序員。 如果必須使用這些語言解決此問題，則可能會在lex和yacc中使用規則，否則我將不得不編寫很多討厭的代碼來手動解析它。

我想學習一種干凈的方法來用Python有效地對此進行編碼，可能使用某種形式的規則輕松支持更多情況。 我想我可以在Python中使用lex和yacc，但是我想知道是否有更簡單的方法。 我是Python新手； 我什至不知道從哪里開始。

我並沒有要求任何人為完整的解決方案編寫代碼，而是需要一種或兩種方法，也許還需要一些示例代碼來說明如何實現。

Answer 1

通過Pyparsing，您可以使用'+'和'|'將較小的解析器拼接在一起，從而構建一個解析器 運算符（以及其他）。 您還可以將名稱附加到解析器中的各個元素，以使以后更容易獲得這些值。

from pyparsing import (pyparsing_common, CaselessKeyword, Optional, ungroup, restOfLine, 
    oneOf, SkipTo)

int_qty = pyparsing_common.integer

# compose an expression for the quantity, in its various forms
"""
{n}
{n}PCS
{n}PC
{n}PCS.
{n}PC.
Lot of {n}
"""
LOT = CaselessKeyword("lot")
OF = CaselessKeyword("of")
pieces = oneOf("PC PCS PC. PCS.", caseless=True)
qty_expr = Optional(LOT + OF).suppress() + int_qty("qty") + Optional(pieces).suppress()

# compose expression for entire line
line_expr = SkipTo(qty_expr)("text1") + qty_expr + restOfLine("text2")

tests = """
    Send me 1000 widgets pronto!
    Deliver a Lot of 50 barrels of maple syrup by Monday, June 10.
    My shipment was short by 25 pcs.
    """

line_expr.runTests(tests)

打印：

Send me 1000 widgets pronto!
['Send me', 1000, ' widgets pronto!']
- qty: 1000
- text1: ['Send me']
- text2:  widgets pronto!


Deliver a Lot of 50 barrels of maple syrup by Monday, June 10.
['Deliver a ', 50, ' barrels of maple syrup by Monday, June 10.']
- qty: 50
- text1: ['Deliver a ']
- text2:  barrels of maple syrup by Monday, June 10.


My shipment was short by 25 pcs.
['My shipment was short by', 25, '']
- qty: 25
- text1: ['My shipment was short by']
- text2:

編輯：Pyparsing支持兩種形式的替代匹配：MatchFirst，它停在第一個匹配替代上（使用'|'運算符定義），或者Or，評估所有替代並選擇最長的匹配（使用'^'定義）運營商）。 因此，如果需要數量表達式的優先級，則可以明確定義它：

qty_pcs_expr = int_qty("qty") + White().suppress() + pieces.suppress()
qty_expr = Optional(LOT + OF).suppress() + int_qty("qty") + FollowedBy(White())

# compose expression for entire line
line_expr = (SkipTo(qty_pcs_expr)("text1") + qty_pcs_expr + restOfLine("text2") |
             SkipTo(qty_expr)("text1") + qty_expr + restOfLine("text2"))

這是新的測試：

tests = """
    Send me 1000 widgets pronto!
    Deliver a Lot of 50 barrels of maple syrup by Monday, June 10.
    My shipment was short by 25 pcs.
    2. I expect 22 pcs delivered in the morning
    On May 15 please deliver 1000 PCS.
    """

贈送：

2. I expect 22 pcs delivered in the morning
['2. I expect ', 22, ' delivered in the morning']
- qty: 22
- text1: ['2. I expect ']
- text2:  delivered in the morning


On May 15 please deliver 1000 PCS.
['On May 15 please deliver ', 1000, '']
- qty: 1000
- text1: ['On May 15 please deliver ']
- text2:

Answer 2

我不知道您是否要使用re ，但是這是我認為可行的正則表達式。 您可以更改str值進行測試。 匹配返回一個具有三個值[{text1}] {quantity} [{text2}]的元組。 如果缺少text1和text2，則元組中的第一和最后一項將為空。

import re

str = "aSOETIHSIBSROG1PCS.ecsrGIR"

matchObj = re.search(r'([a-zA-Z]+|)(\dPCS?\.?|Lot of \d)([a-zA-Z]+|)',str).groups()
print matchObj.groups()

#Output
('aSOETIHSIBSROG', '1PCS.', 'ecsrGIR')

Answer 3

這是一個使用正則表達式匹配兩種情況的規則處理器。 我創建了一個自定義匹配結果類，以保存從輸入字符串中提取的相關值。 規則處理器連續嘗試以下規則：

rule1-嘗試匹配{n}，然后匹配pc，pc。，pcs或pcs之一。
Rule2-嘗試匹配以“很多”開頭的{n}
rule3-匹配{n}，后跟{text2}

運行時，導致

abc 23 PCS. def
amount=23 qtype=PCS. text1="abc" text2="def" rule=1
abc 23pc def
amount=23 qtype=pc text1="abc" text2="def" rule=1
abc 24pc.def
amount=24 qtype=pc. text1="abc" text2="def" rule=1
abc 24 pcs def
amount=24 qtype=pcs text1="abc" text2="def" rule=1
abc lot of 24 def
amount=24 qtype=lot of text1="abc" text2="def" rule=2
3 abcs
amount=3 qtype=None text1="" text2="abcs" rule=3

import re

class Match:
    def __init__(self, amount, qtype, text1, text2, rule):
        self.amount = int(amount)
        self.qtype = qtype
        self.text1 = text1
        self.text2 = text2
        self.rule = rule

    def __str__(self):
        return 'amount={} qtype={} text1="{}" text2="{}" rule={}'.format(
            self.amount, self.qtype, self.text1, self.text2, self.rule)

#{n} pc pc. pcs pcs.
def rule1(s):
    m = re.search("\s*(?P\d+)\s*(?PPCS?\.?)\s*", s, re.IGNORECASE)
    if m:
        return Match(m.group('amount'), m.group('qtype'),
                     text1=s[:m.start()], text2=s[m.end():], rule=1)
    return None

#lot of {n}
def rule2(s):
    m = re.search("\s*lot of\s*(?P\d+)\s*", s, re.IGNORECASE)
    if m:
        return Match(m.group('amount'), 'lot of',
                     text1=s[:m.start()], text2=s[m.end():], rule=2)
    return None

#{n} {text2}
def rule3(s):
    m = re.search("\s*(?P\d+)\s*",s)
    if m:
        return Match(m.group('amount'), None,
                     text1=s[:m.start()], text2=s[m.end():], rule=3)
    return None

RULES = [rule1, rule2, rule3]

def process(s):
    for rule in RULES:
        m = rule(s)
        if m: return m
    return None


tests = [
"abc 23 PCS. def",
"abc 23pc def",
"abc 24pc.def",
"abc 24 pcs def",
"abc lot of 24 def",
"3 abcs"
]


for t in tests:
    m = process(t)
    print(t)
    print(m)

Python中字符串的復雜解析

問題描述

3 個解決方案

解決方案1
2 已采納 2016-06-15 20:04:35

解決方案2
1 2016-06-15 20:13:56

解決方案3
0 2016-06-15 20:47:15

Python中字符串的復雜解析

問題描述

3 個解決方案

解決方案1 2 已采納 2016-06-15 20:04:35

解決方案2 1 2016-06-15 20:13:56

解決方案3 0 2016-06-15 20:47:15

解決方案1
2 已采納 2016-06-15 20:04:35

解決方案2
1 2016-06-15 20:13:56

解決方案3
0 2016-06-15 20:47:15