簡體   English   中英

在pyparsing中使用特定長度的字段標記字符串

[英]Tokenize string with field of specific length in pyparsing

我正在為 ascii 數據編寫一個簡單的解析器,其中每一行都必須被解釋為 8 個字符塊的字段:

"""
|--1---||--2---||--3---||--4---||--5---||--6---||--7---||--8---||--9---|
GRID         119           18.27  562.33  528.87
"""

這一行,應解釋為:

1: GRID + 4 blank spaces
2: 5 blank spaces + 119
3: 8 blank spaces
4: 3 blank spaces + 18.27
5: 2 blank spaces + 562.33
6: 2 blank spaces + 528.87
7: 8 blank spaces
8: 8 blank spaces
9: 8 blank spaces

這是我嘗試過的

EOL = LineEnd().suppress()
card_keyword = Keyword("GRID").leaveWhitespace().suppress()
number_card_fields = (number + ZeroOrMore(White()))
empty_card_fields = 8 * White()
card_fields = (number_card_fields | empty_card_fields)
card = (card_keyword + OneOrMore(card_fields)).setParseAction(self._card_to_dict)


def _card_to_dict(self, toks):
    _FIELDS_MAPPING = {
        0: "id", 1: "cp", 2: "x1", 3: "x2", 4: "x3", 5: "cd", 6: "ps", 7: "seid"
    }
    mapped_card = {self._FIELDS_MAPPING[idx]: token_field for idx, token_field in enumerate(toks)}
    return mapped_card

test2 = """
GRID         119           18.27  562.33  528.87                        
"""
print(card.searchString(test2))

這次回歸

[[{'id': 119, 'cp': '           ', 'x1': 18.27, 'x2': '  ', 'x3': 562.33, 'cd': '  ', 'ps': 528.87, 'seid': '                        \n'}]]

我想獲得這個,而不是

[[{'id': 119, 'cp': '        ', 'x1': 18.27, 'x2': 562.33, 'x3': 528.87, 'cd': '        ', 'ps': '        ', 'seid': '        '}]]

我認為問題出在number_card_fields = (number + ZeroOrMore(White())) 我不知道如何告訴 pyparsing 這個表達式必須是 8 個字符長。

有人可以幫助我嗎?提前感謝您的寶貴支持

Pyparsing 允許您指定精確長度的單詞。 由於您的行是固定大小的字段,因此您的“單詞”由任何可打印或空格字符組成,精確大小為 8:

field = Word(printables + " ", exact=8)

這是您的輸入行的解析器:

import pyparsing as pp
# clear out whitespace characters - pretty much disables whitespace skipping
pp.ParserElement.setDefaultWhitespaceChars('')

# define an expression that matches exactly 8 printable or space characters
field = pp.Word(pp.printables + " ", exact=8).setName('field')

# a line has one or more fields
parser = field[1, ...]

# try it out
line = "GRID         119           18.27  562.33  528.87"

print(parser.parseString(line).asList())

印刷:

['GRID    ', '     119', '        ', '   18.27', '  562.33', '  528.87']

我覺得這些空格很煩人,所以我們可以在字段中添加一個解析操作來去除它們:

# add a parse action to field to strip leading and trailing spaces
field.addParseAction(lambda t: t[0].strip())
print(parser.parseString(line).asList())

現在給出:

['GRID', '119', '', '18.27', '562.33', '528.87']

看起來您希望總共有 8 個字段,並且您希望將數字字段轉換為浮點數。 這是您的_card_to_dict解析操作的一個模式:

def str_to_value(s):
    if not s:
        return None
    try:
        return float(s)
    except ValueError:
        return s

def _card_to_dict(toks):
    _FIELDS_MAPPING = {
        0: "id", 1: "cp", 2: "x1", 3: "x2", 4: "x3", 5: "cd", 6: "ps", 7: "seid"
    }
    
    # this is one way to do it, but you can just add the names to toks
    # mapped_card = {self._FIELDS_MAPPING[idx]: token_field for idx, token_field in enumerate(toks)}
    for idx, token_field in enumerate(toks):
        toks[_FIELDS_MAPPING[idx]] = str_to_value(token_field)

parser.addParseAction(_card_to_dict)
result = parser.parseString(line)

您可以將此結果轉換為字典:

print(result.asDict())

印刷:

{'cd': 528.87, 'x2': 18.27, 'id': 'GRID', 'cp': 119.0, 'x1': None, 'x3': 562.33}

如果您使用以下方式轉儲結果:

print(result.dump())

你會得到:

['GRID', '119', '', '18.27', '562.33', '528.87']
- cd: 528.87
- cp: 119.0
- id: 'GRID'
- x1: None
- x2: 18.27
- x3: 562.33

這顯示了如何直接訪問解析結果,而無需轉換為 dict:

print(result['x2'])
print(result.id)

印刷

18.27
GRID

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM