简体   繁体   English

来自命名值列表的pyparsing语法树

[英]pyparsing syntax tree from named value list

I'd like to parse tag/value descriptions using the delimiters:, and •我想使用定界符解析标签/值描述:, 和 •

Eg the Input would be:例如,输入将是:

Name:Test•Title: Test•Keywords: A,B,C

the expected result should be the name value dict预期结果应该是名称值字典

{
"name": "Test",
"title": "Title",
"keywords: "A,B,C"
}

potentially already splitting the keywords in "A,B,C" to a list.可能已经将“A、B、C”中的关键字拆分到列表中。 (This is a minor detail since the python built in split method of string will happily do this). (这是一个小细节,因为 python 内置的字符串拆分方法会很乐意这样做)。

Also applying a mapping同时应用映射

keys={
  "Name": "name",
  "Title": "title",
  "Keywords": "keywords",
}

as a mapping between names and dict keys would be helpful but could be a separate step.因为名称和字典键之间的映射会有所帮助,但可能是一个单独的步骤。

I tried the code below https://trinket.io/python3/8dbbc783c7我尝试了下面的代码https://trinket.io/python3/8dbbc783c7

# pyparsing named values
# Wolfgang Fahl
# 2023-01-28 for Stackoverflow question
import pyparsing as pp
notes_text="Name:Test•Title: Test•Keywords: A,B,C"
keys={
  "Name": "name",
  "Titel": "title",
  "Keywords": "keywords",
}
keywords=list(keys.keys())
runDelim="•"
name_values_grammar=pp.delimited_list(
  pp.oneOf(keywords,as_keyword=True).setResultsName("key",list_all_matches=True)
  +":"+pp.Suppress(pp.Optional(pp.White()))
  +pp.delimited_list(
    pp.OneOrMore(pp.Word(pp.printables+" ", exclude_chars=",:"))
        ,delim=",")("value")
    ,delim=runDelim).setResultsName("tag", list_all_matches=True)
results=name_values_grammar.parseString(notes_text)
print(results.dump())

and variations of it but i am not even close to the expected result.和它的变化,但我什至没有接近预期的结果。 Currently the dump shows:目前转储显示:

['Name', ':', 'Test']
 - key: 'Name'
 - tag: [['Name', ':', 'Test']]
  [0]:
    ['Name', ':', 'Test']
 - value: ['Test']

Seems i don't know how to define the grammar and work on the parseresult in a way to get the needed dict result.似乎我不知道如何定义语法并以某种方式处理解析结果以获得所需的字典结果。

The main questions for me are:我的主要问题是:

  • Should i use parse actions?我应该使用解析动作吗?
  • How is the naming of part results done?部分结果的命名是如何完成的?
  • How is the navigation of the resulting tree done?结果树的导航是如何完成的?
  • How is it possible to get the list back from delimitedList?如何从 delimitedList 中取回列表?
  • What does list_all_matches=True achieve - it's behavior seems strange list_all_matches=True 实现了什么 - 它的行为看起来很奇怪

I searched for answers on the above questions here on stackoverflow and i couldn't find a consistent picture of what to do.我在 stackoverflow 上搜索了上述问题的答案,但找不到关于该做什么的一致图片。

PyParsing seems to be a great tool but i find it very unintuitive. PyParsing 似乎是一个很棒的工具,但我发现它非常不直观。 There are fortunately lots of answers here so i hope to learn how to get this example working幸运的是这里有很多答案所以我希望学习如何让这个例子工作

Trying myself i took a stepwise approach:尝试自己,我采取了逐步的方法:

First i checked the delimitedList behavior see https://trinket.io/python3/25e60884eb首先我检查了 delimitedList 行为见https://trinket.io/python3/25e60884eb

# Try out pyparsing delimitedList
# WF 2023-01-28
from pyparsing import printables, OneOrMore, Word, delimitedList

notes_text="A,B,C"

comma_separated_values=delimitedList(Word(printables+" ", exclude_chars=",:"),delim=",")("clist")

grammar = comma_separated_values
result=grammar.parseString(notes_text)
print(f"result:{result}")
print(f"dump:{result.dump()}")
print(f"asDict:{result.asDict()}")
print(f"asList:{result.asList()}")

which returns返回

result:['A', 'B', 'C']
dump:['A', 'B', 'C']
- clist: ['A', 'B', 'C']
asDict:{'clist': ['A', 'B', 'C']}
asList:['A', 'B', 'C']

which looks promising and the key success factor seems to be to name this list with "clist" and the default behavior looks fine.这看起来很有希望,关键的成功因素似乎是用“clist”命名这个列表,默认行为看起来很好。

https://trinket.io/python3/bc2517e25a shows in more detail where the problem is. https://trinket.io/python3/bc2517e25a更详细地显示了问题所在。

# Try out pyparsing delimitedList
# see https://stackoverflow.com/q/75266188/1497139
# WF 2023-01-28
from pyparsing import printables, oneOf, OneOrMore,Optional, ParseResults, Suppress,White, Word, delimitedList

def show_result(title:str,result:ParseResults):
  """
  show pyparsing result details
  
  Args:
     result(ParseResults)
  """
  print(f"result for {title}:")
  print(f"  result:{result}")
  print(f"  dump:{result.dump()}")
  print(f"  asDict:{result.asDict()}")
  print(f"  asList:{result.asList()}")
  # asXML is deprecated and doesn't work any more
  # print(f"asXML:{result.asXML()}")

notes_text="Name:Test•Title: Test•Keywords: A,B,C"
comma_text="A,B,C"

keys={
  "Name": "name",
  "Titel": "title",
  "Keywords": "keywords",
}
keywords=list(keys.keys())
runDelim="•"

comma_separated_values=delimitedList(Word(printables+" ", exclude_chars=",:"),delim=",")("clist")

cresult=comma_separated_values.parseString(comma_text)
show_result("comma separated values",cresult)

grammar=delimitedList(
   oneOf(keywords,as_keyword=True)
  +Suppress(":"+Optional(White()))
  +comma_separated_values
  ,delim=runDelim
)("namevalues")

nresult=grammar.parseString(notes_text)
show_result("name value list",nresult)

#ogrammar=OneOrMore(
#   oneOf(keywords,as_keyword=True)
#  +Suppress(":"+Optional(White()))
#  +comma_separated_values
#)
#oresult=grammar.parseString(notes_text)
#show_result("name value list with OneOf",nresult)

output: output:

result for comma separated values:
  result:['A', 'B', 'C']
  dump:['A', 'B', 'C']
- clist: ['A', 'B', 'C']
  asDict:{'clist': ['A', 'B', 'C']}
  asList:['A', 'B', 'C']
result for name value list:
  result:['Name', 'Test']
  dump:['Name', 'Test']
- clist: ['Test']
- namevalues: ['Name', 'Test']
  asDict:{'clist': ['Test'], 'namevalues': ['Name', 'Test']}
  asList:['Name', 'Test']

while the first result makes sense for me the second is unintuitive.虽然第一个结果对我来说有意义,但第二个结果不直观。 I'd expected a nested result - a dict with a dict of list.我期待一个嵌套的结果——一个带有列表字典的字典。

What causes this unintuitive behavior and how can it be mitigated?是什么导致了这种不直观的行为,如何减轻这种行为?

Issues with the grammar being that: you are encapsulating OneOrMore in delimited_list and you only want the outer one, and you aren't telling the parser how your data needs to be structured to give the names meaning.语法问题在于:您将 OneOrMore 封装在 delimited_list 中,而您只想要外部的,并且您没有告诉解析器您的数据需要如何构建才能赋予名称含义。

You also don't need the whitespace suppression as it is automatic.您也不需要空白抑制,因为它是自动的。

Adding parse_all to the parse_string function will help to see where not everything is being consumed.将 parse_all 添加到 parse_string function 将有助于查看不是所有内容都被消耗的地方。

name_values_grammar = pp.delimited_list(
        pp.Group(
                pp.oneOf(keywords,as_keyword=True).setResultsName("key",list_all_matches=True)
                + pp.Suppress(pp.Literal(':'))
                + pp.delimited_list(
                    pp.Word(pp.printables, exclude_chars=':,').setResultsName('value', list_all_matches=True)
                    , delim=',')
            )
            , delim='•'
        ).setResultsName('tag', list_all_matches=True)

Should i use parse actions?我应该使用解析动作吗? As you can see, you don't technically need to, but you've ended up with a data structure that might be less efficient for what you want.正如您所看到的,您在技术上不需要这样做,但您最终得到的数据结构对于您想要的东西来说可能效率较低。 If the grammar gets more complicated, I think using some parse actions would make sense.如果语法变得更复杂,我认为使用一些解析操作是有意义的。 Take a look below for some examples to map the key names (only if they are found), and cleaning up list parsing for a more complicated grammar.请查看下面的一些示例,以了解 map 键名(仅在找到它们的情况下),并清理列表解析以获得更复杂的语法。

How is the naming of part results done?部分结果的命名是如何完成的? By default in a ParseResults object, the last part that is labelled with a name will be returned when you ask for that name.默认情况下,在 ParseResults object 中,最后一个标有名称的部分将在您请求该名称时返回。 Asking for all matches to be returned using list_all_matches will only work usefully for some simple structures, but it does work.使用list_all_matches要求返回所有匹配项仅对某些简单结构有用,但它确实有效。 See below for examples.有关示例,请参见下文。

How is the navigation of the resulting tree done?结果树的导航是如何完成的? By default, everything gets flattened.默认情况下,一切都会变平。 You can use pyparsing.Group to tell the parser not to flatten its contents into the parent list (and therefore retain useful structure and part names).您可以使用pyparsing.Group告诉解析器不要将其内容展平到父列表中(因此保留有用的结构和部分名称)。

How is it possible to get the list back from delimitedList?如何从 delimitedList 中取回列表? If you don't wrap the delimited_list result in another list then the flattening that is done will remove the structure.如果您不将 delimited_list 结果包装在另一个列表中,那么完成的展平将删除该结构。 Parse actions or Group on the internal structure again to the rescue. Parse 动作或内部结构上的Group再次进行救援。

What does list_all_matches=True achieve - its behavior seems strange It is a function of the grammar structure that it seems strange. what does list_all_matches=True achieve - its behavior seems strange这是一个 function 的语法结构,看起来很奇怪。 Consider the different outputs in:考虑以下不同的输出:

import pyparsing as pp

print(
    pp.delimited_list(
            pp.Word(pp.printables, exclude_chars=',').setResultsName('word', list_all_matches=True)
        ).parse_string('x,y,z').dump()
    )

print(
    pp.delimited_list(
                pp.Word(pp.printables, exclude_chars=':,').setResultsName('key', list_all_matches=True)
                + pp.Suppress(pp.Literal(':'))
                + pp.Word(pp.printables, exclude_chars=':,').setResultsName('value', list_all_matches=True)
        )
        .parse_string('x:a,y:b,z:c').dump()
    )

print(
    pp.delimited_list(
        pp.Group(
                pp.Word(pp.printables, exclude_chars=':,').setResultsName('key', list_all_matches=True)
                + pp.Suppress(pp.Literal(':'))
                + pp.Word(pp.printables, exclude_chars=':,').setResultsName('value', list_all_matches=True)
            )
        ).setResultsName('tag', list_all_matches=True)
        .parse_string('x:a,y:b,z:c').dump()
    )

The first one makes sense, giving you a list of all the tokens you would expect.第一个是有道理的,它为您提供了您期望的所有标记的列表。 The third one also makes sense, since you have a structure you can walk.第三个也是有道理的,因为你有一个可以行走的结构。 But the second one you end up with two lists that are not necessarily (in a more complicated grammar) going to be easy to match up.但是第二个你最终得到两个列表,这两个列表不一定(在更复杂的语法中)很容易匹配。

Here's a different way of building the grammar so that it supports quoting strings with delimiters in them so they don't become lists, and keywords that aren't in your mapping.这是构建语法的另一种方法,它支持引用带有分隔符的字符串,这样它们就不会变成列表,也不会变成映射中不存在的关键字。 It's harder to do this without parse actions.没有解析操作就很难做到这一点。

import pyparsing as pp
import json

test_string = "Name:Test•Title: Test•Extra: '1,2,3'•Keywords: A,B,C,'D,E',F"

keys={
  "Name": "name",
  "Title": "title",
  "Keywords": "keywords",
}

g_key = pp.Word(pp.alphas)
g_item = pp.Word(pp.printables, excludeChars='•,\'') | pp.QuotedString(quote_char="'")
g_value = pp.delimited_list(g_item, delim=',')
l_key_value_sep = pp.Suppress(pp.Literal(':'))
g_key_value = g_key + l_key_value_sep + g_value
g_grammar = pp.delimited_list(g_key_value, delim='•')

g_key.add_parse_action(lambda x: keys[x[0]] if x[0] in keys else x)
g_value.add_parse_action(lambda x: [x] if len(x) > 1 else x)
g_key_value.add_parse_action(lambda x: (x[0], x[1].as_list()) if isinstance(x[1],pp.ParseResults) else (x[0], x[1]))

key_values = dict()
for k,v in g_grammar.parse_string(test_string, parse_all=True):
    key_values[k] = v

print(json.dumps(key_values, indent=2))

Another approach using regular expressions would be:另一种使用正则表达式的方法是:

def _extractByKeyword(keyword: str, string: str) -> typing.Union[str, None]:
    """
    Extract the value for the given key from the given string.
    designed for simple key value strings without further formatting
    e.g.
        Title: Hello World
        Goal: extraction
    For keyword="Goal" the string "extraction would be returned"

    Args:
        keyword: extract the value associated to this keyword
        string: string to extract from

    Returns:
        str: value associated to given keyword
        None: keyword not found in given string
    """
    if string is None or keyword is None:
        return None
    # https://stackoverflow.com/a/2788151/1497139
    # value is closure of not space not / colon
    pattern = rf"({keyword}:(?P<value>[\s\w,_-]*))(\s+\w+:|\n|$)"
    import re
    match = re.search(pattern, string)
    value = None
    if match is not None:
        value = match.group('value')
        if isinstance(value, str):
            value = value.strip()
    return value

keys={
  "Name": "name",
  "Title": "title",
  "Keywords": "keywords",
}

notes_text="Name:Test Title: Test Keywords: A,B,C"

lod = {v: _extractByKeyword(k, notes_text) for k,v in keys.items()}

The extraction function was tested with:提取 function 测试了:

import typing
from dataclasses import dataclass
from unittest import TestCase

class TestExtraction(TestCase)

    def test_extractByKeyword(self):
        """
        tests the keyword extraction
        """
        @dataclass
        class TestParam:
            expected: typing.Union[str, None]
            keyword: typing.Union[str, None]
            string: typing.Union[str, None]

        testParams = [
            TestParam("test", "Goal", "Title:Title\nGoal:test\nLabel:title"),
            TestParam("test", "Goal", "Title:Title\nGoal:test Label:title"),
            TestParam("test", "Goal", "Title:Title\nGoal:test"),
            TestParam("test with spaces", "Goal", "Title:Title\nGoal:test with spaces\nLabel:title"),
            TestParam("test with spaces", "Goal", "Title:Title\nGoal:test with spaces Label:title"),
            TestParam("test with spaces", "Goal", "Title:Title\nGoal:test with spaces"),
            TestParam("SQL-DML", "Goal", "Title:Title\nGoal:SQL-DML"),
            TestParam("SQL_DML", "Goal", "Title:Title\nGoal:SQL_DML"),
            TestParam(None, None, "Title:Title\nGoal:test"),
            TestParam(None, "Label", None),
            TestParam(None, None, None),
        ]
        for testParam in testParams:
            with self.subTest(testParam=testParam):
                actual = _extractByKeyword(testParam.keyword, testParam.string)
                self.assertEqual(testParam.expected, actual)

For the time being i am using a simple work-around see https://trinket.io/python3/7ccaa91f7e目前我正在使用一个简单的解决方法,请参阅https://trinket.io/python3/7ccaa91f7e

# Try out parsing name value list
# WF 2023-01-28
import json
notes_text="Name:Test•Title: Test•Keywords: A,B,C"

keys={
  "Name": "name",
  "Title": "title",
  "Keywords": "keywords",
}
result={}
key_values=notes_text.split("•")
for key_value in key_values:
  key,value=key_value.split(":")
  value=value.strip()
  result[keys[key]]=value # could do another split here if need be
  
print(json.dumps(result,indent=2))

output: output:

{
  "name": "Test",
  "title": "Test",
  "keywords": "A,B,C"
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM