简体   繁体   English

高级Python正则表达式:如何从多行字符串中评估和提取嵌套列表和数字?

[英]Advanced Python Regex: how to evaluate and extract nested lists and numbers from a multiline string?

I was trying to separate the elements from a multiline string: 我试图将元素与多行字符串分开:

lines = '''c0 c1 c2 c3 c4 c5
0   10 100.5 [1.5, 2]     [[10, 10.4], [c, 10, eee]]  [[a , bg], [5.5, ddd, edd]] 100.5
1   20 200.5 [2.5, 2]     [[20, 20.4], [d, 20, eee]]  [[a , bg], [7.5, udd, edd]] 200.5'''

My aim is to get a list lst such that: 我的目标是获得一个列表lst这样的:

# first value is index
lst[0] = ['c0', 'c1', 'c2', 'c3', 'c4','c5']
lst[1] = [0, 10, 100.5, [1.5, 2], [[10, 10.4], ['c', 10, 'eee']], [['a' , 'bg'], [5.5, 'ddd', 'edd']], 100.5 ]
lst[2] = [1, 20, 200.5, [2.5, 2], [[20, 20.4], ['d', 20, 'eee']], [['a' , 'bg'], [7.5, 'udd', 'edd']], 200.5 ]

My attempt so far is this: 到目前为止我的尝试是这样的:

import re

lines = '''c0 c1 c2 c3 c4 c5
0   10 100.5 [1.5, 2]     [[10, 10.4], [c, 10, eee]]  [[a , bg], [5.5, ddd, edd]] 100.5
1   20 200.5 [2.5, 2]     [[20, 20.4], [d, 20, eee]]  [[a , bg], [7.5, udd, edd]] 200.5'''


# get n elements for n lines and remove empty lines
lines = lines.split('\n')
lines = list(filter(None,lines))    

lst = []
lst.append(lines[0].split())


for i in range(1,len(lines)): 
  change = re.sub('([a-zA-Z]+)', r"'\1'", lines[i])
  lst.append(change)

for i in lst[1]:
  print(i)

How to fix the regex? 如何修复正则表达式?

Update 更新
Test datasets 测试数据集

data = """
    orig  shifted  not_equal  cumsum  lst
0     10      NaN       True       1  [[10, 10.4], [c, 10, eee]] 
1     10     10.0      False       1  [[10, 10.4], [c, 10, eee]] 
2     23     10.0       True       2  [[10, 10.4], [c, 10, eee]] 
"""

# Gives: ValueError: malformed node or string:

data = """
    Name Result Value
0   Name1   5   2
1   Name1   5   3
2   Name2   11  1
"""
# gives same error


data = """
product  value
0       A     25
1       B     45
2       C     15
3       C     14
4       C     13
5       B     22
"""
# gives same error

data = '''
    c0 c1
0   10 100.5
1   20 200.5
'''
# works perfect

As noted in the comments, this task is impossible to do with regex. 正如评论中所指出的,这个任务与正则表达式无关。 Regex is fundamentally unable to handle nested constructs. 正则表达式从根本上说无法处理嵌套结构。 What you need is a parser. 你需要的是一个解析器。

One of the ways to create a parser is PEG , which lets you set up a list of tokens and their relations to each other in a declarative language. 创建解析器的方法之一是PEG ,它允许您以声明性语言设置令牌列表及其相互之间的关系。 This parser definition is then turned into an actual parser that can handle the described input. 然后将此解析器定义转换为可以处理所描述的输入的实际解析器。 When parsing succeeds, you will get back a tree structure with all the items properly nested. 解析成功后,您将获得一个树结构,其中所有项都已正确嵌套。

For demonstration purposes, I've used the JavaScript implementation peg.js, which has an online demo page where you can live-test parsers against some input. 出于演示目的,我使用了JavaScript实现peg.js,它有一个在线演示页面 ,您可以根据某些输入对解析器进行实时测试。 This parser definition: 这个解析器定义:

{
    // [value, [[delimiter, value], ...]] => [value, value, ...]
    const list = values => [values[0]].concat(values[1].map(i => i[1]));
}
document
    = line*
line "line"
    = value:(item (whitespace item)*) whitespace? eol { return list(value) }
item "item"
    = number / string / group
group "group"
    = "[" value:(item (comma item)*) whitespace? "]" { return list(value) }
comma "comma"
    = whitespace? "," whitespace?
number "number"
    = value:$[0-9.]+ { return +value }
string "string"
    = $([^ 0-9\[\]\r\n,] [^ \[\]\r\n,]*)
whitespace "whitespace"
    = $" "+
eol "eol"
    = [\r]? [\n] / eof
eof "eof"
    = !.

can understand this kind of input: 可以理解这种输入:

c0 c1 c2 c3 c4 c5
0   10 100.5 [1.5, 2]     [[10, 10.4], [c, 10, eee]]  [[a , bg], [5.5, ddd, edd]]
1   20 200.5 [2.5, 2]     [[20, 20.4], [d, 20, eee]]  [[a , bg], [7.5, udd, edd1]]

and produces this object tree (JSON notation): 并生成此对象树(JSON表示法):

[
    ["c0", "c1", "c2", "c3", "c4", "c5"],
    [0, 10, 100.5, [1.5, 2], [[10, 10.4], ["c", 10, "eee"]], [["a", "bg"], [5.5, "ddd", "edd"]]],
    [1, 20, 200.5, [2.5, 2], [[20, 20.4], ["d", 20, "eee"]], [["a", "bg"], [7.5, "udd", "edd1"]]]
]

ie

  • an array of lines, 一系列的线条,
  • each of which is an array of values, 每个都是一个值数组,
  • each of which can be either a number, or a string, or another array of values 每个都可以是数字,字符串或其他值数组

This tree structure can then be handled by your program. 然后,您的程序可以处理此树结构。

The above would work for example with node.js to turn your input into JSON. 上面的例子可以用node.js将您的输入转换为JSON。 The following minimal JS program accepts data from STDIN and writes the parsed result to STDOUT: 以下最小JS程序接受来自STDIN的数据并将解析后的结果写入STDOUT:

// reference the parser.js file, e.g. downloaded from https://pegjs.org/online
const parser = require('./parser');

var chunks = [];

// handle STDIN events to slurp up all the input into one big string
process.stdin.on('data', buffer => chunks.push(buffer.toString()));
process.stdin.on('end', function () {
    var text = chunks.join('');
    var data = parser.parse(text);
    var json = JSON.stringify(data, null, 4);
    process.stdout.write(json);
});

// start reading from STDIN
process.stdin.resume();

Save it as text2json.js or something like that and redirect (or pipe) some text into it: 将它保存为text2json.js或类似的东西,并将一些文本重定向(或管道):

# input redirection (this works on Windows, too)
node text2json.js < input.txt > output.json

# common alternative, but I'd recommend input redirection over this
cat input.txt | node text2json.js > output.json

There are PEG parser generators for Python as well, for example https://github.com/erikrose/parsimonious . 还有用于Python的PEG解析器生成器,例如https://github.com/erikrose/parsimonious The parser creation language differs between implementations, so the above can only be used for peg.js, but the principle is exactly the same. 解析器创建语言在实现之间有所不同,因此上面只能用于peg.js,但原理完全相同。


EDIT I've dug into Parsimonious and recreated the above solution in Python code. 编辑我已经挖到Parsimonious并在Python代码中重新创建了上述解决方案。 The approach is the same, the parser grammar is the same, with a few tiny syntactical changes. 方法是相同的,解析器语法是相同的,只有一些微小的语法变化。

from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor

grammar = Grammar(
    r"""
    document   = line*
    line       = whitespace? item (whitespace item)* whitespace? eol
    item       = group / number / boolean / string
    group      = "[" item (comma item)* whitespace? "]"
    comma      = whitespace? "," whitespace?
    number     = "NaN" / ~"[0-9.]+"
    boolean    = "True" / "False"
    string     = ~"[^ 0-9\[\]\r\n,][^ \[\]\r\n,]*"
    whitespace = ~" +"
    eol        = ~"\r?\n" / eof
    eof        = ~"$"
    """)

class DataExtractor(NodeVisitor):
    @staticmethod
    def concat_items(first_item, remaining_items):
        """ helper to concat the values of delimited items (lines or goups) """
        return first_item + list(map(lambda i: i[1][0], remaining_items))

    def generic_visit(self, node, processed_children):
        """ in general we just want to see the processed children of any node """
        return processed_children

    def visit_line(self, node, processed_children):
        """ line nodes return an array of their processed_children """
        _, first_item, remaining_items, _, _ = processed_children
        return self.concat_items(first_item, remaining_items)

    def visit_group(self, node, processed_children):
        """ group nodes return an array of their processed_children """
        _, first_item, remaining_items, _, _ = processed_children
        return self.concat_items(first_item, remaining_items)

    def visit_number(self, node, processed_children):
        """ number nodes return floats (nan is a special value of floats) """
        return float(node.text)

    def visit_boolean(self, node, processed_children):
        """ boolean nodes return return True or False """
        return node.text == "True"

    def visit_string(self, node, processed_children):
        """ string nodes just return their own text """
        return node.text

The DataExtractor is responsible for traversing the tree and pulling out data from the nodes, returning lists of strings, numbers, booleans, or NaN. DataExtractor负责遍历树并从节点中提取数据,返回字符串,数字,布尔值或NaN的列表。

The concat_items() function performs the same task as the list() function in the Javascript code above, the other functions also have their equivalents in the peg.js approach, except that peg.js integrates them directly into the parser definition and Parsimonious expects definitions in a separate class, so it's a bit wordier in comparison, but not too bad. concat_items()函数执行与上面Javascript代码中的list()函数相同的任务,其他函数也在peg.js方法中具有等价物,除了peg.js将它们直接集成到解析器定义中并且Parsimonious期望在一个单独的类中的定义,所以它相对来说有点讽刺,但也不是太糟糕。

Usage, assuming an input file called "data.txt", also mirrors the JS code: 用法,假设一个名为“data.txt”的输入文件,也反映了JS代码:

de = DataExtractor()

with open("data.txt", encoding="utf8") as f:
    text = f.read()

tree = grammar.parse(text)
data = de.visit(tree)
print(data)

Input: 输入:

orig shifted not_equal cumsum lst
0 10 NaN True 1 [[10, 10.4], [c, 10, eee]]
1 10 10.0 False 1 [[10, 10.4], [c, 10, eee]]
2 23 10.0 True 2 [[10, 10.4], [c, 10, eee]]

Output: 输出:

[
    ['orig', 'shifted', 'not_equal', 'cumsum', 'lst'],
    [0.0, 10.0, nan, True, 1.0, [[10.0, 10.4], ['c', 10.0, 'eee']]],
    [1.0, 10.0, 10.0, False, 1.0, [[10.0, 10.4], ['c', 10.0, 'eee']]], 
    [2.0, 23.0, 10.0, True, 2.0, [[10.0, 10.4], ['c', 10.0, 'eee']]]
]

In the long run, I would expect this approach to be more maintainable and flexible than regex hackery. 从长远来看,我希望这种方法比正则表达式hackery更易于维护和灵活。 Adding explicit support for NaN and for booleans (which the peg.js-Solution above does not have - there they are parsed as strings) for example was easy. 添加对NaN和布尔值的明确支持(例如上面的peg.js-Solution没有 - 它们被解析为字符串)很容易。

I honestly disagree that it is impossible to do with a regular expression. 老实说,我不同意用正则表达式做不可能。 One might state more precisely that it is not possible with regular expressions alone . 有人可能会更精确地说明单独使用正则表达式是不可能的。
See the following code which yields what you want and read the explanation further down. 请参阅以下代码,其中包含您想要的内容并进一步阅读说明。

Code

import regex as re
from ast import literal_eval

data = """
c0 c1 c2 c3 c4 c5
0   10 100.5 [1.5, 2]     [[10, 10.4], [c, 10, eee]]  [[a , bg], [5.5, ddd, edd]] 100.5
1   20 200.5 [2.5, 2]     [[20, 20.4], [d, 20, eee]]  [[a , bg], [7.5, udd, edd]] 200.5
"""

# regex definition
rx = re.compile(r'''
    (?(DEFINE)
        (?<item>[.\w]+)
        (?<list>\[(?:[^][\n]*|(?R))+\])
    )
    (?&list)|(?&item)
    ''', re.X)

# unquoted item
item_rx = re.compile(r"(?<!')\b([a-z][.\w]*)\b(?!')")

# afterwork party
def afterwork(match):
    match = item_rx.sub(r"'\1'", match)
    return literal_eval(match)

matrix = [
    [afterwork(item.group(0)) for item in rx.finditer(line)]
    for line in data.split("\n")
    if line
]

print(matrix)

This yields 这产生了

[['c0', 'c1', 'c2', 'c3', 'c4', 'c5'], [0, 10, 100.5, [1.5, 2], [[10, 10.4], ['c', 10, 'eee']], [['a', 'bg'], [5.5, 'ddd', 'edd']], 100.5], [1, 20, 200.5, [2.5, 2], [[20, 20.4], ['d', 20, 'eee']], [['a', 'bg'], [7.5, 'udd', 'edd']], 200.5]]

Explanation 说明

First, we import the newer regex module and the function literal_eval from the ast module which will be needed to transform the found matches in actual code. 首先,我们从ast模块导入更新的regex模块和函数literal_eval ,这将是在实际代码中转换找到的匹配所需的。 The newer regex module has far more power than the re module and provides recursive functionality and the powerful (yet not very well known) DEFINE construct for subroutines. 较新的regex模块比re模块具有更多的功能,并且为子例程提供递归功能和功能强大(但不是很熟知)的DEFINE构造。

We define two types of elements, the first being a "simple" item, the latter being a "list item", see the demo on regex101.com . 我们定义了两种类型的元素,第一种是“简单”项,后者是“列表项”,请参阅regex101.com上的演示

In a second step we add quotes for those element who needs them (that is, unquoted elements starting with a character). 在第二步中,我们为需要它们的元素添加引号(即,以字符开头的不带引号的元素)。 Everything is fed into literal_eval and then saved within the list comprehension. 所有内容都输入literal_eval ,然后保存在列表理解中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM