简体   繁体   中英

Advanced Python Regex: how to evaluate and extract nested lists and numbers from a multiline string?

I was trying to separate the elements from a multiline string:

lines = '''c0 c1 c2 c3 c4 c5
0   10 100.5 [1.5, 2]     [[10, 10.4], [c, 10, eee]]  [[a , bg], [5.5, ddd, edd]] 100.5
1   20 200.5 [2.5, 2]     [[20, 20.4], [d, 20, eee]]  [[a , bg], [7.5, udd, edd]] 200.5'''

My aim is to get a list lst such that:

# first value is index
lst[0] = ['c0', 'c1', 'c2', 'c3', 'c4','c5']
lst[1] = [0, 10, 100.5, [1.5, 2], [[10, 10.4], ['c', 10, 'eee']], [['a' , 'bg'], [5.5, 'ddd', 'edd']], 100.5 ]
lst[2] = [1, 20, 200.5, [2.5, 2], [[20, 20.4], ['d', 20, 'eee']], [['a' , 'bg'], [7.5, 'udd', 'edd']], 200.5 ]

My attempt so far is this:

import re

lines = '''c0 c1 c2 c3 c4 c5
0   10 100.5 [1.5, 2]     [[10, 10.4], [c, 10, eee]]  [[a , bg], [5.5, ddd, edd]] 100.5
1   20 200.5 [2.5, 2]     [[20, 20.4], [d, 20, eee]]  [[a , bg], [7.5, udd, edd]] 200.5'''


# get n elements for n lines and remove empty lines
lines = lines.split('\n')
lines = list(filter(None,lines))    

lst = []
lst.append(lines[0].split())


for i in range(1,len(lines)): 
  change = re.sub('([a-zA-Z]+)', r"'\1'", lines[i])
  lst.append(change)

for i in lst[1]:
  print(i)

How to fix the regex?

Update
Test datasets

data = """
    orig  shifted  not_equal  cumsum  lst
0     10      NaN       True       1  [[10, 10.4], [c, 10, eee]] 
1     10     10.0      False       1  [[10, 10.4], [c, 10, eee]] 
2     23     10.0       True       2  [[10, 10.4], [c, 10, eee]] 
"""

# Gives: ValueError: malformed node or string:

data = """
    Name Result Value
0   Name1   5   2
1   Name1   5   3
2   Name2   11  1
"""
# gives same error


data = """
product  value
0       A     25
1       B     45
2       C     15
3       C     14
4       C     13
5       B     22
"""
# gives same error

data = '''
    c0 c1
0   10 100.5
1   20 200.5
'''
# works perfect

As noted in the comments, this task is impossible to do with regex. Regex is fundamentally unable to handle nested constructs. What you need is a parser.

One of the ways to create a parser is PEG , which lets you set up a list of tokens and their relations to each other in a declarative language. This parser definition is then turned into an actual parser that can handle the described input. When parsing succeeds, you will get back a tree structure with all the items properly nested.

For demonstration purposes, I've used the JavaScript implementation peg.js, which has an online demo page where you can live-test parsers against some input. This parser definition:

{
    // [value, [[delimiter, value], ...]] => [value, value, ...]
    const list = values => [values[0]].concat(values[1].map(i => i[1]));
}
document
    = line*
line "line"
    = value:(item (whitespace item)*) whitespace? eol { return list(value) }
item "item"
    = number / string / group
group "group"
    = "[" value:(item (comma item)*) whitespace? "]" { return list(value) }
comma "comma"
    = whitespace? "," whitespace?
number "number"
    = value:$[0-9.]+ { return +value }
string "string"
    = $([^ 0-9\[\]\r\n,] [^ \[\]\r\n,]*)
whitespace "whitespace"
    = $" "+
eol "eol"
    = [\r]? [\n] / eof
eof "eof"
    = !.

can understand this kind of input:

c0 c1 c2 c3 c4 c5
0   10 100.5 [1.5, 2]     [[10, 10.4], [c, 10, eee]]  [[a , bg], [5.5, ddd, edd]]
1   20 200.5 [2.5, 2]     [[20, 20.4], [d, 20, eee]]  [[a , bg], [7.5, udd, edd1]]

and produces this object tree (JSON notation):

[
    ["c0", "c1", "c2", "c3", "c4", "c5"],
    [0, 10, 100.5, [1.5, 2], [[10, 10.4], ["c", 10, "eee"]], [["a", "bg"], [5.5, "ddd", "edd"]]],
    [1, 20, 200.5, [2.5, 2], [[20, 20.4], ["d", 20, "eee"]], [["a", "bg"], [7.5, "udd", "edd1"]]]
]

ie

  • an array of lines,
  • each of which is an array of values,
  • each of which can be either a number, or a string, or another array of values

This tree structure can then be handled by your program.

The above would work for example with node.js to turn your input into JSON. The following minimal JS program accepts data from STDIN and writes the parsed result to STDOUT:

// reference the parser.js file, e.g. downloaded from https://pegjs.org/online
const parser = require('./parser');

var chunks = [];

// handle STDIN events to slurp up all the input into one big string
process.stdin.on('data', buffer => chunks.push(buffer.toString()));
process.stdin.on('end', function () {
    var text = chunks.join('');
    var data = parser.parse(text);
    var json = JSON.stringify(data, null, 4);
    process.stdout.write(json);
});

// start reading from STDIN
process.stdin.resume();

Save it as text2json.js or something like that and redirect (or pipe) some text into it:

# input redirection (this works on Windows, too)
node text2json.js < input.txt > output.json

# common alternative, but I'd recommend input redirection over this
cat input.txt | node text2json.js > output.json

There are PEG parser generators for Python as well, for example https://github.com/erikrose/parsimonious . The parser creation language differs between implementations, so the above can only be used for peg.js, but the principle is exactly the same.


EDIT I've dug into Parsimonious and recreated the above solution in Python code. The approach is the same, the parser grammar is the same, with a few tiny syntactical changes.

from parsimonious.grammar import Grammar
from parsimonious.nodes import NodeVisitor

grammar = Grammar(
    r"""
    document   = line*
    line       = whitespace? item (whitespace item)* whitespace? eol
    item       = group / number / boolean / string
    group      = "[" item (comma item)* whitespace? "]"
    comma      = whitespace? "," whitespace?
    number     = "NaN" / ~"[0-9.]+"
    boolean    = "True" / "False"
    string     = ~"[^ 0-9\[\]\r\n,][^ \[\]\r\n,]*"
    whitespace = ~" +"
    eol        = ~"\r?\n" / eof
    eof        = ~"$"
    """)

class DataExtractor(NodeVisitor):
    @staticmethod
    def concat_items(first_item, remaining_items):
        """ helper to concat the values of delimited items (lines or goups) """
        return first_item + list(map(lambda i: i[1][0], remaining_items))

    def generic_visit(self, node, processed_children):
        """ in general we just want to see the processed children of any node """
        return processed_children

    def visit_line(self, node, processed_children):
        """ line nodes return an array of their processed_children """
        _, first_item, remaining_items, _, _ = processed_children
        return self.concat_items(first_item, remaining_items)

    def visit_group(self, node, processed_children):
        """ group nodes return an array of their processed_children """
        _, first_item, remaining_items, _, _ = processed_children
        return self.concat_items(first_item, remaining_items)

    def visit_number(self, node, processed_children):
        """ number nodes return floats (nan is a special value of floats) """
        return float(node.text)

    def visit_boolean(self, node, processed_children):
        """ boolean nodes return return True or False """
        return node.text == "True"

    def visit_string(self, node, processed_children):
        """ string nodes just return their own text """
        return node.text

The DataExtractor is responsible for traversing the tree and pulling out data from the nodes, returning lists of strings, numbers, booleans, or NaN.

The concat_items() function performs the same task as the list() function in the Javascript code above, the other functions also have their equivalents in the peg.js approach, except that peg.js integrates them directly into the parser definition and Parsimonious expects definitions in a separate class, so it's a bit wordier in comparison, but not too bad.

Usage, assuming an input file called "data.txt", also mirrors the JS code:

de = DataExtractor()

with open("data.txt", encoding="utf8") as f:
    text = f.read()

tree = grammar.parse(text)
data = de.visit(tree)
print(data)

Input:

orig shifted not_equal cumsum lst
0 10 NaN True 1 [[10, 10.4], [c, 10, eee]]
1 10 10.0 False 1 [[10, 10.4], [c, 10, eee]]
2 23 10.0 True 2 [[10, 10.4], [c, 10, eee]]

Output:

[
    ['orig', 'shifted', 'not_equal', 'cumsum', 'lst'],
    [0.0, 10.0, nan, True, 1.0, [[10.0, 10.4], ['c', 10.0, 'eee']]],
    [1.0, 10.0, 10.0, False, 1.0, [[10.0, 10.4], ['c', 10.0, 'eee']]], 
    [2.0, 23.0, 10.0, True, 2.0, [[10.0, 10.4], ['c', 10.0, 'eee']]]
]

In the long run, I would expect this approach to be more maintainable and flexible than regex hackery. Adding explicit support for NaN and for booleans (which the peg.js-Solution above does not have - there they are parsed as strings) for example was easy.

I honestly disagree that it is impossible to do with a regular expression. One might state more precisely that it is not possible with regular expressions alone .
See the following code which yields what you want and read the explanation further down.

Code

import regex as re
from ast import literal_eval

data = """
c0 c1 c2 c3 c4 c5
0   10 100.5 [1.5, 2]     [[10, 10.4], [c, 10, eee]]  [[a , bg], [5.5, ddd, edd]] 100.5
1   20 200.5 [2.5, 2]     [[20, 20.4], [d, 20, eee]]  [[a , bg], [7.5, udd, edd]] 200.5
"""

# regex definition
rx = re.compile(r'''
    (?(DEFINE)
        (?<item>[.\w]+)
        (?<list>\[(?:[^][\n]*|(?R))+\])
    )
    (?&list)|(?&item)
    ''', re.X)

# unquoted item
item_rx = re.compile(r"(?<!')\b([a-z][.\w]*)\b(?!')")

# afterwork party
def afterwork(match):
    match = item_rx.sub(r"'\1'", match)
    return literal_eval(match)

matrix = [
    [afterwork(item.group(0)) for item in rx.finditer(line)]
    for line in data.split("\n")
    if line
]

print(matrix)

This yields

[['c0', 'c1', 'c2', 'c3', 'c4', 'c5'], [0, 10, 100.5, [1.5, 2], [[10, 10.4], ['c', 10, 'eee']], [['a', 'bg'], [5.5, 'ddd', 'edd']], 100.5], [1, 20, 200.5, [2.5, 2], [[20, 20.4], ['d', 20, 'eee']], [['a', 'bg'], [7.5, 'udd', 'edd']], 200.5]]

Explanation

First, we import the newer regex module and the function literal_eval from the ast module which will be needed to transform the found matches in actual code. The newer regex module has far more power than the re module and provides recursive functionality and the powerful (yet not very well known) DEFINE construct for subroutines.

We define two types of elements, the first being a "simple" item, the latter being a "list item", see the demo on regex101.com .

In a second step we add quotes for those element who needs them (that is, unquoted elements starting with a character). Everything is fed into literal_eval and then saved within the list comprehension.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM