简体   繁体   中英

How to add character to beginning of regex matched string?

I have some data (below) which I am trying to align.

| 24 | 11 | 506  | -1  | -829.99||
| 24 | 11 | 1910 | 506 | 1      | 829.99|3|
| 12 | 11 | 1933 | 531 | 2      | 7.78  |N|

It seems whenever the 3rd to last value for each row is negative, the row is missing a "|" delimiter. I am trying to use regex to add a vertical bar mid-way through the records to re-align the data like so:

| 24 | 11 |      | 506 | -1     | -829.99||
| 24 | 11 | 1910 | 506 | 1      | 829.99 | 3|
| 12 | 11 | 1933 | 531 | 2      | 7.78   | N|

Disregard the white space, I included it to make the data more readable for the purpose of this question.

I know the below expression will locate the correct text group and place an additional "|" after it but can this be modified to put the "|" before the group?

re.sub(r'(\|*\|*\|\|)', r'\1',DATA)

Just getting started with regex so any help is appreciated!

PS - I am using python to do the actual regex substitutions/additions for this data munging task.

There are some problems in your regex. The asterisk * indicates that the previous element (whether one character or compound) can repeat zero or more times. Therefore, \\|* would match "" (empty string), "|", "||", etc. and \\|*\\|*\\|\\| would match two consecutive bars "||" preceded by any number of bars (0 or more) -- therefore, it matches the last two bars, only.

To prove this, with re.sub , you can surround the back-reference (ie \\1 ) with some different characters (I used curly braces ie {\\1} below).

data="""| 24 | 11 | 506  | -1  | -829.99||
| 24 | 11 | 1910 | 506 | 1      | 829.99|3|
| 12 | 11 | 1933 | 531 | 2      | 7.78  |N|
"""
print("using regex above, with curly braces around captured match:")
print(re.sub(r'(\|*\|*\|\|)', r'{\1}', data))

print("desired output:")
print(re.sub(r'(\|[^|]+\|[^|]+\|[^|]+\|\|)', r'|\1', data))

Output:

using regex above, with curly braces around captured match:
| 24 | 11 | 506  | -1  | -829.99{||}
| 24 | 11 | 1910 | 506 | 1      | 829.99|3|
| 12 | 11 | 1933 | 531 | 2      | 7.78  |N|

desired output:
| 24 | 11 || 506  | -1  | -829.99||
| 24 | 11 | 1910 | 506 | 1      | 829.99|3|
| 12 | 11 | 1933 | 531 | 2      | 7.78  |N|

The solution looks for bars with a positive number of items in between them, which are not bars. [^|] means anything other than | will match. Note that in the brackets, that bar does not need escaping. The + indicates "one or more of the previous element".

Does this work for you ? It gives me the desired output.

re.sub(r'(\|.*\|.*\|.*)(\|.*\|.*\|\|\n)',r'\g<1>'+'|'+r'\g<2>',DATA)

I kept everything before 506 in group 1 and everything after it in group 2 and added a '|' in between.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM