简体   繁体   中英

re.sub in Python 3

I have the following types of text

1. DIMENSIONS:  | ORIGIN: | Position corrected and IL (0) was changed based on RPS: 3482 -230 | Pipe: 
2. DIMENSIONS: 2 x 1350 RCP | ORIGIN: PCD13180 | Position corrected and IL (0) was changed based on RPS: 1390 -20800/1350RCP
3. DIMENSIONS: 3 x 375 RCP | Pipe: 35mm | ORIGIN:
4. DIMENSIONS:  | ORIGIN:
5. Review attribution | DIMENSIONS:  | ORIGIN:
6. Pipe: | DIMENSIONS:  | ORIGIN: 2010 PureData Survey

REQUIRED OUTPUT

1. Position corrected and IL (0) was changed based on RPS: 3482 -230
2. DIMENSIONS: 2 x 1350 RCP | ORIGIN: PCD13180 | Position corrected and IL (0) was changed based on RPS: 1390 -20800/1350RCP
3. DIMENSIONS: 3 x 375 RCP | Pipe: 35mm
4. 
5. Review attribution
6. ORIGIN: 2010 PureData Survey

Basically I want to get rid of any blank keys like Dimensions, Origin, Pipe etc.

I think we have to do this separately for each key...I would prefer this as there are lots more keys I need to use it for.

According to https://regex101.com/r/OX1W3b/6

(.*)DIMENSIONS:  \|(.*)

works but I am not sure how to use it in python

import re
str='DIMENSIONS:  | ORIGIN: | Position corrected and IL (0) was changed based on RPS: 3482 -230'
x=re.sub(".*DIMENSIONS.*","(.*)DIMENSIONS:  \|(.*)",str)
print(x)

Results in just a repeat of the 2nd value in re.sub as it is expecting a string and not a regex function.

In Google Sheets I would use =REGEXEXTRACT(A1,"(.*)DIMENSIONS: \\|(.*)")

Is there something similar in python? Re.sub needs the value to replace with but I am getting this from the regex capture groups.

Note this is similar to my question in gis se - as it's more of a python question than a gis question.

I'd say just split each line on | into separate fields, check if there's no value, and then rejoin on | :

s = '''DIMENSIONS:  | ORIGIN: | Position corrected and IL (0) was changed based on RPS: 3482 -230 | Pipe: 
DIMENSIONS: 2 x 1350 RCP | ORIGIN: PCD13180 | Position corrected and IL (0) was changed based on RPS: 1390 -20800/1350RCP
DIMENSIONS: 3 x 375 RCP | Pipe: 35mm | ORIGIN:
DIMENSIONS:  | ORIGIN:
Review attribution | DIMENSIONS:  | ORIGIN:
Pipe: | DIMENSIONS:  | ORIGIN: 2010 PureData Survey'''.splitlines()

result = []
for line in s:
    line = line.split('|')
    lst = []
    for field in line:
        if not field.strip().endswith(':'):
            lst.append(field)
    result.append('|'.join(lst).strip())

Or, in one line:

result = ['|'.join([field for field in line.split('|') if not field.strip().endswith(':')]).strip() for line in s]

Note that this gives you a list of lines. You can rejoin them with '\\n'.join(result) if necessary.

This is the part that parses each line:

'|'.join([field for field in line.split('|') if not field.strip().endswith(':')]).strip()

For example, if line is DIMENSIONS: 3 x 375 RCP | Pipe: 35mm | ORIGIN: DIMENSIONS: 3 x 375 RCP | Pipe: 35mm | ORIGIN: DIMENSIONS: 3 x 375 RCP | Pipe: 35mm | ORIGIN: , that gives us this:

DIMENSIONS: 3 x 375 RCP | Pipe: 35mm

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM