简体   繁体   中英

Python re.sub() optimization

I have a python list with each string being one of the following 4 possible options like this (of course the names would be different):

Mr: Smith\n
Mr: Smith; John\n
Smith\n
Smith; John\n

I want these to be corrected to:

Mr,Smith,fname\n
Mr,Smith,John\n
title,Smith,fname\n
title,Smith,John\n

Easy enough to do with 4 re.sub():

with open ("path/to/file",'r') as fileset:
    dataset = fileset.readlines()
for item in dataset:
    dataset = [item.strip() for item in dataset]    #removes some misc. white noise
    item = re.sub((.*):\W(.*);\W,r'\g<1>'+','+r'\g<2>'+',',item)
    item = re.sub((.*);\W(.*),'title,'+r'\g<1>'+','+r'\g<2>',item)
    item = re.sub((.*):\W(.*),r'\g<1>'+','+r'\g<2>'+',fname',item)
    item = re.sub((*.),'title,'+r'\g<1>'+',fname',item)

While this is fine for the dataset I'm using, I want to be more efficient.
Is there a single operation that can simplify this process?

Please pardon if I forgot a quote or some such; I'm not at my workstation now and I'm aware I've stripped the newline ( \\n ).

Thank you,

Brief

Instead of running two loops, you can reduce it to just one line. Adapted from How to iterate over the file in Python (and using the code in my Code section):

f = open("path/to/file",'r')
while True:
    x = f.readline()
    if not x: break
    print re.sub(r, repl, x)

See Python - How to use regexp on file, line by line, in Python for other alternatives.


Code

For viewing sake I've changed your file to an array.

See regex in use here

^(?:([^:\r\n]+):\W*)?([^;\r\n]+)(?:;\W*(.+))?

Note: You don't need all that in python, I do in order to show it on regex101, so your regex would actually just be ^(?:([^:]+):\\W*)?([^;]+)(?:;\\W*(.+))?

Usage

See code in use here

import re

a = [
    "Mr: Smith",
    "Mr: Smith; John",
    "Smith",
    "Smith; John"
]
r = r"^(?:([^:]+):\W*)?([^;]+)(?:;\W*(.+))?"

def repl(m):
    return (m.group(1) or "title" ) + "," + m.group(2) + "," + (m.group(3) or "fname")

for s in a:
    print re.sub(r, repl, s)

Explanation

  • ^ Assert position at the start of the line
  • (?:([^:]+):\\W*)? Optionally match the following
    • ([^:]+) Capture any character except : one or more times into capture group 1
    • : Match this literally
    • \\W* Match any number of non-word characters (copied from OP's original code, I assume \\s* can be used instead)
  • ([^;]+) Group any character except ; one or more times into capture group 2
  • (?:;\\W*(.+))? Optionally match the following
    • ; Match this literally
    • \\W* Match any number of non-word characters (copied from OP's original code, I assume \\s* can be used instead)
    • (.+) Capture any character one or more times into capture group 3

Given the above explanation of the regex part. The re.sub(r, repl, s) works as follows:

  • repl is a callback to the repl function which returns:
    • group 1 if it captured anything, title otherwise
    • group 2 (it's supposedly always set - using OP's logic here again)
    • group 3 if it captured anything, fname otherwise

IMHO, RegEx are just too complex here, you can use classic string function to split your string item in chunks. For that, you can use partition (or rpartition ).

First, split your item string in "records", like that:

item = "Mr: Smith\n Mr: Smith; John\n Smith\n Smith; John\n"
records = item.splitlines()
# -> ['Mr,Smith,fname', 'Mr,Smith,John', 'title,Smith,fname', 'title,Smith,John']

Then, you can create a short function to normalize each "record". Here is an example:

def normalize_record(record):
    # type: (str) -> str
    name, _, fname = record.partition(';')
    title, _, name = name.rpartition(':')
    title = title.strip() or 'title'
    name = name.strip()
    fname = fname.strip() or 'fname'
    return "{0},{1},{2}".format(title, name, fname)

This function is easier to understand than a collection of RegEx. And, in most case, it is faster.

For a better integration, you can define another function to handle each item :

def normalize(row):
    records = row.splitlines()
    return "\n".join(normalize_record(record) for record in records) + "\n"

Demo:

item = "Mr: Smith\n Mr: Smith; John\n Smith\n Smith; John\n"
item = normalize(item)

You get:

'Mr,Smith,fname\nMr,Smith,John\ntitle,Smith,fname\ntitle,Smith,John\n'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM