Python re.sub() optimization

Question

I have a python list with each string being one of the following 4 possible options like this (of course the names would be different):

Mr: Smith\n
Mr: Smith; John\n
Smith\n
Smith; John\n

I want these to be corrected to:

Mr,Smith,fname\n
Mr,Smith,John\n
title,Smith,fname\n
title,Smith,John\n

Easy enough to do with 4 re.sub():

with open ("path/to/file",'r') as fileset:
    dataset = fileset.readlines()
for item in dataset:
    dataset = [item.strip() for item in dataset]    #removes some misc. white noise
    item = re.sub((.*):\W(.*);\W,r'\g<1>'+','+r'\g<2>'+',',item)
    item = re.sub((.*);\W(.*),'title,'+r'\g<1>'+','+r'\g<2>',item)
    item = re.sub((.*):\W(.*),r'\g<1>'+','+r'\g<2>'+',fname',item)
    item = re.sub((*.),'title,'+r'\g<1>'+',fname',item)

While this is fine for the dataset I'm using, I want to be more efficient.
Is there a single operation that can simplify this process?

Please pardon if I forgot a quote or some such; I'm not at my workstation now and I'm aware I've stripped the newline ( \\n ).

Thank you,

Answer 1

Brief

Instead of running two loops, you can reduce it to just one line. Adapted from How to iterate over the file in Python (and using the code in my Code section):

f = open("path/to/file",'r')
while True:
    x = f.readline()
    if not x: break
    print re.sub(r, repl, x)

See Python - How to use regexp on file, line by line, in Python for other alternatives.

Code

For viewing sake I've changed your file to an array.

See regex in use here

^(?:([^:\r\n]+):\W*)?([^;\r\n]+)(?:;\W*(.+))?

Note: You don't need all that in python, I do in order to show it on regex101, so your regex would actually just be ^(?:([^:]+):\\W*)?([^;]+)(?:;\\W*(.+))?

Usage

See code in use here

import re

a = [
    "Mr: Smith",
    "Mr: Smith; John",
    "Smith",
    "Smith; John"
]
r = r"^(?:([^:]+):\W*)?([^;]+)(?:;\W*(.+))?"

def repl(m):
    return (m.group(1) or "title" ) + "," + m.group(2) + "," + (m.group(3) or "fname")

for s in a:
    print re.sub(r, repl, s)

Explanation

^ Assert position at the start of the line
(?:([^:]+):\\W*)? Optionally match the following
- ([^:]+) Capture any character except : one or more times into capture group 1
- : Match this literally
- \\W* Match any number of non-word characters (copied from OP's original code, I assume \\s* can be used instead)
([^;]+) Group any character except ; one or more times into capture group 2
(?:;\\W*(.+))? Optionally match the following
- ; Match this literally
- \\W* Match any number of non-word characters (copied from OP's original code, I assume \\s* can be used instead)
- (.+) Capture any character one or more times into capture group 3

Given the above explanation of the regex part. The re.sub(r, repl, s) works as follows:

repl is a callback to the repl function which returns:
- group 1 if it captured anything, title otherwise
- group 2 (it's supposedly always set - using OP's logic here again)
- group 3 if it captured anything, fname otherwise

Answer 2

IMHO, RegEx are just too complex here, you can use classic string function to split your string item in chunks. For that, you can use partition (or rpartition ).

First, split your item string in "records", like that:

item = "Mr: Smith\n Mr: Smith; John\n Smith\n Smith; John\n"
records = item.splitlines()
# -> ['Mr,Smith,fname', 'Mr,Smith,John', 'title,Smith,fname', 'title,Smith,John']

Then, you can create a short function to normalize each "record". Here is an example:

def normalize_record(record):
    # type: (str) -> str
    name, _, fname = record.partition(';')
    title, _, name = name.rpartition(':')
    title = title.strip() or 'title'
    name = name.strip()
    fname = fname.strip() or 'fname'
    return "{0},{1},{2}".format(title, name, fname)

This function is easier to understand than a collection of RegEx. And, in most case, it is faster.

For a better integration, you can define another function to handle each item :

def normalize(row):
    records = row.splitlines()
    return "\n".join(normalize_record(record) for record in records) + "\n"

Demo:

item = "Mr: Smith\n Mr: Smith; John\n Smith\n Smith; John\n"
item = normalize(item)

You get:

'Mr,Smith,fname\nMr,Smith,John\ntitle,Smith,fname\ntitle,Smith,John\n'

Python re.sub() optimization

Question

2 answers

solution1
2 ACCPTED 2018-01-05 21:32:40

Brief

Code

Usage

Explanation

solution2
1 2018-01-05 22:15:12

Python re.sub() optimization

Question

2 answers

solution1 2 ACCPTED 2018-01-05 21:32:40

Brief

Code

Usage

Explanation

solution2 1 2018-01-05 22:15:12

solution1
2 ACCPTED 2018-01-05 21:32:40

solution2
1 2018-01-05 22:15:12