简体   繁体   English

Python re.sub()优化

[英]Python re.sub() optimization

I have a python list with each string being one of the following 4 possible options like this (of course the names would be different): 我有一个python列表,每个字符串是以下4个可能的选项之一(当然名称会有所不同):

Mr: Smith\n
Mr: Smith; John\n
Smith\n
Smith; John\n

I want these to be corrected to: 我希望这些更正为:

Mr,Smith,fname\n
Mr,Smith,John\n
title,Smith,fname\n
title,Smith,John\n

Easy enough to do with 4 re.sub(): 使用4 re.sub()很容易:

with open ("path/to/file",'r') as fileset:
    dataset = fileset.readlines()
for item in dataset:
    dataset = [item.strip() for item in dataset]    #removes some misc. white noise
    item = re.sub((.*):\W(.*);\W,r'\g<1>'+','+r'\g<2>'+',',item)
    item = re.sub((.*);\W(.*),'title,'+r'\g<1>'+','+r'\g<2>',item)
    item = re.sub((.*):\W(.*),r'\g<1>'+','+r'\g<2>'+',fname',item)
    item = re.sub((*.),'title,'+r'\g<1>'+',fname',item)

While this is fine for the dataset I'm using, I want to be more efficient. 虽然这对我正在使用的数据集很好,但我希望更有效率。
Is there a single operation that can simplify this process? 是否有单一操作可以简化此过程?

Please pardon if I forgot a quote or some such; 请原谅我忘了引用或其他一些; I'm not at my workstation now and I'm aware I've stripped the newline ( \\n ). 我现在不在我的工作站,我知道我已经删除了换行符( \\n )。

Thank you, 谢谢,

Brief 简要

Instead of running two loops, you can reduce it to just one line. 您可以将其减少到一行,而不是运行两个循环。 Adapted from How to iterate over the file in Python (and using the code in my Code section): 改编自如何在Python中迭代文件 (并使用我的代码部分中的代码 ):

f = open("path/to/file",'r')
while True:
    x = f.readline()
    if not x: break
    print re.sub(r, repl, x)

See Python - How to use regexp on file, line by line, in Python for other alternatives. 请参阅Python - 如何在Python中逐行使用regexp,以获取其他替代方案。


Code

For viewing sake I've changed your file to an array. 为了便于查看,我已将文件更改为数组。

See regex in use here 请参阅此处使用的正则表达式

^(?:([^:\r\n]+):\W*)?([^;\r\n]+)(?:;\W*(.+))?

Note: You don't need all that in python, I do in order to show it on regex101, so your regex would actually just be ^(?:([^:]+):\\W*)?([^;]+)(?:;\\W*(.+))? 注意:你不需要python中的所有内容,为了在regex101上显示它,所以你的正则表达式实际上只是^(?:([^:]+):\\W*)?([^;]+)(?:;\\W*(.+))?

Usage 用法

See code in use here 请参阅此处使用的代码

import re

a = [
    "Mr: Smith",
    "Mr: Smith; John",
    "Smith",
    "Smith; John"
]
r = r"^(?:([^:]+):\W*)?([^;]+)(?:;\W*(.+))?"

def repl(m):
    return (m.group(1) or "title" ) + "," + m.group(2) + "," + (m.group(3) or "fname")

for s in a:
    print re.sub(r, repl, s)

Explanation 说明

  • ^ Assert position at the start of the line ^在行的开头断言位置
  • (?:([^:]+):\\W*)? Optionally match the following 可选择匹配以下内容
    • ([^:]+) Capture any character except : one or more times into capture group 1 ([^:]+)捕获任何字符,除了:一次或多次进入捕获组1
    • : Match this literally :字面意思匹配
    • \\W* Match any number of non-word characters (copied from OP's original code, I assume \\s* can be used instead) \\W*匹配任意数量的非单词字符(从OP的原始代码复制,我假设可以使用\\s*代替)
  • ([^;]+) Group any character except ; ([^;]+)分组除以外的任何字符; one or more times into capture group 2 一次或多次进入捕获组2
  • (?:;\\W*(.+))? Optionally match the following 可选择匹配以下内容
    • ; Match this literally 按字面意思匹配
    • \\W* Match any number of non-word characters (copied from OP's original code, I assume \\s* can be used instead) \\W*匹配任意数量的非单词字符(从OP的原始代码复制,我假设可以使用\\s*代替)
    • (.+) Capture any character one or more times into capture group 3 (.+)任何字符捕获一次或多次到捕获组3中

Given the above explanation of the regex part. 鉴于正则表达式部分的上述解释。 The re.sub(r, repl, s) works as follows: re.sub(r, repl, s)工作原理如下:

  • repl is a callback to the repl function which returns: repl是对repl函数的回调,它返回:
    • group 1 if it captured anything, title otherwise group 1如果它捕获任何东西,否则title
    • group 2 (it's supposedly always set - using OP's logic here again) group 2 (它应该总是设置 - 再次使用OP的逻辑)
    • group 3 if it captured anything, fname otherwise group 3如果它捕获任何东西, fname否则

IMHO, RegEx are just too complex here, you can use classic string function to split your string item in chunks. 恕我直言,RegEx在这里太复杂了,你可以使用经典的字符串函数来分割你的字符串项目 For that, you can use partition (or rpartition ). 为此,您可以使用partition (或rpartition )。

First, split your item string in "records", like that: 首先,将您的项目字符串拆分为“记录”,如下所示:

item = "Mr: Smith\n Mr: Smith; John\n Smith\n Smith; John\n"
records = item.splitlines()
# -> ['Mr,Smith,fname', 'Mr,Smith,John', 'title,Smith,fname', 'title,Smith,John']

Then, you can create a short function to normalize each "record". 然后,您可以创建一个简短的函数来规范化每个“记录”。 Here is an example: 这是一个例子:

def normalize_record(record):
    # type: (str) -> str
    name, _, fname = record.partition(';')
    title, _, name = name.rpartition(':')
    title = title.strip() or 'title'
    name = name.strip()
    fname = fname.strip() or 'fname'
    return "{0},{1},{2}".format(title, name, fname)

This function is easier to understand than a collection of RegEx. 此函数比RegEx集合更容易理解。 And, in most case, it is faster. 而且,在大多数情况下,它更快。

For a better integration, you can define another function to handle each item : 为了更好地集成,您可以定义另一个函数来处理每个项目

def normalize(row):
    records = row.splitlines()
    return "\n".join(normalize_record(record) for record in records) + "\n"

Demo: 演示:

item = "Mr: Smith\n Mr: Smith; John\n Smith\n Smith; John\n"
item = normalize(item)

You get: 你得到:

'Mr,Smith,fname\nMr,Smith,John\ntitle,Smith,fname\ntitle,Smith,John\n'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM