Python re.sub（）優化

Question

我有一個python列表，每個字符串是以下4個可能的選項之一（當然名稱會有所不同）：

Mr: Smith\n
Mr: Smith; John\n
Smith\n
Smith; John\n

我希望這些更正為：

Mr,Smith,fname\n
Mr,Smith,John\n
title,Smith,fname\n
title,Smith,John\n

使用4 re.sub（）很容易：

with open ("path/to/file",'r') as fileset:
    dataset = fileset.readlines()
for item in dataset:
    dataset = [item.strip() for item in dataset]    #removes some misc. white noise
    item = re.sub((.*):\W(.*);\W,r'\g<1>'+','+r'\g<2>'+',',item)
    item = re.sub((.*);\W(.*),'title,'+r'\g<1>'+','+r'\g<2>',item)
    item = re.sub((.*):\W(.*),r'\g<1>'+','+r'\g<2>'+',fname',item)
    item = re.sub((*.),'title,'+r'\g<1>'+',fname',item)

雖然這對我正在使用的數據集很好，但我希望更有效率。
是否有單一操作可以簡化此過程？

請原諒我忘了引用或其他一些; 我現在不在我的工作站，我知道我已經刪除了換行符（ \\n ）。

謝謝，

Answer 1

簡要

您可以將其減少到一行，而不是運行兩個循環。 改編自如何在Python中迭代文件（並使用我的代碼部分中的代碼）：

f = open("path/to/file",'r')
while True:
    x = f.readline()
    if not x: break
    print re.sub(r, repl, x)

請參閱Python - 如何在Python中逐行使用regexp，以獲取其他替代方案。

碼

為了便於查看，我已將文件更改為數組。

請參閱此處使用的正則表達式

^(?:([^:\r\n]+):\W*)?([^;\r\n]+)(?:;\W*(.+))?

注意：你不需要python中的所有內容，為了在regex101上顯示它，所以你的正則表達式實際上只是^(?:([^:]+):\\W*)?([^;]+)(?:;\\W*(.+))?

用法

請參閱此處使用的代碼

import re

a = [
    "Mr: Smith",
    "Mr: Smith; John",
    "Smith",
    "Smith; John"
]
r = r"^(?:([^:]+):\W*)?([^;]+)(?:;\W*(.+))?"

def repl(m):
    return (m.group(1) or "title" ) + "," + m.group(2) + "," + (m.group(3) or "fname")

for s in a:
    print re.sub(r, repl, s)

說明

^在行的開頭斷言位置
(?:([^:]+):\\W*)? 可選擇匹配以下內容
- ([^:]+)捕獲任何字符，除了:一次或多次進入捕獲組1
- :字面意思匹配
- \\W*匹配任意數量的非單詞字符（從OP的原始代碼復制，我假設可以使用\\s*代替）
([^;]+)分組除以外的任何字符; 一次或多次進入捕獲組2
(?:;\\W*(.+))? 可選擇匹配以下內容
- ; 按字面意思匹配
- \\W*匹配任意數量的非單詞字符（從OP的原始代碼復制，我假設可以使用\\s*代替）
- (.+)任何字符捕獲一次或多次到捕獲組3中

鑒於正則表達式部分的上述解釋。 re.sub(r, repl, s)工作原理如下：

repl是對repl函數的回調，它返回：
- group 1如果它捕獲任何東西，否則title
- group 2 （它應該總是設置 - 再次使用OP的邏輯）
- group 3如果它捕獲任何東西， fname否則

Answer 2

恕我直言，RegEx在這里太復雜了，你可以使用經典的字符串函數來分割你的字符串項目。 為此，您可以使用partition （或rpartition ）。

首先，將您的項目字符串拆分為“記錄”，如下所示：

item = "Mr: Smith\n Mr: Smith; John\n Smith\n Smith; John\n"
records = item.splitlines()
# -> ['Mr,Smith,fname', 'Mr,Smith,John', 'title,Smith,fname', 'title,Smith,John']

然后，您可以創建一個簡短的函數來規范化每個“記錄”。 這是一個例子：

def normalize_record(record):
    # type: (str) -> str
    name, _, fname = record.partition(';')
    title, _, name = name.rpartition(':')
    title = title.strip() or 'title'
    name = name.strip()
    fname = fname.strip() or 'fname'
    return "{0},{1},{2}".format(title, name, fname)

此函數比RegEx集合更容易理解。 而且，在大多數情況下，它更快。

為了更好地集成，您可以定義另一個函數來處理每個項目：

def normalize(row):
    records = row.splitlines()
    return "\n".join(normalize_record(record) for record in records) + "\n"

演示：

item = "Mr: Smith\n Mr: Smith; John\n Smith\n Smith; John\n"
item = normalize(item)

你得到：

'Mr,Smith,fname\nMr,Smith,John\ntitle,Smith,fname\ntitle,Smith,John\n'

Python re.sub（）優化

問題描述

2 個解決方案

解決方案1
2 已采納 2018-01-05 21:32:40

簡要

碼

用法

說明

解決方案2
1 2018-01-05 22:15:12

Python re.sub（）優化

問題描述

2 個解決方案

解決方案1 2 已采納 2018-01-05 21:32:40

簡要

碼

用法

說明

解決方案2 1 2018-01-05 22:15:12

解決方案1
2 已采納 2018-01-05 21:32:40

解決方案2
1 2018-01-05 22:15:12