簡體   English   中英

如何在Python中使用正則表達式拆分

[英]How to split this using Regular expression in Python

我有這種類型的字符串

"Cat/Wheat , Com, Ogl/oyher Face Express/Star,"

我想要這樣

["Cat,Wheat,Com,Ogl,oyher,Face,Express,Star"]

基本上在“,”和“ /”處分割

我嘗試使用拆分功能,但為此我不得不使用雙forloop不太有效

我做了一些研究,發現正則表達式

re.split('\W+',string , 1)

但這不起作用,我應該添加什么到過濾器

目前尚不清楚為什么要在split()中添加maxsplit參數1以防止它拆分所需的所有內容。

沒有它,您將得到:

> import re
> s = "Cat/Wheat , Com, Ogl/oyher Face Express/Star,"
> re.split(r'\W+', s)
['Cat', 'Wheat', 'Com', 'Ogl', 'oyher', 'Face', 'Express', 'Star', '']

最后,這真是令人難以置信的空盪盪。 您可以過濾掉它,但是您可能更re.findall()來匹配您想要的內容,而不是拆分不需要的內容:

> import re
> s = "Cat/Wheat , Com, Ogl/oyher Face Express/Star,"
> re.findall(r'\w+', s)
['Cat', 'Wheat', 'Com', 'Ogl', 'oyher', 'Face', 'Express', 'Star']

要獲得一個逗號分隔的字符串(如果要這樣做),可以加入:

> import re
> s = "Cat/Wheat , Com, Ogl/oyher Face Express/Star,"
> ",".join(re.findall(r'\w+', s))
'Cat,Wheat,Com,Ogl,oyher,Face,Express,Star'
>> import re

>> data = "Cat/Wheat , Com, Ogl/oyher Face Express/Star,"

>> words = re.findall(r"[\w']+", data)

>> print(words)
['Cat', 'Wheat', 'Com', 'Ogl', 'oyher', 'Face', 'Express', 'Star']

如果您在計時之后,最好進行一系列Python字符串操作:

def multisplit(s, splits=('/', ','), base_split=' '):
    for split in splits:
        s = s.replace(split, base_split)
    return s.split() if not base_split.split() else list(filter(bool, s.split(base_split))

或者,甚至更快(對於稍大的輸入):

def multisplit2(s, splits=('/', ','), base_split=' '):
    s = functools.reduce(lambda t, r: t.replace(s, base_split), splits, s)
    return s.split() if not base_split.split() else list(filter(bool, s.split(base_split))

與基於re的解決方案的快速比較表明,所建議方法的速度提高了5到10倍:

import re


def re_findall(s):
    return re.findall(r"[\w']+", s)

def re_split(s):
    return list(filter(bool, re.split('\W+', s)))


s = "Cat/Wheat , Com, Ogl/oyher Face Express/Star,"
print(re_findall(s))
print(re_split(s))
print(multisplit(s))
# ['Cat', 'Wheat', 'Com', 'Ogl', 'oyher', 'Face', 'Express', 'Star']

%timeit re_findall(s)
# 2.54 µs ± 9.14 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit re_split(s)
# 3.05 µs ± 6.54 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit multisplit(s)
# 631 ns ± 10.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit multisplit2(s)
# 908 ns ± 12 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit re_findall(s * 1000)
# 1.55 ms ± 5.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit re_split(s * 1000)
# 1.96 ms ± 10.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit multisplit(s * 1000)
# 222 µs ± 7.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit multisplit2(s * 1000)
# 149 µs ± 1.72 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM