繁体   English   中英

如何在Python中使用正则表达式拆分

[英]How to split this using Regular expression in Python

我有这种类型的字符串

"Cat/Wheat , Com, Ogl/oyher Face Express/Star,"

我想要这样

["Cat,Wheat,Com,Ogl,oyher,Face,Express,Star"]

基本上在“,”和“ /”处分割

我尝试使用拆分功能,但为此我不得不使用双forloop不太有效

我做了一些研究,发现正则表达式

re.split('\W+',string , 1)

但这不起作用,我应该添加什么到过滤器

目前尚不清楚为什么要在split()中添加maxsplit参数1以防止它拆分所需的所有内容。

没有它,您将得到:

> import re
> s = "Cat/Wheat , Com, Ogl/oyher Face Express/Star,"
> re.split(r'\W+', s)
['Cat', 'Wheat', 'Com', 'Ogl', 'oyher', 'Face', 'Express', 'Star', '']

最后,这真是令人难以置信的空荡荡。 您可以过滤掉它,但是您可能更re.findall()来匹配您想要的内容,而不是拆分不需要的内容:

> import re
> s = "Cat/Wheat , Com, Ogl/oyher Face Express/Star,"
> re.findall(r'\w+', s)
['Cat', 'Wheat', 'Com', 'Ogl', 'oyher', 'Face', 'Express', 'Star']

要获得一个逗号分隔的字符串(如果要这样做),可以加入:

> import re
> s = "Cat/Wheat , Com, Ogl/oyher Face Express/Star,"
> ",".join(re.findall(r'\w+', s))
'Cat,Wheat,Com,Ogl,oyher,Face,Express,Star'
>> import re

>> data = "Cat/Wheat , Com, Ogl/oyher Face Express/Star,"

>> words = re.findall(r"[\w']+", data)

>> print(words)
['Cat', 'Wheat', 'Com', 'Ogl', 'oyher', 'Face', 'Express', 'Star']

如果您在计时之后,最好进行一系列Python字符串操作:

def multisplit(s, splits=('/', ','), base_split=' '):
    for split in splits:
        s = s.replace(split, base_split)
    return s.split() if not base_split.split() else list(filter(bool, s.split(base_split))

或者,甚至更快(对于稍大的输入):

def multisplit2(s, splits=('/', ','), base_split=' '):
    s = functools.reduce(lambda t, r: t.replace(s, base_split), splits, s)
    return s.split() if not base_split.split() else list(filter(bool, s.split(base_split))

与基于re的解决方案的快速比较表明,所建议方法的速度提高了5到10倍:

import re


def re_findall(s):
    return re.findall(r"[\w']+", s)

def re_split(s):
    return list(filter(bool, re.split('\W+', s)))


s = "Cat/Wheat , Com, Ogl/oyher Face Express/Star,"
print(re_findall(s))
print(re_split(s))
print(multisplit(s))
# ['Cat', 'Wheat', 'Com', 'Ogl', 'oyher', 'Face', 'Express', 'Star']

%timeit re_findall(s)
# 2.54 µs ± 9.14 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit re_split(s)
# 3.05 µs ± 6.54 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit multisplit(s)
# 631 ns ± 10.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit multisplit2(s)
# 908 ns ± 12 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit re_findall(s * 1000)
# 1.55 ms ± 5.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit re_split(s * 1000)
# 1.96 ms ± 10.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit multisplit(s * 1000)
# 222 µs ± 7.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit multisplit2(s * 1000)
# 149 µs ± 1.72 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM