[英]How to split this using Regular expression in Python
我有这种类型的字符串
"Cat/Wheat , Com, Ogl/oyher Face Express/Star,"
我想要这样
["Cat,Wheat,Com,Ogl,oyher,Face,Express,Star"]
基本上在“,”和“ /”处分割
我尝试使用拆分功能,但为此我不得不使用双forloop不太有效
我做了一些研究,发现正则表达式
re.split('\W+',string , 1)
但这不起作用,我应该添加什么到过滤器
目前尚不清楚为什么要在split()
中添加maxsplit参数1
以防止它拆分所需的所有内容。
没有它,您将得到:
> import re
> s = "Cat/Wheat , Com, Ogl/oyher Face Express/Star,"
> re.split(r'\W+', s)
['Cat', 'Wheat', 'Com', 'Ogl', 'oyher', 'Face', 'Express', 'Star', '']
最后,这真是令人难以置信的空荡荡。 您可以过滤掉它,但是您可能更re.findall()
来匹配您想要的内容,而不是拆分不需要的内容:
> import re
> s = "Cat/Wheat , Com, Ogl/oyher Face Express/Star,"
> re.findall(r'\w+', s)
['Cat', 'Wheat', 'Com', 'Ogl', 'oyher', 'Face', 'Express', 'Star']
要获得一个逗号分隔的字符串(如果要这样做),可以加入:
> import re
> s = "Cat/Wheat , Com, Ogl/oyher Face Express/Star,"
> ",".join(re.findall(r'\w+', s))
'Cat,Wheat,Com,Ogl,oyher,Face,Express,Star'
>> import re
>> data = "Cat/Wheat , Com, Ogl/oyher Face Express/Star,"
>> words = re.findall(r"[\w']+", data)
>> print(words)
['Cat', 'Wheat', 'Com', 'Ogl', 'oyher', 'Face', 'Express', 'Star']
如果您在计时之后,最好进行一系列Python字符串操作:
def multisplit(s, splits=('/', ','), base_split=' '):
for split in splits:
s = s.replace(split, base_split)
return s.split() if not base_split.split() else list(filter(bool, s.split(base_split))
或者,甚至更快(对于稍大的输入):
def multisplit2(s, splits=('/', ','), base_split=' '):
s = functools.reduce(lambda t, r: t.replace(s, base_split), splits, s)
return s.split() if not base_split.split() else list(filter(bool, s.split(base_split))
与基于re
的解决方案的快速比较表明,所建议方法的速度提高了5到10倍:
import re
def re_findall(s):
return re.findall(r"[\w']+", s)
def re_split(s):
return list(filter(bool, re.split('\W+', s)))
s = "Cat/Wheat , Com, Ogl/oyher Face Express/Star,"
print(re_findall(s))
print(re_split(s))
print(multisplit(s))
# ['Cat', 'Wheat', 'Com', 'Ogl', 'oyher', 'Face', 'Express', 'Star']
%timeit re_findall(s)
# 2.54 µs ± 9.14 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit re_split(s)
# 3.05 µs ± 6.54 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit multisplit(s)
# 631 ns ± 10.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit multisplit2(s)
# 908 ns ± 12 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit re_findall(s * 1000)
# 1.55 ms ± 5.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit re_split(s * 1000)
# 1.96 ms ± 10.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit multisplit(s * 1000)
# 222 µs ± 7.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit multisplit2(s * 1000)
# 149 µs ± 1.72 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.