[英]Parse all substrings before a pattern without a loop?
I have a long string that consists of many numbers separated by spaces (and sometimes there's even a new line thrown in there). 我有一个长字符串,其中包含用空格分隔的许多数字(有时甚至会在其中插入新行)。 I'd like to go through the string and append all the numbers to a new list that come before the start of 0.000000000000000000e+00
numbers. 我想遍历字符串并将所有数字附加到0.000000000000000000e+00
数字开始之前的新列表中。 So here's a sample of my string: 因此,这是我的字符串示例:
my_string = '1.249132165057832031e+13 1.638194600635518555e+13 2.127995187558799219e+13 2.744617593148214062e+13 -2.558800658636701519e+28 5.918883595148564680e+30 3.603563681248702509e+31 4.325917213186498068e+31 4.911908042151239481e+31 4.463331378152286632e+31 3.684371076399113503e+31 2.500614504012405068e+31 9.997365425073173512e+30 -7.046725649106466938e+30 -2.192076417151744811e+31 -2.531287564917444482e+31 -6.962936418905874724e+30 3.281685507310205847e+31 9.241630178064907840e+31 1.730544785932614751e+32 2.619210949875333106e+32 2.984440142196566918e+32 8.964375812060072923e+31 -8.515727465135046667e+32 -3.425309034394939997e+33 -8.145884847188906515e+33 -9.922370830834364410e+33 -2.119464668318252366e+28 -1.689726703118075140e+27 1.440101653069986610e+26 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 6.186324149659251562e+13 8.113154959294240625e+13 1.053889122977165625e+14 1.359271226298647969e+14 -2.097046363337115528e+28 4.850777756495711585e+30 2.953274256558218597e+31 3.545273642763729060e+31 4.025456872055449111e+31 3.657581460085835446e+31 3.018816679659856350e+31 2.048223110003727437e+31 8.176806147340775115e+30 -5.796250740354887641e+30 -1.798839398031696094e+31 -2.076444435341100150e+31 -5.711669151245612857e+30 2.691583747083509247e+31 7.579958708961477309e+31 1.419395486743453834e+32 2.148287875274468622e+32 2.447859658750551118e+32 7.352862842410293685e+31 -6.984595303325589259e+32 -2.809449882735912952e+33 -6.681296633318354125e+33 -8.138406580426555140e+33 -1.740744048703962454e+28 -1.411749034480591280e+27 8.079362883576220633e+25 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00'
and from this string, all I want in the end would be: 从这个字符串中,我最后想要的是:
new_list = ['1.440101653069986610e+26', '8.079362883576220633e+25']
I was thinking I'd use regex, but this seems a little tricky since I there are a bunch of 0.000000000000000000e+00
occurrences grouped together and I only want the nonzero number right before the first zero occurrence. 我当时以为我会使用正则表达式,但这似乎有点棘手,因为我将一堆0.000000000000000000e+00
事件组合在一起,并且我只希望第一个零出现之前的非零数字。 I also cannot assume that there's always an equal number of zeros grouped together. 我也不能假设总是有相等数量的零组合在一起。
I also thought of splitting on the spaces and iterating through, but my full string is actually far too long to do this efficiently. 我还考虑过分割空间并进行迭代,但是我的完整字符串实际上太长了,无法有效地执行此操作。 How can I do this? 我怎样才能做到这一点?
You can use negative lookbehind assertion: 您可以使用否定的后向断言:
In [55]: re.findall(r'(\S+)(?<!0\.000000000000000000e\+00)\s+0\.000000000000000000e\+00', my_string)
Out[55]: ['1.440101653069986610e+26', '8.079362883576220633e+25']
Using negative lookahead assertion, the regex could be refined to improve performance, as mentioned in comment by @revo: 使用否定的先行断言,可以对正则表达式进行完善以提高性能,如@revo的评论中所述:
([-+]?\d\.(?!0+e\+0+)\S+)\s+(?:0\.0+e\+00\s*)+
I also cannot assume that there's always an equal number of zeros grouped together. 我也不能假设总是有相等数量的零组合在一起。
How can we differentiate, say, 2 consecutive zero values from "a group of zeros" ? 例如,我们如何区分两个连续的零值与“一组零”?
Well, given you're looking for at least 5 0.000
patterns, you could use a non-capturing group on this multiple 0 pattern (to avoid matching it), following a non-blank pattern (for the number) 好吧,假设您正在寻找至少5个0.000
模式,则可以在此非0模式(对于数字)之后,在此多个0模式上使用一个非捕获组(以避免与之匹配)
re.findall("(\S+)\s+(?:0\.0+e\+00\s+){5,}",my_string)
If there cannot be any zeroes except for the pattern itself, it can be generalized to: 如果除了模式本身不能有任何零,可以将其推广为:
re.findall("(\S+)\s+(?:0\.0+e\+00\s+)+",my_string)
(you need the +
at the end of the non-capturing group to capture and discard all the zeroes) (您需要在非捕获组的末尾加上+
来捕获并丢弃所有零)
result (in both cases): 结果(在两种情况下):
['1.440101653069986610e+26', '8.079362883576220633e+25']
this also takes care of newlines, and is tolerant to variable number of zeroes in the decimal part 这也需要换行,并且可以容忍小数部分中的零
This is about 10-70x times faster than the other solutions. 这比其他解决方案快约10-70倍。
my_values = my_string.split()
output = [x for x,y in zip(my_values,my_values[1:])
if (y == '0.000000000000000000e+00' and x != '0.000000000000000000e+00')]
print(output)
Or, with islice to save memory as kindly suggested by @Jean-François Fabre: 或者,按照@Jean-FrançoisFabre的建议,使用islice来节省内存:
import itertools
my_values = my_string.split()
output = [x for x,y in zip(my_values,itertools.islice(myvalues,1,None))
if (y == '0.000000000000000000e+00' and x != '0.000000000000000000e+00')]
print(output)
This works by grouping the elements in pairs (x,y). 这是通过将元素成对(x,y)分组来实现的。 x should be different than 0.00..
while y should be equal to it. x应该不同于0.00..
而y应该等于它。 By doing the y check first this will evaluate fast to False
in most cases and continue iterating. 通过首先执行y检查,在大多数情况下,这将快速评估为False
并继续进行迭代。 Returns: 返回:
['1.440101653069986610e+26', '8.079362883576220633e+25']
However, another idea (which I would consider as smartest here) would be to use pandas and pd.to_numeric()
. 但是,另一个想法(在这里我认为最聪明)是使用pandas和pd.to_numeric()
。 When you work with numbers you most likely want to use a library like numpy or pandas. 使用数字时,您很可能希望使用numpy或pandas之类的库。 This would be safer as you could also handle errors smoothly. 这样比较安全,因为您也可以顺利处理错误。 Also note that I in both cases convert the numbers back to string (which you could skip). 还要注意,在两种情况下,我都将数字转换回字符串(可以跳过)。
import pandas as pd
data = pd.Series(pd.to_numeric(my_string.split()))
output = data[(data != 0) & (data.shift(-1) == 0)].astype(str).tolist()
print(output)
#['1.440101653069986610e+26', '8.079362883576220633e+25']
And numpy: 和numpy的:
import numpy as np
data = np.loadtxt(my_string.split())
output = list(map(str,data[(data != 0) & (np.roll(ar, -1) == 0)]))
print(output)
#['1.440101653069986610e+26', '8.079362883576220633e+25']
fastest --> slowest 最快->最慢
100000 loops, best of 3: 9.28 µs per loop <-- Anton vBR list comprehension
10000 loops, best of 3: 98.4 µs per loop <-- Revos Regex
1000 loops, best of 3: 256 µs per loop <-- Anton vBR numpy
1000 loops, best of 3: 425 µs per loop <-- Tzot Regex
1000 loops, best of 3: 513 µs per loop <-- Jean-François Fabre Regex
1000 loops, best of 3: 782 µs per loop <-- liliscent
1000 loops, best of 3: 794 µs per loop <-- Anton vBR pandas
If you want the float values and not their string representations: 如果要使用浮点值而不是它们的字符串表示形式:
import re
list(
filter(
None,
map(
float,
re.findall(r"\S+(?=\s0\.0+e)", my_string)
)))
re.findall(r"\\S+(?=\\s0\\.0+e)", my_string)
: re.findall(r"\\S+(?=\\s0\\.0+e)", my_string)
: map(float,
^ )
: map(float,
^ )
: filter(None,
^ )
: filter(None,
^ )
: list(
^ )
: list(
^ )
: Result: 结果:
>>> list(filter(None, map(float, re.findall(r"\S+(?=\s0\.0+e)", my_string))))
[1.4401016530699866e+26, 8.07936288357622e+25]
However, if you still want the string values themselves, let me know; 但是,如果您仍然想要字符串值本身,请告诉我; in that case, the map
& filter
subexpressions need to be modified. 在这种情况下,需要修改map
& filter
子表达式。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.