简体   繁体   English

在没有循环的模式之前解析所有子字符串?

[英]Parse all substrings before a pattern without a loop?

I have a long string that consists of many numbers separated by spaces (and sometimes there's even a new line thrown in there). 我有一个长字符串,其中包含用空格分隔的许多数字(有时甚至会在其中插入新行)。 I'd like to go through the string and append all the numbers to a new list that come before the start of 0.000000000000000000e+00 numbers. 我想遍历字符串并将所有数字附加到0.000000000000000000e+00数字开始之前的新列表中。 So here's a sample of my string: 因此,这是我的字符串示例:

my_string = '1.249132165057832031e+13 1.638194600635518555e+13 2.127995187558799219e+13 2.744617593148214062e+13 -2.558800658636701519e+28 5.918883595148564680e+30 3.603563681248702509e+31 4.325917213186498068e+31 4.911908042151239481e+31 4.463331378152286632e+31 3.684371076399113503e+31 2.500614504012405068e+31 9.997365425073173512e+30 -7.046725649106466938e+30 -2.192076417151744811e+31 -2.531287564917444482e+31 -6.962936418905874724e+30 3.281685507310205847e+31 9.241630178064907840e+31 1.730544785932614751e+32 2.619210949875333106e+32 2.984440142196566918e+32 8.964375812060072923e+31 -8.515727465135046667e+32 -3.425309034394939997e+33 -8.145884847188906515e+33 -9.922370830834364410e+33 -2.119464668318252366e+28 -1.689726703118075140e+27 1.440101653069986610e+26 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 6.186324149659251562e+13 8.113154959294240625e+13 1.053889122977165625e+14 1.359271226298647969e+14 -2.097046363337115528e+28 4.850777756495711585e+30 2.953274256558218597e+31 3.545273642763729060e+31 4.025456872055449111e+31 3.657581460085835446e+31 3.018816679659856350e+31 2.048223110003727437e+31 8.176806147340775115e+30 -5.796250740354887641e+30 -1.798839398031696094e+31 -2.076444435341100150e+31 -5.711669151245612857e+30 2.691583747083509247e+31 7.579958708961477309e+31 1.419395486743453834e+32 2.148287875274468622e+32 2.447859658750551118e+32 7.352862842410293685e+31 -6.984595303325589259e+32 -2.809449882735912952e+33 -6.681296633318354125e+33 -8.138406580426555140e+33 -1.740744048703962454e+28 -1.411749034480591280e+27 8.079362883576220633e+25 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00 0.000000000000000000e+00'

and from this string, all I want in the end would be: 从这个字符串中,我最后想要的是:

new_list = ['1.440101653069986610e+26', '8.079362883576220633e+25']

I was thinking I'd use regex, but this seems a little tricky since I there are a bunch of 0.000000000000000000e+00 occurrences grouped together and I only want the nonzero number right before the first zero occurrence. 我当时以为我会使用正则表达式,但这似乎有点棘手,因为我将一堆0.000000000000000000e+00事件组合在一起,并且我只希望第一个零出现之前的非零数字。 I also cannot assume that there's always an equal number of zeros grouped together. 我也不能假设总是有相等数量的零组合在一起。

I also thought of splitting on the spaces and iterating through, but my full string is actually far too long to do this efficiently. 我还考虑过分割空间并进行迭代,但是我的完整字符串实际上太长了,无法有效地执行此操作。 How can I do this? 我怎样才能做到这一点?

You can use negative lookbehind assertion: 您可以使用否定的后向断言:

In [55]: re.findall(r'(\S+)(?<!0\.000000000000000000e\+00)\s+0\.000000000000000000e\+00', my_string)
Out[55]: ['1.440101653069986610e+26', '8.079362883576220633e+25']

Using negative lookahead assertion, the regex could be refined to improve performance, as mentioned in comment by @revo: 使用否定的先行断言,可以对正则表达式进行完善以提高性能,如@revo的评论中所述:

([-+]?\d\.(?!0+e\+0+)\S+)\s+(?:0\.0+e\+00\s*)+

Live demo 现场演示

I also cannot assume that there's always an equal number of zeros grouped together. 我也不能假设总是有相等数量的零组合在一起。

How can we differentiate, say, 2 consecutive zero values from "a group of zeros" ? 例如,我们如何区分两个连续的零值与“一组零”?

Well, given you're looking for at least 5 0.000 patterns, you could use a non-capturing group on this multiple 0 pattern (to avoid matching it), following a non-blank pattern (for the number) 好吧,假设您正在寻找至少5个0.000模式,则可以在此非0模式(对于数字)之后,在此多个0模式上使用一个非捕获组(以避免与之匹配)

re.findall("(\S+)\s+(?:0\.0+e\+00\s+){5,}",my_string)

If there cannot be any zeroes except for the pattern itself, it can be generalized to: 如果除了模式本身不能有任何零,可以将其推广为:

re.findall("(\S+)\s+(?:0\.0+e\+00\s+)+",my_string)

(you need the + at the end of the non-capturing group to capture and discard all the zeroes) (您需要在非捕获组的末尾加上+来捕获并丢弃所有零)

result (in both cases): 结果(在两种情况下):

['1.440101653069986610e+26', '8.079362883576220633e+25']

this also takes care of newlines, and is tolerant to variable number of zeroes in the decimal part 这也需要换行,并且可以容忍小数部分中的零

List comprehension and zip 清单理解和邮政编码

This is about 10-70x times faster than the other solutions. 这比其他解决方案快约10-70倍。

my_values = my_string.split()
output = [x for x,y in zip(my_values,my_values[1:]) 
           if (y == '0.000000000000000000e+00' and x != '0.000000000000000000e+00')]
print(output)

Or, with islice to save memory as kindly suggested by @Jean-François Fabre: 或者,按照@Jean-FrançoisFabre的建议,使用islice来节省内存:

import itertools
my_values = my_string.split()
output = [x for x,y in zip(my_values,itertools.islice(myvalues,1,None)) 
               if (y == '0.000000000000000000e+00' and x != '0.000000000000000000e+00')]
print(output)

This works by grouping the elements in pairs (x,y). 这是通过将元素成对(x,y)分组来实现的。 x should be different than 0.00.. while y should be equal to it. x应该不同于0.00..而y应该等于它。 By doing the y check first this will evaluate fast to False in most cases and continue iterating. 通过首先执行y检查,在大多数情况下,这将快速评估为False并继续进行迭代。 Returns: 返回:

['1.440101653069986610e+26', '8.079362883576220633e+25']

Pandas and numpy 熊猫和麻木

However, another idea (which I would consider as smartest here) would be to use pandas and pd.to_numeric() . 但是,另一个想法(在这里我认为最聪明)是使用pandas和pd.to_numeric() When you work with numbers you most likely want to use a library like numpy or pandas. 使用数字时,您很可能希望使用numpy或pandas之类的库。 This would be safer as you could also handle errors smoothly. 这样比较安全,因为您也可以顺利处理错误。 Also note that I in both cases convert the numbers back to string (which you could skip). 还要注意,在两种情况下,我都将数字转换回字符串(可以跳过)。

import pandas as pd

data = pd.Series(pd.to_numeric(my_string.split()))
output = data[(data != 0) & (data.shift(-1) == 0)].astype(str).tolist()
print(output)

#['1.440101653069986610e+26', '8.079362883576220633e+25']

And numpy: 和numpy的:

import numpy as np

data = np.loadtxt(my_string.split())
output = list(map(str,data[(data != 0) & (np.roll(ar, -1) == 0)]))
print(output)

#['1.440101653069986610e+26', '8.079362883576220633e+25']

Time comparison 时间比较

fastest --> slowest 最快->最慢

100000 loops, best of 3: 9.28 µs per loop  <-- Anton vBR list comprehension
10000 loops, best of 3: 98.4 µs per loop   <-- Revos Regex
1000 loops, best of 3: 256 µs per loop     <-- Anton vBR numpy
1000 loops, best of 3: 425 µs per loop     <-- Tzot Regex
1000 loops, best of 3: 513 µs per loop     <-- Jean-François Fabre Regex 
1000 loops, best of 3: 782 µs per loop     <-- liliscent 
1000 loops, best of 3: 794 µs per loop     <-- Anton vBR pandas

If you want the float values and not their string representations: 如果要使用浮点值而不是它们的字符串表示形式:

import re

list(
    filter(
        None,
        map(
            float,
            re.findall(r"\S+(?=\s0\.0+e)", my_string)
)))
  • re.findall(r"\\S+(?=\\s0\\.0+e)", my_string) : re.findall(r"\\S+(?=\\s0\\.0+e)", my_string)
    finds all occurences of non-white-space character sequences before a white-space and 0.00000…e 查找在空白和0.00000…e之前出现的所有非空白字符序列
  • map(float, ^ ) : map(float, ^ )
    assume all of the above matches can be converted to float 假设以上所有匹配项都可以转换为float
  • filter(None, ^ ) : filter(None, ^ )
    filter out all zero floats 过滤掉所有零浮点数
  • list( ^ ) : list( ^ )
    make the above into a list (a no-op in Python 2, conversion of the generator into a list in Python 3) 将以上内容制成列表(在Python 2中为空操作,在Python 3中将生成器转换为列表)

Result: 结果:

>>> list(filter(None, map(float, re.findall(r"\S+(?=\s0\.0+e)", my_string))))
[1.4401016530699866e+26, 8.07936288357622e+25]

However, if you still want the string values themselves, let me know; 但是,如果您仍然想要字符串值本身,请告诉我; in that case, the map & filter subexpressions need to be modified. 在这种情况下,需要修改mapfilter子表达式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM