根据python中的条件添加前导零

Question

我有一个包含500万行的数据帧。 假设数据框如下所示：

>>> df = pd.DataFrame(data={"Random": "86 7639103627 96 32 1469476501".split()})
>>> df
       Random
0          86
1  7639103627
2          96
3          32
4  1469476501

请注意， Random列存储为字符串。

如果Random列中的数字少于9位，我想添加前导零以使其成为9位数。 如果数字有9个或更多数字，我想添加前导零，使其成为20位数。

我做的是这样的：

for i in range(0,len(df['Random'])):
      if len(df['Random'][i]) < 9:
          df['Random'][i]=df['Random'][i].zfill(9)
      else:
           df['Random'][i]=df['Random'][i].zfill(20)

由于行数超过500万，这个过程需要花费很多时间！ （性能为5it / sec。使用tqdm测试，估计完成时间为几天！）。

是否有更简单，更快速的方法来执行此任务？

Answer 1

让我们做np.where有机结合起来zfill ，替代你可以检查str.pad

df.Random=np.where(df.Random.str.len()<9,df.Random.str.zfill(9),df.Random.str.zfill(20))
df
Out[9]: 
                 Random
0             000000086
1  00000000007639103627
2             000000096
3             000000032
4  00000000001469476501

Answer 2

我使用'apply'结合下面写的fill_zeros函数，在1,000,000行的数据帧上获得603ms的运行时间。

data = {
    'Random': [str(randint(0, 100_000_000)) for i in range(0, 1_000_000)]
}

df = pd.DataFrame(data)

def fill_zeros(x):
    if len(x) < 9:
        return x.zfill(9)
    else:
        return x.zfill(20)

%timeit df['Random'].apply(fill_zeros)

603 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

相比：

%timeit np.where(df.Random.str.len()<9,df.Random.str.zfill(9),df.Random.str.zfill(20))
1.57 s ± 6.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Answer 3

既然你问效，字符串操作都是常见的“陷阱”与熊猫之一，因为当他们被矢量（中，你可以将它们应用到整个系列一气呵成），这并不意味着它们是更有效而不是循环，这是一个例子，其中循环实际上比使用字符串访问器更快，这往往更方便而不是速度。

如果有疑问，请确保您在实际数据上有时间功能，因为您认为可能笨重而缓慢的东西可能比看起来干净的东西更快！

我将提出一个非常基本的循环函数，我认为它将击败使用字符串访问器的任何方法。

def loopy(series):
    return pd.Series(
        (
            el.zfill(9) if len(el) < 9 else el.zfill(20)
            for el in series
        ),
        name=series.name,
    )

# to compare more fairly with the apply version
def cache_loopy(series, _len=len, _zfill=str.zfill):
    return pd.Series(
      (_zfill(el, 9 if _len(el) < 9 else 20) for el in series), name=series.name)

现在让我们使用上面Martijn和simple_benchmark提供的代码检查时间。

职能

def loopy(series):
    series.copy()    # not necessary but just to make timings fair
    return pd.Series(
        (
            el.zfill(9) if len(el) < 9 else el.zfill(20)
            for el in series
        ),
        name=series.name,
    )

def str_accessor(series):
    target = series.copy()
    mask = series.str.len() < 9
    unmask = ~mask
    target[mask] = target[mask].str.zfill(9)
    target[unmask] = target[unmask].str.zfill(20)
    return target

def np_where_str_accessor(series):
    target = series.copy()
    return np.where(target.str.len()<9,target.str.zfill(9),target.str.zfill(20))

def fill_zeros(x, _len=len, _zfill=str.zfill):
    # len() and str.zfill() are cached as parameters for performance
    return _zfill(x, 9 if _len(x) < 9 else 20)

def apply_fill(series):
    series = series.copy()
    return series.apply(fill_zeros)

def cache_loopy(series, _len=len, _zfill=str.zfill):
    series.copy()
    return pd.Series(
      (_zfill(el, 9 if _len(el) < 9 else 20) for el in series), name=series.name)

设定

import pandas as pd
import numpy as np
from random import choices, randrange
from simple_benchmark import benchmark

def randvalue(chars="0123456789", _c=choices, _r=randrange):
    return "".join(_c(chars, k=randrange(5, 30))).lstrip("0")

fns = [loopy, str_accessor, np_where_str_accessor, apply_fill, cache_loopy]
args = { 2**i: pd.Series([randvalue() for _ in range(2**i)]) for i in range(14, 21)}

b = benchmark(fns, args, 'Series Length')

b.plot()

Answer 4

你需要矢量化这个; 使用布尔索引选择列，并在结果子集上使用.str.zfill() ：

# select the right rows to avoid wasting time operating on longer strings
shorter = df.Random.str.len() < 9
longer = ~shorter
df.Random[shorter] = df.Random[shorter].str.zfill(9)
df.Random[longer] = df.Random[longer].str.zfill(20)

注意：我没有使用np.where()因为我们不想将工作加倍。 向量化的df.Random.str.zfill()比循环遍历行更快，但是执行两次仍然比为每组行只执行一次所花费更多的时间。

对具有随机长度值的100万行字符串进行速度比较（从5个字符一直到30个）：

In [1]: import numpy as np, pandas as pd

In [2]: import platform; print(platform.python_version_tuple(), platform.platform(), pd.__version__, np.__version__, sep="\n")
('3', '7', '3')
Darwin-17.7.0-x86_64-i386-64bit
0.24.2
1.16.4

In [3]: !sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz

In [4]: from random import choices, randrange

In [5]: def randvalue(chars="0123456789", _c=choices, _r=randrange):
   ...:     return "".join(_c(chars, k=randrange(5, 30))).lstrip("0")
   ...:

In [6]: df = pd.DataFrame(data={"Random": [randvalue() for _ in range(10**6)]})

In [7]: %%timeit
   ...: target = df.copy()
   ...: shorter = target.Random.str.len() < 9
   ...: longer = ~shorter
   ...: target.Random[shorter] = target.Random[shorter].str.zfill(9)
   ...: target.Random[longer] = target.Random[longer].str.zfill(20)
   ...:
   ...:
825 ms ± 22.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: %%timeit
   ...: target = df.copy()
   ...: target.Random = np.where(target.Random.str.len()<9,target.Random.str.zfill(9),target.Random.str.zfill(20))
   ...:
   ...:
929 ms ± 69.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

（需要target = df.copy()行以确保每次重复的测试运行都与之前的一次隔离。）

结论：在100万行上，使用np.where()大约慢了10％。

然而，使用df.Row.apply() 所建议的jackbicknell14 ，节拍或者通过一个巨大的余量方法：

In [9]: def fill_zeros(x, _len=len, _zfill=str.zfill):
   ...:     # len() and str.zfill() are cached as parameters for performance
   ...:     return _zfill(x, 9 if _len(x) < 9 else 20)

In [10]: %%timeit
    ...: target = df.copy()
    ...: target.Random = target.Random.apply(fill_zeros)
    ...:
    ...:
299 ms ± 2.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

这大约快3倍！

Answer 5

df.Random.str.zfill(9).where(df.Random.str.len() < 9, df.Random.str.zfill(20))

根据python中的条件添加前导零

问题描述

5 个解决方案

解决方案1
3 2019-09-09 15:30:56

解决方案2
3 2019-09-09 15:37:46

解决方案3
2 2019-09-09 16:38:00

解决方案4
1 已采纳 2019-09-09 15:25:41

解决方案5
0 2019-09-09 16:41:10

根据python中的条件添加前导零

问题描述

5 个解决方案

解决方案1 3 2019-09-09 15:30:56

解决方案2 3 2019-09-09 15:37:46

解决方案3 2 2019-09-09 16:38:00

解决方案4 1 已采纳 2019-09-09 15:25:41

解决方案5 0 2019-09-09 16:41:10

解决方案1
3 2019-09-09 15:30:56

解决方案2
3 2019-09-09 15:37:46

解决方案3
2 2019-09-09 16:38:00

解决方案4
1 已采纳 2019-09-09 15:25:41

解决方案5
0 2019-09-09 16:41:10