简体   繁体   English

将字符串拆分为列表并将项目转换为 int

[英]Splitting a string into list and converting the items to int

I have a pandas dataframe where I have a column values like this:我有一个 Pandas 数据框,其中有一个像这样的列values

0       16 0
1    7 1 2 0
2          5
3          1
4         18

What I want is to create another column, modified_values , that contains a list of all the different numbers that I will get after splitting each value.我想要的是创建另一列modified_values ,其中包含拆分每个值后将获得的所有不同数字的列表。 The new column will be like this:新列将如下所示:

0       [16, 0]
1    [7, 1, 2, 0]
2          [5]
3          [1]
4         [18]

Beware the values in this list should be int and not strings .请注意此列表中的值应该是int而不是strings

Things that I am aware of:我所知道的事情:

1) I can split the column in a vectorized way like this df.values.str.split(" ") . 1)我可以像这样df.values.str.split(" ")以矢量化方式拆分列。 This will give me the list but the objects inside the list will be strings.这将为我提供列表,但列表中的对象将是字符串。 I can add another operation on top of that like this df.values.str.split(" ").apply(func to convert values to int) but that wouldn't be vectorized我可以在上面添加另一个操作,比如df.values.str.split(" ").apply(func to convert values to int)但这不会被向量化

2) I can directly do this df['modified_values']= df['values'].apply(func that splits as well as converts to int) 2)我可以直接做这个df['modified_values']= df['values'].apply(func that splits as well as converts to int)

The second one will be much slower than the first for sure but I am wondering if the same thing can be achieved in a vectorized way.第二个肯定会比第一个慢得多,但我想知道是否可以以矢量化的方式实现同​​样的事情。

No native "vectorised" solution is possible没有可能的原生“矢量化”解决方案

I'm highlighting this because it's a common mistake to assume pd.Series.str methods are vectorised.我强调这一点是因为假设pd.Series.str方法是矢量化的,这是一个常见的错误。 They aren't.他们不是。 They offer convenience and error-handling at the cost of efficiency.它们以效率为代价提供便利和错误处理。 For clean data only , eg no NaN values, a list comprehension is likely your best option:对于干净的数据,例如没有NaN值,列表理解可能是您的最佳选择:

df = pd.DataFrame({'A': ['16 0', '7 1 2 0', '5', '1', '18']})

df['B'] = [list(map(int, i.split())) for i in df['A']]

print(df)

         A             B
0     16 0       [16, 0]
1  7 1 2 0  [7, 1, 2, 0]
2        5           [5]
3        1           [1]
4       18          [18]

Performance benchmarking性能基准测试

To illustrate performance issues with pd.Series.str , you can see for larger dataframes how the more operations you pass to Pandas, the more performance deteriorates:为了说明pd.Series.str性能问题,您可以看到对于更大的数据帧,您传递给 Pandas 的操作越多,性能pd.Series.str多:

df = pd.concat([df]*10000)

%timeit [list(map(int, i.split())) for i in df['A']]            # 55.6 ms
%timeit [list(map(int, i)) for i in df['A'].str.split()]        # 80.2 ms
%timeit df['A'].str.split().apply(lambda x: list(map(int, x)))  # 93.6 ms

list as elements in pd.Series is also anti-Pandas list as element in pd.Series也是反熊猫的

As described here , holding lists in series gives 2 layers of pointers and is not recommended:如此处所述,串联持有列表会提供 2 层指针,不推荐使用:

Don't do this .不要这样做 Pandas was never designed to hold lists in series / columns. Pandas 从来没有被设计为在系列/列中保存列表。 You can concoct expensive workarounds, but these are not recommended.您可以编造昂贵的解决方法,但不推荐使用这些方法。

The main reason holding lists in series is not recommended is you lose the vectorised functionality which goes with using NumPy arrays held in contiguous memory blocks.不推荐连续保存列表的主要原因是您失去了使用连续内存块中保存的 NumPy 数组的矢量化功能。 Your series will be of object dtype, which represents a sequence of pointers, much like list .您的系列将是object dtype,它表示一系列指针,很像list You will lose benefits in terms of memory and performance, as well as access to optimized Pandas methods.您将失去内存和性能方面的优势,以及对优化 Pandas 方法的访问。

See also What are the advantages of NumPy over regular Python lists?另请参阅NumPy 相对于常规 Python 列表的优势是什么? The arguments in favour of Pandas are the same as for NumPy.支持 Pandas 的论据与支持 NumPy 的论据相同。

The double for comprehension is 33% faster than the map comprehension from the jpp's answer.双重for理解比提高了33%, map从JPP的回答理解。 Numba trick is 250 times faster than the map comprehension from jpp's answer, but you get a pandas DataFrame with floats and nan 's and not a series of lists. Numba 技巧比 jpp 的答案中的map理解快 250 倍,但是您会得到一个带有浮点数和nan而不是一系列列表的 Pandas DataFrame。 Numba is included in Anaconda. Numba 包含在 Anaconda 中。

Benchmarks:基准:

%timeit pd.DataFrame(nb_calc(df.A))            # numba trick       0.144 ms
%timeit [int(x) for i in df['A'] for x in i.split()]            # 23.6   ms
%timeit [list(map(int, i.split())) for i in df['A']]            # 35.6   ms
%timeit [list(map(int, i)) for i in df['A'].str.split()]        # 50.9   ms
%timeit df['A'].str.split().apply(lambda x: list(map(int, x)))  # 56.6   ms

Code for Numba function: Numba 函数的代码:

@numba.jit(nopython=True, nogil=True)
def str2int_nb(nb_a):
    n1 = nb_a.shape[0]
    n2 = nb_a.shape[1]
    res = np.empty(nb_a.shape)
    res[:] = np.nan
    j_res_max = 0
    for i in range(n1):
        j_res = 0
        s = 0
        for j in range(n2):
            x = nb_a[i,j]
            if x == 32:
                res[i,j_res]=np.float64(s)
                s=0
                j_res+=1
            elif x == 0:
                break
            else:
                s=s*10+x-48
        res[i,j_res]=np.float64(s)
        if j_res>j_res_max:
            j_res_max = j_res

    return res[:,:j_res_max+1]

def nb_calc(s):
    a_temp = s_a.values.astype("U")
    nb_a = a_temp.view("uint32").reshape(len(s_a),-1).astype(np.int8)
    str2int_nb(nb_a)

Numba does not support strings. Numba 不支持字符串。 So I first convert to array of int8 and only then work with it.所以我首先转换为 int8 数组,然后才使用它。 Conversion to int8 actually takes 3/4 of the execution time.转换为 int8 实际上需要 3/4 的执行时间。

The output of my numba function looks like this:我的 numba 函数的输出如下所示:

      0    1    2    3
-----------------------
0  16.0  0.0  NaN  NaN
1   7.0  1.0  2.0  0.0
2   5.0  NaN  NaN  NaN
3   1.0  NaN  NaN  NaN
4  18.0  NaN  NaN  NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM