[英]Splitting a string into list and converting the items to int
I have a pandas dataframe where I have a column values
like this:我有一个 Pandas 数据框,其中有一个像这样的列values
:
0 16 0
1 7 1 2 0
2 5
3 1
4 18
What I want is to create another column, modified_values
, that contains a list of all the different numbers that I will get after splitting each value.我想要的是创建另一列modified_values
,其中包含拆分每个值后将获得的所有不同数字的列表。 The new column will be like this:新列将如下所示:
0 [16, 0]
1 [7, 1, 2, 0]
2 [5]
3 [1]
4 [18]
Beware the values in this list should be int
and not strings
.请注意此列表中的值应该是int
而不是strings
。
Things that I am aware of:我所知道的事情:
1) I can split the column in a vectorized way like this df.values.str.split(" ")
. 1)我可以像这样df.values.str.split(" ")
以矢量化方式拆分列。 This will give me the list but the objects inside the list will be strings.这将为我提供列表,但列表中的对象将是字符串。 I can add another operation on top of that like this df.values.str.split(" ").apply(func to convert values to int)
but that wouldn't be vectorized我可以在上面添加另一个操作,比如df.values.str.split(" ").apply(func to convert values to int)
但这不会被向量化
2) I can directly do this df['modified_values']= df['values'].apply(func that splits as well as converts to int)
2)我可以直接做这个df['modified_values']= df['values'].apply(func that splits as well as converts to int)
The second one will be much slower than the first for sure but I am wondering if the same thing can be achieved in a vectorized way.第二个肯定会比第一个慢得多,但我想知道是否可以以矢量化的方式实现同样的事情。
I'm highlighting this because it's a common mistake to assume pd.Series.str
methods are vectorised.我强调这一点是因为假设pd.Series.str
方法是矢量化的,这是一个常见的错误。 They aren't.他们不是。 They offer convenience and error-handling at the cost of efficiency.它们以效率为代价提供便利和错误处理。 For clean data only , eg no NaN
values, a list comprehension is likely your best option:对于干净的数据,例如没有NaN
值,列表理解可能是您的最佳选择:
df = pd.DataFrame({'A': ['16 0', '7 1 2 0', '5', '1', '18']})
df['B'] = [list(map(int, i.split())) for i in df['A']]
print(df)
A B
0 16 0 [16, 0]
1 7 1 2 0 [7, 1, 2, 0]
2 5 [5]
3 1 [1]
4 18 [18]
To illustrate performance issues with pd.Series.str
, you can see for larger dataframes how the more operations you pass to Pandas, the more performance deteriorates:为了说明pd.Series.str
性能问题,您可以看到对于更大的数据帧,您传递给 Pandas 的操作越多,性能pd.Series.str
多:
df = pd.concat([df]*10000)
%timeit [list(map(int, i.split())) for i in df['A']] # 55.6 ms
%timeit [list(map(int, i)) for i in df['A'].str.split()] # 80.2 ms
%timeit df['A'].str.split().apply(lambda x: list(map(int, x))) # 93.6 ms
list
as elements in pd.Series
is also anti-Pandas list
as element in pd.Series
也是反熊猫的As described here , holding lists in series gives 2 layers of pointers and is not recommended:如此处所述,串联持有列表会提供 2 层指针,不推荐使用:
Don't do this .不要这样做。 Pandas was never designed to hold lists in series / columns. Pandas 从来没有被设计为在系列/列中保存列表。 You can concoct expensive workarounds, but these are not recommended.您可以编造昂贵的解决方法,但不推荐使用这些方法。
The main reason holding lists in series is not recommended is you lose the vectorised functionality which goes with using NumPy arrays held in contiguous memory blocks.不推荐连续保存列表的主要原因是您失去了使用连续内存块中保存的 NumPy 数组的矢量化功能。 Your series will be of
object
dtype, which represents a sequence of pointers, much likelist
.您的系列将是object
dtype,它表示一系列指针,很像list
。 You will lose benefits in terms of memory and performance, as well as access to optimized Pandas methods.您将失去内存和性能方面的优势,以及对优化 Pandas 方法的访问。See also What are the advantages of NumPy over regular Python lists?另请参阅NumPy 相对于常规 Python 列表的优势是什么? The arguments in favour of Pandas are the same as for NumPy.支持 Pandas 的论据与支持 NumPy 的论据相同。
The double for
comprehension is 33% faster than the map
comprehension from the jpp's answer.双重for
理解比提高了33%, map
从JPP的回答理解。 Numba trick is 250 times faster than the map
comprehension from jpp's answer, but you get a pandas DataFrame with floats and nan
's and not a series of lists. Numba 技巧比 jpp 的答案中的map
理解快 250 倍,但是您会得到一个带有浮点数和nan
而不是一系列列表的 Pandas DataFrame。 Numba is included in Anaconda. Numba 包含在 Anaconda 中。
Benchmarks:基准:
%timeit pd.DataFrame(nb_calc(df.A)) # numba trick 0.144 ms
%timeit [int(x) for i in df['A'] for x in i.split()] # 23.6 ms
%timeit [list(map(int, i.split())) for i in df['A']] # 35.6 ms
%timeit [list(map(int, i)) for i in df['A'].str.split()] # 50.9 ms
%timeit df['A'].str.split().apply(lambda x: list(map(int, x))) # 56.6 ms
Code for Numba function: Numba 函数的代码:
@numba.jit(nopython=True, nogil=True)
def str2int_nb(nb_a):
n1 = nb_a.shape[0]
n2 = nb_a.shape[1]
res = np.empty(nb_a.shape)
res[:] = np.nan
j_res_max = 0
for i in range(n1):
j_res = 0
s = 0
for j in range(n2):
x = nb_a[i,j]
if x == 32:
res[i,j_res]=np.float64(s)
s=0
j_res+=1
elif x == 0:
break
else:
s=s*10+x-48
res[i,j_res]=np.float64(s)
if j_res>j_res_max:
j_res_max = j_res
return res[:,:j_res_max+1]
def nb_calc(s):
a_temp = s_a.values.astype("U")
nb_a = a_temp.view("uint32").reshape(len(s_a),-1).astype(np.int8)
str2int_nb(nb_a)
Numba does not support strings. Numba 不支持字符串。 So I first convert to array of int8 and only then work with it.所以我首先转换为 int8 数组,然后才使用它。 Conversion to int8 actually takes 3/4 of the execution time.转换为 int8 实际上需要 3/4 的执行时间。
The output of my numba function looks like this:我的 numba 函数的输出如下所示:
0 1 2 3
-----------------------
0 16.0 0.0 NaN NaN
1 7.0 1.0 2.0 0.0
2 5.0 NaN NaN NaN
3 1.0 NaN NaN NaN
4 18.0 NaN NaN NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.