[英]Python pandas dataframe : In an array column, if first item contains specific string then remove that item from array
I have a dataframe which has some column like below which contains arrays of different sizes: 我有一个数据框,其中有一些类似下面的列,其中包含不同大小的数组:
column
["a_id","b","c","d"]
["d_ID","e","f"]
["h","i","j","k","l"]
["id_m","n","o","p"]
["ID_q","r","s"]
I want to remove first item from the array of every row if the first item contains "ID" or "id". 如果第一项包含“ ID”或“ id”,我想从每一行的数组中删除第一项。 So, expected output will look like: 因此,预期输出将如下所示:
column
["b","c","d"]
["e","f"]
["h","i","j","k","l"]
["n","o","p"]
["r","s"]
How do we check for this in the column containing array elements in the dataframe? 我们如何在数据框中包含数组元素的列中进行检查?
Edit: It seems I misread your question. 编辑:看来我误解了你的问题。 This solution is meant to remove any element that has 'id'
in it, not just the first. 此解决方案旨在删除其中具有'id'
任何元素,而不仅仅是第一个。
Option 1 选项1
I believe the most straightforward solution is using apply
: 我相信最直接的解决方案是使用apply
:
df
col
0 [a_id, b, c, d]
1 [d_ID, e, f]
2 [h, i, j, k, l]
3 [id_m, n, o, p]
4 [ID_q, r, s]
df.col = df.col.apply(lambda y: (y[1:] if 'id' in y[0].lower() else y))
df
col
0 [b, c, d]
1 [e, f]
2 [h, i, j, k, l]
3 [n, o, p]
4 [r, s]
Option 2 选项2
Alternatively, use a list comprehension : 或者,使用列表推导 :
df.col = [(y[1:] if 'id' in y[0].lower() else y) for y in df.col]
df
col
0 [b, c, d]
1 [e, f]
2 [h, i, j, k, l]
3 [n, o, p]
4 [r, s]
Timings 时机
df = pd.concat([df] * 100000)
%%timeit
m = df['col'].str[0].str.contains('ID', case=False)
df['col'].mask(m, df['col'].str[1:])
1 loop, best of 3: 917 ms per loop
%timeit [(y[1:] if 'id' in y[0].lower() else y) for y in df.col]
1 loop, best of 3: 272 ms per loop
%timeit df.col.apply(lambda y: (y[1:] if 'id' in y[0].lower() else y))
1 loop, best of 3: 309 ms per loop
Use str[0]
for select first values in list and then check ID
by contains
: 使用str[0]
在列表中选择第一个值,然后通过contains
检查ID
:
m = df['column'].str[0].str.contains('ID', case=False)
print (m)
0 True
1 True
2 False
3 True
4 True
Name: column, dtype: bool
And then remove it by mask
with str[1:]
: 然后使用str[1:]
通过mask
将其删除:
df['column'] = df['column'].mask(m, df['column'].str[1:])
print (df)
column
0 [b, c, d]
1 [e, f]
2 [h, i, j, k, l]
3 [n, o, p]
4 [r, s]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.