简体   繁体   English

Python pandas dataframe:在数组列中,如果第一项包含特定字符串,则从数组中删除该项

[英]Python pandas dataframe : In an array column, if first item contains specific string then remove that item from array

I have a dataframe which has some column like below which contains arrays of different sizes: 我有一个数据框,其中有一些类似下面的列,其中包含不同大小的数组:

column
["a_id","b","c","d"]
["d_ID","e","f"]
["h","i","j","k","l"]
["id_m","n","o","p"]
["ID_q","r","s"]

I want to remove first item from the array of every row if the first item contains "ID" or "id". 如果第一项包含“ ID”或“ id”,我想从每一行的数组中删除第一项。 So, expected output will look like: 因此,预期输出将如下所示:

column
["b","c","d"]
["e","f"]
["h","i","j","k","l"]
["n","o","p"]
["r","s"]

How do we check for this in the column containing array elements in the dataframe? 我们如何在数据框中包含数组元素的列中进行检查?

Edit: It seems I misread your question. 编辑:看来我误解了你的问题。 This solution is meant to remove any element that has 'id' in it, not just the first. 此解决方案旨在删除其中具有'id' 任何元素,而不仅仅是第一个。

Option 1 选项1
I believe the most straightforward solution is using apply : 我相信最直接的解决方案是使用apply

df

               col
0  [a_id, b, c, d]
1     [d_ID, e, f]
2  [h, i, j, k, l]
3  [id_m, n, o, p]
4     [ID_q, r, s]


df.col = df.col.apply(lambda y: (y[1:] if 'id' in y[0].lower() else y))

df
               col
0        [b, c, d]
1           [e, f]
2  [h, i, j, k, l]
3        [n, o, p]
4           [r, s]

Option 2 选项2
Alternatively, use a list comprehension : 或者,使用列表推导

df.col = [(y[1:] if 'id' in y[0].lower() else y)  for y in df.col]  

df

               col
0        [b, c, d]
1           [e, f]
2  [h, i, j, k, l]
3        [n, o, p]
4           [r, s]

Timings 时机

df = pd.concat([df] * 100000)
%%timeit
m = df['col'].str[0].str.contains('ID', case=False)
df['col'].mask(m, df['col'].str[1:])

1 loop, best of 3: 917 ms per loop
%timeit [(y[1:] if 'id' in y[0].lower() else y)  for y in df.col]  
1 loop, best of 3: 272 ms per loop
%timeit df.col.apply(lambda y: (y[1:] if 'id' in y[0].lower() else y))
1 loop, best of 3: 309 ms per loop

Use str[0] for select first values in list and then check ID by contains : 使用str[0]在列表中选择第一个值,然后通过contains检查ID

m = df['column'].str[0].str.contains('ID', case=False)
print (m)
0     True
1     True
2    False
3     True
4     True
Name: column, dtype: bool

And then remove it by mask with str[1:] : 然后使用str[1:]通过mask将其删除:

df['column'] = df['column'].mask(m, df['column'].str[1:])
print (df)
            column
0        [b, c, d]
1           [e, f]
2  [h, i, j, k, l]
3        [n, o, p]
4           [r, s]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM