简体   繁体   中英

Removing rows from a Data Frame column which contains lists if a specific string is within the list

Suppose I have a DataFrame pd with a column called 'elements' which contains a list of a list of objects as shown below:

print(df2['elements'])

0       [Element B, Element Cr, Element Re]
1       [Element B, Element Rh, Element Sc]
2       [Element B, Element Mo, Element Y]
3       [Element Al, Element B, Element Lu]
4       [Element B, Element Dy, Element Os]

I would like to search through the column and if, for example, Element Mo is in that row delete the whole row to look like this:

print(df2['elements'])

0       [Element B, Element Cr, Element Re]
1       [Element B, Element Rh, Element Sc]
2       [Element Al, Element B, Element Lu]
3       [Element B, Element Dy, Element Os]

I'm currently trying to do it with a for loop and if statements like this:

for entry in df2['elements']:
    if 'Element Mo' in entry:
        df2.drop(index=[entry],axis=0, inplace=True)
    else:
        continue

But it is not working and giving me a KeyError: [] not found in axis.

Update:

I just realized that the if and in statement route I showed does not search for exact string matches, but also strings that contain target string, so for example with the updated df below:

print(df2['elements'])

0       [Element B, Element Cr, Element Re]
1       [Element B, Element Rh, Element Sc]
2       [Element B, Element Mo, Element Y]
3       [Element Al, Element B, Element Lu]
4       [Element Mop, Element B, Element Lu]      
5       [Element B, Element Dy, Element Os]

If I run a for loop with if/in statements like this:

for ind in df2.index.values:
    entry = df2.loc[ind, 'elements']
    if 'Element Mo' in entry:
        df2.drop(index=ind ,axis=0, inplace=True)

Both row 2 and 5 will be dropped from the df because the string 'Element Mop' contains the string 'Element Mo', but I don't want this to happen. I tried updating the code above with regex like the one below, but it doesn't work.

for ind in df2.index.values:
        entry = df2.loc[ind, 'elements']
        if '\bElement Mo\b' in entry:
            df2.drop(index=ind ,axis=0, inplace=True)

Edit #2: Here is the dictionary of the first 25 items of the column:

df2_dict = df2['elements'].head(25).to_dict()

{0: '[Element B, Element Cr, Element Re]', 1: '[Element B, Element Rh, Element Sc]', 2: '[Element B, Element Mo, Element Y]', 3: '[Element Al, Element B, Element Lu]', 4: '[Element B, Element Dy, Element Os]', 5: '[Element B, Element Fe, Element Sc]', 6: '[Element B, Element Cr, Element W]', 7: '[Element B, Element Ni]', 9: '[Element B, Element Pr, Element Re]', 10: '[Element B, Element Cr, Element V]', 11: '[Element B, Element Co, Element Si]', 12: '[Element B, Element Co, Element Yb]', 13: '[Element B, Element Lu, Element Yb]', 14: '[Element B, Element Ru, Element Yb]', 15: '[Element B, Element Mn, Element Pd]', 16: '[Element B, Element Co, Element Tm]', 17: '[Element B, Element Fe, Element W]', 19: '[Element B, Element Ru, Element Y]', 20: '[Element B, Element Ga, Element Ta]', 21: '[Element B, Element Ho, Element Re]', 22: '[Element B, Element Si]', 23: '[Element B, Element Ni, Element Te]', 24: '[Element B, Element Nd, Element S]', 25: '[Element B, Element Ga, Element Rh, Element Sc]', 26: '[Element B, Element Co, Element La]'}

The actual issue here is that if I try to drop rows that contain the string 'Element S' (in row 25) all entries with elements like 'Element Sc' or 'Element Si' are also removed.

here is one way to do it

string='Element Mo'

df[df['col1'].apply(lambda x: string not in x)]
col1
0   [Element B, Element Cr, Element Re]
1   [Element B, Element Rh, Element Sc]
3   [Element Al, Element B, Element Lu]
4   [Element B, Element Dy, Element Os]

A pandas Series is sort of like a dictionary, where the keys are the index and the values are the series values.

So, entry isn't in the index. You could loop over the index, use the index to reference the values, eg:

for ind in df2.index.values:
    entry = df2.loc[ind, "elements"]
    if 'Element Mo' in entry:
        df2.drop(index=ind, axis=0, inplace=True)

However, it would be far better to use a vectorized solution. This isn't really possible with a series of lists (this really breaks the pandas data model), but you could at least subset your series once instead of iteratively reshaping. For example:

in_values = df2["elements"].apply(lambda x: "Element Mo" in x)
dropped = df2.loc[~in_values]

Here's an alternative option using apply (below, any row that contains 2 is removed):

df = pd.DataFrame(
    [[-1,0,1],
     [1,2,3],
     [4,5,2],
     [6,7,8]])

ix = ~df.apply(lambda x: 2 in x.values, axis=1)
df[ix]

returns:

     0  1   2
0   -1  0   1
3   6   7   8

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM