简体   繁体   中英

Slice pandas dataframe using .loc with both index values and multiple column values, then set values

I have a dataframe, and I would like to select a subset of the dataframe using both index and column values. I can do both of these separately, but cannot figure out the syntax to do them simultaneously. Example:

import pandas as pd

# sample dataframe:
cid=[1,2,3,4,5,6,17,18,91,104]
c1=[1,2,3,1,2,3,3,4,1,3]
c2=[0,0,0,0,1,1,1,1,0,1]

df=pd.DataFrame(list(zip(c1,c2)),columns=['col1','col2'],index=cid)
df

Returns:

    col1    col2
1   1   0
2   2   0
3   3   0
4   1   0
5   2   1
6   3   1
17  3   1
18  4   1
91  1   0
104 3   1

Using.loc, I can collect by index:

rel_index=[5,6,17]
relc1=[2,3]
relc2=[1]
df.loc[rel_index]

Returns:

    col1    col2
5   1   5
6   2   6
17  3   7

Or I can select by column values:

df.loc[df['col1'].isin(relc1) & df['col2'].isin(relc2)]

Returning:

    col1    col2
5   2   1
6   3   1
17  3   1
104 3   1

However, I cannot do both. When I try the following:

df.loc[rel_index,df['col1'].isin(relc1) & df['col2'].isin(relc2)]

Returns:

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match

I have tried a few other variations (such as "&" instead of the ","), but these return the same or other errors.

Once I collect this slice, I am hoping to reassign values on the main dataframe. I imagine this will be trivial once the above is done, but I note it here in case it is not. My goal is to assign something like df2 in the following:

c3=[1,2,3]
c4=[5,6,7]
df2=pd.DataFrame(list(zip(c3,c4)),columns=['col1','col2'],index=rel_index)

to the slice referenced by index and multiple column conditions (overwriting what was in the original dataframe).

The reason for the IndexingError, is that you're calling df.loc with arrays of 2 different sizes.

df.loc[rel_index] has a length of 3 whereas df['col1'].isin(relc1) has a length of 10.

You need the index results to also have a length of 10. If you look at the output of df['col1'].isin(relc1) , it is an array of booleans.

You can achieve a similar array with the proper length by replacing df.loc[rel_index] with df.index.isin([5,6,17])

so you end up with:

df.loc[df.index.isin([5,6,17]) & df['col1'].isin(relc1) & df['col2'].isin(relc2)]

which returns:

    col1  col2
5      2     1
6      3     1
17     3     1

That said, I'm not sure why your index would ever look like this. Typically when slicing by index you would use df.iloc and your index would match the 0,1,2...etc. format.

Alternatively, you could first search by value - then assign the resulting dataframe to a new variable df2

df2 = df.loc[df['col1'].isin(relc1) & df['col2'].isin(relc2)]

then df2.loc[rel_index] would work without issue.

As for your overall goal, you can simply do the following:

c3=[1,2,3]
c4=[5,6,7]
df2=pd.DataFrame(list(zip(c3,c4)),columns=['col1','col2'],index=rel_index)

df.loc[df.index.isin([5,6,17]) & df['col1'].isin(relc1) & df['col2'].isin(relc2)] = df2

@Rexovas explains it quite well, this is an alternative, where you can compute the filters on the index before assigning - it is a bit long, involves MultiIndex, but once you get your head around MultiIndex, should be intuitive:

(df
# move columns into the index
.set_index(['col1', 'col2'], append = True)
# filter based on the index
.loc(axis = 0)[rel_index, relc1, relc2]
# return cols 1 and 2
.reset_index(level = [-2, -1])
# assign values
.assign(col1 = c3, col2 = c4)
)

    col1  col2
5      1     5
6      2     6
17     3     7

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM