简体   繁体   中英

Taking columns from a dataframe based on row values of another dataframe in Python?

I am working with 2 dataframes, I am trying to create multiple dfs from df1 based on row values of df2 . I am unable to find any documentation around how to get this done.

import pandas as pd
import numpy as np

df1 = pd.DataFrame({
    'A': 'foo bar bro bir fin car zoo loo'.split(),
    'B': 'one one two three two two one three'.split(),
    'C': np.arange(8), 'D': np.arange(8) * 2
})
print(df1)


df2 = pd.DataFrame({
    'col1': 'foo bar bro bir'.split(),
    'col2': 'B B C B '.split(),
    'col3': 'D C D D '.split()
})
print(df2)

How do I create a dataframe called 'foo' which takes only columns B and D in df1 (which are inputs from df2 ). Same for another dataframe 'bar' , 'bro' & 'bir' . So an example of the output of df_foo & df_bar will be

df_foo = pd.DataFrame({'B': 'one', 'D': 0})

df_bar = pd.DataFrame({'B': 'one', 'C': 1})

I could not find any documentation on how can this be done.

What about using loc for (label based) indexing? An example:

df1_ = df1.set_index('A')              # use column A to "rename" rows.
print(df1_.loc[('foo',), ('B', 'D')])  # use `.loc` to access values via their label coordinates.
# 
#        B  D
# A          
# foo  one  0

So, to build a new dataframe by taking df2 's rows as input to be used within df1 , you can do

df_all = pd.concat((
    df1_.loc[(row.col1,), (row.col2, row.col3)]
    for _, row in df2.iterrows()
))
print(df_all)
#         B    C    D
# A                   
# foo    one  NaN  0.0
# bar    one  1.0  NaN
# bro    NaN  2.0  4.0
# bir  three  NaN  6.0

and finally, an example with 'bar' (replace 'bar' by 'foo' or whatever)

df_bar = df_all.loc['bar'].dropna()
print(df_bar)            
# B    one
# C      1
# Name: bar, dtype: object

# or, to keep playing with dataframes
print( df_all.loc[('bar',), :].dropna(axis=1) )
#        B    C
# A            
# bar  one  1.0

If you have more than 3 columns, lets say 70-80 columns in df1 , something you can do is

idx     = 'col1'
cols    = [c for c in df2.columns.tolist() if c != idx]
df_agno = pd.concat((
    df1_.loc[
        (row[idx],), row[cols]
    ] for _, row in df2.iterrows()
))
print(df_agno)
#          B    C    D
# A                   
# foo    one  NaN  0.0
# bar    one  1.0  NaN
# bro    NaN  2.0  4.0
# bir  three  NaN  6.0

print( df_agno.loc[('bar',), :].dropna(axis=1) )             
#        B    C
# A            
# bar  one  1.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM