简体   繁体   中英

How to count the values in duplicated rows in pandas

Although this seems like an easy problem I have be struggling with it for a while. I have two dataframes that I want to determine the duplicates between with respect to certain columns and then I want to sum the values of the the both dataframes with respect to another column. I will do my best to show. The following tables describe the structure of the two dataframes, I will call then df1 and df2.

make 2019-12-01 2019-06-04
BMW 0 3
VW 1 3
make 2018-12-01 2019-06-04
TESLA 0 2
VW 2 2

this is my attempt

df = pd.concat ([df1, df2], axis=1)
    df_2 = df [df.duplicated (subset=[ 'make'], keep=False)]
    df_2 = pd.DataFrame(df_2)
    valuePosition1 = df_2.columns.get_loc(2019-12-01)
    valuePosition2 = df_2.columns.get_loc(2018-12-01)
    flow = min(df_2.iloc[:, valuePosition1].sum(), df_2.iloc[:, valuePosition2].sum())
    print(flow)

the answer should be 1, as there is a VW in both df1[2019-12-01] and df2[2018-12-01]. But I keep getting weird errors:

The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

Which doesn't even seem to understand what I am doing. I am really at my wits end. Both dataframes are also very big so I would need a quick way to do it.

Any guidance or help would be appreciated!

It is better to concatenate along the row axis ( concat(..., axis=0) ) since duplicated expects to work along that axis:

Return boolean Series denoting duplicate rows.

You can also use loc (which is primarily label based) rather than iloc (which is primarily integer position based) considering you know the columns you're interested in.

import pandas as pd

df1 = pd.read_csv('sample1.csv', sep='\s+')
df2 = pd.read_csv('sample2.csv', sep='\s+')

df = pd.concat([df1,df2], axis=0)
print(df)

dfd = df[df.duplicated(subset=['make'], keep=False)]
print(dfd)

flow = min(dfd.loc[:, '2019-12-01'].sum(),
           dfd.loc[:, '2018-12-01'].sum())
print(flow)

Output from df

    make  2019-12-01  2019-06-04  2018-12-01
0    BMW         0.0           3         NaN
1     VW         1.0           3         NaN
0  TESLA         NaN           2         0.0
1     VW         NaN           2         2.0

Output from dfd

  make  2019-12-01  2019-06-04  2018-12-01
1   VW         1.0           3         NaN
1   VW         NaN           2         2.0

Output from flow

1.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM