简体   繁体   中英

How do I compare each row with all the others and if it's the same I concatenate to a new dataframe? Python

I have a DataFrame with 2 columns:

import pandas as pd

data = {'Country': ['A',  'A', 'A' ,'B', 'B'],'Capital': ['CC',  'CD','CE','CF','CG'],'Population': [5, 35, 20,34,65]}

df = pd.DataFrame(data,columns=['Country',  'Capital',  'Population'])

I want to compare each row with all others, and if it has the same Country, I would like to concatenate the pair into a new data frame (and transfor it into a new csv).

new_data =  {'Country': ['A',  'A','B'],'Capital': ['CC',  'CD','CF'],'Population': [5, 35,34],'Country_2': ['A', 'A' ,'B'],'Capital_2': ['CD','CE','CG'],'Population_2': [35, 20,65]}

df_new = pd.DataFrame(new_data,columns=['Country',  'Capital',  'Population','Country_2','Capital_2','Population_2'])

NOTE: This is a simplification of my data, I have more than 5000 rows and I would like to do it automatically I tried comparing dictionaries, and also comparing one row at a time, but I couldn't do it. Thank you for the attention

>>> df.join(df.groupby('Country').shift(-1), rsuffix='_2')\
...   .dropna(how='any')
  Country Capital  Population Capital_2  Population_2
0       A      CC           5        CD          35.0
1       A      CD          35        CE          20.0
3       B      CF          34        CG          65.0

This pairs every row with the next one using join + shift − but we restrict shifting only within the same country using groupby . See what the groupby + shift does on its own:

>>> df.groupby('Country').shift(-1)
  Capital  Population
0      CD        35.0
1      CE        20.0
2     NaN         NaN
3      CG        65.0
4     NaN         NaN

Then once these values are added to the right of your data with the _2 suffix, the rows that have NaN s are dropped with dropna() .

Finally note that Country_2 is not repeated as it's the same as Country , but it would be very easy to add

To get all combinations you can try:

from itertools import combinations,chain

df = (
    pd.concat(
        [pd.DataFrame(
            np.array(list(chain(*(combinations(k.values,2))))).reshape(-1, len(df.columns) * 2),
            columns = df.columns.append(df.columns.map(lambda x: x + '_2')))
        for g,k in df.groupby('Country')]
        )
)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM