简体   繁体   中英

How to compare pairs of values in two dataframes of different sizes in python?

I have two dataframes of different sizes:

  1. sdfn with columns 'ConceptID1' and ConceptID2'
  ConceptID1  ConceptID2

0         5743        4513

1           5743        7099

2           4513        7099

3          10242        7042

4          10242        7099

...          ...         ...

2601       12028       12043

2602       12371       12043

2603      266632       54106

2604      266632       51135

2605       54106       51135
  1. jdfn with columns 'Gene1' and 'Gene2'
Gene1   Gene2

0      1535     353

1      9970     332

2     23581  112401

3       846  112401

4    150160  112401

..      ...     ...

384   79626   51284

385   79626   51311

386    7305   51311

387   80342   79626

388    7305   79626

Comparing through both data frames, I need to find matching pairs.

I tried this

for index, row in sdfn.iterrows():
    for index, row in jdfn.iterrows():
        if ((sdfn['ConceptID1']==jdfn['Gene1']) and (sdfn['ConceptID2']==jdfn['Gene2'])) or (sdfn['ConceptID1']==jdfn['Gene2']) and ((sdfn['ConceptID2']==jdfn['Gene1'])):
            print(sdfn['ConceptID1'], jdfn['Gene1'], sdfn['ConceptID2'], jdfn['Gene2'])

The result:

Traceback (most recent call last):

File "", line 3, in

if ((sdfn['ConceptID1']==jdfn['Gene1']) and (sdfn['ConceptID2']==jdfn['Gene2'])) or

(sdfn['ConceptID1']==jdfn['Gene2']) and ((sdfn['ConceptID2']==jdfn['Gene1'])): File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/pandas/core/ops/ init .py", line 1142, in wrapper raise ValueError("Can only compare identically-labeled " "Series objects")

ValueError: Can only compare identically-labeled Series objects

The issue here is that you are not using or naming your for loop variables correctly and attempting to compare the entirety of each dataframe column directly.

sdfn['ConceptID1'] , sdfn['ConceptID2'] , jdfn['Gene1'] , jdfn['Gene2']

will refer to the entire dataframe column, which pandas defines as a Series type object, hence the mention of Series label mismatch in the error message.

You will need to first rename your for loop variables, and then use them in the search:

for sind, srow in sdfn.iterrows():
    for jind, jrow in jdfn.iterrows():
        if ((srow['ConceptID1']==jrow['Gene1']) and (srow['ConceptID2']==jrow['Gene2'])) or (srow['ConceptID1']==jrow['Gene2']) and ((srow['ConceptID2']==jrow['Gene1'])):
            print(srow['ConceptID1'], jrow['Gene1'], srow['ConceptID2'], jrow['Gene2'])

Note that in your posted code, index and row variables are declared and assigned in the outer loop yet are modified in the inner loop. So instead of having two pairs of loop variables, there is only one pair that is being incremented and overwritten, thus unable to compare the appropriate data.

Hope this helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM