简体   繁体   中英

Iterate over rows in Pandas dataframe to find values in other file and extract index

I have one csv file imported as pandas dataframe with filenames in one column. I have another file which is a numpy array with the same filenames in it but at different indexes. Can you help me with iterating over the filenames in the csv file to find the match in the numpy file and extracting the index where the filename is at in the numpy file.

So for example:

d = {'col1': ["Apple", "Peach"], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
   col1  col2
0     Apple 3
1     Peach 4

b = np.array(["Apple", "Banana", "Pear", "Peach"])
b
array(['Apple', 'Banana', 'Pear', 'Peach'], dtype='<U6')

Now i would like to now from every item in the df at what indexes they are in the array so i can append something at that position in another array.

I have tried something like this:

for i,j in df:
    if j in b:
        print(b.get_loc)

IIUC, we can turn array and df into a dict by indices as their keys and use a function to finding matching pairs :

import collections as colls

import numpy as np
import pandas as pd

d = {'col_1': ['Apple', 'Peach'], 'col_2': [3, 4]}
df = pd.DataFrame(data=d)
b = np.array(['Apple', 'Banana', 'Pear', 'Peach'])

d_1 = df['col_1'].to_dict()
d_2 = dict(enumerate(b))


def dicts_to_tuples(*dicts):
    result = colls.defaultdict(list)
    for curr_dict in dicts:
        for k, v in curr_dict.items():
            result[v].append(k)
    return [tuple(v) for v in result.values() if len(v) > 1]


print(d_1)  # {0: 'Apple', 1: 'Peach'}
print(d_2)  # {0: 'Apple', 1: 'Banana', 2: 'Pear', 3: 'Peach'}
print(dicts_to_tuples(d_1, d_2))  # [(0, 0), (1, 3)]

the rest is down to you.

you could even turn the array into a datframe and perform a merge :

df2 = pd.DataFrame(b)
merge_ = pd.merge(df,df2,left_on=['col1',df.index],right_on=['col1',df2.index],how='inner')

Does this solution work? Do you need to know the corresponding key, if not, this is just the index list:

mask = np.in1d(b,df['col1'])
idx = np.arange(len(mask)) 
idx[mask]

# array([0, 3])

You can also do this to get a dict of the locations:

df['idx'] = idx[mask]                                                                                                                                                                              

df.set_index('idx')['col1'].to_dict()                                                                                                                                                              
# {0: 'Apple', 3: 'Peach'}

df.set_index('col1')['idx'].to_dict()                                                                                                                                                              
# {'Apple': 0, 'Peach': 3}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM