简体   繁体   中英

Compare two dataframe columns for matching strings or are substrings then count in pandas (Need for speed edition)

I have two dataframes (A and B). I want to compare strings in A and find a match or is contained in another string in B. Then count the amount of times A was matched or contained in B.

    Dataframe A
 
0   "4012, 4065, 4682"
1   "4712, 2339, 5652, 10007"
2   "4618, 8987"
3   "7447, 4615, 4012"
4   "6515"
5   "4065, 2339, 4012"


    Dataframe B

0   "6515, 4012, 4618, 8987"         <- matches (DF A, Index 2 & 4) (2: 4618, 8987), (4: 6515)
1   "4065, 5116, 2339, 8757, 4012"   <- matches (DF A, Index 5) (4065, 2339, 4012)
2   "1101"
3   "6515"                           <- matches (DF A, Index 4) (6515)
4   "4012, 4615, 7447"               <- matches (DF A, Index 3) (7447, 4615, 4012)
5   "7447, 6515, 4012, 4615"         <- matches (DF A, Index 3 & 4) (3: 7447, 4615, 4012 ), (4: 6515)


    Desired Output:

    Itemset             Count

2   4618, 8987            1
3   7447, 4165, 4012      2
4   6515                  3
5   4065, 2339, 4012      1

Basically, I want to count when there is a direct match of A in B (either in order or not) or if A is partially contained in B (in order or not). My goal is to count how many times A is being validated by B. These are all strings by the way.

EDIT Need for speed edition: This is a redo question from my previous post: Compare two dataframe columns for matching strings or are substrings then count in pandas

I have millions of rows in both dfA and dfB to make these comparisons against. In my previous post, the following code got the job done:

import pandas as pd

dfA = pd.DataFrame(["4012, 4065, 4682",
                    "4712, 2339, 5652, 10007",
                    "4618, 8987",
                    "7447, 4615, 4012",
                    "6515",
                    "4065, 2339, 4012",],
                    columns=['values'])

dfB = pd.DataFrame(["6515, 4012, 4618, 8987",
                    "4065, 5116, 2339, 8757, 4012",
                    "1101",
                    "6515",
                    "4012, 4615, 7447",
                    "7447, 6515, 4012, 4615"],
                    columns=['values'])

dfA['values_list'] = dfA['values'].str.split(', ')
dfB['values_list'] = dfB['values'].str.split(', ')


dfA['overlap_A'] = [sum(all(val in cell for val in row)
                    for cell in dfB['values_list']) 
                    for row in dfA['values_list']]

However with the total amount of rows to check, I am experiencing a performance issue and need another way to check the frequency / counts. Seems like Numpy is needed in this case. this is about the extent of my numpy knowledge as I work primarily in pandas. Anyone have suggestions to make this faster?

dfA_array = dfA['values_list'].to_numpy()
dfB_array = dfB['values_list'].to_numpy()

give this a try. Your algorithm is O(N N K): square of count * words per line. Below should improve to O(NK)

from collections import defaultdict
from functools import reduce

d=defaultdict(set)
for i,t in enumerate(dfB['values']):
    for s in t.split(', '):
        d[s].add(i)
dfA['count']=dfA['values'].apply(lambda x:len(reduce(lambda a,b: a.intersection(b), [d[s] for s in x.split(', ') ])))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM