简体   繁体   English

pd.merge 的问题

[英]Problems with pd.merge

Hope you all are having an excellent week.希望你们都度过了美好的一周。

So, I was finishing a script that worked really well for an specific use case.因此,我正在完成一个非常适合特定用例的脚本。 The base is as follows:基数如下:

Funcion cosine_similarity_join:函数 cosine_similarity_join:

def cosine_similarity_join(a:pd.DataFrame, b:pd.DataFrame, col_name):

    a_len = len(a[col_name])

    # all of the "documents" in a 1D array
    corpus = np.concatenate([a[col_name].to_numpy(), b[col_name].to_numpy()])
    
    # vectorize the array
    tfidf, vectorizer = fit_vectorizer(corpus, 3)

    # in this matrix each row represents the str in a and the col is the str from b, value is the cosine similarity
    res = cosine_similarity(tfidf[:a_len], tfidf[a_len:])

    res_series = pd.DataFrame(res).stack().rename("score")
    res_series.index.set_names(['a', 'b'], inplace=True)
    
    # join scores to b
    b_scored = pd.merge(left=b, right=res_series, left_index=True, right_on='b').droplevel('b')

    # find the indices on which to match, (highest score in each row)
    best_match = np.argmax(res, axis=1)

    # Join the rest of 
    res = pd.merge(left=a, right=b_scored, left_index=True, right_index=True, suffixes=('', '_Guess'))
    print(res)

    df = res.reset_index()
    df = df.iloc[df.groupby(by="RefCol")["score"].idxmax()].reset_index(drop=True)

    return df

This works like a charm when I do something like:当我执行以下操作时,这就像一种魅力:

resulting_df = cosine_similarity_join(df1,df2,'My_col')

But in my case, I need something in the lines of:但就我而言,我需要以下内容:

big_df = pd.read_csv('some_really_big_df.csv')
some_other_df = pd.read_csv('some_other_small_df.csv')

counter = 0
size = 10000
total_size = len(big_df)

while counter <= total_size:

    small_df = big_df[counter:counter+size]
    resulting_df = cosine_similarity_join(small_df,some_other_df,'My_col')
    counter += size
    

I already mapped the problem until one specific line in the function:我已经将问题映射到 function 中的一个特定行:

res = pd.merge(left=a, right=b_scored, left_index=True, right_index=True, suffixes=('', '_Guess'))

Basically this res dataframe is coming out empty and I just cannot understand why (since when I replicate the values outside of the loop it works just fine)...基本上这个 res dataframe 出来是空的,我只是不明白为什么(因为当我在循环之外复制值时它工作得很好)......

I looked at the problem for hours now and would gladly accept a new light over the question.我已经看了几个小时这个问题,很乐意接受这个问题的新观点。

Thank you all in advance!谢谢大家!

Found the problem!发现问题了!

I just needed to reset the indexes for the join clause - once I create a new small df from the big df, the indexes remain equal to the slice of the big one, thus generating the problem when joining with another df!我只需要重置连接子句的索引 - 一旦我从大 df 创建一个新的小 df,索引仍然等于大 df 的切片,从而在与另一个 df 连接时产生问题!

So basically all I needed to do was:所以基本上我需要做的就是:

while counter <= total_size:

    small_df = big_df[counter:counter+size]
    small_df = small_df.reset_index()
    resulting_df = cosine_similarity_join(small_df,some_other_df,'My_col')
    counter += size

I'll leave it here in case it helps someone:)我会把它留在这里以防它帮助某人:)

Cheers!干杯!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM