简体   繁体   English

用于pandas DataFrame中文本的Jaccard相似度

[英]Jaccard Similarity for Texts in a pandas DataFrame

I want to measure the jaccard similarity between texts in a pandas DataFrame. 我想测量pandas DataFrame中文本之间的jaccard相似度。 More precisely I have some groups of entities and there is some text for each entity over a period of time. 更确切地说,我有一些实体组,并且在一段时间内每个实体都有一些文本。 I want to analyse the text similarity (in here the Jaccard similarity) over time, separately for each entity. 我想分析每个实体的文本相似度(这里是Jaccard相似度)随时间的变化。

A minimal example to illustrate my point: 一个简单的例子来说明我的观点:


import pandas as pd

entries = [
    {'Entity_Id':'Firm1', 'date':'2001-02-05', 'text': 'This is a text'},
    {'Entity_Id':'Firm1', 'date':'2001-03-07', 'text': 'This is a text'},
    {'Entity_Id':'Firm1', 'date':'2003-01-04', 'text': 'No similarity'},
    {'Entity_Id':'Firm1', 'date':'2007-10-12', 'text': 'Some similarity'},
    {'Entity_Id':'Firm2', 'date':'2001-10-10', 'text': 'Another firm'},
    {'Entity_Id':'Firm2', 'date':'2005-12-03', 'text': 'Another year'},
    {'Entity_Id':'Firm3', 'date':'2002-05-05', 'text': 'Something different'}
    ]

df = pd.DataFrame(entries)

Entity_Id date text Entity_Id日期文本

Firm1   2001-02-05   'This is a text' 
Firm1   2001-03-07   'This is a text'
Firm1   2003-01-04   'No similarity'
Firm1   2007-10-12   'Some similarity'
Firm2   2001-10-10   'Another firm'
Firm2   2005-12-03   'Another year'
Firm3   2002-05-05   'Something different'

My desired output would be something like this: 我想要的输出将是这样的:

Entity_Id date text Jaccard Entity_Id日期文本Jaccard

Firm1   2001-02-05   'This is a text'       NaN
Firm1   2001-03-07   'This is a text'       1
Firm1   2003-01-04   'No similarity'        0
Firm1   2007-10-12   'Some similarity'      0.33
Firm2   2001-10-10   'Another firm'         NaN 
Firm2   2005-12-03   'Another year'         0.33  
Firm3   2002-05-05   'Something different'  NaN 

That is, i like to compare all text elements within a group of Firms, regardless of the time interval that lays between the texts. 也就是说,我喜欢比较一组公司中的所有文本元素,而不管文本之间的时间间隔。 I would like to compare it always to the previous text. 我想将它与以前的文本进行比较。 Therefore the first entry for each firm is always empty as there is no text to compare with. 因此,每个公司的第一个条目总是空的,因为没有可比较的文本。

My approach is to shift the texts by the Entity Identifier by one time interval (the next date available). 我的方法是按实体标识符将文本移动一个时间间隔(下一个可用日期)。 Then to identify the first report by each Entity and mark this one. 然后识别每个实体的第一份报告并标记该报告。 (I input the original text for NaNs in text_shifted and delete it later on -> need that for tokenization of the whole column) (我在text_shifted中输入NaNs的原始文本,稍后将其删除 - >需要对整列进行标记化)

df = df.sort_values(['Entity_Id', 'date'], ascending=True)
df['text_shifted'] = df.groupby(['Entity_Id'])['text'].shift(1)
df['IsNaN'] = df['text_shifted'].isnull().astype(int)
df['text_shifted'] = df['text_shifted'].fillna(df['text'])

In the follow i use the jaccard similarity as follows: 在下面我使用jaccard相似性如下:

def jaccard_similarity(query, document):
    intersection = set(query).intersection(set(document))
    union = set(query).union(set(document))
    return len(intersection)/len(union)

However i have to tokenize the input first. 但是我必须首先对输入进行标记。 But if i do something like: 但如果我做的事情如下:

import nltk
df['text_tokens'] = df.text.apply(nltk.word_tokenize)
df['shift_tokens'] = df.text_shifted.apply(nltk.word_tokenize)

It needs years to tokenize the texts in a non-simplified text example where each text has roughly 5000 words and i have about 100 000 texts. 在非简化的文本示例中需要多年来标记文本,其中每个文本大约有5000个单词,并且我有大约10万个文本。

Is there any way that i can speed up the process? 有什么方法可以加快这个过程吗? Can i avoid the tokenization or better still use sklearn to calculate the similarity? 我可以避免标记化还是更好地使用sklearn来计算相似度?

If I use the cosine similarity as is suggested here: Cosine Similarity row-wise i get my results pretty quick. 如果我使用这里建议的余弦相似度余弦相似性逐行我得到我的结果很快。 But i am stuck doing that with jaccard. 但我坚持用jaccard做这件事。

One way to speed up the process could be parallel processing using Pandas on Ray . 加速该过程的一种方法可以是使用Pandas on Ray进行并行处理。

You can try NLTK implementation of jaccard_distance for jaccard similarity. 您可以尝试使用jaccard_distance的 NLTK实现jaccard相似性。 I couldn't find any significant improvement in processing time though(for calculating similarity), may work out better on a larger dataset. 我找不到处理时间的任何显着改进(用于计算相似度),可能在更大的数据集上更好地工作。

Tried comparing NLTK implementation to your custom jaccard similarity function (on 200 text samples of average length 4 words/tokens) 尝试将NLTK实现与您的自定义jaccard相似度函数进行比较(平均长度为4个字/令牌的200个文本样本)

NTLK jaccard_distance: NTLK jaccard_distance:

CPU times: user 3.3 s, sys: 30.3 ms, total: 3.34 s
Wall time: 3.38 s

Custom jaccard similarity implementation: 自定义jaccard相似性实现:

CPU times: user 3.67 s, sys: 19.2 ms, total: 3.69 s
Wall time: 3.71 s

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM