简体   繁体   中英

How to find the correlation between two strings in pandas

I have df of string values

   Keyword
    plant
    cell
    cat
    Pandas

And I want to find the relationship or correlation between these two string values.

I have used pandas corr = df1.corrwith(df2,axis=0) . But this is useful for to find the correlation between the numerical values but I want to see whether the two strings are related by finding the correlation distance. How can I do that?

There are a few steps here, the first thing you need to do is extract some sort of vector for each word.

A good way is using gensim word2vec (you need to download the files from here ):

from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format('data/GoogleGoogleNews-vectors-negative300.bin', binary=True)

after getting the pretrained vectors you need to extract the vector for each word:

vector = model['plant']

or in the pandas column example:

df['Vectors'] = df['Keyword'].apply(lambda x: model[x])

Once this is done you can calculate the distance between two vectors using a number of methodologies, eg euclidean distance:

from sklearn.metrics.pairwise import euclidean_distances
distances = euclidean_distances(list(df['Vectors']))

distances will be a matrix, with 0 on the diagonal and the distance of all words from each other. The closer a distance is to 0, the more similar the words are.

You can use different models and different distance metrics, but you can use this as a starting point.

More often than not, the above approach of loading the model may not work so I am sharing with you the approach that worked for me. I am using Google Colab therefore the use of '!' before each command.

Download the file(ie the model) using wget like so:

!wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

Next use gzip to unzip the file using this command:

!gzip -d GoogleNews-vectors-negative300.bin.gz

Next up use the models library from gensim to load the downloaded file using this code. This will give you the wordVector model for further use. I am using Google Colab so the file path can change if you are doing the process locally:

from gensim import models
model = models.KeyedVectors.load_word2vec_format(
    '/content/GoogleNews-vectors-negative300.bin', binary=True)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM