I have df of string values
Keyword
plant
cell
cat
Pandas
And I want to find the relationship or correlation between these two string values.
I have used pandas corr = df1.corrwith(df2,axis=0)
. But this is useful for to find the correlation between the numerical values but I want to see whether the two strings are related by finding the correlation distance. How can I do that?
There are a few steps here, the first thing you need to do is extract some sort of vector for each word.
A good way is using gensim word2vec (you need to download the files from here ):
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('data/GoogleGoogleNews-vectors-negative300.bin', binary=True)
after getting the pretrained vectors you need to extract the vector for each word:
vector = model['plant']
or in the pandas column example:
df['Vectors'] = df['Keyword'].apply(lambda x: model[x])
Once this is done you can calculate the distance between two vectors using a number of methodologies, eg euclidean distance:
from sklearn.metrics.pairwise import euclidean_distances
distances = euclidean_distances(list(df['Vectors']))
distances will be a matrix, with 0 on the diagonal and the distance of all words from each other. The closer a distance is to 0, the more similar the words are.
You can use different models and different distance metrics, but you can use this as a starting point.
More often than not, the above approach of loading the model may not work so I am sharing with you the approach that worked for me. I am using Google Colab therefore the use of '!' before each command.
Download the file(ie the model) using wget
like so:
!wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"
Next use gzip
to unzip the file using this command:
!gzip -d GoogleNews-vectors-negative300.bin.gz
Next up use the models
library from gensim
to load the downloaded file using this code. This will give you the wordVector
model for further use. I am using Google Colab so the file path can change if you are doing the process locally:
from gensim import models
model = models.KeyedVectors.load_word2vec_format(
'/content/GoogleNews-vectors-negative300.bin', binary=True)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.