简体   繁体   English

使用 numpy 的变量之间的相关性

[英]Correlation between variables using numpy

I would need to calculate the correlation between the presence of uppercase, special punctuation, specific words, in texts labelled as fake/not fake.我需要计算标记为假/非假的文本中大写字母、特殊标点符号、特定单词之间的相关性。

For example:例如:

Text      Label        Uppercase       Special Punctuation    Specific Word
text1       1                1                       0                   1
text2       0                0                       0                   0
text3       1                1                       1                   1
text4       1                1                       1                   1
text5       0                0                       0                   1

Uppercase, Special Punctuation and Specific Word can take only one of these two values: 1 or 0. I would like to determine the correlation between these features related to the label (fake=1/not fake=0).大写、特殊标点符号和特定单词只能取这两个值之一:1 或 0。我想确定这些与标签相关的特征之间的相关性(假=1/非假=0)。 I thought to use Pearson correlation as follows我想使用皮尔逊相关如下

import numpy as np

# Create correlation matrix
corr_matrix = df.corr().abs()

May I ask you if this the right function to use or if there are different correlation functions in python that calculate the correlation between binary variables?请问这是否是正确使用的函数,或者python中是否有不同的相关函数来计算二进制变量之间的相关性?

.corr() should work if you have numeric values.如果您有数值, .corr()应该可以工作。

If your variables are strings, just convert them into integers and use correlation.如果您的变量是字符串,只需将它们转换为整数并使用相关性。 This should work:这应该有效:

df[['Uppercase','Special Punctuation', 'Specific Word']].astype(int).corr()

The function is correct, but I don't understand why you're using only absolute values.函数是正确的,但我不明白你为什么只使用绝对值。 The sign of the correlation is informative for the direction of the association.相关性的符号为关联的方向提供信息。 I'm not familiar with your context so I'll just flag this without going any further.我不熟悉你的上下文,所以我只会标记这个而不进一步。

Correlation can be calculated in subtly different ways, ie 'pearson', 'kendall', 'spearman'.相关性可以用不同的方式计算,即“pearson”、“kendall”、“spearman”。 The default method is 'pearson'.默认方法是“皮尔逊”。 You can calculate using other methods by specifying the 'method' argument.您可以通过指定 'method' 参数使用其他方法进行计算。 Like this:像这样:

corr_matrix = df.corr(method = 'kendall')

more information can be found in the documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html更多信息可以在此处的文档中找到: https : //pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM