[英]Calculating the frequency of each word in the transition matrix, using numpy and pandas only
I am trying to calculate the frequency of each word in the transition matrix, using numpy and pandas only.我正在尝试计算转换矩阵中每个单词的频率,仅使用 numpy 和 pandas。
I have a string我有一个字符串
star_wars = [('darth', 'leia'), ('luke', 'han'), ('chewbacca', 'luke'),
('chewbacca', 'obi'), ('chewbacca', 'luke'), ('leia', 'luke')]
I build a matrix for this string, using this question .我使用这个问题为这个字符串构建了一个矩阵。
chewbacca darth han leia luke obi
chewbacca 0 0 0 0 2 1
darth 0 0 0 1 0 0
han 0 0 0 0 1 0
leia 0 0 0 0 1 0
luke 0 0 0 0 0 0
obi 0 0 0 0 0 0
Now I am trying to convert these values of words into probabilities, using this question :现在我正在尝试使用这个问题将这些单词值转换为概率:
Using a crosstab works for the initial dataframe, but gives me pairs only使用交叉表适用于初始 dataframe,但只给我成对
pd.crosstab(pd.Series(star_wars[1:]),
pd.Series(star_wars[:-1]), normalize = 1)
Output is wrong and this also does not work for my created matrix, just an example: Output 是错误的,这也不适用于我创建的矩阵,只是一个例子:
col_0 (chewbacca, luke) (chewbacca, obi) (darth, leia) (luke, han)
row_0
(chewbacca, luke) 0.0 1.0 0.0 1.0
(chewbacca, obi) 0.5 0.0 0.0 0.0
(leia, luke) 0.5 0.0 0.0 0.0
(luke, han) 0.0 0.0 1.0 0.0
I also create a function我还创建了一个 function
from itertools import islice
def my_function(seq, n = 2):
it = iter(seq)
result = tuple(islice(it, n))
if len(result) == n:
yield result
for elem in it:
result = result[1:] + (elem,)
yield result
Apply the function and calculate probabilities应用 function 并计算概率
pairs = pd.DataFrame(my_function(star_wars), columns=['Columns', 'Rows'])
counts = pairs.groupby('Columns')['Rows'].value_counts()
probs = (counts/counts.sum()).unstack()
print(probs)
But it gives me the calculation of pairs (not even sure it is correct)但它给了我对的计算(甚至不确定它是否正确)
Rows (chewbacca, luke) (chewbacca, obi) (leia, luke) \
Columns
(chewbacca, luke) NaN 0.2 0.2
(chewbacca, obi) 0.2 NaN NaN
(darth, leia) NaN NaN NaN
(luke, han) 0.2 NaN NaN
Rows (luke, han)
Columns
(chewbacca, luke) NaN
(chewbacca, obi) NaN
(darth, leia) 0.2
(luke, han) NaN
Another attempt, just using crosstab
另一种尝试,只使用crosstab
Desired about - a matrix with probabilities, not numbers.期望关于 - 具有概率而不是数字的矩阵。
For example例如
chewbacca darth han leia luke obi
chewbacca 0 0 0 0 0.66 0.33
darth 0 0 0 1 0 0
han 0 0 0 0 1 0
leia 0 0 0 0 1 0
luke 0 0 0 0 0 0
obi 0 0 0 0 0 0
Appreciate your time and help!感谢您的时间和帮助!
We can still do it by crosstab
我们仍然可以通过crosstab
来完成
df=pd.DataFrame(star_wars)
s=pd.crosstab(df[0],df[1],normalize='index')
s=s.reindex(index=df.stack().unique(),fill_value=0).reindex(columns=df.stack().unique(),fill_value=0)
s
1 darth leia luke han chewbacca obi
0
darth 0 1.0 0.000000 0.0 0 0.000000
leia 0 0.0 1.000000 0.0 0 0.000000
luke 0 0.0 0.000000 1.0 0 0.000000
han 0 0.0 0.000000 0.0 0 0.000000
chewbacca 0 0.0 0.666667 0.0 0 0.333333
obi 0 0.0 0.000000 0.0 0 0.000000
To get the probabilities from the transition matrix, you need only divide each row by the row sums.要从转换矩阵中获取概率,只需将每一行除以行总和即可。
>>> df / df.values.sum(axis=1).reshape((-1,1))
chewbacca darth han leia luke obi
chewbacca 0.0 0.0 0.0 0.0 0.666667 0.333333
darth 0.0 0.0 0.0 1.0 0.000000 0.000000
han 0.0 0.0 0.0 0.0 1.000000 0.000000
leia 0.0 0.0 0.0 0.0 1.000000 0.000000
luke NaN NaN NaN NaN NaN NaN
obi NaN NaN NaN NaN NaN NaN
Of course, you should be sure to not divide by zero in the last two rows.当然,您应该确保在最后两行中不要除以零。 If the row sum is zero, then all entries of the row are zero, so you replace those row sums with anything you want.如果行总和为零,则该行的所有条目都为零,因此您可以将这些行总和替换为您想要的任何内容。
>>> row_sums = df.values.sum(axis=1)
>>> row_sums[row_sums == 0] = 1
>>> df / row_sums.reshape((-1,1))
chewbacca darth han leia luke obi
chewbacca 0.0 0.0 0.0 0.0 0.666667 0.333333
darth 0.0 0.0 0.0 1.0 0.000000 0.000000
han 0.0 0.0 0.0 0.0 1.000000 0.000000
leia 0.0 0.0 0.0 0.0 1.000000 0.000000
luke 0.0 0.0 0.0 0.0 0.000000 0.000000
obi 0.0 0.0 0.0 0.0 0.000000 0.000000
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.