计算转换矩阵中每个词的频率，仅使用numpy和pandas

Question

I am trying to calculate the frequency of each word in the transition matrix, using numpy and pandas only.我正在尝试计算转换矩阵中每个单词的频率，仅使用 numpy 和 pandas。

I have a string我有一个字符串

star_wars = [('darth', 'leia'), ('luke', 'han'), ('chewbacca', 'luke'), 
         ('chewbacca', 'obi'), ('chewbacca', 'luke'), ('leia', 'luke')]

I build a matrix for this string, using this question .我使用这个问题为这个字符串构建了一个矩阵。

             chewbacca  darth  han  leia  luke  obi
chewbacca          0      0    0     0     2    1
darth              0      0    0     1     0    0
han                0      0    0     0     1    0
leia               0      0    0     0     1    0
luke               0      0    0     0     0    0
obi                0      0    0     0     0    0

Now I am trying to convert these values of words into probabilities, using this question :现在我正在尝试使用这个问题将这些单词值转换为概率：

Using a crosstab works for the initial dataframe, but gives me pairs only使用交叉表适用于初始 dataframe，但只给我成对

pd.crosstab(pd.Series(star_wars[1:]),
        pd.Series(star_wars[:-1]), normalize = 1)

Output is wrong and this also does not work for my created matrix, just an example: Output 是错误的，这也不适用于我创建的矩阵，只是一个例子：

col_0   (chewbacca, luke)   (chewbacca, obi)    (darth, leia)   (luke, han)
row_0               
(chewbacca, luke)   0.0 1.0 0.0 1.0
(chewbacca, obi)    0.5 0.0 0.0 0.0
(leia, luke)        0.5 0.0 0.0 0.0
(luke, han)         0.0 0.0 1.0 0.0

I also create a function我还创建了一个 function

from itertools import islice

def my_function(seq, n = 2):
it = iter(seq)
result = tuple(islice(it, n))
if len(result) == n:
    yield result
for elem in it:
    result = result[1:] + (elem,)
    yield result

Apply the function and calculate probabilities应用 function 并计算概率

pairs = pd.DataFrame(my_function(star_wars), columns=['Columns', 'Rows'])
counts = pairs.groupby('Columns')['Rows'].value_counts()
probs = (counts/counts.sum()).unstack()

print(probs)

But it gives me the calculation of pairs (not even sure it is correct)但它给了我对的计算（甚至不确定它是否正确）

Rows               (chewbacca, luke)  (chewbacca, obi)  (leia, luke)  \
Columns                                                                
(chewbacca, luke)                NaN               0.2           0.2   
(chewbacca, obi)                 0.2               NaN           NaN   
(darth, leia)                    NaN               NaN           NaN   
(luke, han)                      0.2               NaN           NaN   

Rows               (luke, han)  
Columns                         
(chewbacca, luke)          NaN  
(chewbacca, obi)           NaN  
(darth, leia)              0.2  
(luke, han)                NaN

Another attempt, just using crosstab另一种尝试，只使用crosstab

Desired about - a matrix with probabilities, not numbers.期望关于 - 具有概率而不是数字的矩阵。

For example例如

            chewbacca  darth  han  leia  luke  obi
chewbacca          0      0    0     0   0.66 0.33
darth              0      0    0     1     0    0
han                0      0    0     0     1    0
leia               0      0    0     0     1    0
luke               0      0    0     0     0    0
obi                0      0    0     0     0    0

Appreciate your time and help!感谢您的时间和帮助！

Answer 1

We can still do it by crosstab我们仍然可以通过crosstab来完成

df=pd.DataFrame(star_wars)
s=pd.crosstab(df[0],df[1],normalize='index')
s=s.reindex(index=df.stack().unique(),fill_value=0).reindex(columns=df.stack().unique(),fill_value=0)
s
1          darth  leia      luke  han  chewbacca       obi
0                                                         
darth          0   1.0  0.000000  0.0          0  0.000000
leia           0   0.0  1.000000  0.0          0  0.000000
luke           0   0.0  0.000000  1.0          0  0.000000
han            0   0.0  0.000000  0.0          0  0.000000
chewbacca      0   0.0  0.666667  0.0          0  0.333333
obi            0   0.0  0.000000  0.0          0  0.000000

Answer 2

To get the probabilities from the transition matrix, you need only divide each row by the row sums.要从转换矩阵中获取概率，只需将每一行除以行总和即可。

>>> df / df.values.sum(axis=1).reshape((-1,1))
           chewbacca  darth  han  leia      luke       obi
chewbacca        0.0    0.0  0.0   0.0  0.666667  0.333333
darth            0.0    0.0  0.0   1.0  0.000000  0.000000
han              0.0    0.0  0.0   0.0  1.000000  0.000000
leia             0.0    0.0  0.0   0.0  1.000000  0.000000
luke             NaN    NaN  NaN   NaN       NaN       NaN
obi              NaN    NaN  NaN   NaN       NaN       NaN

Of course, you should be sure to not divide by zero in the last two rows.当然，您应该确保在最后两行中不要除以零。 If the row sum is zero, then all entries of the row are zero, so you replace those row sums with anything you want.如果行总和为零，则该行的所有条目都为零，因此您可以将这些行总和替换为您想要的任何内容。

>>> row_sums = df.values.sum(axis=1)
>>> row_sums[row_sums == 0] = 1
>>> df / row_sums.reshape((-1,1))
           chewbacca  darth  han  leia      luke       obi
chewbacca        0.0    0.0  0.0   0.0  0.666667  0.333333
darth            0.0    0.0  0.0   1.0  0.000000  0.000000
han              0.0    0.0  0.0   0.0  1.000000  0.000000
leia             0.0    0.0  0.0   0.0  1.000000  0.000000
luke             0.0    0.0  0.0   0.0  0.000000  0.000000
obi              0.0    0.0  0.0   0.0  0.000000  0.000000

计算转换矩阵中每个词的频率，仅使用numpy和pandas

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-07-17 23:20:16

解决方案2
1 2020-07-17 23:22:30

计算转换矩阵中每个词的频率，仅使用numpy和pandas

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-07-17 23:20:16

解决方案2 1 2020-07-17 23:22:30

解决方案1
1 已采纳 2020-07-17 23:20:16

解决方案2
1 2020-07-17 23:22:30