使用 numpy 或 pandas 从元组列表中为二元组创建频率矩阵

Question

我对 Python 很陌生。 我有一个元组列表，我在其中创建了二元组。

这个问题非常接近我的需求

my_list = [('we', 'consider'), ('what', 'to'), ('use', 'the'), ('words', 'of')]

现在我正在尝试将其转换为频率矩阵

所需的 output 是

          consider  of  the  to  use  we  what  words
consider         0   0    0   0    0   0     0      0
of               0   0    0   0    0   0     0      0
the              0   0    0   0    0   0     0      0
to               0   0    0   0    0   0     0      0
use              0   0    1   0    0   0     0      0
we               1   0    0   0    0   0     0      0
what             0   0    0   1    0   0     0      0
words            0   1    0   0    0   0     0      0

如何做到这一点，使用numpy或pandas ？ 不幸的是，我只能看到nltk的东西。

Answer 1

您可以创建频率数据框并按单词调用索引值：

words=sorted(list(set([item for t in my_list for item in t])))
df = pd.DataFrame(0, columns=words, index=words)
for i in my_list:
  df.at[i[0],i[1]] += 1

output：

          consider  of  the  to  use  we  what  words
consider         0   0    0   0    0   0     0      0
of               0   0    0   0    0   0     0      0
the              0   0    0   0    0   0     0      0
to               0   0    0   0    0   0     0      0
use              0   0    1   0    0   0     0      0
we               1   0    0   0    0   0     0      0
what             0   0    0   1    0   0     0      0
words            0   1    0   0    0   0     0      0

请注意，在这一个中，二元组中的顺序很重要。 如果您不关心顺序，则应首先按内容对元组进行排序，使用以下方法：

my_list = [tuple(sorted(i)) for i in my_list]

另一种方法是使用Counter进行计数，但我希望它具有相似的性能（同样，如果二元组中的顺序很重要，请从frequency_list中删除sorted ）：

from collections import Counter

frequency_list = Counter(tuple(sorted(i)) for i in my_list)
words=sorted(list(set([item for t in my_list for item in t])))
df = pd.DataFrame(0, columns=words, index=words)
for k,v in frequency_list.items():
  df.at[k[0],k[1]] = v

output：

          consider  of  the  to  use  we  what  words
consider         0   0    0   0    0   1     0      0
of               0   0    0   0    0   0     0      1
the              0   0    0   0    1   0     0      0
to               0   0    0   0    0   0     1      0
use              0   0    0   0    0   0     0      0
we               0   0    0   0    0   0     0      0
what             0   0    0   0    0   0     0      0
words            0   0    0   0    0   0     0      0

Answer 2

如果您不太关心速度，则可以使用 for 循环。

import pandas as pd
import numpy as np
from itertools import product

my_list = [('we', 'consider'), ('what', 'to'), ('use', 'the'), ('words', 'of')]

index = pd.DataFrame(my_list)[0].unique()
columns = pd.DataFrame(my_list)[1].unique()
df = pd.DataFrame(np.zeros(shape=(len(columns), len(index))),
                  columns=columns, index=index, dtype=int)

for idx,col in product(index, columns):
    df[col].loc[idx] = my_list.count((idx, col))

print(df)

Output：

       consider  to  the  of
we            1   0    0   0
what          0   1    0   0
use           0   0    1   0
words         0   0    0   1

使用 numpy 或 pandas 从元组列表中为二元组创建频率矩阵

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-07-17 05:40:27

解决方案2
1 2020-07-17 06:15:54

使用 numpy 或 pandas 从元组列表中为二元组创建频率矩阵

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-07-17 05:40:27

解决方案2 1 2020-07-17 06:15:54

解决方案1
1 已采纳 2020-07-17 05:40:27

解决方案2
1 2020-07-17 06:15:54