如何从文本语料库构造 PPMI 矩阵？

Question

I am trying to use an SVD model for word embedding on the Brown corpus.我正在尝试使用 SVD model 在布朗语料库上进行词嵌入。 For this, I want to first generate a word-word co-occurence matrix and then convert to PPMI matrix for the SVD matrix multiplication process.为此，我想首先生成一个词-词共现矩阵，然后转换为 PPMI 矩阵以进行 SVD 矩阵乘法过程。

I have tried to create a co-occurence using SkLearn CountVectorizer我尝试使用 SkLearn CountVectorizer 创建一个共现

count_model = CountVectorizer(ngram_range=(1,1))

X = count_model.fit_transform(corpus)
X[X > 0] = 1
Xc = (X.T * X)
Xc.setdiag(0)
print(Xc.todense())

But:但：

(1) Am not sure how I can control the context window with this method? (1) 我不确定如何用这种方法控制上下文 window？ I want to experiment with various context sizes and see how the impact the process.我想尝试各种上下文大小，看看它们对过程的影响。

(2) How do I then compute the PPMI properly assuming that PMI(a, b) = log p(a, b)/p(a)p(b) (2) 假设 PMI(a, b) = log p(a, b)/p(a)p(b)，我该如何正确计算 PPMI

Any help on the thought process and implementation would be greatly appreciated!任何有关思考过程和实施的帮助将不胜感激！

Thanks (-:谢谢（-：

Answer 1

I tried to play with the provided code, but I couldn't apply the moving window to it.我尝试使用提供的代码，但无法将移动的 window 应用于它。 So, I did my own function that does so.所以，我做了我自己的 function 这样做。 This function takes a list of sentences and returns a pandas.DataFrame object representing the co-occurrence matrix and a window_size number: This function takes a list of sentences and returns a pandas.DataFrame object representing the co-occurrence matrix and a window_size number:

def co_occurrence(sentences, window_size):
    d = defaultdict(int)
    vocab = set()
    for text in sentences:
        # preprocessing (use tokenizer instead)
        text = text.lower().split()
        # iterate over sentences
        for i in range(len(text)):
            token = text[i]
            vocab.add(token)  # add to vocab
            next_token = text[i+1 : i+1+window_size]
            for t in next_token:
                key = tuple( sorted([t, token]) )
                d[key] += 1

    # formulate the dictionary into dataframe
    vocab = sorted(vocab) # sort vocab
    df = pd.DataFrame(data=np.zeros((len(vocab), len(vocab)), dtype=np.int16),
                      index=vocab,
                      columns=vocab)
    for key, value in d.items():
        df.at[key[0], key[1]] = value
        df.at[key[1], key[0]] = value
    return df

Let's try it out given the following two simple sentences:让我们尝试一下以下两个简单的句子：

>>> text = ["I go to school every day by bus .",
            "i go to theatre every night by bus"]
>>> 
>>> df = co_occurrence(text, 2)
>>> df
         .  bus  by  day  every  go  i  night  school  theatre  to
.        0    1   1    0      0   0  0      0       0        0   0
bus      1    0   2    1      0   0  0      1       0        0   0
by       1    2   0    1      2   0  0      1       0        0   0
day      0    1   1    0      1   0  0      0       1        0   0
every    0    0   2    1      0   0  0      1       1        1   2
go       0    0   0    0      0   0  2      0       1        1   2
i        0    0   0    0      0   2  0      0       0        0   2
night    0    1   1    0      1   0  0      0       0        1   0
school   0    0   0    1      1   1  0      0       0        0   1
theatre  0    0   0    0      1   1  0      1       0        0   1
to       0    0   0    0      2   2  2      0       1        1   0

[11 rows x 11 columns]

Now, we have our co-occurrence matrix.现在，我们有了同现矩阵。 Let's find the (Positive) Point-wise Mutual Information or PPMI for short.让我们找到（正）逐点互信息或简称 PPMI。 I used the code provided professor Christopher Potts from Stanford found in this slides that can be summarized in the following image我使用了本幻灯片中提供的斯坦福教授 Christopher Potts 提供的代码，可以在下图中进行总结

The PPMI is the same as the following pmi with positive=True : PPMI 与以下带有positive=True的pmi相同：

def pmi(df, positive=True):
    col_totals = df.sum(axis=0)
    total = col_totals.sum()
    row_totals = df.sum(axis=1)
    expected = np.outer(row_totals, col_totals) / total
    df = df / expected
    # Silence distracting warnings about log(0):
    with np.errstate(divide='ignore'):
        df = np.log(df)
    df[np.isinf(df)] = 0.0  # log(0) = 0
    if positive:
        df[df < 0] = 0.0
    return df

Let's try it out:让我们试一试：

>>> ppmi = pmi(df, positive=True)
>>> ppmi
                .       bus        by  ...    school   theatre        to
.        0.000000  1.722767  1.386294  ...  0.000000  0.000000  0.000000
bus      1.722767  0.000000  1.163151  ...  0.000000  0.000000  0.000000
by       1.386294  1.163151  0.000000  ...  0.000000  0.000000  0.000000
day      0.000000  1.029619  0.693147  ...  1.252763  0.000000  0.000000
every    0.000000  0.000000  0.693147  ...  0.559616  0.559616  0.559616
go       0.000000  0.000000  0.000000  ...  0.847298  0.847298  0.847298
i        0.000000  0.000000  0.000000  ...  0.000000  0.000000  1.252763
night    0.000000  1.029619  0.693147  ...  0.000000  1.252763  0.000000
school   0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.559616
theatre  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.559616
to       0.000000  0.000000  0.000000  ...  0.559616  0.559616  0.000000

[11 rows x 11 columns]

如何从文本语料库构造 PPMI 矩阵？

问题描述

1 个解决方案

解决方案1
3 2019-11-06 08:15:55

如何从文本语料库构造 PPMI 矩阵？

问题描述

1 个解决方案

解决方案1 3 2019-11-06 08:15:55

解决方案1
3 2019-11-06 08:15:55