简体   繁体   中英

How to construct PPMI matrix from a text corpus?

I am trying to use an SVD model for word embedding on the Brown corpus. For this, I want to first generate a word-word co-occurence matrix and then convert to PPMI matrix for the SVD matrix multiplication process.

I have tried to create a co-occurence using SkLearn CountVectorizer

count_model = CountVectorizer(ngram_range=(1,1))

X = count_model.fit_transform(corpus)
X[X > 0] = 1
Xc = (X.T * X)
Xc.setdiag(0)
print(Xc.todense())

But:

(1) Am not sure how I can control the context window with this method? I want to experiment with various context sizes and see how the impact the process.

(2) How do I then compute the PPMI properly assuming that PMI(a, b) = log p(a, b)/p(a)p(b)

Any help on the thought process and implementation would be greatly appreciated!

Thanks (-:

I tried to play with the provided code, but I couldn't apply the moving window to it. So, I did my own function that does so. This function takes a list of sentences and returns a pandas.DataFrame object representing the co-occurrence matrix and a window_size number:

def co_occurrence(sentences, window_size):
    d = defaultdict(int)
    vocab = set()
    for text in sentences:
        # preprocessing (use tokenizer instead)
        text = text.lower().split()
        # iterate over sentences
        for i in range(len(text)):
            token = text[i]
            vocab.add(token)  # add to vocab
            next_token = text[i+1 : i+1+window_size]
            for t in next_token:
                key = tuple( sorted([t, token]) )
                d[key] += 1

    # formulate the dictionary into dataframe
    vocab = sorted(vocab) # sort vocab
    df = pd.DataFrame(data=np.zeros((len(vocab), len(vocab)), dtype=np.int16),
                      index=vocab,
                      columns=vocab)
    for key, value in d.items():
        df.at[key[0], key[1]] = value
        df.at[key[1], key[0]] = value
    return df

Let's try it out given the following two simple sentences:

>>> text = ["I go to school every day by bus .",
            "i go to theatre every night by bus"]
>>> 
>>> df = co_occurrence(text, 2)
>>> df
         .  bus  by  day  every  go  i  night  school  theatre  to
.        0    1   1    0      0   0  0      0       0        0   0
bus      1    0   2    1      0   0  0      1       0        0   0
by       1    2   0    1      2   0  0      1       0        0   0
day      0    1   1    0      1   0  0      0       1        0   0
every    0    0   2    1      0   0  0      1       1        1   2
go       0    0   0    0      0   0  2      0       1        1   2
i        0    0   0    0      0   2  0      0       0        0   2
night    0    1   1    0      1   0  0      0       0        1   0
school   0    0   0    1      1   1  0      0       0        0   1
theatre  0    0   0    0      1   1  0      1       0        0   1
to       0    0   0    0      2   2  2      0       1        1   0

[11 rows x 11 columns]

Now, we have our co-occurrence matrix. Let's find the (Positive) Point-wise Mutual Information or PPMI for short. I used the code provided professor Christopher Potts from Stanford found in this slides that can be summarized in the following image

采购经理人指数

The PPMI is the same as the following pmi with positive=True :

def pmi(df, positive=True):
    col_totals = df.sum(axis=0)
    total = col_totals.sum()
    row_totals = df.sum(axis=1)
    expected = np.outer(row_totals, col_totals) / total
    df = df / expected
    # Silence distracting warnings about log(0):
    with np.errstate(divide='ignore'):
        df = np.log(df)
    df[np.isinf(df)] = 0.0  # log(0) = 0
    if positive:
        df[df < 0] = 0.0
    return df

Let's try it out:

>>> ppmi = pmi(df, positive=True)
>>> ppmi
                .       bus        by  ...    school   theatre        to
.        0.000000  1.722767  1.386294  ...  0.000000  0.000000  0.000000
bus      1.722767  0.000000  1.163151  ...  0.000000  0.000000  0.000000
by       1.386294  1.163151  0.000000  ...  0.000000  0.000000  0.000000
day      0.000000  1.029619  0.693147  ...  1.252763  0.000000  0.000000
every    0.000000  0.000000  0.693147  ...  0.559616  0.559616  0.559616
go       0.000000  0.000000  0.000000  ...  0.847298  0.847298  0.847298
i        0.000000  0.000000  0.000000  ...  0.000000  0.000000  1.252763
night    0.000000  1.029619  0.693147  ...  0.000000  1.252763  0.000000
school   0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.559616
theatre  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.559616
to       0.000000  0.000000  0.000000  ...  0.559616  0.559616  0.000000

[11 rows x 11 columns]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM