I am trying to use an SVD model for word embedding on the Brown corpus. For this, I want to first generate a word-word co-occurence matrix and then convert to PPMI matrix for the SVD matrix multiplication process.
I have tried to create a co-occurence using SkLearn CountVectorizer
count_model = CountVectorizer(ngram_range=(1,1))
X = count_model.fit_transform(corpus)
X[X > 0] = 1
Xc = (X.T * X)
Xc.setdiag(0)
print(Xc.todense())
But:
(1) Am not sure how I can control the context window with this method? I want to experiment with various context sizes and see how the impact the process.
(2) How do I then compute the PPMI properly assuming that PMI(a, b) = log p(a, b)/p(a)p(b)
Any help on the thought process and implementation would be greatly appreciated!
Thanks (-:
I tried to play with the provided code, but I couldn't apply the moving window to it. So, I did my own function that does so. This function takes a list of sentences and returns a pandas.DataFrame
object representing the co-occurrence matrix and a window_size
number:
def co_occurrence(sentences, window_size):
d = defaultdict(int)
vocab = set()
for text in sentences:
# preprocessing (use tokenizer instead)
text = text.lower().split()
# iterate over sentences
for i in range(len(text)):
token = text[i]
vocab.add(token) # add to vocab
next_token = text[i+1 : i+1+window_size]
for t in next_token:
key = tuple( sorted([t, token]) )
d[key] += 1
# formulate the dictionary into dataframe
vocab = sorted(vocab) # sort vocab
df = pd.DataFrame(data=np.zeros((len(vocab), len(vocab)), dtype=np.int16),
index=vocab,
columns=vocab)
for key, value in d.items():
df.at[key[0], key[1]] = value
df.at[key[1], key[0]] = value
return df
Let's try it out given the following two simple sentences:
>>> text = ["I go to school every day by bus .",
"i go to theatre every night by bus"]
>>>
>>> df = co_occurrence(text, 2)
>>> df
. bus by day every go i night school theatre to
. 0 1 1 0 0 0 0 0 0 0 0
bus 1 0 2 1 0 0 0 1 0 0 0
by 1 2 0 1 2 0 0 1 0 0 0
day 0 1 1 0 1 0 0 0 1 0 0
every 0 0 2 1 0 0 0 1 1 1 2
go 0 0 0 0 0 0 2 0 1 1 2
i 0 0 0 0 0 2 0 0 0 0 2
night 0 1 1 0 1 0 0 0 0 1 0
school 0 0 0 1 1 1 0 0 0 0 1
theatre 0 0 0 0 1 1 0 1 0 0 1
to 0 0 0 0 2 2 2 0 1 1 0
[11 rows x 11 columns]
Now, we have our co-occurrence matrix. Let's find the (Positive) Point-wise Mutual Information or PPMI for short. I used the code provided professor Christopher Potts from Stanford found in this slides that can be summarized in the following image
The PPMI is the same as the following pmi
with positive=True
:
def pmi(df, positive=True):
col_totals = df.sum(axis=0)
total = col_totals.sum()
row_totals = df.sum(axis=1)
expected = np.outer(row_totals, col_totals) / total
df = df / expected
# Silence distracting warnings about log(0):
with np.errstate(divide='ignore'):
df = np.log(df)
df[np.isinf(df)] = 0.0 # log(0) = 0
if positive:
df[df < 0] = 0.0
return df
Let's try it out:
>>> ppmi = pmi(df, positive=True)
>>> ppmi
. bus by ... school theatre to
. 0.000000 1.722767 1.386294 ... 0.000000 0.000000 0.000000
bus 1.722767 0.000000 1.163151 ... 0.000000 0.000000 0.000000
by 1.386294 1.163151 0.000000 ... 0.000000 0.000000 0.000000
day 0.000000 1.029619 0.693147 ... 1.252763 0.000000 0.000000
every 0.000000 0.000000 0.693147 ... 0.559616 0.559616 0.559616
go 0.000000 0.000000 0.000000 ... 0.847298 0.847298 0.847298
i 0.000000 0.000000 0.000000 ... 0.000000 0.000000 1.252763
night 0.000000 1.029619 0.693147 ... 0.000000 1.252763 0.000000
school 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.559616
theatre 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.559616
to 0.000000 0.000000 0.000000 ... 0.559616 0.559616 0.000000
[11 rows x 11 columns]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.