![](/img/trans.png)
[英]Generating text corpus from a matrix, based on words and their weighted probabilities
[英]How to construct PPMI matrix from a text corpus?
我正在尝试使用 SVD model 在布朗语料库上进行词嵌入。 为此,我想首先生成一个词-词共现矩阵,然后转换为 PPMI 矩阵以进行 SVD 矩阵乘法过程。
我尝试使用 SkLearn CountVectorizer 创建一个共现
count_model = CountVectorizer(ngram_range=(1,1))
X = count_model.fit_transform(corpus)
X[X > 0] = 1
Xc = (X.T * X)
Xc.setdiag(0)
print(Xc.todense())
但:
(1) 我不确定如何用这种方法控制上下文 window? 我想尝试各种上下文大小,看看它们对过程的影响。
(2) 假设 PMI(a, b) = log p(a, b)/p(a)p(b),我该如何正确计算 PPMI
任何有关思考过程和实施的帮助将不胜感激!
谢谢 (-:
我尝试使用提供的代码,但无法将移动的 window 应用于它。 所以,我做了我自己的 function 这样做。 This function takes a list of sentences and returns a pandas.DataFrame
object representing the co-occurrence matrix and a window_size
number:
def co_occurrence(sentences, window_size):
d = defaultdict(int)
vocab = set()
for text in sentences:
# preprocessing (use tokenizer instead)
text = text.lower().split()
# iterate over sentences
for i in range(len(text)):
token = text[i]
vocab.add(token) # add to vocab
next_token = text[i+1 : i+1+window_size]
for t in next_token:
key = tuple( sorted([t, token]) )
d[key] += 1
# formulate the dictionary into dataframe
vocab = sorted(vocab) # sort vocab
df = pd.DataFrame(data=np.zeros((len(vocab), len(vocab)), dtype=np.int16),
index=vocab,
columns=vocab)
for key, value in d.items():
df.at[key[0], key[1]] = value
df.at[key[1], key[0]] = value
return df
让我们尝试一下以下两个简单的句子:
>>> text = ["I go to school every day by bus .",
"i go to theatre every night by bus"]
>>>
>>> df = co_occurrence(text, 2)
>>> df
. bus by day every go i night school theatre to
. 0 1 1 0 0 0 0 0 0 0 0
bus 1 0 2 1 0 0 0 1 0 0 0
by 1 2 0 1 2 0 0 1 0 0 0
day 0 1 1 0 1 0 0 0 1 0 0
every 0 0 2 1 0 0 0 1 1 1 2
go 0 0 0 0 0 0 2 0 1 1 2
i 0 0 0 0 0 2 0 0 0 0 2
night 0 1 1 0 1 0 0 0 0 1 0
school 0 0 0 1 1 1 0 0 0 0 1
theatre 0 0 0 0 1 1 0 1 0 0 1
to 0 0 0 0 2 2 2 0 1 1 0
[11 rows x 11 columns]
现在,我们有了同现矩阵。 让我们找到(正)逐点互信息或简称 PPMI。 我使用了本幻灯片中提供的斯坦福教授 Christopher Potts 提供的代码,可以在下图中进行总结
PPMI 与以下带有positive=True
的pmi
相同:
def pmi(df, positive=True):
col_totals = df.sum(axis=0)
total = col_totals.sum()
row_totals = df.sum(axis=1)
expected = np.outer(row_totals, col_totals) / total
df = df / expected
# Silence distracting warnings about log(0):
with np.errstate(divide='ignore'):
df = np.log(df)
df[np.isinf(df)] = 0.0 # log(0) = 0
if positive:
df[df < 0] = 0.0
return df
让我们试一试:
>>> ppmi = pmi(df, positive=True)
>>> ppmi
. bus by ... school theatre to
. 0.000000 1.722767 1.386294 ... 0.000000 0.000000 0.000000
bus 1.722767 0.000000 1.163151 ... 0.000000 0.000000 0.000000
by 1.386294 1.163151 0.000000 ... 0.000000 0.000000 0.000000
day 0.000000 1.029619 0.693147 ... 1.252763 0.000000 0.000000
every 0.000000 0.000000 0.693147 ... 0.559616 0.559616 0.559616
go 0.000000 0.000000 0.000000 ... 0.847298 0.847298 0.847298
i 0.000000 0.000000 0.000000 ... 0.000000 0.000000 1.252763
night 0.000000 1.029619 0.693147 ... 0.000000 1.252763 0.000000
school 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.559616
theatre 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.559616
to 0.000000 0.000000 0.000000 ... 0.559616 0.559616 0.000000
[11 rows x 11 columns]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.