Fasttext model 数字表示

Question

I would like to create a fasttext model for numbers.我想为数字创建一个 fasttext model。 Is this a good approach?这是一个好方法吗？

Use Case:用例：

I have a given number set of about 100.000 integer invoice-numbers.我有一个大约 100.000 integer 发票号码的给定号码集。 Our OCR sometimes creates false invoice-numbers like 1000o00 or 383I338, so my idea was to use fasttext to predict nearest invoice-number based on my 100.000 integers.我们的 OCR 有时会创建错误的发票号码，例如 1000o00 或 383I338，所以我的想法是使用 fasttext 根据我的 100.000 整数预测最近的发票号码。 As correct invoice-numbers are known in advance, I trained a fastext model with all invoice-numbers to create a word-embeding space just with invoices-numbers.由于事先知道正确的发票编号，我训练了一个包含所有发票编号的 fastext model，以创建一个仅包含发票编号的词嵌入空间。

But it is not working and I don´t know if my idea is completly wrong?但它不起作用，我不知道我的想法是否完全错误？ But I would assume that even if I have no sentences, embedding into vector space should work and therefore also a similarity between 383I338 and 3831338 should be found by the model.但我会假设即使我没有句子，嵌入向量空间也应该有效，因此 model 也应该找到 383I338 和 3831338 之间的相似性。

Here some of my code :这是我的一些代码：

import pandas as pd
from random import seed
from random import randint
import fasttext

# seed random number generator
seed(9999)
number_of_vnr = 100000
min_vnr = 1111    
max_vnr = 999999999

# generate vnr integers
versicherungsscheinnummern = [randint(min_vnr, max_vnr) for i in range(number_of_vnr)]

# save numbers as csv
df_vnr = pd.DataFrame(versicherungsscheinnummern, columns=['VNR'])
df_vnr['VNR'].dropna().astype(str).to_csv('vnr_str.csv', index=False)

# train model
model = fasttext.train_unsupervised('vnr_str.csv',"cbow", minn=2, maxn=5)

Even data in the space is not found连空间里的数据都找不到

model.get_nearest_neighbors("833803015")
[(0.10374893993139267, '</s>')]

model has no words model 无字

model.words
["'</s>'"]

Answer 1

I doubt FastText is the right approach for this.我怀疑 FastText 是正确的方法。

Unlike in natural-languages, where word roots/prefixes/suffixes (character n-grams) can be hints to meaning, most invoice number schemes are just incrementing numbers.与自然语言中的词根/前缀/后缀（字符 n-gram）可以暗示含义不同，大多数发票编号方案只是递增数字。

Every '###' or '####' is going to have a similar frequency.每个“###”或“####”都会有相似的频率。 (Well, perhaps there'd be a little bit of a bias towards lower digits to the left, for Benford's Law-like reasons.) Unless the exact same invoice numbers repeat often* throughout the corpus, so that the whole token, & its fragments, acquire a word-like meaning from surrounding other tokens, FastText's post-training nearest-neighbors are unlikely to offer any hints about correct numbers. （好吧，出于类本福德定律的原因，也许会稍微偏向左边的低位数字。）除非完全相同的发票号码在整个语料库中经常重复*，以便整个令牌及其片段，从周围的其他标记中获得类似单词的含义，FastText 的训练后最近邻不太可能提供有关正确数字的任何提示。 (For it to have a chance to help, you'd want the same invoice-numbers to not just repeat many times, but for a lot of those appeearances to have similar OCR errors - but I strongly suspet your corpus instead has invoice numbers only on individual texts.) （为了有机会提供帮助，您不仅希望相同的发票号码重复多次，而且对于其中许多出现类似的 OCR 错误 - 但我强烈怀疑您的语料库只有发票号码个别文本。）

Is the real goal to correct the invoice-numbers, or just to have them be less-noisy in a model thaat's trained on a lot more meaningful, text-like tokens?真正的目标是纠正发票号码，还是只是让它们在 model 中减少噪音？ (If the latter, it might be better just to discard anything that looks like an invoice number – with or without OCR glitches – or is similarly so rare it's likely an OCR scanno.) （如果是后者，最好丢弃任何看起来像发票号码的东西——不管有没有 OCR 故障——或者同样如此罕见，很可能是 OCR 扫描。）

That said, statistical & edit-distance methods could potentially help if the real need is correcting OCR errors - just not semantic-context-dependent methods like FastText.也就是说，如果真正需要纠正 OCR 错误，那么统计和编辑距离方法可能会有所帮助——而不是像 FastText 这样依赖于语义上下文的方法。 You might get useful ideas from Peter Norvig's classic writeup on " How to Write a Spelling Corrector ".您可能会从 Peter Norvig 关于“如何编写拼写校正器”的经典文章中获得有用的想法。

Fasttext model 数字表示

问题描述

1 个解决方案

解决方案1
0 2022-01-07 21:33:19

Fasttext model 数字表示

问题描述

1 个解决方案

解决方案1 0 2022-01-07 21:33:19

解决方案1
0 2022-01-07 21:33:19