简体   繁体   English

Fasttext model 数字表示

[英]Fasttext model representations for numbers

I would like to create a fasttext model for numbers.我想为数字创建一个 fasttext model。 Is this a good approach?这是一个好方法吗?

Use Case:用例:

I have a given number set of about 100.000 integer invoice-numbers.我有一个大约 100.000 integer 发票号码的给定号码集。 Our OCR sometimes creates false invoice-numbers like 1000o00 or 383I338, so my idea was to use fasttext to predict nearest invoice-number based on my 100.000 integers.我们的 OCR 有时会创建错误的发票号码,例如 1000o00 或 383I338,所以我的想法是使用 fasttext 根据我的 100.000 整数预测最近的发票号码。 As correct invoice-numbers are known in advance, I trained a fastext model with all invoice-numbers to create a word-embeding space just with invoices-numbers.由于事先知道正确的发票编号,我训练了一个包含所有发票编号的 fastext model,以创建一个仅包含发票编号的词嵌入空间。

But it is not working and I don´t know if my idea is completly wrong?但它不起作用,我不知道我的想法是否完全错误? But I would assume that even if I have no sentences, embedding into vector space should work and therefore also a similarity between 383I338 and 3831338 should be found by the model.但我会假设即使我没有句子,嵌入向量空间也应该有效,因此 model 也应该找到 383I338 和 3831338 之间的相似性。

Here some of my code :这是我的一些代码

import pandas as pd
from random import seed
from random import randint
import fasttext
# seed random number generator
seed(9999)
number_of_vnr = 100000
min_vnr = 1111    
max_vnr = 999999999

# generate vnr integers
versicherungsscheinnummern = [randint(min_vnr, max_vnr) for i in range(number_of_vnr)]

# save numbers as csv
df_vnr = pd.DataFrame(versicherungsscheinnummern, columns=['VNR'])
df_vnr['VNR'].dropna().astype(str).to_csv('vnr_str.csv', index=False)

# train model
model = fasttext.train_unsupervised('vnr_str.csv',"cbow", minn=2, maxn=5)  

Even data in the space is not found连空间里的数据都找不到

model.get_nearest_neighbors("833803015")
[(0.10374893993139267, '</s>')]

model has no words model 无字

model.words
["'</s>'"]

I doubt FastText is the right approach for this.我怀疑 FastText 是正确的方法。

Unlike in natural-languages, where word roots/prefixes/suffixes (character n-grams) can be hints to meaning, most invoice number schemes are just incrementing numbers.与自然语言中的词根/前缀/后缀(字符 n-gram)可以暗示含义不同,大多数发票编号方案只是递增数字。

Every '###' or '####' is going to have a similar frequency.每个“###”或“####”都会有相似的频率。 (Well, perhaps there'd be a little bit of a bias towards lower digits to the left, for Benford's Law-like reasons.) Unless the exact same invoice numbers repeat often* throughout the corpus, so that the whole token, & its fragments, acquire a word-like meaning from surrounding other tokens, FastText's post-training nearest-neighbors are unlikely to offer any hints about correct numbers. (好吧,出于类本福德定律的原因,也许会稍微偏向左边的低位数字。)除非完全相同的发票号码在整个语料库中经常重复*,以便整个令牌及其片段,从周围的其他标记中获得类似单词的含义,FastText 的训练后最近邻不太可能提供有关正确数字的任何提示。 (For it to have a chance to help, you'd want the same invoice-numbers to not just repeat many times, but for a lot of those appeearances to have similar OCR errors - but I strongly suspet your corpus instead has invoice numbers only on individual texts.) (为了有机会提供帮助,您不仅希望相同的发票号码重复多次,而且对于其中许多出现类似的 OCR 错误 - 但我强烈怀疑您的语料库只有发票号码个别文本。)

Is the real goal to correct the invoice-numbers, or just to have them be less-noisy in a model thaat's trained on a lot more meaningful, text-like tokens?真正的目标是纠正发票号码,还是只是让它们在 model 中减少噪音? (If the latter, it might be better just to discard anything that looks like an invoice number – with or without OCR glitches – or is similarly so rare it's likely an OCR scanno.) (如果是后者,最好丢弃任何看起来像发票号码的东西——不管有没有 OCR 故障——或者同样如此罕见,很可能是 OCR 扫描。)

That said, statistical & edit-distance methods could potentially help if the real need is correcting OCR errors - just not semantic-context-dependent methods like FastText.也就是说,如果真正需要纠正 OCR 错误,那么统计和编辑距离方法可能会有所帮助——而不是像 FastText 这样依赖于语义上下文的方法。 You might get useful ideas from Peter Norvig's classic writeup on " How to Write a Spelling Corrector ".您可能会从 Peter Norvig 关于“如何编写拼写校正器”的经典文章中获得有用的想法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM