简体   繁体   English

"是否可以创建一个词袋但词字符或子字符串?"

[英]Is it possible to create a Bag of Words but word char or substrings?

Is it possible to create a BoW but instead of searching for words I do it for substrings?是否可以创建一个 BoW,而不是搜索我为子字符串做的单词?

I'm working on a python program were I input an array with various names (instead of full sentences) on it and try to apply BoW on it and the problem is that because BoW is for words in sentences the program treats them as sentences.我正在开发一个 python 程序,如果我在其上输入一个包含各种名称(而不是完整句子)的数组并尝试在其上应用 BoW,问题是因为 BoW 用于句子中的单词,所以程序将它们视为句子。

Example: If I have the word Farahoka, Csanoha, April, Bas, Phrahonee<\/code> and I'm looking for the substring aho<\/code>示例:如果我有单词Farahoka, Csanoha, April, Bas, Phrahonee<\/code> ,并且我正在寻找子字符串aho<\/code>

How could I do this?我怎么能这样做?

Edit: It seems that my question is not that clear, so I'll try to do my best to explain what is the task and what I need to do.编辑:看来我的问题不是很清楚,所以我会尽力解释什么是任务以及我需要做什么。

I have a list of various names on an array, and I'm trying to find a way to vectorize the letters or maybe find a way to separate into syllabes.我有一个数组上各种名称的列表,我正在尝试找到一种方法来矢量化字母,或者找到一种方法来分离成音节。

Example:例子:

In BoW if I have The sky is blue today<\/code> it will be separated into [The, sky, is, blue, today]<\/code> , in the problem I have I'm trying to do something similar, separate\/find substrings for words.在BoW中,如果我有The sky is blue today<\/code> ,它将被分成[The, sky, is, blue, today]<\/code> ,在我遇到的问题中,我正在尝试做类似的事情,分离\/查找单词的子字符串。

Using the previous example, I want to take the word today<\/code> and search for the substring ay<\/code>使用前面的例子,我想取单词today<\/code>并搜索子字符串ay<\/code>

Is it possible to do it without using things like if 'ay' in today<\/code> or endswith('ay')<\/code> ?是否可以在不使用if 'ay' in today<\/code>或endswith('ay')<\/code>情况下做到这一点?

In theory I need to use an unigram model for this in order to learn wights for a predictor but it seems all I can find online is focused on words and not substrings.理论上,我需要为此使用 unigram 模型来学习预测变量的 wights,但似乎我在网上能找到的所有内容都集中在单词而不是子字符串上。

"

You don't have much choice but to loop over the elements.您别无选择,只能遍历元素。

The exact output you expect is unclear, but you could do on of the following:您期望的确切输出尚不清楚,但您可以执行以下操作:

Searching for matches:搜索匹配项:

words = ['Farahoka', 'Csanoha', 'April', 'Bas', 'Phrahonee']

[w for w in words if 'aho' in w]
# ['Farahoka', 'Phrahonee']

Your best bet is to iterate over each word and check if your substring is in that word.您最好的选择是遍历每个单词并检查您的子字符串是否在该单词中。

substring = 'aho'
words = ['Farahoka', 'Csanoha', 'April', 'Bas', 'Phrahonee']


for word in words:
    if substring in word:
        print(f'{substring} in {word}')
    else:
        print(f'{substring} not found in {word}')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM