[英]How to extract specific words from pieces of text, using a dictionary of words in categories?
I'm wanting to extract specific words from text in a data frame.我想从数据框中的文本中提取特定的单词。 These words I've inputted in a list in a dictionary and they fall under certain categories (the keys).这些词我已经输入到字典的列表中,它们属于某些类别(键)。 From this I want to create columns that correspond to categories that store the words.从这里我想创建与存储单词的类别相对应的列。 As always, it's best illustrated by example:与往常一样,最好通过示例来说明:
I have a data frame:我有一个数据框:
df = pd.DataFrame({'Text': ["This car is fast, agile and large and wide", "This wagon is slow, sluggish, small and compact with alloy wheels"]} )
Which creates the table:创建表:
Text
0 This car is fast, agile and large and wide
1 This wagon is slow, sluggish, small and compact with alloy wheels
And a dictionary of words within categories I want to extract from them.以及我想从中提取的类别中的单词词典。 The words are all natural language words without symbols and can include phrases, such as "alloy wheels" in this example" (this doesn't have to be a dictionary, I just felt this was the best approach):这些单词都是没有符号的自然语言单词,并且可以包含短语,例如本例中的“合金车轮””(这不一定是字典,我只是觉得这是最好的方法):
myDict = {
"vehicle": ["car", "wagon"],
"speed": ["fast", "agile", "slow", "sluggish"],
"size": ["large", "small", "wide", "compact"]
"feature": ["alloy wheels"]
}
And from this I am wanting to create a table that looks like this:从这里我想创建一个看起来像这样的表:
| Text | vehicle | speed | size | feature |
| ----------------------------------------------------------------- | ------- | -------------- | -------------- | ------------ |
| This car is fast, agile and large and wide | car | fast, agile | large, wide | NaN |
| This wagon is slow, sluggish, small and compact with allow wheels | wagon | slow, sluggish | small, compact | alloy wheels |
Cheers for the help in advance!提前为帮助干杯! Would love to use regex but any solutions welcome!很想使用正则表达式,但欢迎任何解决方案!
There are many ways you could tackle this.有很多方法可以解决这个问题。 One approach I'd maybe start with is: define a function which returns a list of words if they match your sentence.我可能开始的一种方法是:定义一个 function 如果它们与您的句子匹配,则返回一个单词列表。
def get_matching_words(sentence, category_dict, category):
matching_words = list()
for word in category_dict[category]:
if word in sentence.split(" "):
matching_words.append(word)
return matching_words
Then, you want to apply this function to your pandas dataframe.然后,您想将此 function 应用于您的 pandas dataframe。
df["vehicle"] = df["Text"].apply(lambda x: get_matching_words(x, "vehicle", my_dict))
df["speed"] = df["Text"].apply(lambda x: get_matching_words(x, "speed", my_dict))
The only thing to add here would be to concatenate the list into a string, instead of returning a list.这里唯一要添加的是将列表连接成一个字符串,而不是返回一个列表。
def get_matching_words(sentence, category_dict, category):
matching_words = list()
for word in category_dict[category]:
if word in sentence:
matching_words.append(word)
return ",".join(matching_words)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.