将文本特征分解为分类

Question

我有一个数据帧，df由文本和数字功能组成，类似于下面所示。

Feature 1     Feature 2         Feature 3           Feature 4         Label
 10            20                keyword             Human             1
  2             3                Keywords            Dog               0
  8             2                Stackoverflow       cat               0

目前，我使用factorize函数将文本特征转换为数字特征，然后使用新的数据框进行分类。

df[' Feature 3'] = df[' Feature 3'].factorize()[0]
df[' Feature 4'] = df[' Feature 4'].factorize()[0]

运行上面的代码后，我的数据框看起来像这样

 Feature 1     Feature 2         Feature 3           Feature 4         Label
 10            20                0                    0                 1
  2             3                1                    1                 0
  8             2                2                    2                 0

factorize函数将“关键字”和“关键字”视为不同的单词，因此是否有任何函数可以将类似“关键字”和“关键字”的单词读作相同的单词？

输出数据帧实际上应该是这样的

 Feature 1     Feature 2         Feature 3           Feature 4         Label
 10            20                0                    0                 1
  2             3                0                    1                 0
  8             2                1                    2                 0

Answer 1

你可能想看看词干分析器。

NLTK给出了如何在这里使用它们的例子，但是简短的词干分析器将词语切割成它们的词干，例如......

from nltk.stem.porter import *

stemmer = PorterStemmer()

words = ['jog', 'jogging', 'jogged']

[stemmer.stem(word) for word in words]

收益：

['jog', 'jog', 'jog']

还是为了你

words = ['keyword', 'keywords']

[stemmer.stem(word) for word in words]

收益：

['keyword', 'keyword']

编辑：

我应该指出，这个词不需要相似就可以起作用：

words = ['drinking', 'running', 'walking', 'walked']

输出：

['drink', 'run', 'walk', 'walk']

将文本特征分解为分类

问题描述

1 个解决方案

解决方案1
5 2019-03-04 14:50:34

将文本特征分解为分类

问题描述

1 个解决方案

解决方案1 5 2019-03-04 14:50:34

解决方案1
5 2019-03-04 14:50:34