[英]Python scikit-learn: prediction on dataset with text and numeric variables
我有一个项目数据集,我想使用 Python 和 scikit-learn 来预测结果(成功/失败)。 数据集包含多种数据类型:项目标题、项目描述等都是文本列。 另一方面,项目成本是一个数字字段。
我想使用 TF-IDF 转换文本列,我可以将其用作模型的输入。 这是我的代码:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
tfidf_transformer = TfidfTransformer()
X_train['Project Title'] = tfidf_transformer.fit_transform(X_train['Project Title'])
但我收到错误:
TypeError: no supported conversion for types: (dtype('O'),)
任何想法为什么会显示此错误?
编辑:下面的示例数据
Project Title Project Essay Project Short Description Project Need Statement Project Cost Project Type ID Project Subject Category Tree ID Project Subject Subcategory Tree ID Project Resource Category ID Project Grade Level Category ID Project Current Status ID
Stand Up to Bullying: Together We Can! Did you know that 1-7 students in grades K-12 ... Did you know that 1-7 students in grades K-12 ... My students need 25 copies of "Bullying in Sch... 361.80 0 0 0 0 0 0
问题是您使用TfidfTransformer
将计数矩阵转换为标准化的 tf 或 tf-idf 表示,而不是TfidfVectorizer
将原始文档集合转换为 TF-IDF 特征矩阵
from sklearn.feature_extraction.text import TfidfVectorizer
X = pd.DataFrame({'Project Title': ['hello stackoverflow', 'text column', 'scikit learn', 'machine learning projects']})
vect = TfidfVectorizer(ngram_range=(1, 2))
tfidf = vect.fit_transform(X['Project Title'])
X_tfidf = pd.DataFrame(matrix.todense(), columns=vect.get_feature_names())
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.