在作为列表列表的数据帧的每一行中应用TfidfVectorizer

Question

I have a pandas dataframe containing 2 columns and I want to use sklearn TfidfVectorizer for text-classification in one of them. 我有一个包含2列的pandas数据sklearn TfidfVectorizer ，我想使用sklearn TfidfVectorizer在其中之一中进行文本分类 。 However this column is a list of lists and TFIDF wants raw input as text. 但是，此列是列表的列表，TFIDF希望将原始输入作为文本。 In this question they provide a solution in case we have just one list of lists, but I would like to ask how it would be possible to apply this function in every single row of my dataframe, which row contains a list of lists. 在这个问题中，它们为我们只有一个列表列表提供了一种解决方案，但是我想问一问如何在我的数据帧的每一行中应用此功能，该行包含一个列表列表。 Thank you in advance. 先感谢您。

 Input: 0 [[this, is, the], [first, row], [of, dataframe]] 1 [[that, is, the], [second], [row, of, dataframe]] 2 [[etc], [etc, etc]]

Wanted Output: 想要的输出：

0    ['this is the', 'first row', 'of dataframe']
1    ['that is the', 'second', 'row of dataframe']
2    ['etc', 'etc etc']

Answer 1

You could use apply : 您可以使用apply ：

import pandas as pd

df = pd.DataFrame(data=[[[['this', 'is', 'the'], ['first', 'row'], ['of', 'dataframe']]],
                        [[['that', 'is', 'the'], ['second'], ['row', 'of', 'dataframe']]]],
                  columns=['paragraphs'])


df['result'] = df['paragraphs'].apply(lambda xs: [' '.join(x) for x in xs])
print(df['result'])

Output 产量

0     [this is the, first row, of dataframe]
1    [that is the, second, row of dataframe]
Name: result, dtype: object

Further, if you want to apply the vectorizer in conjunction with the above function you could do something like this: 此外，如果要将矢量化程序与上述功能结合使用，可以执行以下操作：

def vectorize(xs, vectorizer=TfidfVectorizer(min_df=1, stop_words="english")):
    text = [' '.join(x) for x in xs]
    return vectorizer.fit_transform(text)


df['vectors'] = df['paragraphs'].apply(vectorize)
print(df['vectors'].values)

在作为列表列表的数据帧的每一行中应用TfidfVectorizer

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-11-08 10:19:35

在作为列表列表的数据帧的每一行中应用TfidfVectorizer

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-11-08 10:19:35

解决方案1
1 已采纳 2018-11-08 10:19:35