Select Pandas 中的列名 dataframe 仅当行值满足 Python 中的特定条件时

Question

我想从我的数据集的每一行中获取那些值为 0.6 或更多的列的名称。

数据集的形式是我有一个句子，对于每个句子，我都有该句子中每个相关单词的 tf-idf 值。

Sample dataset:
                                            heel syrup word3 word4 word5
    What is a better exercise?              0     0     0      0    0.34 
    how many days hv to take syrup          0   0.95    0      0     0      
    Can I take this solution ?              0   0   0   0      0.23

数据集非常庞大，大约有 10K 行是句子，5K 列是单词。 从这里我想创建一个新专栏，并为每个句子保留 tf-idf 值大于 0.6 的单词。 实现的代码是：

dataset = pd.read_csv(r'Desktop/tfidf_val.csv')

dataset.apply(lambda x: x.index[x.astype(bool)].tolist(), 1)

但我收到 Memory 错误，也不确定此代码是否正确。 知道如何解决这个问题或者代码是否有问题

Answer 1

您可以使用矩阵乘法快速连接单词：

thresh= 0.2                 # change this to 0.6 as asked
(df>0.2)@(df.columns+',')

Output：

What is a better exercise?        word5,
how many days hv to take syrup    syrup,
Can I take this solution ?        word5,
dtype: object

或者，如果您想要列表（如您的代码中所示）：

df.apply(lambda x: list(x.index[x>thresh]), axis=1)

Output：

What is a better exercise?        [word5]
how many days hv to take syrup    [syrup]
Can I take this solution ?        [word5]
dtype: object

Select Pandas 中的列名 dataframe 仅当行值满足 Python 中的特定条件时

问题描述

1 个解决方案

解决方案1
0 2020-11-12 19:54:06

Select Pandas 中的列名 dataframe 仅当行值满足 Python 中的特定条件时

问题描述

1 个解决方案

解决方案1 0 2020-11-12 19:54:06

解决方案1
0 2020-11-12 19:54:06