Select column names in Pandas dataframe only if row values meet a certain criteria in Python

Question

I want to get the names of those columns which have a value of 0.6 or more from each row of my dataset.

The dataset is in the form that I have a sentence and for each sentence, I have the tf-idf value for each relevant word in that sentence.

Sample dataset:
                                            heel syrup word3 word4 word5
    What is a better exercise?              0     0     0      0    0.34 
    how many days hv to take syrup          0   0.95    0      0     0      
    Can I take this solution ?              0   0   0   0      0.23

The dataset is really huge and has around 10K rows which are sentences and 5K columns which are words. From here I want to make a new column and for each sentence, keep words that have tf-idf value greater than 0.6. The code implemented is:

dataset = pd.read_csv(r'Desktop/tfidf_val.csv')

dataset.apply(lambda x: x.index[x.astype(bool)].tolist(), 1)

but I am getting a Memory Error and also not sure if this code is correct. Any idea how to solve this or if there's an issue with the code

Answer 1

You can use matrix multiplication to quickly concatenate the words:

thresh= 0.2                 # change this to 0.6 as asked
(df>0.2)@(df.columns+',')

Output:

What is a better exercise?        word5,
how many days hv to take syrup    syrup,
Can I take this solution ?        word5,
dtype: object

Or if you want lists (as in your code):

df.apply(lambda x: list(x.index[x>thresh]), axis=1)

Output:

What is a better exercise?        [word5]
how many days hv to take syrup    [syrup]
Can I take this solution ?        [word5]
dtype: object

Select column names in Pandas dataframe only if row values meet a certain criteria in Python

Question

1 answers

solution1
0 2020-11-12 19:54:06

Select column names in Pandas dataframe only if row values meet a certain criteria in Python

Question

1 answers

solution1 0 2020-11-12 19:54:06

solution1
0 2020-11-12 19:54:06