简体   繁体   中英

Select column names in Pandas dataframe only if row values meet a certain criteria in Python

I want to get the names of those columns which have a value of 0.6 or more from each row of my dataset.

The dataset is in the form that I have a sentence and for each sentence, I have the tf-idf value for each relevant word in that sentence.

Sample dataset:
                                            heel syrup word3 word4 word5
    What is a better exercise?              0     0     0      0    0.34 
    how many days hv to take syrup          0   0.95    0      0     0      
    Can I take this solution ?              0   0   0   0      0.23     
      

The dataset is really huge and has around 10K rows which are sentences and 5K columns which are words. From here I want to make a new column and for each sentence, keep words that have tf-idf value greater than 0.6. The code implemented is:

dataset = pd.read_csv(r'Desktop/tfidf_val.csv')

dataset.apply(lambda x: x.index[x.astype(bool)].tolist(), 1)

but I am getting a Memory Error and also not sure if this code is correct. Any idea how to solve this or if there's an issue with the code

You can use matrix multiplication to quickly concatenate the words:

thresh= 0.2                 # change this to 0.6 as asked
(df>0.2)@(df.columns+',')

Output:

What is a better exercise?        word5,
how many days hv to take syrup    syrup,
Can I take this solution ?        word5,
dtype: object

Or if you want lists (as in your code):

df.apply(lambda x: list(x.index[x>thresh]), axis=1)

Output:

What is a better exercise?        [word5]
how many days hv to take syrup    [syrup]
Can I take this solution ?        [word5]
dtype: object

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM