简体   繁体   中英

Choose column with the highest count of a certain value

I am having issues on returning the column name with the highest count of value "GPE". In this case I want my output to just be "text" because that column has two rows of 'GPE' while column text2 has 1 and column text3 has 0.

Code:

import spacy
import pandas as pd
import en_core_web_sm

nlp = en_core_web_sm.load()
text = [["Canada", 'University of California has great research', "non-location"],["China", 'MIT is at Boston', "non-location"]]
df = pd.DataFrame(text, columns = ['text', 'text2', 'text3'])

col_list = df.columns # obtains the columns of the dataframe

for col in col_list:
    df["".join(col)] = df[col].apply(lambda x: [[w.label_] for w in list(nlp(x).ents)]) # combine the ent_<<col_name>> as the new columns which contain the named entities.
df

Desired output:

text

Once you have the dataframe df ready from the script provided, you can run the below 3 lines to get the column with GPE entities appearing the maximum number of times

col_count_dict = {}
for cols in df.columns:
    col_count_dict[cols] = df[cols].sum().count(['GPE'])
print(max(col_count_dict, key=col_count_dict.get)) 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM