I have a pandas dataframe like this:
data={
'col1':['New Zealand', 'Gym', 'United States'],
'col2':['Republic of South Africa', 'Park', 'United States of America'],
}
df=pd.DataFrame(data)
print(df)
col1 col2
0 New Zealand Republic of South Africa
1 Gym Park
2 United States United States of America
And I have a sentence that might contain words from any of the columns of the dataframe. I want to get the values in columns that are present in the sentence given and in which column they are. I have seen some similar solutions but they match the sentence given with the column values and not the other way around. Currently, I am doing it like this:
def find_match(df,sentence):
"returns true/false depending on the matching value and column name where the value exists"
arr=[]
cols=[]
flag=False
for i,row in df.iterrows():
if row['col1'].lower() in sentence.lower():
arr.append(row['col1'])
cols.append('col1')
flag=True
elif row['col2'].lower() in sentence.lower():
arr.append(row['col2'])
cols.append('col2')
flag=True
return flag,arr,cols
sentence="I live in the United States"
find_match(df,sentence) # returns (True, ['United States'], ['col1'])
I want a more pythonic way to do this because it is consuming lots of time on quite a large dataframe and it doesn't seem pythonic to me.
I cannot use.isin() because it wants a list of strings and matches the column value with the whole sentence given. I have tried doing the following as well but it throws error:
df.loc[df['col1'].str.lower() in sentence] # throws error that df['col1'] should be a string
Any help will be highly appreciated. Thanks!
I would do something something like this:
def find_match(df,sentence):
ids = [(i,j) for j in df.columns for i,v in enumerate(df[j]) if v.lower() in sentence.lower()]
return len(ids)>0, [df[id[1]][id[0]] for id in ids], [id[1] for id in ids]
Which gives:
find_match(df, sentence = 'I regularly go to the gym in the United States of America')
(True,
['Gym', 'United States', 'United States of America'],
['col1', 'col1', 'col2'])
From my feeling this is quite pythonic although there might be more elegant ways making more use of pandas functions.
Evidently you would like to check each value in Col 1 is a sub-string of the sentence. Is this correct? If so, here is one way:
df = pd.DataFrame(
{'col1': ['New Zealand', 'Gym', 'United States'],
'col2': ['Republic of South Africa', 'Park', 'United States of America']})
sentence = 'I live in the United States'
mask = df['col1'].apply(lambda x: x in sentence) # `mask` is a boolean array
if mask.any():
matches = df.loc[mask, 'col1']
print(mask.any(), end=', ')
print(df.loc[mask, 'col1'].values, end=', ')
print('col1')
print()
# the print statements produce the following line
# True, ['United States'], col1
If this is the right logic for one column, then you could put the mask
statement and the if clause in a loop for col in df.columns:
Update: we can modify the lambda expression to perform case-insensitive comparison. (The original data frame is not changed.)
mask = df['col1'].apply(lambda x: x.lower() in sentence.lower())
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.