I have a dataframe of publications that have the following rows:
publication_ID , title, author_name, date 12344, Design style, Jake Kreath, 20071208 12334, Power of Why, Samantha Finn, 20150704
I ask the user for a string and use that string to search through the titles.
The goal: Search through the dataframe to see if the title contains the word the user provides and return the rows in a new dataframe with just the title and publication_ID.
This is my code so far:
import pandas as pd
from pandas import DataFrame
publications = pd.read_csv(filepath, sep= "|")
search_term = input('Enter the term you are looking for: ')
def stringDataFrame(publications, title, regex):
newdf = pd.DataFrame()
for idx, search_term in publications['title'].iteritems():
if re.search(regex, search_term):
newdf = concat([publications[publications['title'] == search_term], newdf], ignore_index=True)
return newdf
print(newdf.stringDataFrame)
Use a combination of .str.contains
and .loc
publications.loc[publications.title.str.contains(search_term), ['title', 'publication_ID']]
Just be careful, because if your title is 'nightlife'
and someone searches for 'night'
this will return a match. If that's not your desired behavior then you may need .str.split
instead.
As jpp points out, str.contains
is case sensitive. One simple fix is to just ensure everything is lowercase.
title_mask = publications.title.str.lower().str.contains(search_term.lower())
pmids = publications.loc[title_mask, ['title', 'publication_ID']]
now Lord
, LoRD
, lord
and all other permutations will return a valid match, and your original DataFrame
has the capitalization unchanged.
Full example but you should accept the answer above by @ALollz
import pandas as pd
# you publications dataframe
publications = pd.DataFrame({'title':['The Odyssey','The Canterbury Tales','Inferno','The Lord of The Rings', 'Lord of The Flies'],'publication_ID':[1,2,3,4,5]})
search_term = input('Enter the term you are looking for: ')
publications[['title','publication_ID']][publications['title'].str.contains(search_term)]
Enter the term you are looking for: Lord
title publication_ID
3 The Lord of The Rings 4
4 Lord of The Flies 5
per your error you can filter out all np.nan
values by as part of the logic using the new code below:
import pandas as pd
import numpy as np
publications = pd.DataFrame({'title':['The Odyssey','The Canterbury Tales','Inferno','The Lord of The Rings', 'Lord of The Flies',np.nan],'publication_ID':[1,2,3,4,5,6]})
search_term = input('Enter the term you are looking for: ')
publications[['title','publication_ID']][publications['title'].str.contains(search_term) & ~publications['title'].isna()]
Enter the term you are looking for: Lord
title publication_ID
3 The Lord of The Rings 4
4 Lord of The Flies 5
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.