简体   繁体   中英

Search through a dataframe for a partial string match and put the rows into a new dataframe with only their IDs

I have a dataframe of publications that have the following rows:

publication_ID , title, author_name, date 12344, Design style, Jake Kreath, 20071208 12334, Power of Why, Samantha Finn, 20150704

I ask the user for a string and use that string to search through the titles.

The goal: Search through the dataframe to see if the title contains the word the user provides and return the rows in a new dataframe with just the title and publication_ID.

This is my code so far:

import pandas as pd
from pandas import DataFrame

 publications = pd.read_csv(filepath, sep= "|")

 search_term = input('Enter the term you are looking for: ')
 def stringDataFrame(publications, title, regex):
      newdf = pd.DataFrame()
      for idx, search_term in publications['title'].iteritems():
        if re.search(regex, search_term):
        newdf = concat([publications[publications['title'] == search_term], newdf], ignore_index=True)

        return newdf
print(newdf.stringDataFrame)

Use a combination of .str.contains and .loc

publications.loc[publications.title.str.contains(search_term), ['title', 'publication_ID']]

Just be careful, because if your title is 'nightlife' and someone searches for 'night' this will return a match. If that's not your desired behavior then you may need .str.split instead.


As jpp points out, str.contains is case sensitive. One simple fix is to just ensure everything is lowercase.

title_mask = publications.title.str.lower().str.contains(search_term.lower())
pmids = publications.loc[title_mask, ['title', 'publication_ID']]

now Lord , LoRD , lord and all other permutations will return a valid match, and your original DataFrame has the capitalization unchanged.

Full example but you should accept the answer above by @ALollz

import pandas as pd
# you publications dataframe
publications = pd.DataFrame({'title':['The Odyssey','The Canterbury Tales','Inferno','The Lord of The Rings', 'Lord of The Flies'],'publication_ID':[1,2,3,4,5]})

search_term = input('Enter the term you are looking for: ')

publications[['title','publication_ID']][publications['title'].str.contains(search_term)]


Enter the term you are looking for: Lord

       title               publication_ID
3   The Lord of The Rings      4
4   Lord of The Flies          5

per your error you can filter out all np.nan values by as part of the logic using the new code below:

import pandas as pd
import numpy as np

publications = pd.DataFrame({'title':['The Odyssey','The Canterbury Tales','Inferno','The Lord of The Rings', 'Lord of The Flies',np.nan],'publication_ID':[1,2,3,4,5,6]})

search_term = input('Enter the term you are looking for: ')

publications[['title','publication_ID']][publications['title'].str.contains(search_term) & ~publications['title'].isna()]

Enter the term you are looking for: Lord

    title                 publication_ID
3   The Lord of The Rings       4
4   Lord of The Flies           5

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM