[英]Search through a dataframe for a partial string match and put the rows into a new dataframe with only their IDs
I have a dataframe of publications that have the following rows: 我有一个包含以下行的出版物的数据框:
publication_ID , title, author_name, date 12344, Design style, Jake Kreath, 20071208 12334, Power of Why, Samantha Finn, 20150704 publication_ID,标题,作者名称,日期12344,设计风格,Jake Kreath,20071208 12334,《为什么力量》,萨曼莎·芬恩(Samantha Finn),20150704
I ask the user for a string and use that string to search through the titles. 我要求用户输入一个字符串,然后使用该字符串搜索标题。
The goal: Search through the dataframe to see if the title contains the word the user provides and return the rows in a new dataframe with just the title and publication_ID. 目标:搜索数据框以查看标题中是否包含用户提供的单词,并在新数据框中返回仅包含标题和Publication_ID的行。
This is my code so far: 到目前为止,这是我的代码:
import pandas as pd
from pandas import DataFrame
publications = pd.read_csv(filepath, sep= "|")
search_term = input('Enter the term you are looking for: ')
def stringDataFrame(publications, title, regex):
newdf = pd.DataFrame()
for idx, search_term in publications['title'].iteritems():
if re.search(regex, search_term):
newdf = concat([publications[publications['title'] == search_term], newdf], ignore_index=True)
return newdf
print(newdf.stringDataFrame)
Use a combination of .str.contains
and .loc
结合使用
.str.contains
和.loc
publications.loc[publications.title.str.contains(search_term), ['title', 'publication_ID']]
Just be careful, because if your title is 'nightlife'
and someone searches for 'night'
this will return a match. 请注意,因为如果您的标题是
'nightlife'
并且有人搜索'night'
这将返回一个匹配项。 If that's not your desired behavior then you may need .str.split
instead. 如果这不是您想要的行为,那么您可能需要
.str.split
。
As jpp points out, str.contains
is case sensitive. 正如jpp指出的那样,
str.contains
区分大小写。 One simple fix is to just ensure everything is lowercase. 一种简单的解决方法是仅确保所有内容均为小写。
title_mask = publications.title.str.lower().str.contains(search_term.lower())
pmids = publications.loc[title_mask, ['title', 'publication_ID']]
now Lord
, LoRD
, lord
and all other permutations will return a valid match, and your original DataFrame
has the capitalization unchanged. 现在
Lord
, LoRD
, lord
和所有其他排列将返回有效的匹配项,并且原始DataFrame
的大小写不变。
Full example but you should accept the answer above by @ALollz 完整示例,但您应该接受@ALollz的回答
import pandas as pd
# you publications dataframe
publications = pd.DataFrame({'title':['The Odyssey','The Canterbury Tales','Inferno','The Lord of The Rings', 'Lord of The Flies'],'publication_ID':[1,2,3,4,5]})
search_term = input('Enter the term you are looking for: ')
publications[['title','publication_ID']][publications['title'].str.contains(search_term)]
Enter the term you are looking for: Lord
title publication_ID
3 The Lord of The Rings 4
4 Lord of The Flies 5
per your error you can filter out all np.nan
values by as part of the logic using the new code below: 根据您的错误,您可以使用以下新代码过滤掉所有
np.nan
值,作为逻辑的一部分:
import pandas as pd
import numpy as np
publications = pd.DataFrame({'title':['The Odyssey','The Canterbury Tales','Inferno','The Lord of The Rings', 'Lord of The Flies',np.nan],'publication_ID':[1,2,3,4,5,6]})
search_term = input('Enter the term you are looking for: ')
publications[['title','publication_ID']][publications['title'].str.contains(search_term) & ~publications['title'].isna()]
Enter the term you are looking for: Lord
title publication_ID
3 The Lord of The Rings 4
4 Lord of The Flies 5
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.