繁体   English   中英

在数据框中搜索部分字符串匹配项,并将行仅包含其ID放入新的数据框中

[英]Search through a dataframe for a partial string match and put the rows into a new dataframe with only their IDs

我有一个包含以下行的出版物的数据框:

publication_ID,标题,作者名称,日期12344,设计风格,Jake Kreath,20071208 12334,《为什么力量》,萨曼莎·芬恩(Samantha Finn),20150704

我要求用户输入一个字符串,然后使用该字符串搜索标题。

目标:搜索数据框以查看标题中是否包含用户提供的单词,并在新数据框中返回仅包含标题和Publication_ID的行。

到目前为止,这是我的代码:

import pandas as pd
from pandas import DataFrame

 publications = pd.read_csv(filepath, sep= "|")

 search_term = input('Enter the term you are looking for: ')
 def stringDataFrame(publications, title, regex):
      newdf = pd.DataFrame()
      for idx, search_term in publications['title'].iteritems():
        if re.search(regex, search_term):
        newdf = concat([publications[publications['title'] == search_term], newdf], ignore_index=True)

        return newdf
print(newdf.stringDataFrame)

结合使用.str.contains.loc

publications.loc[publications.title.str.contains(search_term), ['title', 'publication_ID']]

请注意,因为如果您的标题是'nightlife'并且有人搜索'night'这将返回一个匹配项。 如果这不是您想要的行为,那么您可能需要.str.split


正如jpp指出的那样, str.contains区分大小写。 一种简单的解决方法是仅确保所有内容均为小写。

title_mask = publications.title.str.lower().str.contains(search_term.lower())
pmids = publications.loc[title_mask, ['title', 'publication_ID']]

现在LordLoRDlord和所有其他排列将返回有效的匹配项,并且原始DataFrame的大小写不变。

完整示例,但您应该接受@ALollz的回答

import pandas as pd
# you publications dataframe
publications = pd.DataFrame({'title':['The Odyssey','The Canterbury Tales','Inferno','The Lord of The Rings', 'Lord of The Flies'],'publication_ID':[1,2,3,4,5]})

search_term = input('Enter the term you are looking for: ')

publications[['title','publication_ID']][publications['title'].str.contains(search_term)]


Enter the term you are looking for: Lord

       title               publication_ID
3   The Lord of The Rings      4
4   Lord of The Flies          5

根据您的错误,您可以使用以下新代码过滤掉所有np.nan值,作为逻辑的一部分:

import pandas as pd
import numpy as np

publications = pd.DataFrame({'title':['The Odyssey','The Canterbury Tales','Inferno','The Lord of The Rings', 'Lord of The Flies',np.nan],'publication_ID':[1,2,3,4,5,6]})

search_term = input('Enter the term you are looking for: ')

publications[['title','publication_ID']][publications['title'].str.contains(search_term) & ~publications['title'].isna()]

Enter the term you are looking for: Lord

    title                 publication_ID
3   The Lord of The Rings       4
4   Lord of The Flies           5

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM