在数据框中搜索部分字符串匹配项，并将行仅包含其ID放入新的数据框中

Question

I have a dataframe of publications that have the following rows: 我有一个包含以下行的出版物的数据框：

publication_ID , title, author_name, date 12344, Design style, Jake Kreath, 20071208 12334, Power of Why, Samantha Finn, 20150704 publication_ID，标题，作者名称，日期12344，设计风格，Jake Kreath，20071208 12334，《为什么力量》，萨曼莎·芬恩（Samantha Finn），20150704

I ask the user for a string and use that string to search through the titles. 我要求用户输入一个字符串，然后使用该字符串搜索标题。

The goal: Search through the dataframe to see if the title contains the word the user provides and return the rows in a new dataframe with just the title and publication_ID. 目标：搜索数据框以查看标题中是否包含用户提供的单词，并在新数据框中返回仅包含标题和Publication_ID的行。

This is my code so far: 到目前为止，这是我的代码：

import pandas as pd
from pandas import DataFrame

 publications = pd.read_csv(filepath, sep= "|")

 search_term = input('Enter the term you are looking for: ')
 def stringDataFrame(publications, title, regex):
      newdf = pd.DataFrame()
      for idx, search_term in publications['title'].iteritems():
        if re.search(regex, search_term):
        newdf = concat([publications[publications['title'] == search_term], newdf], ignore_index=True)

        return newdf
print(newdf.stringDataFrame)

Answer 1

Use a combination of .str.contains and .loc 结合使用.str.contains和.loc

publications.loc[publications.title.str.contains(search_term), ['title', 'publication_ID']]

Just be careful, because if your title is 'nightlife' and someone searches for 'night' this will return a match. 请注意，因为如果您的标题是'nightlife'并且有人搜索'night'这将返回一个匹配项。 If that's not your desired behavior then you may need .str.split instead. 如果这不是您想要的行为，那么您可能需要.str.split 。

As jpp points out, str.contains is case sensitive. 正如jpp指出的那样， str.contains区分大小写。 One simple fix is to just ensure everything is lowercase. 一种简单的解决方法是仅确保所有内容均为小写。

title_mask = publications.title.str.lower().str.contains(search_term.lower())
pmids = publications.loc[title_mask, ['title', 'publication_ID']]

now Lord , LoRD , lord and all other permutations will return a valid match, and your original DataFrame has the capitalization unchanged. 现在Lord ， LoRD ， lord和所有其他排列将返回有效的匹配项，并且原始DataFrame的大小写不变。

Answer 2

Full example but you should accept the answer above by @ALollz 完整示例，但您应该接受@ALollz的回答

import pandas as pd
# you publications dataframe
publications = pd.DataFrame({'title':['The Odyssey','The Canterbury Tales','Inferno','The Lord of The Rings', 'Lord of The Flies'],'publication_ID':[1,2,3,4,5]})

search_term = input('Enter the term you are looking for: ')

publications[['title','publication_ID']][publications['title'].str.contains(search_term)]


Enter the term you are looking for: Lord

       title               publication_ID
3   The Lord of The Rings      4
4   Lord of The Flies          5

per your error you can filter out all np.nan values by as part of the logic using the new code below: 根据您的错误，您可以使用以下新代码过滤掉所有np.nan值，作为逻辑的一部分：

import pandas as pd
import numpy as np

publications = pd.DataFrame({'title':['The Odyssey','The Canterbury Tales','Inferno','The Lord of The Rings', 'Lord of The Flies',np.nan],'publication_ID':[1,2,3,4,5,6]})

search_term = input('Enter the term you are looking for: ')

publications[['title','publication_ID']][publications['title'].str.contains(search_term) & ~publications['title'].isna()]

Enter the term you are looking for: Lord

    title                 publication_ID
3   The Lord of The Rings       4
4   Lord of The Flies           5

在数据框中搜索部分字符串匹配项，并将行仅包含其ID放入新的数据框中

问题描述

2 个解决方案

解决方案1
1 已采纳 2018-08-15 14:02:26

解决方案2
1 2018-08-15 14:18:47

在数据框中搜索部分字符串匹配项，并将行仅包含其ID放入新的数据框中

问题描述

2 个解决方案

解决方案1 1 已采纳 2018-08-15 14:02:26

解决方案2 1 2018-08-15 14:18:47

解决方案1
1 已采纳 2018-08-15 14:02:26

解决方案2
1 2018-08-15 14:18:47