简体   繁体   English

在数据框中搜索部分字符串匹配项,并将行仅包含其ID放入新的数据框中

[英]Search through a dataframe for a partial string match and put the rows into a new dataframe with only their IDs

I have a dataframe of publications that have the following rows: 我有一个包含以下行的出版物的数据框:

publication_ID , title, author_name, date 12344, Design style, Jake Kreath, 20071208 12334, Power of Why, Samantha Finn, 20150704 publication_ID,标题,作者名称,日期12344,设计风格,Jake Kreath,20071208 12334,《为什么力量》,萨曼莎·芬恩(Samantha Finn),20150704

I ask the user for a string and use that string to search through the titles. 我要求用户输入一个字符串,然后使用该字符串搜索标题。

The goal: Search through the dataframe to see if the title contains the word the user provides and return the rows in a new dataframe with just the title and publication_ID. 目标:搜索数据框以查看标题中是否包含用户提供的单词,并在新数据框中返回仅包含标题和Publication_ID的行。

This is my code so far: 到目前为止,这是我的代码:

import pandas as pd
from pandas import DataFrame

 publications = pd.read_csv(filepath, sep= "|")

 search_term = input('Enter the term you are looking for: ')
 def stringDataFrame(publications, title, regex):
      newdf = pd.DataFrame()
      for idx, search_term in publications['title'].iteritems():
        if re.search(regex, search_term):
        newdf = concat([publications[publications['title'] == search_term], newdf], ignore_index=True)

        return newdf
print(newdf.stringDataFrame)

Use a combination of .str.contains and .loc 结合使用.str.contains.loc

publications.loc[publications.title.str.contains(search_term), ['title', 'publication_ID']]

Just be careful, because if your title is 'nightlife' and someone searches for 'night' this will return a match. 请注意,因为如果您的标题是'nightlife'并且有人搜索'night'这将返回一个匹配项。 If that's not your desired behavior then you may need .str.split instead. 如果这不是您想要的行为,那么您可能需要.str.split


As jpp points out, str.contains is case sensitive. 正如jpp指出的那样, str.contains区分大小写。 One simple fix is to just ensure everything is lowercase. 一种简单的解决方法是仅确保所有内容均为小写。

title_mask = publications.title.str.lower().str.contains(search_term.lower())
pmids = publications.loc[title_mask, ['title', 'publication_ID']]

now Lord , LoRD , lord and all other permutations will return a valid match, and your original DataFrame has the capitalization unchanged. 现在LordLoRDlord和所有其他排列将返回有效的匹配项,并且原始DataFrame的大小写不变。

Full example but you should accept the answer above by @ALollz 完整示例,但您应该接受@ALollz的回答

import pandas as pd
# you publications dataframe
publications = pd.DataFrame({'title':['The Odyssey','The Canterbury Tales','Inferno','The Lord of The Rings', 'Lord of The Flies'],'publication_ID':[1,2,3,4,5]})

search_term = input('Enter the term you are looking for: ')

publications[['title','publication_ID']][publications['title'].str.contains(search_term)]


Enter the term you are looking for: Lord

       title               publication_ID
3   The Lord of The Rings      4
4   Lord of The Flies          5

per your error you can filter out all np.nan values by as part of the logic using the new code below: 根据您的错误,您可以使用以下新代码过滤掉所有np.nan值,作为逻辑的一部分:

import pandas as pd
import numpy as np

publications = pd.DataFrame({'title':['The Odyssey','The Canterbury Tales','Inferno','The Lord of The Rings', 'Lord of The Flies',np.nan],'publication_ID':[1,2,3,4,5,6]})

search_term = input('Enter the term you are looking for: ')

publications[['title','publication_ID']][publications['title'].str.contains(search_term) & ~publications['title'].isna()]

Enter the term you are looking for: Lord

    title                 publication_ID
3   The Lord of The Rings       4
4   Lord of The Flies           5

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python - 根据部分字符串匹配将行保留在数据框中 - Python - keep rows in dataframe based on partial string match 移位熊猫数据框的参差不齐的行以使用部分字符串搜索来清理数据 - Shift ragged rows of Pandas DataFrame to clean data with partial string search 如果在特定符号上匹配,如何将数据框中的字符串搜索并提取到新列中? - How to Search and Extract string in dataframe into new column if match on particular symbol? 将字典映射到 dataframe 中的部分字符串匹配 - Mapping dictionary to partial string match in dataframe Python Pandas dataframe中的字符串列表部分匹配 - Python Pandas partial match of list of string in dataframe 使用部分字符串匹配将 dataframe 中的列替换为另一个 dataframe 列 - Replacing a column in a dataframe with another dataframe column using partial string match 遍历 dataframe 行以匹配列表中的单词 - Iterate through dataframe rows to match word in list 如何过滤数据框中具有指定条件的行并将它们放入新的数据框中? - How to filter rows with specified conditions in a dataframe and put them in a new dataframe? 如何搜索 dataframe 的特定行以在第二个 dataframe 中找到匹配项? - how search on a specific rows of a dataframe to find a match in a second dataframe? 创建仅选择符合条件的行的数据框 - Create a dataframe only selecting rows that match condition
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM