簡體   English   中英

在數據框中搜索部分字符串匹配項,並將行僅包含其ID放入新的數據框中

[英]Search through a dataframe for a partial string match and put the rows into a new dataframe with only their IDs

我有一個包含以下行的出版物的數據框:

publication_ID,標題,作者名稱,日期12344,設計風格,Jake Kreath,20071208 12334,《為什么力量》,薩曼莎·芬恩(Samantha Finn),20150704

我要求用戶輸入一個字符串,然后使用該字符串搜索標題。

目標:搜索數據框以查看標題中是否包含用戶提供的單詞,並在新數據框中返回僅包含標題和Publication_ID的行。

到目前為止,這是我的代碼:

import pandas as pd
from pandas import DataFrame

 publications = pd.read_csv(filepath, sep= "|")

 search_term = input('Enter the term you are looking for: ')
 def stringDataFrame(publications, title, regex):
      newdf = pd.DataFrame()
      for idx, search_term in publications['title'].iteritems():
        if re.search(regex, search_term):
        newdf = concat([publications[publications['title'] == search_term], newdf], ignore_index=True)

        return newdf
print(newdf.stringDataFrame)

結合使用.str.contains.loc

publications.loc[publications.title.str.contains(search_term), ['title', 'publication_ID']]

請注意,因為如果您的標題是'nightlife'並且有人搜索'night'這將返回一個匹配項。 如果這不是您想要的行為,那么您可能需要.str.split


正如jpp指出的那樣, str.contains區分大小寫。 一種簡單的解決方法是僅確保所有內容均為小寫。

title_mask = publications.title.str.lower().str.contains(search_term.lower())
pmids = publications.loc[title_mask, ['title', 'publication_ID']]

現在LordLoRDlord和所有其他排列將返回有效的匹配項,並且原始DataFrame的大小寫不變。

完整示例,但您應該接受@ALollz的回答

import pandas as pd
# you publications dataframe
publications = pd.DataFrame({'title':['The Odyssey','The Canterbury Tales','Inferno','The Lord of The Rings', 'Lord of The Flies'],'publication_ID':[1,2,3,4,5]})

search_term = input('Enter the term you are looking for: ')

publications[['title','publication_ID']][publications['title'].str.contains(search_term)]


Enter the term you are looking for: Lord

       title               publication_ID
3   The Lord of The Rings      4
4   Lord of The Flies          5

根據您的錯誤,您可以使用以下新代碼過濾掉所有np.nan值,作為邏輯的一部分:

import pandas as pd
import numpy as np

publications = pd.DataFrame({'title':['The Odyssey','The Canterbury Tales','Inferno','The Lord of The Rings', 'Lord of The Flies',np.nan],'publication_ID':[1,2,3,4,5,6]})

search_term = input('Enter the term you are looking for: ')

publications[['title','publication_ID']][publications['title'].str.contains(search_term) & ~publications['title'].isna()]

Enter the term you are looking for: Lord

    title                 publication_ID
3   The Lord of The Rings       4
4   Lord of The Flies           5

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM