[英]How to extract year (or datetime) from a column in a pandas dataframe that contains text
假設我有一個熊貓數據框:
Id Book
1 Harry Potter (1997)
2 Of Mice and Men (1937)
3 Babe Ruth Story, The (1948) Drama 948) Babe Ruth Story
如何從列中提取年份?
輸出應該是:
Id Book Title Year
1 Harry Potter 1997
2 Of Mice and Men 1937
3 Babe Ruth Story, The 1948
到目前為止,我已經嘗試過:
movies['year'] = movies['title'].str.extract('([0-9(0-9)]+)', expand=False).str.strip()
和
books['year'] = books['title'].str[-5:-1]
我搞砸了一些其他的事情,還沒有讓它發揮作用。 有什么建議?
一個簡單的正則表達式怎么樣:
text = 'Harry Potter (1997)'
re.findall('\((\d{4})\)', text)
# ['1997'] Note that this is a list of "all" the occurrences.
使用 Dataframe,它可以像這樣完成:
text = 'Harry Potter (1997)'
df = pd.DataFrame({'Book': text}, index=[1])
pattern = '\((\d{4})\)'
df['year'] = df.Book.str.extract(pattern, expand=False) #False returns a series
df
# Book year
# 1 Harry Potter (1997) 1997
最后,如果您真的想將標題和數據分開(在另一個答案中從 Philip 那里獲取數據幀重建):
df = pd.DataFrame(columns=['Book'], data=[['Harry Potter (1997)'],['Of Mice and Men (1937)'],['Babe Ruth Story, The (1948) Drama 948) Babe Ruth Story']])
sep = df['Book'].str.extract('(.*)\((\d{4})\)', expand=False)
sep # A new df, separated into title and year
# 0 1
# 0 Harry Potter 1997
# 1 Of Mice and Men 1937
# 2 Babe Ruth Story, The 1948
您可以執行以下操作。
import pandas as pd
df = pd.DataFrame(columns=['id','Book'], data=[[1,'Harry Potter (1997)'],[2,'Of Mice and Men (1937)'],[3,'Babe Ruth Story, The (1948) Drama 948) Babe Ruth Story']])
df['Year'] = df['Book'].str.extract(r'(?!\()\b(\d+){1}')
使用正則表達式查找數字。 我使用https://regex101.com/r/Bid0qA/1 ,這對理解正則表達式的工作原理有很大幫助。
完整系列的答案實際上是這樣的:
books['title'].str.findall('\((\d{4})\)').str.get(0)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.