如何從包含文本的熊貓數據框中的列中提取年份（或日期時間）

Question

假設我有一個熊貓數據框：

Id    Book                      
1     Harry Potter (1997)
2     Of Mice and Men (1937)
3     Babe Ruth Story, The (1948)   Drama   948)    Babe Ruth Story

如何從列中提取年份？

輸出應該是：

Id    Book Title               Year
1     Harry Potter             1997
2     Of Mice and Men          1937
3     Babe Ruth Story, The     1948

到目前為止，我已經嘗試過：

movies['year'] = movies['title'].str.extract('([0-9(0-9)]+)', expand=False).str.strip()

和

books['year'] = books['title'].str[-5:-1]

我搞砸了一些其他的事情，還沒有讓它發揮作用。 有什么建議？

Answer 1

一個簡單的正則表達式怎么樣：

text = 'Harry Potter (1997)'
re.findall('\((\d{4})\)', text)
# ['1997'] Note that this is a list of "all" the occurrences.

使用 Dataframe，它可以像這樣完成：

text = 'Harry Potter (1997)'
df = pd.DataFrame({'Book': text}, index=[1])
pattern = '\((\d{4})\)'
df['year'] = df.Book.str.extract(pattern, expand=False) #False returns a series

df
#                  Book   year
# 1  Harry Potter (1997)  1997

最后，如果您真的想將標題和數據分開（在另一個答案中從 Philip 那里獲取數據幀重建）：

df = pd.DataFrame(columns=['Book'], data=[['Harry Potter (1997)'],['Of Mice and Men (1937)'],['Babe Ruth Story, The (1948)   Drama   948)    Babe Ruth Story']])

sep = df['Book'].str.extract('(.*)\((\d{4})\)', expand=False)

sep # A new df, separated into title and year
#                       0      1                           
# 0          Harry Potter   1997 
# 1       Of Mice and Men   1937
# 2  Babe Ruth Story, The   1948

Answer 2

您可以執行以下操作。

import pandas as pd
df = pd.DataFrame(columns=['id','Book'], data=[[1,'Harry Potter (1997)'],[2,'Of Mice and Men (1937)'],[3,'Babe Ruth Story, The (1948)   Drama   948)    Babe Ruth Story']])

df['Year'] = df['Book'].str.extract(r'(?!\()\b(\d+){1}')

行：進口熊貓
行：為了理解而創建數據框
行：創建一個新列 'Year'，它是從列 Book 上的字符串提取中創建的。

使用正則表達式查找數字。 我使用https://regex101.com/r/Bid0qA/1 ，這對理解正則表達式的工作原理有很大幫助。

Answer 3

完整系列的答案實際上是這樣的：

books['title'].str.findall('\((\d{4})\)').str.get(0)

如何從包含文本的熊貓數據框中的列中提取年份（或日期時間）

問題描述

3 個解決方案

解決方案1
5 已采納 2018-11-15 17:50:29

解決方案2
1 2018-11-15 17:59:37

解決方案3
0 2018-11-15 18:00:18

如何從包含文本的熊貓數據框中的列中提取年份（或日期時間）

問題描述

3 個解決方案

解決方案1 5 已采納 2018-11-15 17:50:29

解決方案2 1 2018-11-15 17:59:37

解決方案3 0 2018-11-15 18:00:18

解決方案1
5 已采納 2018-11-15 17:50:29

解決方案2
1 2018-11-15 17:59:37

解決方案3
0 2018-11-15 18:00:18