从 pandas DataFrame 中的日期列获取唯一年份

Question

I have a date column in my DataFrame say df_dob and it looks like -我的 DataFrame 中有一个日期列，说df_dob ，它看起来像 -

id ID	DOB出生日期
23312 23312	31-12-9999 31-12-9999
1482 1482	31-12-9999 31-12-9999
807 807	#VALUE! ＃价值！
2201 2201	06-12-1925 06-12-1925
653 653	01/01/1855 1855 年 1 月 1 日
108 108	01/01/1855 1855 年 1 月 1 日
768 768	1967-02-20 1967-02-20

What I want to print is a list of unique years like - `['9999', '1925', '1855', '1967']我要打印的是独特年份的列表，例如 - `['9999', '1925', '1855', '1967']

basically through this list I just wanted to check whether there is some unwanted year is present or not.基本上通过这个列表，我只是想检查是否存在一些不需要的年份。 I have tried(pasted my code below) but getting ValueError: time data 01/01/1855 doesn't match format specified and could not resolve it.我已经尝试过（在下面粘贴了我的代码），但得到了ValueError: time data 01/01/1855 doesn't match format specified and cannot resolve it。

df_dob['DOB'] = df_dob['DOB'].replace('01/01/1855 00:00:00', '1855-01-01')
df_dob['DOB'] = pd.to_datetime(df_dob.DOB, format='%Y-%m-%d')
df_dob['DOB'] = df_dob['DOB'].dt.strftime('%Y-%m-%d')
print(np.unique(df_dob['DOB']))
# print(list(df_dob['DOB'].year.unique()))

PS - when I print df_dob['DOB'] , I get values like - 1967-02-20 00:00:00 PS - 当我打印df_dob['DOB']时，我得到的值是 - 1967-02-20 00:00:00

Answer 1

Use pandas' unique for this.为此使用熊猫的独特之处。 And on year only.仅在一年内。

So try:所以试试：

print(df['DOB'].dt.year.unique())

Also, you don't need to stringify your time.此外，您不需要将时间串起来。 Alse, you don't need to replace anything, pandas is smart enough to do it for you.此外，您无需更换任何东西，pandas 足够聪明，可以为您完成。 So you overall code becomes:所以你的整体代码变成了：

df_dob['DOB'] = pd.to_datetime(df_dob.DOB)    # No need to pass format if there isn't some specific anomoly
print(df['DOB'].dt.year.unique())

Edit:编辑：

Another method: Since you have outofbounds problem, Another method you can try is not converting them to datetime, but rather find all the four digit numbers in each column using regex.另一种方法：由于您有越界问题，您可以尝试的另一种方法不是将它们转换为日期时间，而是使用正则表达式查找每列中的所有四位数字。 So,所以，

df['DOB'].str.extract(r'(\d{4})')[0].unique()

[0] because unique() is a function of pd.series not a dataframe . [0]因为unique()是 pd.series 的pd.series而不是dataframe 。 So taking the first series in the dataframe.所以采取dataframe中的第一个系列。

Answer 2

Can you try this?你能试试这个吗？

df_dob["DOB"] = pd.to_datetime(df_DOB["Date"])

df_dob['YOB'] = df_dob['DOB'].dt.strftime('%Y')

Answer 3

The first thing you need to know is if the resulting values (which you said look like 1967-02-20 00:00:00 are datetimes or not. That's as simple as df_dob.info()您需要知道的第一件事是结果值（您所说的看起来像1967-02-20 00:00:00是否是日期时间。就像df_dob.info()

If the result says similar to datetime64[ns] for the DOB column, you're good.如果结果显示与 DOB 列的datetime64[ns]相似，那就很好。 If not you'll need to cast it as a DateTime.如果不是，您需要将其转换为 DateTime。 You have a couple of different formats so that might be part of your problem.您有几种不同的格式，因此这可能是您的问题的一部分。 Also, because there're several ways of doing this and it's a separate question, I'm not addressing it.另外，因为有几种方法可以做到这一点，而且这是一个单独的问题，所以我没有解决它。

We going to leverage the speed of sets, plus a bit of pandas, and then convert that back to a list as you wanted the final version to be.我们将利用集合的速度，加上一点 pandas，然后将其转换回您想要的最终版本的列表。

years = list({i for i in df['date'].dt.year})

And just a side note, you can't use [] instead of list() as you'll end with a list with a single element that's a set.顺便说一句，您不能使用[]代替list() ，因为您将以带有单个元素的列表结尾，该列表是一个集合。

That's a list as you indicated.这是你指出的清单。 If you want it as a column, you won't get unique values如果您希望它作为一列，您将不会获得唯一值

Nitish's answer will also work but give you something like: array([9999, 1925, 1855, 1967]) Nitish 的答案也可以，但会给你类似的东西： array([9999, 1925, 1855, 1967])

从 pandas DataFrame 中的日期列获取唯一年份

问题描述

3 个解决方案

解决方案1
0 2022-08-19 08:57:59

解决方案2
0 2022-08-19 09:16:20

解决方案3
0 2022-09-18 18:18:43

从 pandas DataFrame 中的日期列获取唯一年份

问题描述

3 个解决方案

解决方案1 0 2022-08-19 08:57:59

解决方案2 0 2022-08-19 09:16:20

解决方案3 0 2022-09-18 18:18:43

解决方案1
0 2022-08-19 08:57:59

解决方案2
0 2022-08-19 09:16:20

解决方案3
0 2022-09-18 18:18:43