简体   繁体   English

从 pandas DataFrame 中的日期列获取唯一年份

[英]Get unique years from a date column in pandas DataFrame

I have a date column in my DataFrame say df_dob and it looks like -我的 DataFrame 中有一个日期列,说df_dob ,它看起来像 -

id ID DOB出生日期
23312 23312 31-12-9999 31-12-9999
1482 1482 31-12-9999 31-12-9999
807 807 #VALUE! #价值!
2201 2201 06-12-1925 06-12-1925
653 653 01/01/1855 1855 年 1 月 1 日
108 108 01/01/1855 1855 年 1 月 1 日
768 768 1967-02-20 1967-02-20

What I want to print is a list of unique years like - `['9999', '1925', '1855', '1967']我要打印的是独特年份的列表,例如 - `['9999', '1925', '1855', '1967']

basically through this list I just wanted to check whether there is some unwanted year is present or not.基本上通过这个列表,我只是想检查是否存在一些不需要的年份。 I have tried(pasted my code below) but getting ValueError: time data 01/01/1855 doesn't match format specified and could not resolve it.我已经尝试过(在下面粘贴了我的代码),但得到了ValueError: time data 01/01/1855 doesn't match format specified and cannot resolve it。

df_dob['DOB'] = df_dob['DOB'].replace('01/01/1855 00:00:00', '1855-01-01')
df_dob['DOB'] = pd.to_datetime(df_dob.DOB, format='%Y-%m-%d')
df_dob['DOB'] = df_dob['DOB'].dt.strftime('%Y-%m-%d')
print(np.unique(df_dob['DOB']))
# print(list(df_dob['DOB'].year.unique()))

PS - when I print df_dob['DOB'] , I get values like - 1967-02-20 00:00:00 PS - 当我打印df_dob['DOB']时,我得到的值是 - 1967-02-20 00:00:00

Use pandas' unique for this.为此使用熊猫的独特之处。 And on year only.仅在一年内。

So try:所以试试:

print(df['DOB'].dt.year.unique())

Also, you don't need to stringify your time.此外,您不需要将时间串起来。 Alse, you don't need to replace anything, pandas is smart enough to do it for you.此外,您无需更换任何东西,pandas 足够聪明,可以为您完成。 So you overall code becomes:所以你的整体代码变成了:

df_dob['DOB'] = pd.to_datetime(df_dob.DOB)    # No need to pass format if there isn't some specific anomoly
print(df['DOB'].dt.year.unique())

Edit:编辑:

Another method: Since you have outofbounds problem, Another method you can try is not converting them to datetime, but rather find all the four digit numbers in each column using regex.另一种方法:由于您有越界问题,您可以尝试的另一种方法不是将它们转换为日期时间,而是使用正则表达式查找每列中的所有四位数字。 So,所以,

df['DOB'].str.extract(r'(\d{4})')[0].unique()

[0] because unique() is a function of pd.series not a dataframe . [0]因为unique()是 pd.series 的pd.series而不是dataframe So taking the first series in the dataframe.所以采取dataframe中的第一个系列。

Can you try this?你能试试这个吗?

df_dob["DOB"] = pd.to_datetime(df_DOB["Date"])

df_dob['YOB'] = df_dob['DOB'].dt.strftime('%Y')

The first thing you need to know is if the resulting values (which you said look like 1967-02-20 00:00:00 are datetimes or not. That's as simple as df_dob.info()您需要知道的第一件事是结果值(您所说的看起来像1967-02-20 00:00:00是否是日期时间。就像df_dob.info()

If the result says similar to datetime64[ns] for the DOB column, you're good.如果结果显示与 DOB 列的datetime64[ns]相似,那就很好。 If not you'll need to cast it as a DateTime.如果不是,您需要将其转换为 DateTime。 You have a couple of different formats so that might be part of your problem.您有几种不同的格式,因此这可能是您的问题的一部分。 Also, because there're several ways of doing this and it's a separate question, I'm not addressing it.另外,因为有几种方法可以做到这一点,而且这是一个单独的问题,所以我没有解决它。

We going to leverage the speed of sets, plus a bit of pandas, and then convert that back to a list as you wanted the final version to be.我们将利用集合的速度,加上一点 pandas,然后将其转换回您想要的最终版本的列表。

years = list({i for i in df['date'].dt.year})

And just a side note, you can't use [] instead of list() as you'll end with a list with a single element that's a set.顺便说一句,您不能使用[]代替list() ,因为您将以带有单个元素的列表结尾,该列表是一个集合。

That's a list as you indicated.这是你指出的清单。 If you want it as a column, you won't get unique values如果您希望它作为一列,您将不会获得唯一值

Nitish's answer will also work but give you something like: array([9999, 1925, 1855, 1967]) Nitish 的答案也可以,但会给你类似的东西: array([9999, 1925, 1855, 1967])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM