[英]Get unique years from a date column in pandas DataFrame
I have a date column in my DataFrame say df_dob
and it looks like -我的 DataFrame 中有一个日期列,说
df_dob
,它看起来像 -
id ![]() |
DOB![]() |
---|---|
23312 ![]() |
31-12-9999 ![]() |
1482 ![]() |
31-12-9999 ![]() |
807 ![]() |
#VALUE! ![]() |
2201 ![]() |
06-12-1925 ![]() |
653 ![]() |
01/01/1855 ![]() |
108 ![]() |
01/01/1855 ![]() |
768 ![]() |
1967-02-20 ![]() |
What I want to print is a list of unique years like - `['9999', '1925', '1855', '1967']我要打印的是独特年份的列表,例如 - `['9999', '1925', '1855', '1967']
basically through this list I just wanted to check whether there is some unwanted year is present or not.基本上通过这个列表,我只是想检查是否存在一些不需要的年份。 I have tried(pasted my code below) but getting
ValueError: time data 01/01/1855 doesn't match format specified
and could not resolve it.我已经尝试过(在下面粘贴了我的代码),但得到了
ValueError: time data 01/01/1855 doesn't match format specified
and cannot resolve it。
df_dob['DOB'] = df_dob['DOB'].replace('01/01/1855 00:00:00', '1855-01-01')
df_dob['DOB'] = pd.to_datetime(df_dob.DOB, format='%Y-%m-%d')
df_dob['DOB'] = df_dob['DOB'].dt.strftime('%Y-%m-%d')
print(np.unique(df_dob['DOB']))
# print(list(df_dob['DOB'].year.unique()))
PS - when I print df_dob['DOB']
, I get values like - 1967-02-20 00:00:00
PS - 当我打印
df_dob['DOB']
时,我得到的值是 - 1967-02-20 00:00:00
Use pandas' unique for this.为此使用熊猫的独特之处。 And on year only.
仅在一年内。
So try:所以试试:
print(df['DOB'].dt.year.unique())
Also, you don't need to stringify your time.此外,您不需要将时间串起来。 Alse, you don't need to replace anything, pandas is smart enough to do it for you.
此外,您无需更换任何东西,pandas 足够聪明,可以为您完成。 So you overall code becomes:
所以你的整体代码变成了:
df_dob['DOB'] = pd.to_datetime(df_dob.DOB) # No need to pass format if there isn't some specific anomoly
print(df['DOB'].dt.year.unique())
Edit:编辑:
Another method: Since you have outofbounds problem, Another method you can try is not converting them to datetime, but rather find all the four digit numbers in each column using regex.另一种方法:由于您有越界问题,您可以尝试的另一种方法不是将它们转换为日期时间,而是使用正则表达式查找每列中的所有四位数字。 So,
所以,
df['DOB'].str.extract(r'(\d{4})')[0].unique()
[0]
because unique()
is a function of pd.series
not a dataframe
. [0]
因为unique()
是 pd.series 的pd.series
而不是dataframe
。 So taking the first series in the dataframe.所以采取dataframe中的第一个系列。
Can you try this?你能试试这个吗?
df_dob["DOB"] = pd.to_datetime(df_DOB["Date"])
df_dob['YOB'] = df_dob['DOB'].dt.strftime('%Y')
The first thing you need to know is if the resulting values (which you said look like 1967-02-20 00:00:00
are datetimes or not. That's as simple as df_dob.info()
您需要知道的第一件事是结果值(您所说的看起来像
1967-02-20 00:00:00
是否是日期时间。就像df_dob.info()
If the result says similar to datetime64[ns]
for the DOB column, you're good.如果结果显示与 DOB 列的
datetime64[ns]
相似,那就很好。 If not you'll need to cast it as a DateTime.如果不是,您需要将其转换为 DateTime。 You have a couple of different formats so that might be part of your problem.
您有几种不同的格式,因此这可能是您的问题的一部分。 Also, because there're several ways of doing this and it's a separate question, I'm not addressing it.
另外,因为有几种方法可以做到这一点,而且这是一个单独的问题,所以我没有解决它。
We going to leverage the speed of sets, plus a bit of pandas, and then convert that back to a list as you wanted the final version to be.我们将利用集合的速度,加上一点 pandas,然后将其转换回您想要的最终版本的列表。
years = list({i for i in df['date'].dt.year})
And just a side note, you can't use []
instead of list()
as you'll end with a list with a single element that's a set.顺便说一句,您不能使用
[]
代替list()
,因为您将以带有单个元素的列表结尾,该列表是一个集合。
That's a list as you indicated.这是你指出的清单。 If you want it as a column, you won't get unique values
如果您希望它作为一列,您将不会获得唯一值
Nitish's answer will also work but give you something like: array([9999, 1925, 1855, 1967])
Nitish 的答案也可以,但会给你类似的东西:
array([9999, 1925, 1855, 1967])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.