[英]Pandas Converting columns to different dtypes
So I have am using Pandas to create a data frame with some columns which have types bool, int64 and date time. 所以我使用Pandas创建一个数据框,其中包含一些类型为bool,int64和date time的列。 For smaller datasets the dtypes remain but for larger datasets pandas converts all of these to objects.
对于较小的数据集,dtypes仍然存在,但对于较大的数据集,pandas会将所有这些转换为对象。 Would anyone know why its doing this and how i can explicitly set type if then?
谁会知道为什么它这样做以及我如何明确设置类型呢?
Reading CSV : 阅读CSV:
twitterDataFrame = pandas.read_csv(DataSetLocation)
twitterDataFrame['CreatedAt'] = twitterDataFrame['CreatedAt'].map(lambda x: pandas.to_datetime(x,dayfirst=True))
twitterDataFrame['CreatedAtForCalculations'] = twitterDataFrame['CreatedAt']
twitterDataFrame['InReplyToStatusID'] = twitterDataFrame['InReplyToStatusID'].map(lambda x: True if pandas.notnull(x) else False)
twitterDataFrame['InReplyToUserID'] = twitterDataFrame['InReplyToUserID'].map(lambda x: True if pandas.notnull(x) else False)
twitterDataFrame['RetweetCount'] = twitterDataFrame['RetweetCount'].map(lambda x: x if pandas.notnull(x) else 0)
twitterDataFrame['FavouriteCount'] = twitterDataFrame['FavouriteCount'].map(lambda x: x if pandas.notnull(x) else 0)
twitterDataFrame['Hashtags'] = twitterDataFrame['Hashtags'].map(lambda x: True if pandas.notnull(x) else False)
twitterDataFrame['URL'] = twitterDataFrame['URL'].map(lambda x: True if pandas.notnull(x) else False)
twitterDataFrame['MediaURL'] = twitterDataFrame['MediaURL'].map(lambda x: True if pandas.notnull(x) else False)
twitterDataFrame['MediaType'] = twitterDataFrame['MediaType'].map(lambda x: x if pandas.notnull(x) else False)
twitterDataFrame['UserMentionID'] = twitterDataFrame['UserMentionID'].map(lambda x: True if pandas.notnull(x) else False)
twitterDataFrame['PossiblySensitive'] = twitterDataFrame['PossiblySensitive'].map(lambda x: x if pandas.notnull(x) else 'NoData')
When i print info this is what I get. 当我打印信息时,这就是我得到的。
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21836 entries, 0 to 21835
Data columns (total 17 columns):
CreatedAt 21836 non-null object
ActualTweet 21836 non-null object
InReplyToStatusID 21836 non-null bool
InReplyToUserID 21836 non-null bool
UserID 21836 non-null object
RetweetCount 21836 non-null object
FavouriteCount 21836 non-null object
Hashtags 21836 non-null bool
URL 21836 non-null bool
MediaURL 21836 non-null bool
MediaType 21836 non-null object
UserMentionID 21836 non-null bool
PossiblySensitive 21836 non-null object
Language 21836 non-null object
Classifier 21836 non-null object
TweetLength 21836 non-null object
CreatedAtForCalculations 21836 non-null object
dtypes: bool(6), object(11)None
For smaller datasets however this works as it should and we get : 对于较小的数据集,但是它可以正常工作,我们得到:
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8978 entries, 0 to 8977
Data columns (total 17 columns):
CreatedAt 8978 non-null datetime64[ns]
ActualTweet 8978 non-null object
InReplyToStatusID 8978 non-null bool
InReplyToUserID 8978 non-null bool
UserID 8978 non-null int64
RetweetCount 8978 non-null int64
FavouriteCount 8978 non-null int64
Hashtags 8978 non-null bool
URL 8978 non-null bool
MediaURL 8978 non-null bool
MediaType 8978 non-null object
UserMentionID 8978 non-null bool
PossiblySensitive 8978 non-null object
Language 8978 non-null object
Trustworthy 8978 non-null int64
TweetLength 8978 non-null int64
CreatedAtForCalculations 8978 non-null datetime64[ns]
dtypes: bool(6), datetime64[ns](2), int64(5), object(4)None
Would anyone know why this is and what i can do to fix it? 有谁知道为什么这是我可以做些什么来解决它?
Here's a nice way to convert an already existing frame's columns from object
to something more useful. 这是将现有框架的列从
object
为更有用的一种很好的方法。 Normally you don't need to do this as something like read_csv
will do conversions for you. 通常你不需要这样做,因为像
read_csv
这样的东西会为你做转换。 But if you have mixed values then these conversions can fail. 但是如果你有混合值,那么这些转换可能会失败。
In [13]: data = """21-01-2014,1
....: 31x01x2014,foo
....: 01-01-2014,2
....: hello,3"""
In [14]: df = pd.DataFrame.from_csv( StringIO(data), index_col=None, header=None )
In [15]: df
Out[15]:
0 1
0 21-01-2014 1
1 31x01x2014 foo
2 01-01-2014 2
3 hello 3
In [16]: df.dtypes
Out[16]:
0 object
1 object
dtype: object
In [17]: df.convert_objects(convert_dates='coerce',convert_numeric=True)
Out[17]:
0 1
0 2014-01-21 1
1 NaT NaN
2 2014-01-01 2
3 NaT 3
In [18]: df.convert_objects(convert_dates='coerce',convert_numeric=True).dtypes
Out[18]:
0 datetime64[ns]
1 float64
dtype: object
This will convert columns that 'look' like datetimes and numbers. 这将转换“看起来”像日期时间和数字的列。 Its possible that you want to limit this to certain columns and be a bit more selective.
您可能希望将此限制为某些列并且更具选择性。 It will only attempt
object
type columns. 它只会尝试
object
类型列。 Furthermore this is implemented in cython so will be quite fast. 此外,这是在cython中实现的,所以会非常快。
Maybe there is better solution how to find values which can't be converted. 也许有更好的解决方案如何找到无法转换的值。
It is my solution using apply()
这是我使用
apply()
解决方案
My data for test: 我的测试数据:
import pandas as pd
from StringIO import StringIO
data = '''21-01-2014
31x01x2014
01-01-2014
"Hello World"'''
df = pd.DataFrame.from_csv( StringIO(data), index_col=None, header=None )
print df
'''
0
0 21-01-2014
1 31x01x2014
2 01-01-2014
3 Hello World
'''
I create function which use datetime.datetime.strptime()
and try/except
to catch (and print) incorrect date. 我创建函数使用
datetime.datetime.strptime()
和try/except
来捕获(并打印)不正确的日期。
from datetime import datetime
def test_datetime(x):
try:
datetime.strptime(x, "%d-%M-%Y")
except:
print 'incorect:', x
then I can use apply()
to test all values in column 然后我可以使用
apply()
来测试列中的所有值
df[0].apply(test_datetime)
'''
incorect: 31x01x2014
incorect: Hello World
'''
But I can add return True/False
in previous function 但是我可以在之前的函数中添加
return True/False
from datetime import datetime
def test_datetime(x):
try:
datetime.strptime(x, "%d-%M-%Y")
return False
except:
return True
to use it this way and get data with index 以这种方式使用它并获取索引数据
print df[ df[0].apply(test_datetime) ]
'''
0
1 31x01x2014
3 Hello World
'''
and run other functions on this rows 并在此行上运行其他功能
df[ df[0].apply(test_datetime) ] = '01-01-2000'
print df
'''
0
0 21-01-2014
1 01-01-2000
2 01-01-2014
3 01-01-2000
'''
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.