So I have am using Pandas to create a data frame with some columns which have types bool, int64 and date time. For smaller datasets the dtypes remain but for larger datasets pandas converts all of these to objects. Would anyone know why its doing this and how i can explicitly set type if then?
Reading CSV :
twitterDataFrame = pandas.read_csv(DataSetLocation)
twitterDataFrame['CreatedAt'] = twitterDataFrame['CreatedAt'].map(lambda x: pandas.to_datetime(x,dayfirst=True))
twitterDataFrame['CreatedAtForCalculations'] = twitterDataFrame['CreatedAt']
twitterDataFrame['InReplyToStatusID'] = twitterDataFrame['InReplyToStatusID'].map(lambda x: True if pandas.notnull(x) else False)
twitterDataFrame['InReplyToUserID'] = twitterDataFrame['InReplyToUserID'].map(lambda x: True if pandas.notnull(x) else False)
twitterDataFrame['RetweetCount'] = twitterDataFrame['RetweetCount'].map(lambda x: x if pandas.notnull(x) else 0)
twitterDataFrame['FavouriteCount'] = twitterDataFrame['FavouriteCount'].map(lambda x: x if pandas.notnull(x) else 0)
twitterDataFrame['Hashtags'] = twitterDataFrame['Hashtags'].map(lambda x: True if pandas.notnull(x) else False)
twitterDataFrame['URL'] = twitterDataFrame['URL'].map(lambda x: True if pandas.notnull(x) else False)
twitterDataFrame['MediaURL'] = twitterDataFrame['MediaURL'].map(lambda x: True if pandas.notnull(x) else False)
twitterDataFrame['MediaType'] = twitterDataFrame['MediaType'].map(lambda x: x if pandas.notnull(x) else False)
twitterDataFrame['UserMentionID'] = twitterDataFrame['UserMentionID'].map(lambda x: True if pandas.notnull(x) else False)
twitterDataFrame['PossiblySensitive'] = twitterDataFrame['PossiblySensitive'].map(lambda x: x if pandas.notnull(x) else 'NoData')
When i print info this is what I get.
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21836 entries, 0 to 21835
Data columns (total 17 columns):
CreatedAt 21836 non-null object
ActualTweet 21836 non-null object
InReplyToStatusID 21836 non-null bool
InReplyToUserID 21836 non-null bool
UserID 21836 non-null object
RetweetCount 21836 non-null object
FavouriteCount 21836 non-null object
Hashtags 21836 non-null bool
URL 21836 non-null bool
MediaURL 21836 non-null bool
MediaType 21836 non-null object
UserMentionID 21836 non-null bool
PossiblySensitive 21836 non-null object
Language 21836 non-null object
Classifier 21836 non-null object
TweetLength 21836 non-null object
CreatedAtForCalculations 21836 non-null object
dtypes: bool(6), object(11)None
For smaller datasets however this works as it should and we get :
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8978 entries, 0 to 8977
Data columns (total 17 columns):
CreatedAt 8978 non-null datetime64[ns]
ActualTweet 8978 non-null object
InReplyToStatusID 8978 non-null bool
InReplyToUserID 8978 non-null bool
UserID 8978 non-null int64
RetweetCount 8978 non-null int64
FavouriteCount 8978 non-null int64
Hashtags 8978 non-null bool
URL 8978 non-null bool
MediaURL 8978 non-null bool
MediaType 8978 non-null object
UserMentionID 8978 non-null bool
PossiblySensitive 8978 non-null object
Language 8978 non-null object
Trustworthy 8978 non-null int64
TweetLength 8978 non-null int64
CreatedAtForCalculations 8978 non-null datetime64[ns]
dtypes: bool(6), datetime64[ns](2), int64(5), object(4)None
Would anyone know why this is and what i can do to fix it?
Here's a nice way to convert an already existing frame's columns from object
to something more useful. Normally you don't need to do this as something like read_csv
will do conversions for you. But if you have mixed values then these conversions can fail.
See docs here
In [13]: data = """21-01-2014,1
....: 31x01x2014,foo
....: 01-01-2014,2
....: hello,3"""
In [14]: df = pd.DataFrame.from_csv( StringIO(data), index_col=None, header=None )
In [15]: df
Out[15]:
0 1
0 21-01-2014 1
1 31x01x2014 foo
2 01-01-2014 2
3 hello 3
In [16]: df.dtypes
Out[16]:
0 object
1 object
dtype: object
In [17]: df.convert_objects(convert_dates='coerce',convert_numeric=True)
Out[17]:
0 1
0 2014-01-21 1
1 NaT NaN
2 2014-01-01 2
3 NaT 3
In [18]: df.convert_objects(convert_dates='coerce',convert_numeric=True).dtypes
Out[18]:
0 datetime64[ns]
1 float64
dtype: object
This will convert columns that 'look' like datetimes and numbers. Its possible that you want to limit this to certain columns and be a bit more selective. It will only attempt object
type columns. Furthermore this is implemented in cython so will be quite fast.
Maybe there is better solution how to find values which can't be converted.
It is my solution using apply()
My data for test:
import pandas as pd
from StringIO import StringIO
data = '''21-01-2014
31x01x2014
01-01-2014
"Hello World"'''
df = pd.DataFrame.from_csv( StringIO(data), index_col=None, header=None )
print df
'''
0
0 21-01-2014
1 31x01x2014
2 01-01-2014
3 Hello World
'''
I create function which use datetime.datetime.strptime()
and try/except
to catch (and print) incorrect date.
from datetime import datetime
def test_datetime(x):
try:
datetime.strptime(x, "%d-%M-%Y")
except:
print 'incorect:', x
then I can use apply()
to test all values in column
df[0].apply(test_datetime)
'''
incorect: 31x01x2014
incorect: Hello World
'''
But I can add return True/False
in previous function
from datetime import datetime
def test_datetime(x):
try:
datetime.strptime(x, "%d-%M-%Y")
return False
except:
return True
to use it this way and get data with index
print df[ df[0].apply(test_datetime) ]
'''
0
1 31x01x2014
3 Hello World
'''
and run other functions on this rows
df[ df[0].apply(test_datetime) ] = '01-01-2000'
print df
'''
0
0 21-01-2014
1 01-01-2000
2 01-01-2014
3 01-01-2000
'''
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.