简体   繁体   English

Pandas将列转换为不同的dtypes

[英]Pandas Converting columns to different dtypes

So I have am using Pandas to create a data frame with some columns which have types bool, int64 and date time. 所以我使用Pandas创建一个数据框,其中包含一些类型为bool,int64和date time的列。 For smaller datasets the dtypes remain but for larger datasets pandas converts all of these to objects. 对于较小的数据集,dtypes仍然存在,但对于较大的数据集,pandas会将所有这些转换为对象。 Would anyone know why its doing this and how i can explicitly set type if then? 谁会知道为什么它这样做以及我如何明确设置类型呢?

Reading CSV : 阅读CSV:

  twitterDataFrame = pandas.read_csv(DataSetLocation)

  twitterDataFrame['CreatedAt'] = twitterDataFrame['CreatedAt'].map(lambda x: pandas.to_datetime(x,dayfirst=True))
  twitterDataFrame['CreatedAtForCalculations'] = twitterDataFrame['CreatedAt']
  twitterDataFrame['InReplyToStatusID'] = twitterDataFrame['InReplyToStatusID'].map(lambda x: True if pandas.notnull(x) else False)
  twitterDataFrame['InReplyToUserID'] = twitterDataFrame['InReplyToUserID'].map(lambda x: True if pandas.notnull(x) else False)
  twitterDataFrame['RetweetCount'] = twitterDataFrame['RetweetCount'].map(lambda x: x if pandas.notnull(x) else 0)
  twitterDataFrame['FavouriteCount'] = twitterDataFrame['FavouriteCount'].map(lambda x: x if pandas.notnull(x) else 0)
  twitterDataFrame['Hashtags'] = twitterDataFrame['Hashtags'].map(lambda x: True if pandas.notnull(x) else False)
  twitterDataFrame['URL'] = twitterDataFrame['URL'].map(lambda x: True if pandas.notnull(x) else False)
  twitterDataFrame['MediaURL'] = twitterDataFrame['MediaURL'].map(lambda x: True if pandas.notnull(x) else False)
  twitterDataFrame['MediaType'] = twitterDataFrame['MediaType'].map(lambda x: x if pandas.notnull(x) else False)
  twitterDataFrame['UserMentionID'] = twitterDataFrame['UserMentionID'].map(lambda x: True if pandas.notnull(x) else False)
  twitterDataFrame['PossiblySensitive'] = twitterDataFrame['PossiblySensitive'].map(lambda x: x if pandas.notnull(x) else 'NoData')

When i print info this is what I get. 当我打印信息时,这就是我得到的。

None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21836 entries, 0 to 21835
Data columns (total 17 columns):
CreatedAt                   21836 non-null object
ActualTweet                 21836 non-null object
InReplyToStatusID           21836 non-null bool
InReplyToUserID             21836 non-null bool
UserID                      21836 non-null object
RetweetCount                21836 non-null object
FavouriteCount              21836 non-null object
Hashtags                    21836 non-null bool
URL                         21836 non-null bool
MediaURL                    21836 non-null bool
MediaType                   21836 non-null object
UserMentionID               21836 non-null bool
PossiblySensitive           21836 non-null object
Language                    21836 non-null object
Classifier                  21836 non-null object
TweetLength                 21836 non-null object
CreatedAtForCalculations    21836 non-null object
dtypes: bool(6), object(11)None

For smaller datasets however this works as it should and we get : 对于较小的数据集,但是它可以正常工作,我们得到:

None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8978 entries, 0 to 8977
Data columns (total 17 columns):
CreatedAt                   8978 non-null datetime64[ns]
ActualTweet                 8978 non-null object
InReplyToStatusID           8978 non-null bool
InReplyToUserID             8978 non-null bool
UserID                      8978 non-null int64
RetweetCount                8978 non-null int64
FavouriteCount              8978 non-null int64
Hashtags                    8978 non-null bool
URL                         8978 non-null bool
MediaURL                    8978 non-null bool
MediaType                   8978 non-null object
UserMentionID               8978 non-null bool
PossiblySensitive           8978 non-null object
Language                    8978 non-null object
Trustworthy                 8978 non-null int64
TweetLength                 8978 non-null int64
CreatedAtForCalculations    8978 non-null datetime64[ns]
dtypes: bool(6), datetime64[ns](2), int64(5), object(4)None

Would anyone know why this is and what i can do to fix it? 有谁知道为什么这是我可以做些什么来解决它?

Here's a nice way to convert an already existing frame's columns from object to something more useful. 这是将现有框架的列从object为更有用的一种很好的方法。 Normally you don't need to do this as something like read_csv will do conversions for you. 通常你不需要这样做,因为像read_csv这样的东西会为你做转换。 But if you have mixed values then these conversions can fail. 但是如果你有混合值,那么这些转换可能会失败。

See docs here 请参阅此处的文档

In [13]: data = """21-01-2014,1
   ....: 31x01x2014,foo
   ....: 01-01-2014,2
   ....: hello,3"""

In [14]: df = pd.DataFrame.from_csv( StringIO(data), index_col=None, header=None )

In [15]: df
Out[15]: 
            0    1
0  21-01-2014    1
1  31x01x2014  foo
2  01-01-2014    2
3       hello    3

In [16]: df.dtypes
Out[16]: 
0    object
1    object
dtype: object

In [17]: df.convert_objects(convert_dates='coerce',convert_numeric=True)
Out[17]: 
           0   1
0 2014-01-21   1
1        NaT NaN
2 2014-01-01   2
3        NaT   3

In [18]: df.convert_objects(convert_dates='coerce',convert_numeric=True).dtypes
Out[18]: 
0    datetime64[ns]
1           float64
dtype: object

This will convert columns that 'look' like datetimes and numbers. 这将转换“看起来”像日期时间和数字的列。 Its possible that you want to limit this to certain columns and be a bit more selective. 您可能希望将此限制为某些列并且更具选择性。 It will only attempt object type columns. 它只会尝试object类型列。 Furthermore this is implemented in cython so will be quite fast. 此外,这是在cython中实现的,所以会非常快。

Maybe there is better solution how to find values which can't be converted. 也许有更好的解决方案如何找到无法转换的值。

It is my solution using apply() 这是我使用apply()解决方案

My data for test: 我的测试数据:

import pandas as pd
from StringIO import StringIO

data = '''21-01-2014
31x01x2014
01-01-2014
"Hello World"'''

df = pd.DataFrame.from_csv( StringIO(data), index_col=None, header=None )

print df

'''
             0
0   21-01-2014
1   31x01x2014
2   01-01-2014
3  Hello World
'''

I create function which use datetime.datetime.strptime() and try/except to catch (and print) incorrect date. 我创建函数使用datetime.datetime.strptime()try/except来捕获(并打印)不正确的日期。

from datetime import datetime

def test_datetime(x):
    try:
        datetime.strptime(x, "%d-%M-%Y")
    except:
        print 'incorect:', x

then I can use apply() to test all values in column 然后我可以使用apply()来测试列中的所有值

df[0].apply(test_datetime)

'''
incorect: 31x01x2014
incorect: Hello World
'''

But I can add return True/False in previous function 但是我可以在之前的函数中添加return True/False

from datetime import datetime

def test_datetime(x):
    try:
        datetime.strptime(x, "%d-%M-%Y")
        return False
    except:
        return True

to use it this way and get data with index 以这种方式使用它并获取索引数据

print df[ df[0].apply(test_datetime) ]

'''
             0
1   31x01x2014
3  Hello World
'''

and run other functions on this rows 并在此行上运行其他功能

df[ df[0].apply(test_datetime) ] = '01-01-2000'

print df

'''
            0
0  21-01-2014
1  01-01-2000
2  01-01-2014
3  01-01-2000
'''

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM