简体   繁体   中英

How to specify dtype for pd.read_csv when there are no column headers?

I am currently writing code to analyse a large subset of data. I have used the pandas to read the text files and I am printing it using data.head(). I need to specify the dtype for 9 columns (the ninth one being null) because the process would be too memory intensive otherwise but I have no clue how to specify the dtype for columns lacking column headers. Would it be the same as for specifying dtype for column headers? For reference my columns data type would probably be the follows:

Column 1: Mixed as it contains alphanumeric characters

Column 2: Date in the format YY/MM/DD

Column 3: Time in Hours/Minutes/Seconds/Milliseconds

Column 4: Str

Column 5: Time

Column 6: Str

Column 7: Time

Column 8: Time

Column 9: Null

Here is an excerpt of the text file

Here is also an excerpt of my code

    import sys
    import os
    import glob
    import pandas as pd
    import numpy as np

    path = '/Users/MysteriousHo-Oh1231/Downloads/Datapoints1/*.txt'
    dataframes = []
    for filename in glob.iglob(path):
      data = pd.read_csv(filename, header=None, delimiter='\t',  dtype={0: object, 1: int, 2: int, 3: object, 4: int, 5: object, 6: int, 7: int, 8: None})
      print(data.head())

I tried the above code and it returned this error :

Please help me with this!

Define 3 following conversion functions:

def strToDate(tt):
    return pd.to_datetime(tt, yearfirst=True)

def strToTime(tt):
    return pd.to_datetime(tt, format='%I:%M:%S.%f').time()

def strToTime2(tt):
    return pd.Timestamp(float(tt), unit='s').time()

Then read your DataFrame, passing them as converters for the columns requiring "specialized" conversion:

df = pd.read_csv('Input.csv', header=None, converters={ 1: strToDate,
    2: strToTime, 4: strToTime2, 6: strToTime2, 7: strToTime2 })

When you print df.info() , then:

  • column 1 (date) is of datetime64[ns] type,
  • column 8 ( NaN s) is of float64 type,
  • all other columns are of object type.

But don't be misguided. In Pandas the type of object means actually "something other than a number or datetime".

When you retrieve individual values, eg df.iloc[0,2] you will get: datetime.time(11, 24, 31, 758000) , similar for any cell from column 4 , 6 or 7 , so they are of just the required type.

Another solution based on Timedelta

Define converter functions as:

def strToDate(tt):
    return pd.to_datetime(tt, yearfirst=True)

def strToTimeDelta(tt):
    return pd.Timedelta(float(tt), unit='S')

Read your dataframe:

df = pd.read_csv('Input.csv', header=None, converters={ 1: strToDate,
    2: pd.Timedelta, 4: strToTimeDelta, 6: strToTimeDelta, 7: strToTimeDelta })

(to convert column 2 use native pandasonic function pd.Timedelta ).

Then, if you need to convert some Timedelta column (eg column 7 ) to the total number of seconds, including the fractional part, run:

df[7].dt.seconds + df[7].dt.microseconds / 1e6

But the default result of reading columns 4 , 6 and 7 is just float ie the number of seconds.

They are conceptually times, but actually:

  • in the input file they are kept as text ,
  • after read_csv they are floats keeping the number of seconds.

So why do you need any conversion of these columns?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM