I am currently writing code to analyse a large subset of data. I have used the pandas to read the text files and I am printing it using data.head(). I need to specify the dtype for 9 columns (the ninth one being null) because the process would be too memory intensive otherwise but I have no clue how to specify the dtype for columns lacking column headers. Would it be the same as for specifying dtype for column headers? For reference my columns data type would probably be the follows:
Column 1: Mixed as it contains alphanumeric characters
Column 2: Date in the format YY/MM/DD
Column 3: Time in Hours/Minutes/Seconds/Milliseconds
Column 4: Str
Column 5: Time
Column 6: Str
Column 7: Time
Column 8: Time
Column 9: Null
Here is an excerpt of the text file
Here is also an excerpt of my code
import sys
import os
import glob
import pandas as pd
import numpy as np
path = '/Users/MysteriousHo-Oh1231/Downloads/Datapoints1/*.txt'
dataframes = []
for filename in glob.iglob(path):
data = pd.read_csv(filename, header=None, delimiter='\t', dtype={0: object, 1: int, 2: int, 3: object, 4: int, 5: object, 6: int, 7: int, 8: None})
print(data.head())
I tried the above code and it returned this error :
Please help me with this!
Define 3 following conversion functions:
def strToDate(tt):
return pd.to_datetime(tt, yearfirst=True)
def strToTime(tt):
return pd.to_datetime(tt, format='%I:%M:%S.%f').time()
def strToTime2(tt):
return pd.Timestamp(float(tt), unit='s').time()
Then read your DataFrame, passing them as converters for the columns requiring "specialized" conversion:
df = pd.read_csv('Input.csv', header=None, converters={ 1: strToDate,
2: strToTime, 4: strToTime2, 6: strToTime2, 7: strToTime2 })
When you print df.info()
, then:
But don't be misguided. In Pandas the type of object means actually "something other than a number or datetime".
When you retrieve individual values, eg df.iloc[0,2]
you will get: datetime.time(11, 24, 31, 758000)
, similar for any cell from column 4 , 6 or 7 , so they are of just the required type.
Define converter functions as:
def strToDate(tt):
return pd.to_datetime(tt, yearfirst=True)
def strToTimeDelta(tt):
return pd.Timedelta(float(tt), unit='S')
Read your dataframe:
df = pd.read_csv('Input.csv', header=None, converters={ 1: strToDate,
2: pd.Timedelta, 4: strToTimeDelta, 6: strToTimeDelta, 7: strToTimeDelta })
(to convert column 2 use native pandasonic function pd.Timedelta ).
Then, if you need to convert some Timedelta column (eg column 7 ) to the total number of seconds, including the fractional part, run:
df[7].dt.seconds + df[7].dt.microseconds / 1e6
But the default result of reading columns 4 , 6 and 7 is just float ie the number of seconds.
They are conceptually times, but actually:
So why do you need any conversion of these columns?
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.