简体   繁体   English

将 CSV 中的时间戳列设置为索引并使用 Python 和 Pandas 解析日期

[英]Set Timestamp Column in CSV as Index and Parse Dates Using Python and Pandas

I have a Python script using pandas that takes web-scraped data on COVID-19 from CSVs compressed in ZIP files.我有一个使用 pandas 的 Python 脚本,该脚本从 ZIP 文件中压缩的 CSV 中获取有关 COVID-19 的网络抓取数据。 This is original data source of web-scraped data: https://github.com/statistikat/coronaDAT这是网络抓取数据的原始数据源: https://github.com/statistikat/coronaDAT

I am having trouble with the Timestamp column that I load from the CSV files.我从 CSV 文件加载的时间戳列遇到问题。 The data appears to load properly into the DataFrame with all five columns from the original CSV files.数据似乎已正确加载到 DataFrame 中,其中所有五列来自原始 CSV 文件。 The fifth column is the Timestamp of the data.第五列是数据的时间戳。 When I use print(df_master.columns) I get the correct five columns, including the Timestamp.当我使用print(df_master.columns)时,我得到了正确的五列,包括时间戳。

Here is what I get from这是我从中得到的

print(df_master.info())
print(df_master.head(10))
print(df_master.columns)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 903 entries, 87 to 87
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   Bezirk           903 non-null    object
 1   Anzahl           903 non-null    int64
 2   Anzahl_Inzidenz  903 non-null    object
 3   GKZ              859 non-null    float64
 4   Timestamp        859 non-null    object
dtypes: float64(1), int64(1), object(3)
memory usage: 42.3+ KB
None
         Bezirk  Anzahl   Anzahl_Inzidenz    GKZ            Timestamp
87  Wien(Stadt)    2231   117,57631524998  900.0  2020-04-22T06:00:00
87  Wien(Stadt)    2264  119,315453933642  900.0  2020-04-22T19:00:00
87  Wien(Stadt)    2243  118,208729316766  900.0  2020-04-22T12:00:00
87  Wien(Stadt)    2254   118,78844221132  900.0  2020-04-22T16:00:00
87  Wien(Stadt)    2242  118,156028144534  900.0  2020-04-22T09:00:00
87  Wien(Stadt)    2266  119,420856278106  900.0  2020-04-22T23:00:00
87  Wien(Stadt)    2231   117,57631524998  900.0  2020-04-22T02:00:00
87  Wien(Stadt)    2256  118,893844555784  900.0  2020-04-22T18:00:00
87  Wien(Stadt)    2237  117,892522283373  900.0  2020-04-22T07:00:00
87  Wien(Stadt)    2244  118,261430488998  900.0  2020-04-22T13:00:00
Index(['Bezirk', 'Anzahl', 'Anzahl_Inzidenz', 'GKZ', 'Timestamp'], dtype='object')
Export to CSV Successful

However, when I try to set the DataFrame index to the Timestamp column ( index_col=['Timestamp'] ), or parse the dates of the Timestamp column ( parse_dates=['Timestamp'] ), I the following error message:但是,当我尝试将 DataFrame 索引设置为时间戳列 ( index_col=['Timestamp'] ),或解析时间戳列的日期 ( parse_dates=['Timestamp'] ) 时,出现以下错误消息:

ValueError: Index Timestamp invalid

I tried specifying the exact columns in the CSV, but that didn't make a difference.我尝试在 CSV 中指定确切的列,但这并没有什么不同。 Some of the CSV files being read may have no value or strings with no value in the Timestamp column.正在读取的某些 CSV 文件可能没有值或时间戳列中没有值的字符串。 I tried replacing any empty strings in the Timestamp column with NaN and then dropping all NaN, which would remove all rows with no value in the Timestamp column.我尝试用 NaN 替换 Timestamp 列中的任何空字符串,然后删除所有 NaN,这将删除 Timestamp 列中没有值的所有行。 I also tried setting the data type for the Timestamp column to datetime.我还尝试将 Timestamp 列的数据类型设置为 datetime。

Set empty strings in TimeStamp column to NaN and drop rows:将 TimeStamp 列中的空字符串设置为 NaN 并删除行:

#replace empty strings in Timestamp column with NaN values
                df['Timestamp'].replace('', np.nan, inplace=True)
                #replace whitespace in Timestamp column with NaN values
                df['Timestamp'].replace('  ', np.nan, inplace=True)
                #drop rows where Timestamp column has NaN values 
                df.dropna(subset=['Timestamp'], inplace=True)

Set data type to datetime:将数据类型设置为日期时间:

pd.to_datetime(df['Timestamp'],errors='ignore')

When I do either of these two things, I get the error message:当我做这两件事中的任何一件时,我都会收到错误消息:

KeyError: 'Timestamp'

Any ideas why I can't do anything to the Timestamp column, like set as index, parse dates, or do anything to values in that column?任何想法为什么我不能对 Timestamp 列做任何事情,比如设置为索引、解析日期或对该列中的值做任何事情?

Here is the full code:这是完整的代码:

import fnmatch
import os
import pandas as pd
import numpy as np
from zipfile import ZipFile


#set root path
rootPath = r"/Users/matt/test/"

#set file extension pattern - get all ZIPs with data from 10:00 AM
pattern_ext = '*00_orig_csv.zip'

#set file name - get all CSVs with data from Bezirke
pattern_filename = 'Bezirke.csv'
#set Bezirk to export to CSV
set_bezirk = 'Wien(Stadt)'

#initialize variables
df_master = pd.DataFrame()
flag = False


#crawl entire directory in root folder
for root, dirs, files in os.walk(rootPath):
    #filter files that match pattern of .zip
    for filename in fnmatch.filter(files, pattern_ext):
        #create complete file name of ZIP file
        zip_file = ZipFile(os.path.join(root, filename))
        for text_file in zip_file.infolist():
            #if the filename starts with variable file_name
            if text_file.filename.startswith(pattern_filename):
                df = pd.read_csv(zip_file.open(text_file.filename), 
                    delimiter = ';', 
                    header = 0, 
                    #index_col = 'Timestamp', 
                    #parse_dates = 'Timestamp'
                    )

                #set data type of Timestamp column to datetime
                #pd.to_datetime(df['Timestamp'],errors='ignore') 

                #replace empty strings in Timestamp column with NaN values
                #df['Timestamp'].replace('', np.nan, inplace=True)
                #replace whitespace in Timestamp column with NaN values
                #df['Timestamp'].replace('  ', np.nan, inplace=True)
                #drop rows where Timestamp column has NaN values 
                #df.dropna(subset=['Timestamp'], inplace=True)


                #filter for Bezirk values that equal variable set_bezirk
                df_vienna = df[df['Bezirk'] == set_bezirk]

                ##filter for Timestamp values that equal variable set_time
                #df_vienna = df[df['Timestamp'] != 0]

                #insert filtered values for variable set_bezirk to dataframe df
                df = df_vienna
                if not flag:
                    df_master = df
                    flag = True
                else:
                    df_master = pd.concat([df_master, df])

#sort index field Timestamp
df_master.set_index('Timestamp').sort_index(inplace=True, na_position='first')

#print master dataframe info
print(df_master.info())
print(df_master.head(10))
print(df_master.columns)


#prepare date to export to csv
frame = df_master

#export to csv
try:
    frame.to_csv( "combined_zip_Bezirk_Wien.csv", encoding='utf-8-sig')
    print("Export to CSV Successful")
except:
    print("Export to CSV Failed")


#verify if the dataset is present
    #if not present, download data set from GitHub
    #if present, verfify with GitHUb if dataset is updated
        #update dataset

Use利用

df2 = pd.to_datetime(df_master['Timestamp'], format="%Y-%m-%dT%H:%M:%S")

to convert to a timestamp column, then do your processing转换为时间戳列,然后进行处理

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM