[英]Set Timestamp Column in CSV as Index and Parse Dates Using Python and Pandas
我有一个使用 pandas 的 Python 脚本,该脚本从 ZIP 文件中压缩的 CSV 中获取有关 COVID-19 的网络抓取数据。 这是网络抓取数据的原始数据源: https://github.com/statistikat/coronaDAT
我从 CSV 文件加载的时间戳列遇到问题。 数据似乎已正确加载到 DataFrame 中,其中所有五列来自原始 CSV 文件。 第五列是数据的时间戳。 当我使用print(df_master.columns)
时,我得到了正确的五列,包括时间戳。
这是我从中得到的
print(df_master.info())
print(df_master.head(10))
print(df_master.columns)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 903 entries, 87 to 87
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Bezirk 903 non-null object
1 Anzahl 903 non-null int64
2 Anzahl_Inzidenz 903 non-null object
3 GKZ 859 non-null float64
4 Timestamp 859 non-null object
dtypes: float64(1), int64(1), object(3)
memory usage: 42.3+ KB
None
Bezirk Anzahl Anzahl_Inzidenz GKZ Timestamp
87 Wien(Stadt) 2231 117,57631524998 900.0 2020-04-22T06:00:00
87 Wien(Stadt) 2264 119,315453933642 900.0 2020-04-22T19:00:00
87 Wien(Stadt) 2243 118,208729316766 900.0 2020-04-22T12:00:00
87 Wien(Stadt) 2254 118,78844221132 900.0 2020-04-22T16:00:00
87 Wien(Stadt) 2242 118,156028144534 900.0 2020-04-22T09:00:00
87 Wien(Stadt) 2266 119,420856278106 900.0 2020-04-22T23:00:00
87 Wien(Stadt) 2231 117,57631524998 900.0 2020-04-22T02:00:00
87 Wien(Stadt) 2256 118,893844555784 900.0 2020-04-22T18:00:00
87 Wien(Stadt) 2237 117,892522283373 900.0 2020-04-22T07:00:00
87 Wien(Stadt) 2244 118,261430488998 900.0 2020-04-22T13:00:00
Index(['Bezirk', 'Anzahl', 'Anzahl_Inzidenz', 'GKZ', 'Timestamp'], dtype='object')
Export to CSV Successful
但是,当我尝试将 DataFrame 索引设置为时间戳列 ( index_col=['Timestamp']
),或解析时间戳列的日期 ( parse_dates=['Timestamp']
) 时,出现以下错误消息:
ValueError: Index Timestamp invalid
我尝试在 CSV 中指定确切的列,但这并没有什么不同。 正在读取的某些 CSV 文件可能没有值或时间戳列中没有值的字符串。 我尝试用 NaN 替换 Timestamp 列中的任何空字符串,然后删除所有 NaN,这将删除 Timestamp 列中没有值的所有行。 我还尝试将 Timestamp 列的数据类型设置为 datetime。
将 TimeStamp 列中的空字符串设置为 NaN 并删除行:
#replace empty strings in Timestamp column with NaN values
df['Timestamp'].replace('', np.nan, inplace=True)
#replace whitespace in Timestamp column with NaN values
df['Timestamp'].replace(' ', np.nan, inplace=True)
#drop rows where Timestamp column has NaN values
df.dropna(subset=['Timestamp'], inplace=True)
将数据类型设置为日期时间:
pd.to_datetime(df['Timestamp'],errors='ignore')
当我做这两件事中的任何一件时,我都会收到错误消息:
KeyError: 'Timestamp'
任何想法为什么我不能对 Timestamp 列做任何事情,比如设置为索引、解析日期或对该列中的值做任何事情?
这是完整的代码:
import fnmatch
import os
import pandas as pd
import numpy as np
from zipfile import ZipFile
#set root path
rootPath = r"/Users/matt/test/"
#set file extension pattern - get all ZIPs with data from 10:00 AM
pattern_ext = '*00_orig_csv.zip'
#set file name - get all CSVs with data from Bezirke
pattern_filename = 'Bezirke.csv'
#set Bezirk to export to CSV
set_bezirk = 'Wien(Stadt)'
#initialize variables
df_master = pd.DataFrame()
flag = False
#crawl entire directory in root folder
for root, dirs, files in os.walk(rootPath):
#filter files that match pattern of .zip
for filename in fnmatch.filter(files, pattern_ext):
#create complete file name of ZIP file
zip_file = ZipFile(os.path.join(root, filename))
for text_file in zip_file.infolist():
#if the filename starts with variable file_name
if text_file.filename.startswith(pattern_filename):
df = pd.read_csv(zip_file.open(text_file.filename),
delimiter = ';',
header = 0,
#index_col = 'Timestamp',
#parse_dates = 'Timestamp'
)
#set data type of Timestamp column to datetime
#pd.to_datetime(df['Timestamp'],errors='ignore')
#replace empty strings in Timestamp column with NaN values
#df['Timestamp'].replace('', np.nan, inplace=True)
#replace whitespace in Timestamp column with NaN values
#df['Timestamp'].replace(' ', np.nan, inplace=True)
#drop rows where Timestamp column has NaN values
#df.dropna(subset=['Timestamp'], inplace=True)
#filter for Bezirk values that equal variable set_bezirk
df_vienna = df[df['Bezirk'] == set_bezirk]
##filter for Timestamp values that equal variable set_time
#df_vienna = df[df['Timestamp'] != 0]
#insert filtered values for variable set_bezirk to dataframe df
df = df_vienna
if not flag:
df_master = df
flag = True
else:
df_master = pd.concat([df_master, df])
#sort index field Timestamp
df_master.set_index('Timestamp').sort_index(inplace=True, na_position='first')
#print master dataframe info
print(df_master.info())
print(df_master.head(10))
print(df_master.columns)
#prepare date to export to csv
frame = df_master
#export to csv
try:
frame.to_csv( "combined_zip_Bezirk_Wien.csv", encoding='utf-8-sig')
print("Export to CSV Successful")
except:
print("Export to CSV Failed")
#verify if the dataset is present
#if not present, download data set from GitHub
#if present, verfify with GitHUb if dataset is updated
#update dataset
利用
df2 = pd.to_datetime(df_master['Timestamp'], format="%Y-%m-%dT%H:%M:%S")
转换为时间戳列,然后进行处理
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.