[英]Set Timestamp Column in CSV as Index and Parse Dates Using Python and Pandas
I have a Python script using pandas that takes web-scraped data on COVID-19 from CSVs compressed in ZIP files.我有一个使用 pandas 的 Python 脚本,该脚本从 ZIP 文件中压缩的 CSV 中获取有关 COVID-19 的网络抓取数据。 This is original data source of web-scraped data: https://github.com/statistikat/coronaDAT
这是网络抓取数据的原始数据源: https://github.com/statistikat/coronaDAT
I am having trouble with the Timestamp column that I load from the CSV files.我从 CSV 文件加载的时间戳列遇到问题。 The data appears to load properly into the DataFrame with all five columns from the original CSV files.
数据似乎已正确加载到 DataFrame 中,其中所有五列来自原始 CSV 文件。 The fifth column is the Timestamp of the data.
第五列是数据的时间戳。 When I use
print(df_master.columns)
I get the correct five columns, including the Timestamp.当我使用
print(df_master.columns)
时,我得到了正确的五列,包括时间戳。
Here is what I get from这是我从中得到的
print(df_master.info())
print(df_master.head(10))
print(df_master.columns)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 903 entries, 87 to 87
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Bezirk 903 non-null object
1 Anzahl 903 non-null int64
2 Anzahl_Inzidenz 903 non-null object
3 GKZ 859 non-null float64
4 Timestamp 859 non-null object
dtypes: float64(1), int64(1), object(3)
memory usage: 42.3+ KB
None
Bezirk Anzahl Anzahl_Inzidenz GKZ Timestamp
87 Wien(Stadt) 2231 117,57631524998 900.0 2020-04-22T06:00:00
87 Wien(Stadt) 2264 119,315453933642 900.0 2020-04-22T19:00:00
87 Wien(Stadt) 2243 118,208729316766 900.0 2020-04-22T12:00:00
87 Wien(Stadt) 2254 118,78844221132 900.0 2020-04-22T16:00:00
87 Wien(Stadt) 2242 118,156028144534 900.0 2020-04-22T09:00:00
87 Wien(Stadt) 2266 119,420856278106 900.0 2020-04-22T23:00:00
87 Wien(Stadt) 2231 117,57631524998 900.0 2020-04-22T02:00:00
87 Wien(Stadt) 2256 118,893844555784 900.0 2020-04-22T18:00:00
87 Wien(Stadt) 2237 117,892522283373 900.0 2020-04-22T07:00:00
87 Wien(Stadt) 2244 118,261430488998 900.0 2020-04-22T13:00:00
Index(['Bezirk', 'Anzahl', 'Anzahl_Inzidenz', 'GKZ', 'Timestamp'], dtype='object')
Export to CSV Successful
However, when I try to set the DataFrame index to the Timestamp column ( index_col=['Timestamp']
), or parse the dates of the Timestamp column ( parse_dates=['Timestamp']
), I the following error message:但是,当我尝试将 DataFrame 索引设置为时间戳列 (
index_col=['Timestamp']
),或解析时间戳列的日期 ( parse_dates=['Timestamp']
) 时,出现以下错误消息:
ValueError: Index Timestamp invalid
I tried specifying the exact columns in the CSV, but that didn't make a difference.我尝试在 CSV 中指定确切的列,但这并没有什么不同。 Some of the CSV files being read may have no value or strings with no value in the Timestamp column.
正在读取的某些 CSV 文件可能没有值或时间戳列中没有值的字符串。 I tried replacing any empty strings in the Timestamp column with NaN and then dropping all NaN, which would remove all rows with no value in the Timestamp column.
我尝试用 NaN 替换 Timestamp 列中的任何空字符串,然后删除所有 NaN,这将删除 Timestamp 列中没有值的所有行。 I also tried setting the data type for the Timestamp column to datetime.
我还尝试将 Timestamp 列的数据类型设置为 datetime。
Set empty strings in TimeStamp column to NaN and drop rows:将 TimeStamp 列中的空字符串设置为 NaN 并删除行:
#replace empty strings in Timestamp column with NaN values
df['Timestamp'].replace('', np.nan, inplace=True)
#replace whitespace in Timestamp column with NaN values
df['Timestamp'].replace(' ', np.nan, inplace=True)
#drop rows where Timestamp column has NaN values
df.dropna(subset=['Timestamp'], inplace=True)
Set data type to datetime:将数据类型设置为日期时间:
pd.to_datetime(df['Timestamp'],errors='ignore')
When I do either of these two things, I get the error message:当我做这两件事中的任何一件时,我都会收到错误消息:
KeyError: 'Timestamp'
Any ideas why I can't do anything to the Timestamp column, like set as index, parse dates, or do anything to values in that column?任何想法为什么我不能对 Timestamp 列做任何事情,比如设置为索引、解析日期或对该列中的值做任何事情?
Here is the full code:这是完整的代码:
import fnmatch
import os
import pandas as pd
import numpy as np
from zipfile import ZipFile
#set root path
rootPath = r"/Users/matt/test/"
#set file extension pattern - get all ZIPs with data from 10:00 AM
pattern_ext = '*00_orig_csv.zip'
#set file name - get all CSVs with data from Bezirke
pattern_filename = 'Bezirke.csv'
#set Bezirk to export to CSV
set_bezirk = 'Wien(Stadt)'
#initialize variables
df_master = pd.DataFrame()
flag = False
#crawl entire directory in root folder
for root, dirs, files in os.walk(rootPath):
#filter files that match pattern of .zip
for filename in fnmatch.filter(files, pattern_ext):
#create complete file name of ZIP file
zip_file = ZipFile(os.path.join(root, filename))
for text_file in zip_file.infolist():
#if the filename starts with variable file_name
if text_file.filename.startswith(pattern_filename):
df = pd.read_csv(zip_file.open(text_file.filename),
delimiter = ';',
header = 0,
#index_col = 'Timestamp',
#parse_dates = 'Timestamp'
)
#set data type of Timestamp column to datetime
#pd.to_datetime(df['Timestamp'],errors='ignore')
#replace empty strings in Timestamp column with NaN values
#df['Timestamp'].replace('', np.nan, inplace=True)
#replace whitespace in Timestamp column with NaN values
#df['Timestamp'].replace(' ', np.nan, inplace=True)
#drop rows where Timestamp column has NaN values
#df.dropna(subset=['Timestamp'], inplace=True)
#filter for Bezirk values that equal variable set_bezirk
df_vienna = df[df['Bezirk'] == set_bezirk]
##filter for Timestamp values that equal variable set_time
#df_vienna = df[df['Timestamp'] != 0]
#insert filtered values for variable set_bezirk to dataframe df
df = df_vienna
if not flag:
df_master = df
flag = True
else:
df_master = pd.concat([df_master, df])
#sort index field Timestamp
df_master.set_index('Timestamp').sort_index(inplace=True, na_position='first')
#print master dataframe info
print(df_master.info())
print(df_master.head(10))
print(df_master.columns)
#prepare date to export to csv
frame = df_master
#export to csv
try:
frame.to_csv( "combined_zip_Bezirk_Wien.csv", encoding='utf-8-sig')
print("Export to CSV Successful")
except:
print("Export to CSV Failed")
#verify if the dataset is present
#if not present, download data set from GitHub
#if present, verfify with GitHUb if dataset is updated
#update dataset
Use利用
df2 = pd.to_datetime(df_master['Timestamp'], format="%Y-%m-%dT%H:%M:%S")
to convert to a timestamp column, then do your processing转换为时间戳列,然后进行处理
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.