[英]Set Timestamp Column in CSV as Index and Parse Dates Using Python and Pandas
我有一個使用 pandas 的 Python 腳本,該腳本從 ZIP 文件中壓縮的 CSV 中獲取有關 COVID-19 的網絡抓取數據。 這是網絡抓取數據的原始數據源: https://github.com/statistikat/coronaDAT
我從 CSV 文件加載的時間戳列遇到問題。 數據似乎已正確加載到 DataFrame 中,其中所有五列來自原始 CSV 文件。 第五列是數據的時間戳。 當我使用print(df_master.columns)
時,我得到了正確的五列,包括時間戳。
這是我從中得到的
print(df_master.info())
print(df_master.head(10))
print(df_master.columns)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 903 entries, 87 to 87
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Bezirk 903 non-null object
1 Anzahl 903 non-null int64
2 Anzahl_Inzidenz 903 non-null object
3 GKZ 859 non-null float64
4 Timestamp 859 non-null object
dtypes: float64(1), int64(1), object(3)
memory usage: 42.3+ KB
None
Bezirk Anzahl Anzahl_Inzidenz GKZ Timestamp
87 Wien(Stadt) 2231 117,57631524998 900.0 2020-04-22T06:00:00
87 Wien(Stadt) 2264 119,315453933642 900.0 2020-04-22T19:00:00
87 Wien(Stadt) 2243 118,208729316766 900.0 2020-04-22T12:00:00
87 Wien(Stadt) 2254 118,78844221132 900.0 2020-04-22T16:00:00
87 Wien(Stadt) 2242 118,156028144534 900.0 2020-04-22T09:00:00
87 Wien(Stadt) 2266 119,420856278106 900.0 2020-04-22T23:00:00
87 Wien(Stadt) 2231 117,57631524998 900.0 2020-04-22T02:00:00
87 Wien(Stadt) 2256 118,893844555784 900.0 2020-04-22T18:00:00
87 Wien(Stadt) 2237 117,892522283373 900.0 2020-04-22T07:00:00
87 Wien(Stadt) 2244 118,261430488998 900.0 2020-04-22T13:00:00
Index(['Bezirk', 'Anzahl', 'Anzahl_Inzidenz', 'GKZ', 'Timestamp'], dtype='object')
Export to CSV Successful
但是,當我嘗試將 DataFrame 索引設置為時間戳列 ( index_col=['Timestamp']
),或解析時間戳列的日期 ( parse_dates=['Timestamp']
) 時,出現以下錯誤消息:
ValueError: Index Timestamp invalid
我嘗試在 CSV 中指定確切的列,但這並沒有什么不同。 正在讀取的某些 CSV 文件可能沒有值或時間戳列中沒有值的字符串。 我嘗試用 NaN 替換 Timestamp 列中的任何空字符串,然后刪除所有 NaN,這將刪除 Timestamp 列中沒有值的所有行。 我還嘗試將 Timestamp 列的數據類型設置為 datetime。
將 TimeStamp 列中的空字符串設置為 NaN 並刪除行:
#replace empty strings in Timestamp column with NaN values
df['Timestamp'].replace('', np.nan, inplace=True)
#replace whitespace in Timestamp column with NaN values
df['Timestamp'].replace(' ', np.nan, inplace=True)
#drop rows where Timestamp column has NaN values
df.dropna(subset=['Timestamp'], inplace=True)
將數據類型設置為日期時間:
pd.to_datetime(df['Timestamp'],errors='ignore')
當我做這兩件事中的任何一件時,我都會收到錯誤消息:
KeyError: 'Timestamp'
任何想法為什么我不能對 Timestamp 列做任何事情,比如設置為索引、解析日期或對該列中的值做任何事情?
這是完整的代碼:
import fnmatch
import os
import pandas as pd
import numpy as np
from zipfile import ZipFile
#set root path
rootPath = r"/Users/matt/test/"
#set file extension pattern - get all ZIPs with data from 10:00 AM
pattern_ext = '*00_orig_csv.zip'
#set file name - get all CSVs with data from Bezirke
pattern_filename = 'Bezirke.csv'
#set Bezirk to export to CSV
set_bezirk = 'Wien(Stadt)'
#initialize variables
df_master = pd.DataFrame()
flag = False
#crawl entire directory in root folder
for root, dirs, files in os.walk(rootPath):
#filter files that match pattern of .zip
for filename in fnmatch.filter(files, pattern_ext):
#create complete file name of ZIP file
zip_file = ZipFile(os.path.join(root, filename))
for text_file in zip_file.infolist():
#if the filename starts with variable file_name
if text_file.filename.startswith(pattern_filename):
df = pd.read_csv(zip_file.open(text_file.filename),
delimiter = ';',
header = 0,
#index_col = 'Timestamp',
#parse_dates = 'Timestamp'
)
#set data type of Timestamp column to datetime
#pd.to_datetime(df['Timestamp'],errors='ignore')
#replace empty strings in Timestamp column with NaN values
#df['Timestamp'].replace('', np.nan, inplace=True)
#replace whitespace in Timestamp column with NaN values
#df['Timestamp'].replace(' ', np.nan, inplace=True)
#drop rows where Timestamp column has NaN values
#df.dropna(subset=['Timestamp'], inplace=True)
#filter for Bezirk values that equal variable set_bezirk
df_vienna = df[df['Bezirk'] == set_bezirk]
##filter for Timestamp values that equal variable set_time
#df_vienna = df[df['Timestamp'] != 0]
#insert filtered values for variable set_bezirk to dataframe df
df = df_vienna
if not flag:
df_master = df
flag = True
else:
df_master = pd.concat([df_master, df])
#sort index field Timestamp
df_master.set_index('Timestamp').sort_index(inplace=True, na_position='first')
#print master dataframe info
print(df_master.info())
print(df_master.head(10))
print(df_master.columns)
#prepare date to export to csv
frame = df_master
#export to csv
try:
frame.to_csv( "combined_zip_Bezirk_Wien.csv", encoding='utf-8-sig')
print("Export to CSV Successful")
except:
print("Export to CSV Failed")
#verify if the dataset is present
#if not present, download data set from GitHub
#if present, verfify with GitHUb if dataset is updated
#update dataset
利用
df2 = pd.to_datetime(df_master['Timestamp'], format="%Y-%m-%dT%H:%M:%S")
轉換為時間戳列,然后進行處理
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.