[英]CSV to List of Dictionaries - Better Way?
我正在使用一個函數,該函數接受CSV的文件名,並將每行轉換為字典,然后返回創建的字典列表(以便能夠遍歷並組織以后的函數。通過執行以下操作來完成我想做的事情,但感覺必須有更好的方法。
import re
def import_incidents(filename):
"""Imports CSV and returns list of dictionaries for each incident"""
with open(filename, 'r') as file:
data = file.read()
data = data.split('\n')
list_of_data = []
headers = True
for line in data:
line = line.split('","')
if headers == True:
#Skip header and set to false
headers = False
elif len(line) == 1 or line[3] == '':
#File always has a 1 lenth final line, skip it.
#Events can leave blank policies, skip those too.
pass
else:
temp_dict = {}
temp_dict['id'] = re.sub('"', '', line[0])
temp_dict['time'] = re.sub('GMT-0600','',line[1])
temp_dict['source'] = line[2]
temp_dict['policy'] = line[3]
temp_dict['destination'] = line[5]
temp_dict['status'] = line[10]
list_of_data.append(temp_dict)
return list_of_data
print(import_incidents('Incidents (Yesterday Only).csv'))
CSV內容樣本:
"ID","Incident Time","Source","Policies","Channel","Destination","Severity","Action","Maximum Matches","Transaction Size","Status",
"9511564","29 Dec. 2015, 08:33:59 AM GMT-0600","Doe, John","Encrypted files","HTTPS","blah.blah.com","Medium","Permitted","0","47.7 KB","Closed - Authorized",
"1848446","29 Dec. 2015, 08:23:36 AM GMT-0600","Smith, Joe","","HTTP","google.com","Low","Permitted","0","775 B","Closed"
恐怕您已經重新發明了csv.DictReader()
類 :
import csv
def import_incidents(filename):
with open(filename, 'r', newline='') as file:
reader = csv.DictReader(file)
for row in reader:
if not row or not row['Policies']:
continue
row['Incident Time'] = re.sub('GMT-0600', '', row['Incident Time'])
yield row
這依賴於字典鍵的標題行。 您可以使用DictReader()
的fieldnames
參數定義自己的字典鍵( fieldnames
字段按順序匹配到文件中的列),但是仍然像其他任何行一樣讀取文件的第一行。 您可以使用next()
函數跳過行(請參閱使用Python編輯csv文件時跳過標題 )。
您可以使用熊貓 。 它速度很快,可以在幾行中完成;
import pandas as pd
df = pd.read_csv('incidents.csv')
df['Incident Time'] = df['Incident Time'].str.replace('GMT-0600', '')
list_of_data = df.dropna(subset=['Policies']).to_dict(orient='records')
現在list_of_data
包含:
[{'Action': 'Permitted',
'Channel': 'HTTPS',
'Destination': 'blah.blah.com',
'ID': 9511564,
'Incident Time': '29 Dec. 2015, 08:33:59 AM ',
'Maximum Matches': 0,
'Policies': 'Encrypted files',
'Severity': 'Medium',
'Source': 'Doe, John',
'Status': 'Closed - Authorized',
'Transaction Size': '47.7 KB',
'Unnamed: 11': nan}]
所述.dropna(subset='Policies')
刪除具有所述線NaN
S IN列Policies
,即,缺失值。
如果您不需要字典列表,請保留數據框:
df = pd.read_csv('incidents.csv', parse_dates=[1]).dropna(subset=['Policies'])
這會將Incident Time
讀取為非常方便的datetime64[ns]
對象。 數據框如下所示:
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.