[英]Parsing and Reorganizing CSV Files with Python
Python專家
過去,我一直在使用Perl來處理非常大的文本文件以進行數據挖掘。 最近,我決定進行切換,因為我相信Python使我可以更輕松地檢查代碼並弄清楚發生了什么。 Python的不幸(也許是幸運的?)是,與Perl相比,存儲和組織數據非常困難,因為我無法通過自動生存來創建哈希。 我也無法總結字典詞典的內容。
也許對我的問題有一個優雅的解決方案。
我有數百個文件,其中包含幾百行數據(全部都可以容納在內存中)。 目標是結合這兩個文件,但要符合某些條件:
對於每個級別(僅在下面顯示一個級別),我需要為在所有文件中找到的每個缺陷類創建一行。 並非所有文件都有相同的缺陷。
對於每個級別和缺陷類別,總結所有文件中找到的所有GEC和BEC值。
最終輸出應類似於(更新的樣本輸出,錯字):
等級缺陷類別BEC合計GEC合計
1415PA,0,643,1991
1415PA,1,1994,6470
...等等.....
文件一:
Level, defectClass, BEC, GEC
1415PA, 0, 262, 663
1415PA, 1, 1138, 4104
1415PA, 107, 2, 0
1415PA, 14, 3, 4
1415PA, 15, 1, 0
1415PA, 2, 446, 382
1415PA, 21, 5, 0
1415PA, 23, 10, 5
1415PA, 4, 3, 16
1415PA, 6, 52, 105
文件二:
level, defectClass, BEC, GEC
1415PA, 0, 381, 1328
1415PA, 1, 856, 2366
1415PA, 107, 7, 11
1415PA, 14, 4, 1
1415PA, 2, 315, 202
1415PA, 23, 4, 7
1415PA, 4, 0, 2
1415PA, 6, 46, 42
1415PA, 7, 1, 7
我最大的問題是無法對字典進行求和。 這是我到目前為止的代碼(不起作用):
import os
import sys
class AutoVivification(dict):
"""Implementation of perl's autovivification feature. Has features from both dicts and lists,
dynamically generates new subitems as needed, and allows for working (somewhat) as a basic type.
"""
def __getitem__(self, item):
if isinstance(item, slice):
d = AutoVivification()
items = sorted(self.iteritems(), reverse=True)
k,v = items.pop(0)
while 1:
if (item.start < k < item.stop):
d[k] = v
elif k > item.stop:
break
if item.step:
for x in range(item.step):
k,v = items.pop(0)
else:
k,v = items.pop(0)
return d
try:
return dict.__getitem__(self, item)
except KeyError:
value = self[item] = type(self)()
return value
def __add__(self, other):
"""If attempting addition, use our length as the 'value'."""
return len(self) + other
def __radd__(self, other):
"""If the other type does not support addition with us, this addition method will be tried."""
return len(self) + other
def append(self, item):
"""Add the item to the dict, giving it a higher integer key than any currently in use."""
largestKey = sorted(self.keys())[-1]
if isinstance(largestKey, str):
self.__setitem__(0, item)
elif isinstance(largestKey, int):
self.__setitem__(largestKey+1, item)
def count(self, item):
"""Count the number of keys with the specified item."""
return sum([1 for x in self.items() if x == item])
def __eq__(self, other):
"""od.__eq__(y) <==> od==y. Comparison to another AV is order-sensitive
while comparison to a regular mapping is order-insensitive. """
if isinstance(other, AutoVivification):
return len(self)==len(other) and self.items() == other.items()
return dict.__eq__(self, other)
def __ne__(self, other):
"""od.__ne__(y) <==> od!=y"""
return not self == other
for filename in os.listdir('/Users/aleksarias/Desktop/DefectMatchingDatabase/'):
if filename[0] == '.' or filename == 'YieldToDefectDatabaseJan2014Continued.csv':
continue
path = '/Users/aleksarias/Desktop/DefectMatchingDatabase/' + filename
for filename2 in os.listdir(path):
if filename2[0] == '.':
continue
path2 = path + "/" + filename2
techData = AutoVivification()
for file in os.listdir(path2):
if file[0:13] == 'SummaryRearr_':
dataFile = path2 + '/' + file
print('Location of file to read: ', dataFile, '\n')
fh = open(dataFile, 'r')
for lines in fh:
if lines[0:5] == 'level':
continue
lines = lines.strip()
elements = lines.split(',')
if techData[elements[0]][elements[1]]['BEC']:
techData[elements[0]][elements[1]]['BEC'].append(elements[2])
else:
techData[elements[0]][elements[1]]['BEC'] = elements[2]
if techData[elements[0]][elements[1]]['GEC']:
techData[elements[0]][elements[1]]['GEC'].append(elements[3])
else:
techData[elements[0]][elements[1]]['GEC'] = elements[3]
print(elements[0], elements[1], techData[elements[0]][elements[1]]['BEC'], techData[elements[0]][elements[1]]['GEC'])
techSumPath = path + '/Summary_' + filename + '.csv'
fh2 = open(techSumPath, 'w')
for key1 in sorted(techData):
for key2 in sorted(techData[key1]):
BECtotal = sum(map(int, techData[key1][key2]['BEC']))
GECtotal = sum(map(int, techData[key1][key2]['GEC']))
fh2.write('%s,%s,%s,%s\n' % (key1, key2, BECtotal, GECtotal))
print('Created file at:', techSumPath)
input('Go check the file!!!!')
謝謝你看這個!!!!!
亞歷克斯
我將建議一種不同的方法:如果要處理表格數據,則應查看pandas
庫。 您的代碼變成類似
import pandas as pd
filenames = "fileone.txt", "filetwo.txt" # or whatever
dfs = []
for filename in filenames:
df = pd.read_csv(filename, skipinitialspace=True)
df = df.rename(columns={"level": "Level"})
dfs.append(df)
df_comb = pd.concat(dfs)
df_totals = df_comb.groupby(["Level", "defectClass"], as_index=False).sum()
df_totals.to_csv("combined.csv", index=False)
產生
dsm@notebook:~/coding/pand$ cat combined.csv
Level,defectClass,BEC,GEC
1415PA,0,643,1991
1415PA,1,1994,6470
1415PA,2,761,584
1415PA,4,3,18
1415PA,6,98,147
1415PA,7,1,7
1415PA,14,7,5
1415PA,15,1,0
1415PA,21,5,0
1415PA,23,14,12
1415PA,107,9,11
在這里,我已將每個文件同時讀取到內存中,並將它們組合成一個大的DataFrame
(如Excel工作表),但是我們可以很容易地逐個文件地完成groupby
操作,因此我們只需要在內存中有一個文件一次,如果我們喜歡。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.