使用Python解析和重組CSV文件

Question

Python專家

過去，我一直在使用Perl來處理非常大的文本文件以進行數據挖掘。 最近，我決定進行切換，因為我相信Python使我可以更輕松地檢查代碼並弄清楚發生了什么。 Python的不幸（也許是幸運的？）是，與Perl相比，存儲和組織數據非常困難，因為我無法通過自動生存來創建哈希。 我也無法總結字典詞典的內容。

也許對我的問題有一個優雅的解決方案。

我有數百個文件，其中包含幾百行數據（全部都可以容納在內存中）。 目標是結合這兩個文件，但要符合某些條件：

對於每個級別（僅在下面顯示一個級別），我需要為在所有文件中找到的每個缺陷類創建一行。 並非所有文件都有相同的缺陷。
對於每個級別和缺陷類別，總結所有文件中找到的所有GEC和BEC值。
最終輸出應類似於（更新的樣本輸出，錯字）：

等級缺陷類別BEC合計GEC合計
1415PA，0，643，1991
1415PA，1，1994，6470
...等等.....

文件一：

Level,  defectClass,    BEC,    GEC
1415PA,      0,         262,    663
1415PA,      1,         1138,   4104
1415PA,    107,     2,  0
1415PA,     14,         3,  4
1415PA,     15,         1,  0
1415PA,      2,         446,    382
1415PA,     21,         5,  0
1415PA,     23,         10, 5
1415PA,      4,         3,  16
1415PA,      6,        52,  105

文件二：

level,  defectClass,   BEC, GEC
1415PA, 0,     381, 1328
1415PA, 1,     856, 2366
1415PA, 107,       7,   11
1415PA, 14,    4,   1
1415PA, 2,     315, 202
1415PA, 23,    4,   7
1415PA, 4,     0,   2
1415PA, 6,     46,  42
1415PA, 7,     1,   7

我最大的問題是無法對字典進行求和。 這是我到目前為止的代碼（不起作用）：

import os
import sys


class AutoVivification(dict):
    """Implementation of perl's autovivification feature. Has features from both dicts and lists,
    dynamically generates new subitems as needed, and allows for working (somewhat) as a basic type.
    """
    def __getitem__(self, item):
    if isinstance(item, slice):
        d = AutoVivification()
        items = sorted(self.iteritems(), reverse=True)
        k,v = items.pop(0)
        while 1:
        if (item.start < k < item.stop):
            d[k] = v
        elif k > item.stop:
            break
        if item.step:
            for x in range(item.step):
            k,v = items.pop(0)
        else:
            k,v = items.pop(0)
        return d
    try:
        return dict.__getitem__(self, item)
    except KeyError:
        value = self[item] = type(self)()
        return value

    def __add__(self, other):
    """If attempting addition, use our length as the 'value'."""
    return len(self) + other

    def __radd__(self, other):
    """If the other type does not support addition with us, this addition method will be tried."""
    return len(self) + other

    def append(self, item):
    """Add the item to the dict, giving it a higher integer key than any currently in use."""
    largestKey = sorted(self.keys())[-1]
    if isinstance(largestKey, str):
        self.__setitem__(0, item)
    elif isinstance(largestKey, int):
        self.__setitem__(largestKey+1, item)

    def count(self, item):
    """Count the number of keys with the specified item."""
    return sum([1 for x in self.items() if x == item])

    def __eq__(self, other):
    """od.__eq__(y) <==> od==y. Comparison to another AV is order-sensitive
    while comparison to a regular mapping is order-insensitive. """
    if isinstance(other, AutoVivification):
        return len(self)==len(other) and self.items() == other.items()
    return dict.__eq__(self, other)

    def __ne__(self, other):
    """od.__ne__(y) <==> od!=y"""
    return not self == other

for filename in os.listdir('/Users/aleksarias/Desktop/DefectMatchingDatabase/'):
    if filename[0] == '.' or filename == 'YieldToDefectDatabaseJan2014Continued.csv':
    continue
    path = '/Users/aleksarias/Desktop/DefectMatchingDatabase/' + filename

    for filename2 in os.listdir(path):
    if filename2[0] == '.':
        continue
    path2 = path + "/" + filename2
    techData = AutoVivification()

    for file in os.listdir(path2):
        if file[0:13] == 'SummaryRearr_':
        dataFile = path2 + '/' + file
        print('Location of file to read: ', dataFile, '\n')
        fh = open(dataFile, 'r')

        for lines in fh:
            if lines[0:5] == 'level':
            continue
            lines = lines.strip()
            elements = lines.split(',')

            if techData[elements[0]][elements[1]]['BEC']:
            techData[elements[0]][elements[1]]['BEC'].append(elements[2])
            else:
            techData[elements[0]][elements[1]]['BEC'] = elements[2]

            if techData[elements[0]][elements[1]]['GEC']:
            techData[elements[0]][elements[1]]['GEC'].append(elements[3])
            else:
            techData[elements[0]][elements[1]]['GEC'] = elements[3]


            print(elements[0], elements[1], techData[elements[0]][elements[1]]['BEC'], techData[elements[0]][elements[1]]['GEC'])

    techSumPath = path + '/Summary_' + filename + '.csv'
    fh2 = open(techSumPath, 'w')
    for key1 in sorted(techData):
    for key2 in sorted(techData[key1]):
        BECtotal = sum(map(int, techData[key1][key2]['BEC']))
        GECtotal = sum(map(int, techData[key1][key2]['GEC']))
        fh2.write('%s,%s,%s,%s\n' % (key1, key2, BECtotal, GECtotal))
    print('Created file at:', techSumPath)
    input('Go check the file!!!!')

謝謝你看這個!!!!!
亞歷克斯

Answer 1

我將建議一種不同的方法：如果要處理表格數據，則應查看pandas庫。 您的代碼變成類似

import pandas as pd

filenames = "fileone.txt", "filetwo.txt"  # or whatever

dfs = []
for filename in filenames:
    df = pd.read_csv(filename, skipinitialspace=True)
    df = df.rename(columns={"level": "Level"})
    dfs.append(df)

df_comb = pd.concat(dfs)
df_totals = df_comb.groupby(["Level", "defectClass"], as_index=False).sum()
df_totals.to_csv("combined.csv", index=False)

產生

dsm@notebook:~/coding/pand$ cat combined.csv 
Level,defectClass,BEC,GEC
1415PA,0,643,1991
1415PA,1,1994,6470
1415PA,2,761,584
1415PA,4,3,18
1415PA,6,98,147
1415PA,7,1,7
1415PA,14,7,5
1415PA,15,1,0
1415PA,21,5,0
1415PA,23,14,12
1415PA,107,9,11

在這里，我已將每個文件同時讀取到內存中，並將它們組合成一個大的DataFrame （如Excel工作表），但是我們可以很容易地逐個文件地完成groupby操作，因此我們只需要在內存中有一個文件一次，如果我們喜歡。

使用Python解析和重組CSV文件

問題描述

1 個解決方案

解決方案1
3 已采納 2014-02-10 05:34:33

使用Python解析和重組CSV文件

問題描述

1 個解決方案

解決方案1 3 已采納 2014-02-10 05:34:33

解決方案1
3 已采納 2014-02-10 05:34:33