如何讀取包含多個記錄類型的文件？

Question

我有一個 .csv 文件，其中包含 3 種類型的記錄，每一種都有不同數量的列。

我知道每個記錄類型的結構，並且行總是首先是類型 1，然后是類型 2 和類型 3，但我不知道每種記錄類型有多少行。

每行的前 4 個字符定義該行的記錄類型。

CSV 示例：

typ1,John,Smith,40,M,Single
typ1,Harry,Potter,22,M,Married
typ1,Eva,Adams,35,F,Single
typ2,2020,08,16,A
typ2,2020,09,02,A
typ3,Chevrolet,FC101TT,2017
typ3,Toyota,CE972SY,2004

我如何用 Pandas 閱讀它？ 我是否每次必須讀取一種記錄類型都沒有關系。

謝謝！！

Answer 1

這是一個熊貓解決方案。

首先，我們必須以一種方式讀取 csv 文件，pandas 將整個行保存在一個單元格中。 我們通過簡單地使用錯誤的分隔符來做到這一點，例如 'at' 符號'@' 。 它可以是我們想要的任何東西，因為我們保證它永遠不會出現在我們的數據文件中。

wrong_sep = '@'
right_sep = ','

df = pd.read_csv('my_file.csv', sep=wrong_sep).iloc[:, 0]

.iloc[:, 0]用作將 DataFrame 轉換為系列的快速方法。

然后我們使用循環根據起始字符選擇屬於每個數據結構的行。 現在我們使用“正確的分隔符”（可能是逗號',' ）將所需的數據拆分為真正的 DataFrame。

starters = ['typ1', 'typ2', 'typ3']
detected_dfs = dict()

for start in starters:
    _df = df[df.str.startswith(start)].str.split(right_sep, expand=True)

    detected_dfs[start] = _df

給你。 如果我們打印結果數據幀，我們得到：

      0      1       2   3  4        5
0  typ1  Harry  Potter  22  M  Married
1  typ1    Eva   Adams  35  F   Single

      0     1   2   3  4
2  typ2  2020  08  16  A
3  typ2  2020  09  02  A

      0          1        2     3
4  typ3  Chevrolet  FC101TT  2017
5  typ3     Toyota  CE972SY  2004

如果對您有幫助，請告訴我！

Answer 2

不是熊貓：

from collections import defaultdict

filename2 = 'Types.txt'

with open(filename2) as dataLines:
    nL = dataLines.read().splitlines()
    defDList = defaultdict(list)
    subs = ['typ1','typ2','typ3']
    dataReadLines = [defDList[i].append(j) for i in subs for j in nL if i in j]
    # dataReadLines = [i for i in nL]
    print(defDList)

輸出：

defaultdict(<class 'list'>, {'typ1': ['typ1,John,Smith,40,M,Single', 'typ1,Harry,Potter,22,M,Married', 'typ1,Eva,Adams,35,F,Single'], 'typ2': ['typ2,2020,08,16,A', 'typ2,2020,09,02,A'], 'typ3': ['typ3,Chevrolet,FC101TT,2017', 'typ3,Toyota,CE972SY,2004']})

Answer 3

您可以使用 pandas read_csv方法的skiprows參數來跳過您對特定記錄類型不感興趣的行。 以下為您提供了每種類型的數據幀的字典dfs 。 一個優點是相同類型的記錄不必在 csv 文件中彼此相鄰。

對於較大的文件，您可能需要調整代碼，使文件只讀取一次而不是兩次。

import pandas as pd
from collections import defaultdict

indices = defaultdict(list)
types = ['typ1', 'typ2', 'typ3']
filename = 'test.csv'

with open(filename) as csv:
    for idx, line in enumerate(csv.readlines()):
        for typ in types:
            if line.startswith(typ):
                indices[typ].append(idx)

dfs = {typ: pd.read_csv(filename, header=None,
                        skiprows=lambda x: x not in indices[typ])
       for typ in types}

Answer 4

使用 CSV 閱讀器將文件讀取為 CSV 文件。 幸運的是，讀者並不關心行格式：

import csv
with open("yourfile.csv") as infile:
    data = list(csv.reader(infile))

收集具有相同第一個元素的行並構建它們的數據框：

import pandas as pd
from itertools import groupby
dfs = [pd.DataFrame(v) for _,v in groupby(data, lambda x: x[0])]

您已經獲得了三個數據幀的列表（或根據需要盡可能多）。

dfs[1]
#      0     1   2   3  4
#0  typ2  2020  08  16  A
#1  typ2  2020  09  02  A

如何讀取包含多個記錄類型的文件？

問題描述

4 個解決方案

解決方案1
1 2020-10-06 02:29:02

解決方案2
0 2020-10-06 01:57:40

解決方案3
0 2020-10-06 02:41:23

解決方案4
0 2020-10-06 04:36:48

如何讀取包含多個記錄類型的文件？

問題描述

4 個解決方案

解決方案1 1 2020-10-06 02:29:02

解決方案2 0 2020-10-06 01:57:40

解決方案3 0 2020-10-06 02:41:23

解決方案4 0 2020-10-06 04:36:48

解決方案1
1 2020-10-06 02:29:02

解決方案2
0 2020-10-06 01:57:40

解決方案3
0 2020-10-06 02:41:23

解決方案4
0 2020-10-06 04:36:48