在pandas.read_csv（）中使用自定義對象

Question

我對將自定義對象流式傳輸到pandas數據框感興趣。 根據文檔，可以使用具有read（）方法的任何對象。 但是，即使實現了此功能，我仍然會收到此錯誤：

ValueError：無效的文件路徑或緩沖區對象類型：<class'__main __。DataFile'>

這是對象的簡單版本，以及我的調用方式：

class DataFile(object):
    def __init__(self, files):
        self.files = files

    def read(self):
        for file_name in self.files:
            with open(file_name, 'r') as file:
                for line in file:
                    yield line

import pandas as pd
hours = ['file1.csv', 'file2.csv', 'file3.csv']

data = DataFile(hours)
df = pd.read_csv(data)

我是否缺少某些東西，或者只是無法在Pandas中使用自定義生成器？ 當我調用read（）方法時，它可以正常工作。

編輯：我想使用自定義對象而不是將數據幀串聯在一起的原因是看是否可以減少內存使用量。 我過去使用過gensim庫，它使使用自定義數據對象真的非常容易，因此我希望找到一些類似的方法。

Answer 1

文檔中提到了read方法，但實際上是在檢查它是否是一個is_file_like參數（在該參數處引發異常）。 該功能實際上非常簡單：

def is_file_like(obj):
    if not (hasattr(obj, 'read') or hasattr(obj, 'write')):
        return False
    if not hasattr(obj, "__iter__"):
        return False
    return True

因此，它還需要一個__iter__方法。

但這不是唯一的問題。 熊貓要求它實際上表現得像文件。 因此， read方法應該接受一個附加的字節數參數（因此，您不能read一個生成器-因為它必須可以使用2個參數調用，並且應該返回一個字符串）。

因此，例如：

class DataFile(object):
    def __init__(self, files):
        self.data = """a b
1 2
2 3
"""
        self.pos = 0

    def read(self, x):
        nxt = self.pos + x
        ret = self.data[self.pos:nxt]
        self.pos = nxt
        return ret

    def __iter__(self):
        yield from self.data.split('\n')

將被識別為有效輸入。

但是，對於多個文件而言，這比較困難，我希望fileinput可以有一些適當的例程，但看起來好像不是這樣：

import fileinput

pd.read_csv(fileinput.input([...]))
# ValueError: Invalid file path or buffer object type: <class 'fileinput.FileInput'>

Answer 2

通過子類io.RawIOBase在Python3中制作類似文件的對象的一種方法。 使用Mechanical snail的iterstream ，您可以將任何可迭代的字節轉換為類似文件的對象：

import tempfile
import io
import pandas as pd

def iterstream(iterable, buffer_size=io.DEFAULT_BUFFER_SIZE):
    """
    http://stackoverflow.com/a/20260030/190597 (Mechanical snail)
    Lets you use an iterable (e.g. a generator) that yields bytestrings as a
    read-only input stream.

    The stream implements Python 3's newer I/O API (available in Python 2's io
    module).

    For efficiency, the stream is buffered.
    """
    class IterStream(io.RawIOBase):
        def __init__(self):
            self.leftover = None
        def readable(self):
            return True
        def readinto(self, b):
            try:
                l = len(b)  # We're supposed to return at most this much
                chunk = self.leftover or next(iterable)
                output, self.leftover = chunk[:l], chunk[l:]
                b[:len(output)] = output
                return len(output)
            except StopIteration:
                return 0    # indicate EOF
    return io.BufferedReader(IterStream(), buffer_size=buffer_size)


class DataFile(object):
    def __init__(self, files):
        self.files = files

    def read(self):
        for file_name in self.files:
            with open(file_name, 'rb') as f:
                for line in f:
                    yield line

def make_files(num):
    filenames = []
    for i in range(num):
        with tempfile.NamedTemporaryFile(mode='wb', delete=False) as f:
            f.write(b'''1,2,3\n4,5,6\n''')
            filenames.append(f.name)
    return filenames

# hours = ['file1.csv', 'file2.csv', 'file3.csv']
hours = make_files(3)
print(hours)
data = DataFile(hours)
df = pd.read_csv(iterstream(data.read()), header=None)

print(df)

版畫

Answer 3

如何使用這種替代方法：

def get_merged_csv(flist, **kwargs):
    return pd.concat([pd.read_csv(f, **kwargs) for f in flist], ignore_index=True)

df = get_merged_csv(hours)

在pandas.read_csv（）中使用自定義對象

問題描述

3 個解決方案

解決方案1
3 2017-09-19 21:54:27

解決方案2
3 2017-09-19 22:07:13

解決方案3
0 2017-09-19 21:45:54

在pandas.read_csv（）中使用自定義對象

問題描述

3 個解決方案

解決方案1 3 2017-09-19 21:54:27

解決方案2 3 2017-09-19 22:07:13

解決方案3 0 2017-09-19 21:45:54

解決方案1
3 2017-09-19 21:54:27

解決方案2
3 2017-09-19 22:07:13

解決方案3
0 2017-09-19 21:45:54