從一個大的 CSV 文件中讀取一個小的隨機樣本到一個 Python 數據框中

Question

我想讀取的 CSV 文件不適合主內存。 如何讀取其中的幾行（~10K）隨機行並對選定的數據框進行一些簡單的統計？

Answer 1

假設 CSV 文件中沒有標題：

import pandas
import random

n = 1000000 #number of records in file
s = 10000 #desired sample size
filename = "data.txt"
skip = sorted(random.sample(range(n),n-s))
df = pandas.read_csv(filename, skiprows=skip)

如果 read_csv 有一個 keeprows，或者如果 skiprows 使用回調函數而不是列表會更好。

帶有標題和未知文件長度：

import pandas
import random

filename = "data.txt"
n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
s = 10000 #desired sample size
skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
df = pandas.read_csv(filename, skiprows=skip)

Answer 2

@dlm 的回答很好，但是從 v0.20.0 開始， skiprows 確實接受了 callable 。 callable 接收行號作為參數。

還要注意，他們對未知文件長度的回答依賴於對文件進行兩次迭代——一次是為了獲得長度，然后是另一次讀取 csv。 我在這里有三個解決方案，它們只依賴於對文件進行一次迭代，盡管它們都有權衡。

解決方案 1：近似百分比

如果您可以指定所需的行數百分比，而不是行數，您甚至不需要獲取文件大小，只需通讀一次文件即可。 假設第一行有一個標題：

import pandas as pd
import random
p = 0.01  # 1% of the lines
# keep the header, then take only 1% of lines
# if random from [0,1] interval is greater than 0.01 the row will be skipped
df = pd.read_csv(
         filename,
         header=0, 
         skiprows=lambda i: i>0 and random.random() > p
)

正如評論中指出的那樣，這僅給出了大約正確的行數，但我認為它滿足了所需的用例。

解決方案 2：每第 N 行

比第一個隨機得多，但給出了確切所需的行數。 根據文件的排序方式，這可能會滿足您的用例。

n = 100  # every 100th line = 1% of the lines
df = pd.read_csv(filename, header=0, skiprows=lambda i: i % n != 0)

解決方案 3：水庫采樣

（2021 年 7 月添加）

水庫采樣是一種優雅的算法，用於從長度未知但您只看到一次的流中隨機選擇k項目。

最大的優點是您可以在沒有磁盤上的完整數據集的情況下使用它，並且它可以在不知道完整數據集大小的情況下為您提供精確大小的樣本。 缺點是我沒有看到在純 Pandas 中實現它的方法，我認為你需要進入 python 來讀取文件，然后再構造數據幀。 所以你可能會失去read_csv一些功能或者需要重新實現它，因為我們沒有使用 pandas 來實際讀取文件。

在這里從 Oscar Benjamin那里獲取算法的實現：

from math import exp, log, floor
from random import random, randrange
from itertools import islice
from io import StringIO

def reservoir_sample(iterable, k=1):
    """Select k items uniformly from iterable.

    Returns the whole population if there are k or fewer items

    from https://bugs.python.org/issue41311#msg373733
    """
    iterator = iter(iterable)
    values = list(islice(iterator, k))

    W = exp(log(random())/k)
    while True:
        # skip is geometrically distributed
        skip = floor( log(random())/log(1-W) )
        selection = list(islice(iterator, skip, skip+1))
        if selection:
            values[randrange(k)] = selection[0]
            W *= exp(log(random())/k)
        else:
            return values

def sample_file(filepath, k):
    with open(filepath, 'r') as f:
        header = next(f)
        result = [header] + sample_iter(f, k)
    df = pd.read_csv(StringIO(''.join(result)))

reservoir_sample函數返回一個字符串列表，每個字符串都是一行，所以我們只需要在最后把它變成一個數據幀。 這假設只有一個標題行，我還沒有想過如何將其擴展到其他情況。

我在本地對此進行了測試，它比其他兩種解決方案快得多。 使用 550 MB csv（紐約市 TLC 的2020 年 1 月“黃色出租車行程記錄”），解決方案 3 運行約 1 秒，而其他兩個運行約 3-4 秒。

在我的測試中，這比使用shuf的答案還要快一點（~10-20%），這讓我感到驚訝。

Answer 3

這不在 Pandas 中，但它通過 bash 更快地獲得相同的結果，同時不會將整個文件讀入內存：

shuf -n 100000 data/original.tsv > data/sample.tsv

shuf命令將shuf輸入，並且-n參數表示我們想要輸出的行數。

相關問題： https : //unix.stackexchange.com/q/108581

此處提供 700 萬行 csv 的基准測試（2008 年）：

最佳答案：

def pd_read():
    filename = "2008.csv"
    n = sum(1 for line in open(filename)) - 1 #number of records in file (excludes header)
    s = 100000 #desired sample size
    skip = sorted(random.sample(range(1,n+1),n-s)) #the 0-indexed header will not be included in the skip list
    df = pandas.read_csv(filename, skiprows=skip)
    df.to_csv("temp.csv")

熊貓時間：

%time pd_read()
CPU times: user 18.4 s, sys: 448 ms, total: 18.9 s
Wall time: 18.9 s

使用shuf ：

time shuf -n 100000 2008.csv > temp.csv

real    0m1.583s
user    0m1.445s
sys     0m0.136s

所以shuf大約快 12 倍，重要的是它不會將整個文件讀入內存。

Answer 4

這里有一個算法，不需要預先計算文件中的行數，所以你只需要讀取文件一次。

假設你想要 m 個樣本。 首先，算法保留前 m 個樣本。 當它看到第 i 個樣本 (i > m) 時，概率為 m/i，該算法使用該樣本隨機替換已選擇的樣本。

通過這樣做，對於任何 i > m，我們總是從前 i 個樣本中隨機選擇 m 個樣本的子集。

見下面的代碼：

import random

n_samples = 10
samples = []

for i, line in enumerate(f):
    if i < n_samples:
        samples.append(line)
    elif random.random() < n_samples * 1. / (i+1):
            samples[random.randint(0, n_samples-1)] = line

Answer 5

以下代碼首先讀取標題，然后在其他行讀取隨機樣本：

import pandas as pd
import numpy as np

filename = 'hugedatafile.csv'
nlinesfile = 10000000
nlinesrandomsample = 10000
lines2skip = np.random.choice(np.arange(1,nlinesfile+1), (nlinesfile-nlinesrandomsample), replace=False)
df = pd.read_csv(filename, skiprows=lines2skip)

Answer 6

class magic_checker:
    def __init__(self,target_count):
        self.target = target_count
        self.count = 0
    def __eq__(self,x):
        self.count += 1
        return self.count >= self.target

min_target=100000
max_target = min_target*2
nlines = randint(100,1000)
seek_target = randint(min_target,max_target)
with open("big.csv") as f:
     f.seek(seek_target)
     f.readline() #discard this line
     rand_lines = list(iter(lambda:f.readline(),magic_checker(nlines)))

#do something to process the lines you got returned .. perhaps just a split
print rand_lines
print rand_lines[0].split(",")

我認為這樣的事情應該有效

Answer 7

沒有熊貓！

import random
from os import fstat
from sys import exit

f = open('/usr/share/dict/words')

# Number of lines to be read
lines_to_read = 100

# Minimum and maximum bytes that will be randomly skipped
min_bytes_to_skip = 10000
max_bytes_to_skip = 1000000

def is_EOF():
    return f.tell() >= fstat(f.fileno()).st_size

# To accumulate the read lines
sampled_lines = []

for n in xrange(lines_to_read):
    bytes_to_skip = random.randint(min_bytes_to_skip, max_bytes_to_skip)
    f.seek(bytes_to_skip, 1)
    # After skipping "bytes_to_skip" bytes, we can stop in the middle of a line
    # Skip current entire line
    f.readline()
    if not is_EOF():
        sampled_lines.append(f.readline())
    else:
        # Go to the begginig of the file ...
        f.seek(0, 0)
        # ... and skip lines again
        f.seek(bytes_to_skip, 1)
        # If it has reached the EOF again
        if is_EOF():
            print "You have skipped more lines than your file has"
            print "Reduce the values of:"
            print "   min_bytes_to_skip"
            print "   max_bytes_to_skip"
            exit(1)
        else:
            f.readline()
            sampled_lines.append(f.readline())

print sampled_lines

您最終會得到一個 sampled_lines 列表。 你是指什么樣的統計數據？

Answer 8

使用子樣本

pip install subsample
subsample -n 1000 file.csv > file_1000_sample.csv

Answer 9

您還可以在將其引入 Python 環境之前創建一個包含 10000 條記錄的示例。

使用 Git Bash (Windows 10) 我只是運行以下命令來生成示例

shuf -n 10000 BIGFILE.csv > SAMPLEFILE.csv

注意：如果您的 CSV 有標題，這不是最佳解決方案。

Answer 10

TL; 博士

如果您知道所需樣本的大小，但不知道輸入文件的大小，則可以使用以下pandas代碼有效地從中加載隨機樣本：

import pandas as pd
import numpy as np

filename = "data.csv"
sample_size = 10000
batch_size = 200

rng = np.random.default_rng()

sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)

sample = sample_reader.get_chunk(sample_size)

for chunk in sample_reader:
    chunk.index = rng.integers(sample_size, size=len(chunk))
    sample.loc[chunk.index] = chunk

解釋

了解輸入 CSV 文件的大小並不總是那么簡單。

如果有嵌入的換行符，像wc或shuf這樣的工具會給你錯誤的答案或者只是把你的數據弄得一團糟。

因此，根據desktable的回答，我們可以將文件的前sample_size行視為初始樣本，然后，對於文件中的每個后續行，隨機替換初始樣本中的一行。

為了有效地做到這一點，我們通過傳遞chunksize=參數使用TextFileReader加載 CSV 文件：

sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)

首先，我們得到初始樣本：

sample = sample_reader.get_chunk(sample_size)

然后，我們迭代文件的剩余塊，用與塊大小一樣長的隨機整數序列替換每個塊的索引，但每個整數都在初始樣本的index范圍內（發生與range(sample_size) ) 相同：

for chunk in sample_reader:
    chunk.index = rng.integers(sample_size, size=len(chunk))

並使用這個重新索引的塊來替換示例中的（一些）行：

sample.loc[chunk.index] = chunk

在for循環之后，您將擁有一個最多sample_size行長的數據sample_size ，但會從大 CSV 文件中選擇隨機行。

為了使循環更高效，您可以將batch_size為內存允許的大小（是的，如果可以，甚至大於sample_size ）。

請注意，在使用np.random.default_rng().integers()創建新的塊索引時，我們使用len(chunk)作為新的塊索引大小而不是簡單的batch_size因為循環中的最后一個塊可能更小。

另一方面，我們使用sample_size而不是len(sample)作為隨機整數的“范圍”，即使文件中的行數可能少於sample_size 。 這是因為在這種情況下不會有任何塊可以循環，所以這永遠不會成為問題。

Answer 11

讀取數據文件

import pandas as pd
df = pd.read_csv('data.csv', 'r')

首先檢查df的形狀

df.shape()

從 df 創建 1000 個原始數據的小樣本

sample_data = df.sample(n=1000, replace='False')

#檢查sample_data的形狀

sample_data.shape()

Answer 12

例如，您有loan.csv，您可以使用此腳本輕松加載指定數量的隨機項目。

data = pd.read_csv('loan.csv').sample(10000, random_state=44)

Answer 13

假設您要加載數據集的 20% 樣本：

    import pandas as pd
    df = pd.read_csv(filepath).sample(frac = 0.20)

從一個大的 CSV 文件中讀取一個小的隨機樣本到一個 Python 數據框中

問題描述

13 個解決方案

解決方案1
91 2014-03-07 19:29:08

解決方案2
52 2018-02-02 19:38:30

解決方案 1：近似百分比

解決方案 2：每第 N 行

解決方案 3：水庫采樣

解決方案3
31 2016-06-10 17:50:05

解決方案4
11 2016-03-18 18:05:44

解決方案5
4 2015-01-10 14:50:45

解決方案6
3 2014-03-07 19:29:28

解決方案7
3 2014-03-07 23:08:05

解決方案8
2 2018-04-17 21:23:48

解決方案9
2 2019-11-03 21:32:07

解決方案10
1 2020-05-06 09:27:59

TL; 博士

解釋

解決方案11
1 2020-07-12 23:24:46

讀取數據文件

首先檢查df的形狀

從 df 創建 1000 個原始數據的小樣本

解決方案12
0 2020-03-20 01:00:34

解決方案13
-2 2020-04-09 15:23:57

從一個大的 CSV 文件中讀取一個小的隨機樣本到一個 Python 數據框中

問題描述

13 個解決方案

解決方案1 91 2014-03-07 19:29:08

解決方案2 52 2018-02-02 19:38:30

解決方案 1：近似百分比

解決方案 2：每第 N 行

解決方案 3：水庫采樣

解決方案3 31 2016-06-10 17:50:05

解決方案4 11 2016-03-18 18:05:44

解決方案5 4 2015-01-10 14:50:45

解決方案6 3 2014-03-07 19:29:28

解決方案7 3 2014-03-07 23:08:05

解決方案8 2 2018-04-17 21:23:48

解決方案9 2 2019-11-03 21:32:07

解決方案10 1 2020-05-06 09:27:59

TL; 博士

解釋

解決方案11 1 2020-07-12 23:24:46

讀取數據文件

首先檢查df的形狀

從 df 創建 1000 個原始數據的小樣本

解決方案12 0 2020-03-20 01:00:34

解決方案13 -2 2020-04-09 15:23:57

解決方案1
91 2014-03-07 19:29:08

解決方案2
52 2018-02-02 19:38:30

解決方案3
31 2016-06-10 17:50:05

解決方案4
11 2016-03-18 18:05:44

解決方案5
4 2015-01-10 14:50:45

解決方案6
3 2014-03-07 19:29:28

解決方案7
3 2014-03-07 23:08:05

解決方案8
2 2018-04-17 21:23:48

解決方案9
2 2019-11-03 21:32:07

解決方案10
1 2020-05-06 09:27:59

解決方案11
1 2020-07-12 23:24:46

解決方案12
0 2020-03-20 01:00:34

解決方案13
-2 2020-04-09 15:23:57