與python open相比，pandas read_csv真的慢嗎？

Question

我的要求是從csv文件中刪除重復的行，但是文件的大小為11.3GB。 所以我在長凳上標記了熊貓和python文件生成器。

Python文件生成器：

def fileTestInPy():
    with open(r'D:\my-file.csv') as fp, open(r'D:\mining.csv', 'w') as mg:
        dups = set()
        for i, line in enumerate(fp):
            if i == 0:
                continue
            cols = line.split(',')
            if cols[0] in dups:
                continue
            dups.add(cols[0])
            mg.write(line)
            mg.write('\n')

使用熊貓read_csv：

import pandas as pd
df = pd.read_csv(r'D:\my-file.csv', sep=',', iterator=True, chunksize=1024*128)
def fileInPandas():
    for d in df:
        d_clean = d.drop_duplicates('NPI')
        d_clean.to_csv(r'D:\mining1.csv', mode='a')

詳細信息：大小：11.3 GB行：1億，但其中有5000萬是重復的Python版本：3.5.2 Pandas版本：0.19.0 RAM：8GB CPU：Core-i5 2.60GHz

我在這里觀察到的是，當我使用python文件生成器時花費了643秒，而當我使用熊貓時花費了1756秒。

當我使用python文件生成器時，即使我的系統也沒有被掛起，但是當我使用熊貓時，我的系統被掛起了。

我在大熊貓中使用正確的方法嗎？ 即使我想對11.3GB的文件進行排序，該怎么做？

Answer 1

熊貓不是執行此任務的理想選擇。 它將整個11.3G文件讀入內存，並對所有列進行字符串到整數的轉換。 我對您的機器陷入困境並不感到驚訝！

逐行版本更精簡。 它不會進行任何轉換，不會打擾不重要的列，也不會在內存中保留大量數據集。 這是工作的更好工具。

def fileTestInPy():
    with open(r'D:\my-file.csv') as fp, open(r'D:\mining.csv', 'w') as mg:
        dups = set()
        next(fp) # <-- advance fp so you don't need to check each line
                 # or use enumerate
        for line in fp:
            col = line.split(',', 1)[0]  # <-- only split what you need
            if col in dups:
                continue
            dups.add(col)
            mg.write(line)
            # mg.write('\n')  # <-- line still has its \n, did you
                              # want another?

另外，如果這是python 3.x，並且您知道文件是ascii或UTF-8，則可以以二進制模式打開兩個文件並保存轉換。

與python open相比，pandas read_csv真的慢嗎？

問題描述

1 個解決方案

解決方案1
2 已采納 2016-10-08 16:50:21

與python open相比，pandas read_csv真的慢嗎？

問題描述

1 個解決方案

解決方案1 2 已采納 2016-10-08 16:50:21

解決方案1
2 已采納 2016-10-08 16:50:21