简体   繁体   English

如何使用 Python (200 GB+) 从长 csv 文件的中间读取块

[英]How to read chunk from middle of a long csv file using Python (200 GB+)

I have a large csv file and I am reading it with chunks.我有一个大的 csv 文件,我正在阅读它。 In the middle of the process memory got full so I want to restart from where it left.在过程的中间 memory 已满,所以我想从它离开的地方重新开始。 I know which chunk but don't know how to go to that chunk directly.我知道哪个块,但不知道如何直接将 go 到那个块。

This is what I tried.这是我尝试过的。

# data is the txt file
reader = pd.read_csv(data , 
                     delimiter = "\t",
                     chunksize = 1000
                    )


# Please see the code below. When my last process broke, i was 154 so I think it should 
# start from 154000th line. This time I don't 
# plan to read whole file at once so I have an 
# end point at 160000

first = 154*1000
last = 160*1000

output_path = 'usa_hotspot_data_' + str(first) + '_' + str(last) + '.csv'
print("Output file: ", output_path)

try:
    os.remove(output_path)
except OSError:
    pass

# Read chunks and save to a new csv
for i,chunk in enumerate(reader):
    if (i >= first and i<=last) :
          < -- here I do something  -- > 
        # Progress Bar to keep track 
        if (i% 1000 == 0):
            print("#", end ='')

However, this is taking a lot of time to reach the ith line I want to go.但是,这需要很多时间才能到达我想要 go 的第 i 行。 How can I skip reading chunks before it and directly go there?我怎样才能跳过它之前的阅读块并直接 go 那里?

pandas.read_csv

skiprows : Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file. skiprows :文件开头要跳过的行号(0-indexed)或要跳过的行数(int)。

You can pass this skiprows to read_csv , It will act like offset.您可以将此skirows 传递给read_csv ,它的作用类似于偏移量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM