如何使用 Python (200 GB+) 从长 csv 文件的中间读取块

Question

I have a large csv file and I am reading it with chunks.我有一个大的 csv 文件，我正在阅读它。 In the middle of the process memory got full so I want to restart from where it left.在过程的中间 memory 已满，所以我想从它离开的地方重新开始。 I know which chunk but don't know how to go to that chunk directly.我知道哪个块，但不知道如何直接将 go 到那个块。

This is what I tried.这是我尝试过的。

# data is the txt file
reader = pd.read_csv(data , 
                     delimiter = "\t",
                     chunksize = 1000
                    )


# Please see the code below. When my last process broke, i was 154 so I think it should 
# start from 154000th line. This time I don't 
# plan to read whole file at once so I have an 
# end point at 160000

first = 154*1000
last = 160*1000

output_path = 'usa_hotspot_data_' + str(first) + '_' + str(last) + '.csv'
print("Output file: ", output_path)

try:
    os.remove(output_path)
except OSError:
    pass

# Read chunks and save to a new csv
for i,chunk in enumerate(reader):
    if (i >= first and i<=last) :
          < -- here I do something  -- > 
        # Progress Bar to keep track 
        if (i% 1000 == 0):
            print("#", end ='')

However, this is taking a lot of time to reach the ith line I want to go.但是，这需要很多时间才能到达我想要 go 的第 i 行。 How can I skip reading chunks before it and directly go there?我怎样才能跳过它之前的阅读块并直接 go 那里？

Answer 1

pandas.read_csv

skiprows : Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file. skiprows ：文件开头要跳过的行号（0-indexed）或要跳过的行数（int）。

You can pass this skiprows to read_csv , It will act like offset.您可以将此skirows 传递给read_csv ，它的作用类似于偏移量。

如何使用 Python (200 GB+) 从长 csv 文件的中间读取块

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-04-19 03:02:38

如何使用 Python (200 GB+) 从长 csv 文件的中间读取块

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-04-19 03:02:38

解决方案1
0 已采纳 2021-04-19 03:02:38