简体   繁体   English

如何逐行读取大型文本文件,而不将其加载到内存中?

[英]How can I read large text files line by line, without loading it into memory?

I need to read a large file, line by line.我需要逐行读取一个大文件。 Lets say that file has more than 5GB and I need to read each line, but obviously I do not want to use readlines() because it will create a very large list in the memory.假设该文件超过 5GB,我需要读取每一行,但显然我不想使用readlines()因为它会在内存中创建一个非常大的列表。

How will the code below work for this case?下面的代码将如何适用于这种情况? Is xreadlines itself reading one by one into memory? xreadlines本身是否会一一读入内存? Is the generator expression needed?需要生成器表达式吗?

f = (line for line in open("log.txt").xreadlines())  # how much is loaded in memory?

f.next()  

Plus, what can I do to read this in reverse order, just as the Linux tail command?另外,我该怎么做才能以相反的顺序阅读它,就像 Linux tail命令一样?

I found:我发现:

http://code.google.com/p/pytailer/ http://code.google.com/p/pytailer/

and

" python head, tail and backward read by lines of a text file " python 头、尾和反向读取文本文件的行

Both worked very well!两者都工作得很好!

I provided this answer because Keith's, while succinct, doesn't close the file explicitly我提供了这个答案,因为基思虽然简洁,但没有明确关闭文件

with open("log.txt") as infile:
    for line in infile:
        do_something_with(line)

All you need to do is use the file object as an iterator.您需要做的就是使用文件对象作为迭代器。

for line in open("log.txt"):
    do_something_with(line)

Even better is using context manager in recent Python versions.更好的是在最近的 Python 版本中使用上下文管理器。

with open("log.txt") as fileobject:
    for line in fileobject:
        do_something_with(line)

This will automatically close the file as well.这也会自动关闭文件。

Please try this:请试试这个:

with open('filename','r',buffering=100000) as f:
    for line in f:
        print line

You are better off using an iterator instead.你最好改用迭代器。
Relevant: fileinput — Iterate over lines from multiple input streams .相关: fileinput — 遍历来自多个输入流的行

From the docs:从文档:

import fileinput
for line in fileinput.input("filename", encoding="utf-8"):
    process(line)

This will avoid copying the whole file into memory at once.这将避免一次将整个文件复制到内存中。

An old school approach:老派的做法:

fh = open(file_name, 'rt')
line = fh.readline()
while line:
    # do stuff with line
    line = fh.readline()
fh.close()

Here's what you do if you dont have newlines in the file:如果文件中没有换行符,请执行以下操作:

with open('large_text.txt') as f:
  while True:
    c = f.read(1024)
    if not c:
      break
    print(c,end='')

I couldn't believe that it could be as easy as @john-la-rooy's answer made it seem.我不敢相信这会像@john-la-rooy 的回答看起来那么简单。 So, I recreated the cp command using line by line reading and writing.因此,我使用逐行读写重新创建了cp命令。 It's CRAZY FAST.这太疯狂了。

#!/usr/bin/env python3.6

import sys

with open(sys.argv[2], 'w') as outfile:
    with open(sys.argv[1]) as infile:
        for line in infile:
            outfile.write(line)

The blaze project has come a long way over the last 6 years.在过去的 6 年中,大火项目取得了长足的进步。 It has a simple API covering a useful subset of pandas features.它有一个简单的 API,涵盖了有用的 pandas 功能子集。

dask.dataframe takes care of chunking internally, supports many parallelisable operations and allows you to export slices back to pandas easily for in-memory operations. dask.dataframe在内部负责分块,支持许多可并行化的操作,并允许您轻松地将切片导出回 pandas 以进行内存操作。

import dask.dataframe as dd

df = dd.read_csv('filename.csv')
df.head(10)  # return first 10 rows
df.tail(10)  # return last 10 rows

# iterate rows
for idx, row in df.iterrows():
    ...

# group by my_field and return mean
df.groupby(df.my_field).value.mean().compute()

# slice by column
df[df.my_field=='XYZ'].compute()

Heres the code for loading text files of any size without causing memory issues.这是加载任何大小的文本文件而不会导致内存问题的代码。 It support gigabytes sized files它支持千兆字节大小的文件

https://gist.github.com/iyvinjose/e6c1cb2821abd5f01fd1b9065cbc759d https://gist.github.com/iyvinjose/e6c1cb2821abd5f01fd1b9065cbc759d

download the file data_loading_utils.py and import it into your code下载文件data_loading_utils.py并将其导入您的代码

usage用法

import data_loading_utils.py.py
file_name = 'file_name.ext'
CHUNK_SIZE = 1000000


def process_lines(data, eof, file_name):

    # check if end of file reached
    if not eof:
         # process data, data is one single line of the file

    else:
         # end of file reached

data_loading_utils.read_lines_from_file_as_data_chunks(file_name, chunk_size=CHUNK_SIZE, callback=self.process_lines)

process_lines method is the callback function. process_lines方法是回调函数。 It will be called for all the lines, with parameter data representing one single line of the file at a time.它将为所有行调用,参数数据一次代表文件的一行。

You can configure the variable CHUNK_SIZE depending on your machine hardware configurations.您可以根据您的机器硬件配置配置变量CHUNK_SIZE

How about this?这个怎么样? Divide your file into chunks and then read it line by line, because when you read a file, your operating system will cache the next line.把你的文件分成块然后逐行读取,因为当你读取一个文件时,你的操作系统会缓存下一行。 If you are reading the file line by line, you are not making efficient use of the cached information.如果您逐行读取文件,则无法有效利用缓存信息。

Instead, divide the file into chunks and load the whole chunk into memory and then do your processing.相反,将文件分成块并将整个块加载到内存中,然后进行处理。

def chunks(file,size=1024):
    while 1:

        startat=fh.tell()
        print startat #file's object current position from the start
        fh.seek(size,1) #offset from current postion -->1
        data=fh.readline()
        yield startat,fh.tell()-startat #doesnt store whole list in memory
        if not data:
            break
if os.path.isfile(fname):
    try:
        fh=open(fname,'rb') 
    except IOError as e: #file --> permission denied
        print "I/O error({0}): {1}".format(e.errno, e.strerror)
    except Exception as e1: #handle other exceptions such as attribute errors
        print "Unexpected error: {0}".format(e1)
    for ele in chunks(fh):
        fh.seek(ele[0])#startat
        data=fh.read(ele[1])#endat
        print data

Thank you!谢谢! I have recently converted to python 3 and have been frustrated by using readlines(0) to read large files.我最近转换为 python 3,并且对使用 readlines(0) 读取大文件感到沮丧。 This solved the problem.这解决了问题。 But to get each line, I had to do a couple extra steps.但是为了得到每一行,我必须做几个额外的步骤。 Each line was preceded by a "b'" which I guess that it was in binary format.每行前面都有一个“b”,我猜它是二进制格式。 Using "decode(utf-8)" changed it ascii.使用“decode(utf-8)”将其更改为 ascii。

Then I had to remove a "=\n" in the middle of each line.然后我不得不在每行中间删除一个“=\n”。

Then I split the lines at the new line.然后我在新行拆分行。

b_data=(fh.read(ele[1]))#endat This is one chunk of ascii data in binary format
        a_data=((binascii.b2a_qp(b_data)).decode('utf-8')) #Data chunk in 'split' ascii format
        data_chunk = (a_data.replace('=\n','').strip()) #Splitting characters removed
        data_list = data_chunk.split('\n')  #List containing lines in chunk
        #print(data_list,'\n')
        #time.sleep(1)
        for j in range(len(data_list)): #iterate through data_list to get each item 
            i += 1
            line_of_data = data_list[j]
            print(line_of_data)

Here is the code starting just above "print data" in Arohi's code.这是 Arohi 代码中“打印数据”上方的代码。

I demonstrated a parallel byte level random access approach here in this other question: 我在另一个问题中演示了并行字节级随机访问方法:

Getting number of lines in a text file without readlines 获取没有readlines的文本文件中的行数

Some of the answers already provided are nice and concise. 已经提供的一些答案很简洁。 I like some of them. 我喜欢其中的一些。 But it really depends what you want to do with the data that's in the file. 但这实际上取决于你想要对文件中的数据做什么。 In my case I just wanted to count lines, as fast as possible on big text files. 在我的情况下,我只想在大文本文件上尽可能快地计算行数。 My code can be modified to do other things too of course, like any code. 我的代码当然可以修改为做其他事情,就像任何代码一样。

The best solution I found regarding this, and I tried it on 330 MB file.我找到了最好的解决方案,我在 330 MB 文件上进行了尝试。

lineno = 500
line_length = 8
with open('catfour.txt', 'r') as file:
    file.seek(lineno * (line_length + 2))
    print(file.readline(), end='')

Where line_length is the number of characters in a single line.其中 line_length 是单行中的字符数。 For example "abcd" has line length 4.例如“abcd”的行长为 4。

I have added 2 in line length to skip the '\n' character and move to the next character.我在行长中添加了 2 以跳过 '\n' 字符并移至下一个字符。

I realise this has been answered quite some time ago, but here is a way of doing it in parallel without killing your memory overhead (which would be the case if you tried to fire each line into the pool).我意识到这在很久以前就已经得到了回答,但是这是一种并行执行的方法,而不会杀死您的内存开销(如果您尝试将每一行都放入池中,就会出现这种情况)。 Obviously swap the readJSON_line2 function out for something sensible - its just to illustrate the point here!显然,将 readJSON_line2 函数换成一些有意义的东西——它只是为了说明这一点!

Speedup will depend on filesize and what you are doing with each line - but worst case scenario for a small file and just reading it with the JSON reader, I'm seeing similar performance to the ST with the settings below.加速将取决于文件大小和您对每一行所做的事情 - 但对于小文件并仅使用 JSON 阅读器读取它的最坏情况,我看到具有以下设置的 ST 类似的性能。

Hopefully useful to someone out there:希望对那里的人有用:

def readJSON_line2(linesIn):
  #Function for reading a chunk of json lines
   '''
   Note, this function is nonsensical. A user would never use the approach suggested 
   for reading in a JSON file, 
   its role is to evaluate the MT approach for full line by line processing to both 
   increase speed and reduce memory overhead
   '''
   import json

   linesRtn = []
   for lineIn in linesIn:

       if lineIn.strip() != 0:
           lineRtn = json.loads(lineIn)
       else:
           lineRtn = ""
        
       linesRtn.append(lineRtn)

   return linesRtn




# -------------------------------------------------------------------
if __name__ == "__main__":
   import multiprocessing as mp

   path1 = "C:\\user\\Documents\\"
   file1 = "someBigJson.json"

   nBuffer = 20*nCPUs  # How many chunks are queued up (so cpus aren't waiting on processes spawning)
   nChunk = 1000 # How many lines are in each chunk
   #Both of the above will require balancing speed against memory overhead

   iJob = 0  #Tracker for SMP jobs submitted into pool
   iiJob = 0  #Tracker for SMP jobs extracted back out of pool

   jobs = []  #SMP job holder
   MTres3 = []  #Final result holder
   chunk = []  
   iBuffer = 0 # Buffer line count
   with open(path1+file1) as f:
      for line in f:
            
          #Send to the chunk
          if len(chunk) < nChunk:
              chunk.append(line)
          else:
              #Chunk full
              #Don't forget to add the current line to chunk
              chunk.append(line)
                
              #Then add the chunk to the buffer (submit to SMP pool)                  
              jobs.append(pool.apply_async(readJSON_line2, args=(chunk,)))
              iJob +=1
              iBuffer +=1
              #Clear the chunk for the next batch of entries
              chunk = []
                            
          #Buffer is full, any more chunks submitted would cause undue memory overhead
          #(Partially) empty the buffer
          if iBuffer >= nBuffer:
              temp1 = jobs[iiJob].get()
              for rtnLine1 in temp1:
                  MTres3.append(rtnLine1)
              iBuffer -=1
              iiJob+=1
            
      #Submit the last chunk if it exists (as it would not have been submitted to SMP buffer)
      if chunk:
          jobs.append(pool.apply_async(readJSON_line2, args=(chunk,)))
          iJob +=1
          iBuffer +=1

      #And gather up the last of the buffer, including the final chunk
      while iiJob < iJob:
          temp1 = jobs[iiJob].get()
          for rtnLine1 in temp1:
              MTres3.append(rtnLine1)
          iiJob+=1

   #Cleanup
   del chunk, jobs, temp1
   pool.close()

This might be useful when you want to work in parallel and read only chunks of data but keep it clean with new lines.当您想要并行工作并且只读取数据块但使用新行保持干净时,这可能很有用。

def readInChunks(fileObj, chunkSize=1024):
    while True:
        data = fileObj.read(chunkSize)
        if not data:
            break
        while data[-1:] != '\n':
            data+=fileObj.read(1)
        yield data
f=open('filename','r').read()
f1=f.split('\n')
for i in range (len(f1)):
    do_something_with(f1[i])

hope this helps. 希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 python 中打开一个 csv 文件,一次读取一行,而不将整个 csv 文件加载到内存中? - How can I open a csv file in python, and read one line at a time, without loading the whole csv file in memory? 逐行读取 XML 而无需将整个文件加载到 memory - Read XML line by line without loading whole file to memory 在 memory 中读取第 n 行 importlib.resources.files 而不加载整个文件 - Read nth line of importlib.resources.files without loading whole file in memory 合并2个非常大的文本文件,更新每一行,而不使用内存 - Merging 2 very large text files, update each line, without using memory 如何在python中读取大型压缩文件而不将其全部加载到内存中 - how to read a large compressed file in python without loading it all in memory 使用python逐行比较大型文本文件 - Line by line comparison of large text files with python 我如何从 python 中的特定行读取文本文件? - How i can Read text file from a specific line in python? 逐行读取大型文本文件仍然占用了我的全部内存 - Reading large text files line-by-line is still using all my memory 如何正确地在Python中读取大型文本文件,以免阻塞内存? - How do I properly read large text files in Python so I dont clog up memory? 如何逐行读取和替换python文件中的文本 - How to read and replace text in files in python line by line
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM