[英]How can I read large text files line by line, without loading it into memory?
I need to read a large file, line by line.我需要逐行读取一个大文件。 Lets say that file has more than 5GB and I need to read each line, but obviously I do not want to use
readlines()
because it will create a very large list in the memory.假设该文件超过 5GB,我需要读取每一行,但显然我不想使用
readlines()
因为它会在内存中创建一个非常大的列表。
How will the code below work for this case?下面的代码将如何适用于这种情况? Is
xreadlines
itself reading one by one into memory? xreadlines
本身是否会一一读入内存? Is the generator expression needed?需要生成器表达式吗?
f = (line for line in open("log.txt").xreadlines()) # how much is loaded in memory?
f.next()
Plus, what can I do to read this in reverse order, just as the Linux tail
command?另外,我该怎么做才能以相反的顺序阅读它,就像 Linux
tail
命令一样?
I found:我发现:
http://code.google.com/p/pytailer/ http://code.google.com/p/pytailer/
and和
" python head, tail and backward read by lines of a text file " “ python 头、尾和反向读取文本文件的行”
Both worked very well!两者都工作得很好!
I provided this answer because Keith's, while succinct, doesn't close the file explicitly我提供了这个答案,因为基思虽然简洁,但没有明确关闭文件
with open("log.txt") as infile:
for line in infile:
do_something_with(line)
All you need to do is use the file object as an iterator.您需要做的就是使用文件对象作为迭代器。
for line in open("log.txt"):
do_something_with(line)
Even better is using context manager in recent Python versions.更好的是在最近的 Python 版本中使用上下文管理器。
with open("log.txt") as fileobject:
for line in fileobject:
do_something_with(line)
This will automatically close the file as well.这也会自动关闭文件。
Please try this:请试试这个:
with open('filename','r',buffering=100000) as f:
for line in f:
print line
You are better off using an iterator instead.你最好改用迭代器。
Relevant: fileinput
— Iterate over lines from multiple input streams .相关:
fileinput
— 遍历来自多个输入流的行。
From the docs:从文档:
import fileinput
for line in fileinput.input("filename", encoding="utf-8"):
process(line)
This will avoid copying the whole file into memory at once.这将避免一次将整个文件复制到内存中。
An old school approach:老派的做法:
fh = open(file_name, 'rt')
line = fh.readline()
while line:
# do stuff with line
line = fh.readline()
fh.close()
Here's what you do if you dont have newlines in the file:如果文件中没有换行符,请执行以下操作:
with open('large_text.txt') as f:
while True:
c = f.read(1024)
if not c:
break
print(c,end='')
I couldn't believe that it could be as easy as @john-la-rooy's answer made it seem.我不敢相信这会像@john-la-rooy 的回答看起来那么简单。 So, I recreated the
cp
command using line by line reading and writing.因此,我使用逐行读写重新创建了
cp
命令。 It's CRAZY FAST.这太疯狂了。
#!/usr/bin/env python3.6
import sys
with open(sys.argv[2], 'w') as outfile:
with open(sys.argv[1]) as infile:
for line in infile:
outfile.write(line)
The blaze project has come a long way over the last 6 years.在过去的 6 年中,大火项目取得了长足的进步。 It has a simple API covering a useful subset of pandas features.
它有一个简单的 API,涵盖了有用的 pandas 功能子集。
dask.dataframe takes care of chunking internally, supports many parallelisable operations and allows you to export slices back to pandas easily for in-memory operations. dask.dataframe在内部负责分块,支持许多可并行化的操作,并允许您轻松地将切片导出回 pandas 以进行内存操作。
import dask.dataframe as dd
df = dd.read_csv('filename.csv')
df.head(10) # return first 10 rows
df.tail(10) # return last 10 rows
# iterate rows
for idx, row in df.iterrows():
...
# group by my_field and return mean
df.groupby(df.my_field).value.mean().compute()
# slice by column
df[df.my_field=='XYZ'].compute()
Heres the code for loading text files of any size without causing memory issues.这是加载任何大小的文本文件而不会导致内存问题的代码。 It support gigabytes sized files
它支持千兆字节大小的文件
https://gist.github.com/iyvinjose/e6c1cb2821abd5f01fd1b9065cbc759d https://gist.github.com/iyvinjose/e6c1cb2821abd5f01fd1b9065cbc759d
download the file data_loading_utils.py and import it into your code下载文件data_loading_utils.py并将其导入您的代码
usage用法
import data_loading_utils.py.py
file_name = 'file_name.ext'
CHUNK_SIZE = 1000000
def process_lines(data, eof, file_name):
# check if end of file reached
if not eof:
# process data, data is one single line of the file
else:
# end of file reached
data_loading_utils.read_lines_from_file_as_data_chunks(file_name, chunk_size=CHUNK_SIZE, callback=self.process_lines)
process_lines method is the callback function. process_lines方法是回调函数。 It will be called for all the lines, with parameter data representing one single line of the file at a time.
它将为所有行调用,参数数据一次代表文件的一行。
You can configure the variable CHUNK_SIZE depending on your machine hardware configurations.您可以根据您的机器硬件配置配置变量CHUNK_SIZE 。
How about this?这个怎么样? Divide your file into chunks and then read it line by line, because when you read a file, your operating system will cache the next line.
把你的文件分成块然后逐行读取,因为当你读取一个文件时,你的操作系统会缓存下一行。 If you are reading the file line by line, you are not making efficient use of the cached information.
如果您逐行读取文件,则无法有效利用缓存信息。
Instead, divide the file into chunks and load the whole chunk into memory and then do your processing.相反,将文件分成块并将整个块加载到内存中,然后进行处理。
def chunks(file,size=1024):
while 1:
startat=fh.tell()
print startat #file's object current position from the start
fh.seek(size,1) #offset from current postion -->1
data=fh.readline()
yield startat,fh.tell()-startat #doesnt store whole list in memory
if not data:
break
if os.path.isfile(fname):
try:
fh=open(fname,'rb')
except IOError as e: #file --> permission denied
print "I/O error({0}): {1}".format(e.errno, e.strerror)
except Exception as e1: #handle other exceptions such as attribute errors
print "Unexpected error: {0}".format(e1)
for ele in chunks(fh):
fh.seek(ele[0])#startat
data=fh.read(ele[1])#endat
print data
Thank you!谢谢! I have recently converted to python 3 and have been frustrated by using readlines(0) to read large files.
我最近转换为 python 3,并且对使用 readlines(0) 读取大文件感到沮丧。 This solved the problem.
这解决了问题。 But to get each line, I had to do a couple extra steps.
但是为了得到每一行,我必须做几个额外的步骤。 Each line was preceded by a "b'" which I guess that it was in binary format.
每行前面都有一个“b”,我猜它是二进制格式。 Using "decode(utf-8)" changed it ascii.
使用“decode(utf-8)”将其更改为 ascii。
Then I had to remove a "=\n" in the middle of each line.然后我不得不在每行中间删除一个“=\n”。
Then I split the lines at the new line.然后我在新行拆分行。
b_data=(fh.read(ele[1]))#endat This is one chunk of ascii data in binary format
a_data=((binascii.b2a_qp(b_data)).decode('utf-8')) #Data chunk in 'split' ascii format
data_chunk = (a_data.replace('=\n','').strip()) #Splitting characters removed
data_list = data_chunk.split('\n') #List containing lines in chunk
#print(data_list,'\n')
#time.sleep(1)
for j in range(len(data_list)): #iterate through data_list to get each item
i += 1
line_of_data = data_list[j]
print(line_of_data)
Here is the code starting just above "print data" in Arohi's code.这是 Arohi 代码中“打印数据”上方的代码。
I demonstrated a parallel byte level random access approach here in this other question: 我在另一个问题中演示了并行字节级随机访问方法:
Getting number of lines in a text file without readlines 获取没有readlines的文本文件中的行数
Some of the answers already provided are nice and concise. 已经提供的一些答案很简洁。 I like some of them.
我喜欢其中的一些。 But it really depends what you want to do with the data that's in the file.
但这实际上取决于你想要对文件中的数据做什么。 In my case I just wanted to count lines, as fast as possible on big text files.
在我的情况下,我只想在大文本文件上尽可能快地计算行数。 My code can be modified to do other things too of course, like any code.
我的代码当然可以修改为做其他事情,就像任何代码一样。
The best solution I found regarding this, and I tried it on 330 MB file.我找到了最好的解决方案,我在 330 MB 文件上进行了尝试。
lineno = 500
line_length = 8
with open('catfour.txt', 'r') as file:
file.seek(lineno * (line_length + 2))
print(file.readline(), end='')
Where line_length is the number of characters in a single line.其中 line_length 是单行中的字符数。 For example "abcd" has line length 4.
例如“abcd”的行长为 4。
I have added 2 in line length to skip the '\n' character and move to the next character.我在行长中添加了 2 以跳过 '\n' 字符并移至下一个字符。
I realise this has been answered quite some time ago, but here is a way of doing it in parallel without killing your memory overhead (which would be the case if you tried to fire each line into the pool).我意识到这在很久以前就已经得到了回答,但是这是一种并行执行的方法,而不会杀死您的内存开销(如果您尝试将每一行都放入池中,就会出现这种情况)。 Obviously swap the readJSON_line2 function out for something sensible - its just to illustrate the point here!
显然,将 readJSON_line2 函数换成一些有意义的东西——它只是为了说明这一点!
Speedup will depend on filesize and what you are doing with each line - but worst case scenario for a small file and just reading it with the JSON reader, I'm seeing similar performance to the ST with the settings below.加速将取决于文件大小和您对每一行所做的事情 - 但对于小文件并仅使用 JSON 阅读器读取它的最坏情况,我看到具有以下设置的 ST 类似的性能。
Hopefully useful to someone out there:希望对那里的人有用:
def readJSON_line2(linesIn):
#Function for reading a chunk of json lines
'''
Note, this function is nonsensical. A user would never use the approach suggested
for reading in a JSON file,
its role is to evaluate the MT approach for full line by line processing to both
increase speed and reduce memory overhead
'''
import json
linesRtn = []
for lineIn in linesIn:
if lineIn.strip() != 0:
lineRtn = json.loads(lineIn)
else:
lineRtn = ""
linesRtn.append(lineRtn)
return linesRtn
# -------------------------------------------------------------------
if __name__ == "__main__":
import multiprocessing as mp
path1 = "C:\\user\\Documents\\"
file1 = "someBigJson.json"
nBuffer = 20*nCPUs # How many chunks are queued up (so cpus aren't waiting on processes spawning)
nChunk = 1000 # How many lines are in each chunk
#Both of the above will require balancing speed against memory overhead
iJob = 0 #Tracker for SMP jobs submitted into pool
iiJob = 0 #Tracker for SMP jobs extracted back out of pool
jobs = [] #SMP job holder
MTres3 = [] #Final result holder
chunk = []
iBuffer = 0 # Buffer line count
with open(path1+file1) as f:
for line in f:
#Send to the chunk
if len(chunk) < nChunk:
chunk.append(line)
else:
#Chunk full
#Don't forget to add the current line to chunk
chunk.append(line)
#Then add the chunk to the buffer (submit to SMP pool)
jobs.append(pool.apply_async(readJSON_line2, args=(chunk,)))
iJob +=1
iBuffer +=1
#Clear the chunk for the next batch of entries
chunk = []
#Buffer is full, any more chunks submitted would cause undue memory overhead
#(Partially) empty the buffer
if iBuffer >= nBuffer:
temp1 = jobs[iiJob].get()
for rtnLine1 in temp1:
MTres3.append(rtnLine1)
iBuffer -=1
iiJob+=1
#Submit the last chunk if it exists (as it would not have been submitted to SMP buffer)
if chunk:
jobs.append(pool.apply_async(readJSON_line2, args=(chunk,)))
iJob +=1
iBuffer +=1
#And gather up the last of the buffer, including the final chunk
while iiJob < iJob:
temp1 = jobs[iiJob].get()
for rtnLine1 in temp1:
MTres3.append(rtnLine1)
iiJob+=1
#Cleanup
del chunk, jobs, temp1
pool.close()
This might be useful when you want to work in parallel and read only chunks of data but keep it clean with new lines.当您想要并行工作并且只读取数据块但使用新行保持干净时,这可能很有用。
def readInChunks(fileObj, chunkSize=1024):
while True:
data = fileObj.read(chunkSize)
if not data:
break
while data[-1:] != '\n':
data+=fileObj.read(1)
yield data
f=open('filename','r').read()
f1=f.split('\n')
for i in range (len(f1)):
do_something_with(f1[i])
hope this helps. 希望这可以帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.