简体   繁体   English

阅读非常大的一个班轮文本文件

[英]Reading Very Large One Liner Text File

I have a 30MB .txt file, with one line of data (30 Million Digit Number) 我有一个30MB .txt文件,与数据的一行 (30百万位数字)
Unfortunately, every method I've tried ( mmap.read() , readline() , allocating 1GB of RAM, for loops) takes 45+ minutes to completely read the file. 不幸的是,我尝试的每个方法( mmap.read()readline() ,分配1GB的RAM,for循环)需要45分钟才能完全读取文件。 Every method I found on the internet seems to work on the fact that each line is small, therefore the memory consumption is only as big as the biggest line in the file. 我在互联网上找到的每一种方法似乎都是因为每条线都很小,因此内存消耗量只有文件中的最大线。 Here's the code I've been using. 这是我一直在使用的代码。

start = time.clock()
z = open('Number.txt','r+') 
m = mmap.mmap(z.fileno(), 0)
global a
a = int(m.read())
z.close()
end = time.clock()
secs = (end - start)
print("Number read in","%s" % (secs),"seconds.", file=f)
print("Number read in","%s" % (secs),"seconds.")
f.flush()
del end,start,secs,z,m

Other than splitting the number from one line to various lines; 除了将数字从一行分成不同的行; which I'd rather not do, is there a cleaner method which won't require the better part of an hour? 我宁愿不这样做,是否有一种更清洁的方法,不需要一小时的大部分时间?

By the way, I don't necessarily have to use text files. 顺便说一句,我不一定要使用文本文件。

I have: Windows 8.1 64-Bit, 16GB RAM, Python 3.5.1 我有:Windows 8.1 64位,16GB RAM,Python 3.5.1

The file read is quick (<1s): 读取的文件很快(<1s):

with open('number.txt') as f:
    data = f.read()

Converting a 30-million-digit string to an integer, that's slow: 将一个3000万字节的字符串转换为整数,这很慢:

z=int(data) # still waiting...

If you store the number as raw big- or little-endian binary data, then int.from_bytes(data,'big') is much quicker. 如果将数字存储为原始的big或little-endian二进制数据,则int.from_bytes(data,'big')要快得多。

If I did my math right (Note _ means "last line's answer" in Python's interactive interpreter): 如果我的数学运算正确(注意_表示Python的交互式解释器中的“最后一行答案”):

>>> import math
>>> math.log(10)/math.log(2)  # Number of bits to represent a base 10 digit.
3.3219280948873626
>>> 30000000*_                # Number of bits to represent 30M-digit #.
99657842.84662087
>>> _/8                       # Number of bytes to represent 30M-digit #.
12457230.35582761             # Only ~12MB so file will be smaller :^)
>>> import os
>>> data=os.urandom(12457231) # Generate some random bytes
>>> z=int.from_bytes(data,'big')  # Convert to integer (<1s)
99657848
>>> math.log10(z)   # number of base-10 digits in number.
30000001.50818886

EDIT : FYI, my math wasn't right, but I fixed it. 编辑 :仅供参考,我的数学不对,但我修好了。 Thanks for 10 upvotes without noticing :^) 感谢10个赞成票而没有注意到:^)

A 30MB text file should not take very long to read, modern hard drives should be able to do this in less than a second (not counting access time) 一个30MB的文本文件不应该花很长时间阅读,现代硬盘驱动器应该能够在不到一秒的时间内完成(不计入访问时间)

Using the standard python file IO should work fine in this case: 在这种情况下,使用标准的python文件IO应该可以正常工作:

with open('my_file', 'r') as handle:
    content = handle.read()

Using this on my laptop yields times much less than a second. 在我的笔记本电脑上使用它可以产生不到一秒的时间。

However, converting those 30 MB to an integer is your bottleneck, since python cannot represent this with the long datatype. 但是,将这30 MB转换为整数是你的瓶颈,因为python不能用long数据类型来表示它。

You can have a try with the Decimal module, however it is mainly designed for floating point arithmetic. 您可以尝试使用Decimal模块,但它主要用于浮点运算。

Besides of that, there is numpy of course, which might be faster (and since you probably want to do some work with the number later on, it would make sense to use such a library). 除此之外,当然还有numpy,它可能更快(因为你可能想稍后使用这个数字进行一些工作,所以使用这样的库是有意义的)。

I used the gmpy2 module to convert the string to a number. 我使用gmpy2模块将字符串转换为数字。

start = time.clock()  
z=open('Number.txt','r+') 
data=z.read()
global a
a=gmpy2.mpz(data)
end = time.clock()
secs = (end - start)
print("Number read in","%s" % (secs),"seconds.", file=f)
print("Number read in","%s" % (secs),"seconds.")
f.flush()
del end,secs,start,z,data

It worked in 3 seconds, much slower, but at least it gave me an integer value. 它在3秒内工作,慢得多,但至少它给了我一个整数值。

Thank you all for your invaluable answers, however I'm going to mark this one as soon as possible. 谢谢大家的宝贵答案,但我会尽快给你留下这个答案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM