[英]Efficient way of reading large txt file in python
I'm trying to open a txt file with 4605227 rows (305 MB)我正在尝试打开一个包含 4605227 行 (305 MB) 的 txt 文件
The way I have done this before is:我以前这样做的方式是:
data = np.loadtxt('file.txt', delimiter='\t', dtype=str, skiprows=1)
df = pd.DataFrame(data, columns=["a", "b", "c", "d", "e", "f", "g", "h", "i"])
df = df.astype(dtype={"a": "int64", "h": "int64", "i": "int64"})
But it's using up most of available ram ~10GB and not finishing.但它用尽了大部分可用内存 ~10GB 并且没有完成。 Is there a faster way of reading in this txt file and creating a pandas dataframe?
有没有更快的方法来读取这个 txt 文件并创建 pandas dataframe?
Thanks!谢谢!
Edit: Solved now, thank you.编辑:现在解决了,谢谢。 Why is np.loadtxtx() so slow?
为什么 np.loadtxtx() 这么慢?
Rather than reading it in with numpy you could just read it directly in as a Pandas DataFrame.与其使用 numpy 读取它,不如直接将其读取为 Pandas DataFrame。 Eg, using the pandas.read_csv function, with something like:
例如,使用pandas.read_csv function,类似于:
df = pd.read_csv('file.txt', delimiter='\t', usecols=["a", "b", "c", "d", "e", "f", "g", "h", "i"])
Method 1:方法一:
You can read the file by chunks, Moreover there is a buffer size which ou can mention in readline and you can read.您可以按块读取文件,此外还有一个缓冲区大小,您可以在 readline 中提及并且您可以读取。
inputFile = open('inputTextFile','r')
buffer_line = inputFile.readlines(BUFFERSIZE)
while buffer_line:
#logic goes here
Method 2:方法二:
You can also use nmap Module, Here below is the link whic will explain the usage.您也可以使用 nmap 模块,下面是解释用法的链接。
import mmap导入地图
with open("hello.txt", "r+b") as f:
# memory-map the file, size 0 means whole file
mm = mmap.mmap(f.fileno(), 0)
# read content via standard file methods
print(mm.readline()) # prints b"Hello Python!\n"
# read content via slice notation
print(mm[:5]) # prints b"Hello"
# update content using slice notation;
# note that new content must have same size
mm[6:] = b" world!\n"
# ... and read again using standard file methods
mm.seek(0)
print(mm.readline()) # prints b"Hello world!\n"
# close the map
mm.close()
https://docs.python.org/3/library/mmap.html https://docs.python.org/3/library/mmap.html
You read it directly in as a Pandas DataFrame.您直接将其读取为 Pandas DataFrame。 eg
例如
import pandas as pd
pd.read_csv(path)
If you want to read faster, you can use modin:如果你想更快地阅读,你可以使用 modin:
import modin.pandas as pd
pd.read_csv(path)
https://github.com/modin-project/modin https://github.com/modin-project/modin
Below code will read the file line by line, It will iterate over each line in the file object in a for loop and process those lines as you want.下面的代码将逐行读取文件,它将在 for 循环中遍历文件 object 中的每一行并根据需要处理这些行。
with open("file.txt") as fobj:
for line in fobj:
print(line) #do your process
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.