I have to parse through a really big file, modify its contents, and write that to another file. The file that I have right now is not that big in comparison to what it could be, but it's big nonetheless.
The file is 1.3 GB and contains about 7 million lines of this format:
8823192\t/home/pcastr/...
Where \\t
is a tab character. The number at the beginning is the apparent size of the path that follows.
I want an output file with lines looking like this (in csv format):
True,8823192,/home/pcastr/...
Where the first value is whether the path is a directory.
Currently, my code looks something like this:
with open(filepath, "r") as open_file:
while True:
line = open_file.readline()
if line == "": # Checks for the end of the file
break
size = line.split("\t")[0]
path = line.strip().split("\t")[1]
is_dir = os.path.isdir(path)
streamed_file.write(unicode("{isdir},{size},{path}\n".format(isdir=is_dir, size=size, path=path))
A caveat with this is that files like this WILL get tremendously big, so I not only need a fast solution, but a memory efficient solution as well. I know that there is usually a trade off between these two qualities,
The biggest gain is likely to come from calling split
only once per line
size, path = line.strip().split("\t")
# or ...split("\t", 3)[0:2] if there are extra fields to ignore
You can at least simplify your code by treating the input file as an iterator and using the csv
module. This might give you a speed-up as well, as it eliminates the need for an explicit call to split
:
with open(filepath, "r") as open_file:
reader = csv.reader(open_file, delimiter="\t")
writer = csv.writer(streamed_file)
for size, path in reader:
is_dir = os.path.isdir(path)
writer.writerow([is_dir, size, path])
Compressing the file before copying trough the network could speed up the processing of data because you will get your data to your script faster.
Can you keep the input text file compressed on the remote target system? if yes, you could compress it to a format using an algorithm that is supported in python (modules zlib, gzip, bz2, lzma, zipfile)
If no you could at least run a script on remote storage system to compress the file. Next you would read the file and decompress it in memory using one of the python modules and then process each line.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.