In my ML projects I've started encountering 10 Gb+ sized csv files, so I am trying to implement an efficient way to grab specific lines from my csv files.
This led me to discover itertools
(which can supposedly skip a csv.reader
's lines efficiently, whereas looping over it instead would load every row it went over into memory), and following this answer I tried the following:
import collections
import itertools
with open(csv_name, newline='') as f:
## Efficiently find total number of lines in csv
lines = sum(1 for line in f)
## Proceed only if my csv has more than just its header
if lines < 2:
return None
else:
## Read csv file
reader = csv.reader(f, delimiter=',')
## Skip to last line
consume(reader, lines)
## Output last row
last_row = list(itertools.islice(reader, None, None))
with consume()
being
def consume(iterator, n):
"Advance the iterator n-steps ahead. If n is none, consume entirely."
# Use functions that consume iterators at C speed.
if n is None:
# feed the entire iterator into a zero-length deque
collections.deque(iterator, maxlen=0)
else:
# advance to the empty slice starting at position n
next(itertools.islice(iterator, n, n), None)
However, I only get an empty lists from last_row
, meaning something went wrong.
The short csv which I am testing this code out on:
Author,Date,Text,Length,Favorites,Retweets
Random_account,2019-03-02 19:14:51,twenty-two,10,0,0
Where am I going wrong?
What's going wrong is you are iterating over the file to get its length exhausting the file iterator,
lines = sum(1 for line in f)
You need to either re-open the file, or use f.seek(0)
.
So either:
def get_last_line(csv_name):
with open(csv_name, newline='') as f:
## Efficiently find total number of lines in csv
lines = sum(1 for line in f) # the iterator is now exhausted
if len(lines) < 2:
return
with open(csv_name, newline='') as f: # open file again
# Keep going with your function
...
Alternatively,
def get_last_line(csv_name):
with open(csv_name, newline='') as f:
## Efficiently find total number of lines in csv
lines = sum(1 for line in f) # the iterator is now exhausted
if len(lines) < 2:
return
# we can "cheat" the iterator protocol and
# and move the iterator back to the beginning
f.seek(0)
... # continue with the function
However, if you want the last line, you can simply do:
for line in f:
pass
print(line)
Perhaps, using a collections.deque
would be faster (they use it in the recipe):
collections.deque(f, maxlen=1)
Here are two different ways to approach the problem, let me just create a file real quick:
Juans-MacBook-Pro:tempdata juan$ history > history.txt
Juans-MacBook-Pro:tempdata juan$ history >> history.txt
Juans-MacBook-Pro:tempdata juan$ history >> history.txt
Juans-MacBook-Pro:tempdata juan$ history >> history.txt
Juans-MacBook-Pro:tempdata juan$ cat history.txt | wc -l
2000
OK, in an IPython repl:
In [1]: def get_last_line_fl(filename):
...: with open(filename) as f:
...: prev = None
...: for line in f:
...: prev = line
...: if prev is None:
...: return None
...: else:
...: return line
...:
In [2]: import collections
...: def get_last_line_dq(filename):
...: with open(filename) as f:
...: last_two = collections.deque(f, maxlen=2)
...: if len(last_two) < 2:
...: return
...: else:
...: return last_two[-1]
...:
In [3]: %timeit get_last_line_fl('history.txt')
1000 loops, best of 3: 337 µs per loop
In [4]: %timeit get_last_line_dq('history.txt')
1000 loops, best of 3: 339 µs per loop
In [5]: get_last_line_fl('history.txt')
Out[5]: ' 588 history >> history.txt\n'
In [6]: get_last_line_dq('history.txt')
Out[6]: ' 588 history >> history.txt\n'
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.