I would like to use python to delete the header and the 1st row of a huge csv file (3GB) with good performance.
import csv
import pandas as pd
def remove2rows(csv_file):
data = pd.read_csv(csv_file)
data = data.iloc[1:]
data.to_csv(csv_file, header=None, index=False)
if __name__ == "__main__":
remove2rows(filename)
This script works but takes some time, probably because it reads the whole file and it writes every row starting from row 3 to the end of the file to a new csv file.
Is there any ways that can improve the performance?
Note that the only way to "remove lines from a file" IS to read the whole file (though not necessarily all at once xD) and write back selected lines to a new file. That's how files work.
But you'd certainly save time by not using panda here - panda is a tool for doing computations on tabular data, not a file utility. Using the stdlib's csv module or even more simply just plain file features (if you are 101% sure your csv doesn't contains embedded newlines) would probably be more efficient, at least wrt/ memory use, and probably wrt/ raw perfs.
Question : Delete first two rows of a huge csv file
This exampel do :
Find the offset of the second NewLine, change the file position to it and copy to the end of the file.
Report back if you gain any improved performance!
Reference :
bytes.find(sub[, start[, end]])
Return the lowest index in the data where the subsequence sub is found,
Change the file position to the given byte offset.
shutil.copyfileobj(fsrc, fdst[, length])
Contents from the current file position to the end of the file will be copied.
import io, shutil
DATA = b"""First line to be skipped
Second line to be skipped
Data Line 1
Data Line 2
Data Line 3
"""
def main():
# with open('in_filename', 'rb') as in_fh, open('out_filename', 'wb') as out_fh:
with io.BytesIO(DATA) as in_fh, io.BytesIO() as out_fh:
# Find the offset of the second NewLine
# Assuming it within the first 70 bytes
# Assuming NO embeded NewLine
# Adjust it to your needs
buffer = in_fh.read(70)
offset = 0
for n in range(2):
offset = buffer.find(b'\n', offset) + 1
print('Change the file position to: {}'.format(offset))
in_fh.seek(offset)
# Copy to the end of the file
shutil.copyfileobj(in_fh, out_fh)
# This is only for demo printing the result
print(out_fh.getvalue())
if __name__ == "__main__":
main()
Output :
Change the file position to: 59 b'Data Line 1\\nData Line 2\\nData Line 3\\n'
Tested with Python: 3.5
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.