[英]parsing a masive csv file having a problem with bad version number of the software how do i format quickly 5 million rows
Here's a sample of my data:这是我的数据示例:
from io import StringIO
data = StringIO("""software,version
Visual C++ Minimum Runtime,11.0.61030
Visual C++ Minimum Runtime,11.0.61030
Visual C++ Minimum Runtime,11.0.61030.0.0.0.0""")
Notice that the last record the version number has 0.0.0.0
in it.请注意,版本号的最后一条记录中包含0.0.0.0
。
How can I get to xx.yy.zz
first front 3 characters and clean up the remaining data?我怎样才能得到xx.yy.zz
前 3 个字符并清理剩余的数据?
As an example: Visual C++ Minimum Runtime,11.0.61030.0.0.0.0
should be truncated to:例如: Visual C++ Minimum Runtime,11.0.61030.0.0.0.0
应截断为:
"Visual C++ Minimum Runtime,11.0.61030"
Is there an efficient way to accomplish this?有没有一种有效的方法来完成这个?
You could use generators to load the file row by row and then write the truncated rows to a backup file.您可以使用生成器逐行加载文件,然后将截断的行写入备份文件。 eg.例如。
import csv
filename = "foo.csv"
def get_row(filename):
with open(filename, "rb") as csvfile:
data = csv.reader(csvfile)
yield next(data)
with open('truncated.csv','wb') as truncatedcsv:
writer = csv.writer(truncatedcsv, delimiter=',')
for row in get_row(filename):
truncated_row = # your truncation logic
writer.writerow(truncated_row)
Don't forget to rename the new file and delete the old one.不要忘记重命名新文件并删除旧文件。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.