简体   繁体   English

解析一个巨大的 csv 文件,软件版本号有问题如何快速格式化 500 万行

[英]parsing a masive csv file having a problem with bad version number of the software how do i format quickly 5 million rows

Here's a sample of my data:这是我的数据示例:

from io import StringIO

data = StringIO("""software,version
Visual C++ Minimum Runtime,11.0.61030
Visual C++ Minimum Runtime,11.0.61030
Visual C++ Minimum Runtime,11.0.61030.0.0.0.0""")

Notice that the last record the version number has 0.0.0.0 in it.请注意,版本号的最后一条记录中包含0.0.0.0

How can I get to xx.yy.zz first front 3 characters and clean up the remaining data?我怎样才能得到xx.yy.zz前 3 个字符并清理剩余的数据?

As an example: Visual C++ Minimum Runtime,11.0.61030.0.0.0.0 should be truncated to:例如: Visual C++ Minimum Runtime,11.0.61030.0.0.0.0应截断为:

"Visual C++ Minimum Runtime,11.0.61030"

Is there an efficient way to accomplish this?有没有一种有效的方法来完成这个?

You could use generators to load the file row by row and then write the truncated rows to a backup file.您可以使用生成器逐行加载文件,然后将截断的行写入备份文件。 eg.例如。

import csv

filename = "foo.csv"

def get_row(filename):
    with open(filename, "rb") as csvfile:
        data = csv.reader(csvfile)
        yield next(data)

with open('truncated.csv','wb') as truncatedcsv:
    writer = csv.writer(truncatedcsv, delimiter=',')
    for row in get_row(filename):
        truncated_row = # your truncation logic
        writer.writerow(truncated_row)

Don't forget to rename the new file and delete the old one.不要忘记重命名新文件并删除旧文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM