简体   繁体   English

Python - CSV:包含不同长度行的大文件

[英]Python - CSV: Large file with rows of different lengths

In short, I have a 20,000,000 line csv file that has different row lengths. 简而言之,我有一个20,000,000行csv文件,它有不同的行长度。 This is due to archaic data loggers and proprietary formats. 这是由于古老的数据记录器和专有格式。 We get the end result as a csv file in the following format. 我们以下列格式将最终结果作为csv文件获取。 MY goal is to insert this file into a postgres database. 我的目标是将此文件插入postgres数据库。 How Can I do the following: 我该怎么做?

  • Keep the first 8 columns and my last 2 columns, to have a consistent CSV file 保留前8列和最后2列,以获得一致的CSV文件
  • Add a new column to the csv file ether at the first or last position. 在第一个或最后一个位置向csv文件ether添加一个新列。

1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0 img_id.jpg, -50

Read a row with csv , then: csv读一行,然后:

newrow = row[:8] + row[-2:]

then add your new field and write it out (also with csv ). 然后添加新字段并将其写出(也使用csv )。

You can open the file as a textfile and read the lines one at a time. 您可以将文件作为文本文件打开,并一次读取一行。 Are there quoted or escaped commas that don't "split fields"? 是否有引用或转义的逗号不是“分割字段”? If not, you can do 如果没有,你可以做到

with open('thebigfile.csv', 'r') as thecsv:
    for line in thecsv:
        fields = [f.strip() for f in thecsv.split(',')]
        consist = fields[:8] + fields[-2:] + ['onemore']
        ... use the `consist` list as warranted ...

I suspect that where I have + ['onemore'] you may want to "add a column", as you say, with some very different content, but of course I can't guess what it might be. 我怀疑我在哪里+ ['onemore']你可能想要“添加一个列”,正如你所说,有一些非常不同的内容,但当然我无法猜测它可能是什么。

Don't send each line separately with an insert to the DB -- 20 million inserts would take a long time. 不要将每一行单独插入数据库 - 2000万次插入需要长时间。 Rather, group the "made-consistent" lists, appending them to a temporary list -- each time that list's length hits, say, 1000, use an executemany to add all those entries. 相反,将“制作一致”列表分组,将它们附加到临时列表 - 每次该列表的长度达到1000时,使用executemany添加所有这些条目。

Edit : to clarify, I don't recommend using csv to process a file you know is not in "proper" csv format: processing it directly gives you more direct control (especially as and when you discover other irregularities beyond the varying number of commas per line). 编辑 :澄清一下,我建议使用csv来处理你知道不是“正确的”csv格式的文件:直接处理它会给你更直接的控制(特别是当你发现除了不同数量的逗号之外的其他违规行为时)每行)。

I would recommend using the csv module. 我建议使用csv模块。 Here's some code based off CSV processing that I've done elsewhere 这里有一些基于CSV处理的代码,我在其他地方已经完成了

from __future__ import with_statement
import csv

def process( reader, writer):
    for line in reader:
        data = row[:8] + row[-2:]
        writer.write( data )

def main( infilename, outfilename ):
    with open( infilename, 'rU' ) as infile:
        reader = csv.reader( infile )
        with open( outfilename, 'w') as outfile:
            writer = csv.writer( outfile )
            process( reader, writer )

if __name__ == '__main__':
    if len(sys.argv) != 3:
        print "syntax: python process.py filename outname"
        sys.exit(1)
    main( sys.argv[1], sys.argv[2] )

Sorry, you will need to write some code with this one. 对不起,您需要使用此代码编写一些代码。 When you have a huge file like this, it's worth checking all of it to be sure it's consistent with what you expect. 当你有这样一个巨大的文件时,值得检查所有文件,以确保它符合你的期望。 If you let the unhappy data into your database, you will never get all of it out. 如果您将不愉快的数据存入您的数据库,您将永远无法获得所有数据。

Remember oddities about CSV: it's a mishmash of a bunch of similar standards with different rules about quoting, escaping, null characters, unicode, empty fields (",,,"), multi-line inputs, and blank lines. 记住关于CSV的奇怪之处:它是一堆类似标准的混杂,它们有关于引用,转义,空字符,unicode,空字段(“,,,”),多行输入和空行的不同规则。 The csv module has 'dialects' and options, and you might find the csv.Sniffer class helpful. csv模块有'方言'和选项,您可能会发现csv.Sniffer类很有用。

I recommend you: 我推荐你:

  • run a 'tail' command to look at the last few lines. 运行'tail'命令查看最后几行。
  • if it appears well behaved, run the whole file through csv reader to see it breaks. 如果看起来表现良好,请通过csv阅读器运行整个文件以查看它是否中断。 Make a quick histogram of "fields per line". 快速绘制“每行字段”的直方图。
  • Think about "valid" ranges and character types and rigorously check them as you read. 考虑“有效”范围和字符类型,并在阅读时严格检查它们。 Especially watch for unusual unicode or characters outside of the printable range. 特别注意在可打印范围之外的不寻常的unicode或字符。
  • Seriously consider if you want to keep the extra, odd-ball values in a "rest of the line" text field. 认真考虑是否要将额外的奇数球值保留在“其余行”文本字段中。
  • Toss any unexpected lines into an exception file. 将任何意外行放入异常文件中。
  • Fix up your code to handle the new pattern in exceptions file. 修复代码以处理异常文件中的新模式。 Rinse. 冲洗。 Repeat. 重复。
  • Finally, run the whole thing again, actually dumping data into the database. 最后,再次运行整个过程,实际上将数据转储到数据库中。

Your development time will be faster from not touching a database until you are completely done. 在完成任务之前,不接触数据库的开发时间会更快。 Also, be advised the SQLite is blazingly fast on read only data, so PostGres might not be the best solution. 另外,请注意SQLite在只读数据上的速度非常快,因此PostGres可能不是最佳解决方案。

Your final code will probably look like this, but I can't be sure without knowing your data, especially how 'well behaved' it is: 你的最终代码可能看起来像这样,但我不知道你不知道你的数据,特别是它的“表现良好”:

while not eof
    out = []
    for chunk in range(1000):
       try:
          fields = csv.reader.next()
       except StopIteration:
          break
       except:
          print str(reader.line_num) + ", 'failed to parse'"
       try:
          assert len(fields) > 5 and len(fields < 12)
          assert int(fields[3]) > 0 and int(fields[3]) < 999999
          assert int(fields[4]) >= 1 and int(fields[4] <= 12) # date
          assert field[5] == field[5].strip()  # no extra whitespace
          assert not field[5].strip(printable_chars)  # no odd chars
          ...
       except AssertionError:
          print str(reader.line_num) + ", 'failed checks'"
       new_rec = [reader.line_num]  # new first item
       new_rec.extend(fields[:8])   # first eight
       new_rec.extend(fields[-2:])  # last two
       new_rec.append(",".join(field[8:-2])) # and the rest
       out.append(new_rec)
    if database:
       cursor.execute_many("INSERT INTO raw_table VALUES %d,...", out)

Of course, your mileage my vary with this code. 当然,您的里程数因此而异。 It's a first draft of pseduo-code. 这是pseduo代码的初稿。 Expect writing solid code for the input to take most of a day. 期望为输入编写可靠的代码以花费大约一天的时间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM