简体   繁体   English

使用Python替换某些行并在CSV文件中添加其余行

[英]Replacing certain rows and appending rest in CSV files with Python

I have a bunch of file names that I need to append together and put into a new file. 我有一堆文件名,需要将它们附加在一起并放入一个新文件中。 The first column are dates. 第一列是日期。 If the dates overlap from one file to another, I want the next file I'm appending to replace what I already have. 如果日期从一个文件重叠到另一个文件,则我要添加的下一个文件替换已经存在的文件。 For example, if the first file is something like: 例如,如果第一个文件是这样的:

1/5/2010 'hello'
1/6/2010 'goodbye'
1/7/2010 'yes'

and the second file is: 第二个文件是:

1/7/2010 'No'
1/8/2010 "spam'
1/9/2010 'today'

I want my new file to look like this: 我希望我的新文件看起来像这样:

1/5/2010 'hello'
1/6/2010 'goodbye'
1/7/2010 'No'
1/8/2010 'spam'
1/9/2010 'today'

Right now I'm trying something like this but am not getting the right results. 目前,我正在尝试类似的方法,但没有得到正确的结果。 (reader 2 and reader refer to the second file and the first file respectively, newfile2.csv already has the contents of file 1) (读取器2和读取器分别引用第二个文件和第一个文件,newfile2.csv已具有文件1的内容)

for row in reader2:
    for row2 in reader:
        if row == row2:
            target = open('newfile2.csv', 'wb')
            writer = csv.writer(target)
            writer.writerow(row)
            target.close()
        else:
            target = open('newfile2.csv', 'ab')
            writer = csv.writer(target)
            writer.writerow(row)
            target.close()

Any ideas would be greatly appreciated. 任何想法将不胜感激。 Thanks Okay so I guess I should clarify after reading through some of the comments. 谢谢,好吧,我想我应该在阅读一些评论后进行澄清。 The order is important. 顺序很重要。 At the end of this, code, I want the data for every single day of the year in order. 在代码结束时,我希望按顺序获取一年中每一天的数据。 The good news is the data is already in order in the files, there are just some duplicates. 好消息是文件中的数据已经按顺序排列,只有一些重复项。

There are more than one duplicate. 有多个重复项。 For example, my first file that I'm actually dealing with goes until March 9th, while I want it to stop at the end of February. 例如,我实际上要处理的第一个文件要到3月9日,而我要在2月底停止。 I want all the March data from my second file. 我需要第二个文件中的所有3月数据。

Also, there are about 1500+ rows because in the real files, every single hour of the day is also part of the rows. 此外,大约有1500多个行,因为在实际文件中,一天中的每个小时也是行的一部分。

I hope that clarifies what I need done. 我希望这可以澄清我需要做的事情。

I think something like the code I posted above, but only check the first column of each row (since only the dates are going to be duplicates of each other) may work? 我认为类似于我上面发布的代码,但是仅检查每一行的第一列(因为只有日期会彼此重复)才可以工作? Right now I'm checking the whole row and while the dates are duplicates, the rows as a whole are unique. 现在,我正在检查整个行,并且当日期重复时,整个行都是唯一的。

Oh yea one last thing. 哦,是最后一件事。 I want all duplicates eliminated. 我希望消除所有重复项。

Try: 尝试:

dictio = {}
for row in reader:
    [date, text] = row.split()
    dictio[date] = text

for row in reader2:
    [date, text] = row.split()
    dictio[date] = text

target = open('newfile2.csv', 'wb')
writer = csv.writer(target)
for date, text in dictio.iteritems():
    writer.writerow("%s %s" %(date, text))
target.close()

Edit : After the comments, if you want to maintain the order of the items, change 编辑 :注释后,如果要保持项目顺序,请更改

dictio = {}

with

dictio = collections.OrderedDict()

this works for python > 2.6 这适用于python> 2.6

If the files aren't huge (many thousands of rows) this should work well for any number of input files, maintain line order, and only remove duplicates as you specified. 如果文件不是很大(成千上万行),那么这对于任何数量的输入文件都应该很好用,保持行顺序,并且仅删除指定的重复项。

input_files = 'a.csv, b.csv, c.csv, d.csv'

last = '.'
# open the outfile and make the csv writer here
for input_file in input_files:
    # open the infile and make the csv reader here
    lines = reader.readlines()
    # save the last line for later
    last_new = reader.pop()
    if last.split()[0] != lines[0].split()[0]:
        writer.writeln(last)
    writer.writelines(lines)
    last = last_new
    reader.close()
writer.writeln(last)
writer.close()

If you want to get rid of all duplicates, use the dict method in one of the other answers, but don't use a dict , ( {} ), use a collections.OrderedDict() so the rows stay in order. 如果要消除所有重复项,请在其他答案之一中使用dict方法,但不要使用dict{} ),请使用collections.OrderedDict()以便行保持顺序。

The alternative to OrderedDict for Python 2.4-2.6 is http://pypi.python.org/pypi/ordereddict . 用于Python 2.4-2.6的OrderedDict的替代方法是http://pypi.python.org/pypi/ordereddict

The answers posted so far all rely on reading the data into memory, which is fine for small input files. 到目前为止发布的答案都依赖于将数据读取到内存中,这对于较小的输入文件是很好的。 But since you say your input files are already sorted, it is possible to process the input files row by row, allowing you to handle files with an arbitrary number of rows. 但是,由于您说输入文件已经排序,因此可以逐行处理输入文件,从而可以处理任意行的文件。

Assuming you have the list of csv readers (in preference order -- if many files contain a row with the same key, the row from the first reader will be taken), a csv writer for the output, and a function key for extracting the sort key for each row, you could just output always the row containing the minimum sort key value, and advance all readers having the same key value: 假设你已经列表CSV readers (按优先顺序-如果许多文件包含具有相同键的一行,从第1读写该行会采取),一个CSV writer的输出,以及一个功能key用于提取对于每一行的排序键,您可以始终只输出包含最小排序键值的行,并使所有具有相同键值的阅读器前进:

def combine(readers, writer, key):
    rows = [reader.next() for reader in readers]
    while rows:
        # select the first input row with the minimum sort key value
        row = min(rows, key=key)
        writer.writerow(row)
        # advance all readers with the minimum sort key value
        min_key = key(row)
        for i in xrange(len(readers)):
            if key(rows[i]) == min_key:
                try:
                    rows[i] = readers[i].next()
                except StopIteration:
                    # reader exhausted, remove it
                    del rows[i]
                    del readers[i]

To get a sortable key from the example files, you have to parse the date since it is in a somewhat awkward format. 要从示例文件中获得可排序的键,您必须解析日期,因为它的格式有些尴尬。 Using ISO %Y-%m-%d dates in the files would make life easier, since they sort naturally. 在文件中使用ISO %Y-%m-%d日期将使生活更加轻松,因为它们自然排序。

import datetime

def key(row):
    return datetime.datetime.strptime(row[0], '%m/%d/%Y')

Putting it all together, so you can run python combine.py input1.csv input2.csv > output.csv . 将它们放在一起,因此您可以运行python combine.py input1.csv input2.csv > output.csv The order of the input files is reversed so that files specified later will override files specified earlier. 输入文件的顺序相反,因此以后指定的文件将覆盖以前指定的文件。

import csv, sys

delimiter = ' '                         # used in the example input files
readers = [csv.reader(open(filename), delimiter=delimiter)
           for filename in reversed(sys.argv[1:])]
writer = csv.writer(sys.stdout, delimiter=delimiter);
combine(readers, writer, key)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM