[英]Merge 2 csv files based on a common field using python
I generated 2 csv files from 2 mysql tables. 我从2个mysql表中生成了2个csv文件。 now I want to merge the 2 files together. 现在我想将2个文件合并在一起。
I manually added this header for the first csv: 我为第一个csv手动添加了这个标题:
ID,name,sector,sub_sector
and this is the second csv header: 这是第二个csv头:
ID,url
my goal is to have 1 file: 我的目标是拥有1个文件:
ID,name,sector,sub_sector,url
note: not whole records in the first file have a match in the second file. 注意:第一个文件中的整个记录在第二个文件中没有匹配。
this is the snippet I was using: 这是我使用的片段:
#!/usr/bin/env python
import glob, csv
if __name__ == '__main__':
infiles = glob.glob('./*.csv')
out = 'temp.csv'
data = {}
fields = []
for fname in infiles:
df = open(fname, 'rb')
reader = csv.DictReader(df)
for line in reader:
# assuming the field is called ID
if line['ID'] not in data:
data[line['ID']] = line
else:
for k,v in line.iteritems():
if k not in data[line['ID']]:
data[line['ID']][k] = v
for k in line.iterkeys():
if k not in fields:
fields.append(k)
del reader
df.close()
writer = csv.DictWriter(open(out, "wb"), fields, extrasaction='ignore', dialect='excel')
# write the header at the top of the file
writer.writeheader()
writer.writerows(data)
del writer
taken from another sof thread. 取自另一个软线程。 and this is the error I'm getting: 这是我得到的错误:
File "db_work.py", line 30, in <module>
writer.writerows(data)
File "/usr/lib/python2.7/csv.py", line 153, in writerows
rows.append(self._dict_to_list(rowdict))
File "/usr/lib/python2.7/csv.py", line 144, in _dict_to_list
", ".join(wrong_fields))
ValueError: dict contains fields not in fieldnames: 4, 4, 4, 6
~/Development/python/DB$ python db_work.py
Traceback (most recent call last):
File "db_work.py", line 30, in <module>
writer.writerows(data)
File "/usr/lib/python2.7/csv.py", line 153, in writerows
rows.append(self._dict_to_list(rowdict))
File "/usr/lib/python2.7/csv.py", line 145, in _dict_to_list
return [rowdict.get(key, self.restval) for key in self.fieldnames]
AttributeError: 'str' object has no attribute 'get'
any ideas how to fix this? 任何想法如何解决这一问题?
.writerows()
expects a list , but you are passing in a dict
instead. .writerows()
需要一个列表 ,但你传递的是dict
。 I think you wanted to write the values of data
only: 我想你只想写data
的值:
writer = csv.DictWriter(open(out, "wb"), fields, dialect='excel')
# write the header at the top of the file
writer.writeheader()
writer.writerows(data.values())
Personally, I'd read the file with just the id, url
rows, add those to a dict, then read the other file and write each row one at a time by adding the corresponding url
entry. 就个人而言,我只用id, url
行读取文件,将它们添加到dict中,然后读取另一个文件,并通过添加相应的url
条目一次写入一行。
import csv
with open('urls.csv', 'rb') as urls:
reader = csv.reader(urls)
reader.next() # skip the header, won't need that here
urls = {id: url for id, url in reader}
with open('other.csv', 'rb') as other:
with open(out, 'wb') as output:
reader = csv.reader(other)
writer = csv.writer(output)
writer.writerow(reader.next() + ['url']) # read old header, add urls and write out
for row in reader:
# write out original row plus url if we can find one
writer.writerow(row + [urls.get(row[0], '')])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.