[英]How can I merge two csv files by a common column, in the case of unequal rows?
I have a set of 100 files. 我有一组100个文件。 50 files containing census information for each US state.
50个文件,包含每个美国州的人口普查信息。 The other fifty are geographic data that need to be merged with the correct file for each state.
另外五十个是需要与每个州的正确文件合并的地理数据。
For each state, the census file and its corresponding geo file are related by a common variable, LOGRECNO, that is the 10th column in the census file and the 7th column in the geo file. 对于每个州,人口普查文件及其相应的地理文件通过公共变量LOGRECNO相关,该变量是人口普查文件中的第10列和地理文件中的第7列。
The problem is that the geo file has more rows than the census file; 问题是地理文件的行数比人口普查文件多; my census data does not cover certain subsets of geographic locations and hence has fewer rows than the geo data file.
我的人口普查数据不包括地理位置的某些子集,因此行数少于地理数据文件。
How can I merge the census data with the geographic date (keeping only the rows/geo locations where census data exists, don't care about the rest)? 如何将人口普查数据与地理日期合并(仅保留人口普查数据存在的行/地理位置,不关心其余部分)?
I am a newbie to Python and I somewhat know how to do basic csv file i/o in python. 我是Python的新手,我有点知道如何在python中执行基本的csv文件i / o。 Manipulating 2 csvs at the same time is proving confusing.
同时操纵2个csvs令人困惑。
Example: 例:
sample_state_census.csv sample_state_census.csv
Varname 1 Varname 2 ... Varname 10 (LOGRECNO) ... Varname 16000
xxx xxx ... 1 ... xxx
xxx xxx ... 2 ... xxx
...
...
xxx xxx ... 514 ... xxx
xxx xxx ... 1312 ... xxx
...
...
xxx xxx ... 1500 ... xxx
sample_state_geo.csv sample_state_geo.csv
GeoVarname 1 GeoVarname 2 ... GeoVarname 7 (LOGRECNO) ... GeoVarname 65
yyy yyy ... 1 ... yyy
yyy yyy ... 2 ... yyy
...
...
yyy yyy ... 514 ... yyy
yyy yyy ... 515 ... yyy
...
...
yyy yyy ... 1500 ... yyy
Expected output (don't merge rows for values of LOGRECNO that don't exist in sample_state_census.csv) 预期输出(不合并sample_state_census.csv中不存在的LOGRECNO值的行)
Varname 1 Varname 2 ... Varname 10 (LOGRECNO) GeoVarname 1 GeoVarname 2 ... GeoVarname 65 Varname 11... Varname 16000
xxx xxx ... 1 yyy yyy ... yyy xxx ... xxx
xxx xxx ... 2 yyy yyy ... yyy xxx ... xxx
...
...
xxx xxx ... 514 yyy yyy ... yyy xxx ... xxx
xxx xxx ... 1312 yyy yyy ... yyy xxx ... xxx
...
...
xxx xxx ... 1500 yyy yyy ... yyy xxx ... xxx
Read data from the shorter file into memory, into a dictionary keyed on the LOGRECNO
row: 将较短文件中的数据读入内存,转换为
LOGRECNO
行上的字典:
import csv
with open('sample_state_census.csv', 'rb') as census_file:
reader = csv.reader(census_file, delimiter='\t')
census_header = next(reader, None) # store header
census = {row[9]: row for row in reader}
then use this dictionary to match against the geo data, write out matches: 然后使用此字典匹配地理数据,写出匹配:
with open('sample_state_geo.csv', 'rb') as geo_file:
with open('outputfile.csv', 'wd') as outfile:
reader = csv.reader(geo_file, delimiter='\t')
geo_header = next(reader, None) # grab header
geo_header.pop(6) # no need to list LOGRECNO header twice
writer = csv.writer(outfile, delimiter='\t')
writer.writerow(census_header + geo_header)
for row in reader:
if row[6] not in census:
# no census data for this LOGRECNO entry
continue
# new row is all of the census data plus all of geo minus column 7
newrow = census[row[6]] + row[:6] + row[7:]
writer.writerow(newrow)
This all assumes the census file is not so big as to take up too much memory. 这一切都假设人口普查文件没有那么大,以至于占用太多内存。 If that's the case you'll have to use a database instead (read all data into a SQLite database, match in the same vein agains the geo data).
如果是这种情况,您将不得不使用数据库(将所有数据读入SQLite数据库,再次匹配地理数据)。
For merging multiple files (even > 2) based on one or more common columns, one of the best and efficient approaches in python would be to use "brewery". 对于基于一个或多个公共列合并多个文件(甚至> 2),python中最好和最有效的方法之一是使用“brewery”。 You could even specify what fields need to be considered for merging and what fields need to be saved.
您甚至可以指定合并时需要考虑哪些字段以及需要保存哪些字段。
import brewery
from brewery
import ds
import sys
sources = [
{"file": "grants_2008.csv",
"fields": ["receiver", "amount", "date"]},
{"file": "grants_2009.csv",
"fields": ["id", "receiver", "amount", "contract_number", "date"]},
{"file": "grants_2010.csv",
"fields": ["receiver", "subject", "requested_amount", "amount", "date"]}
]
all_fields = brewery.FieldList(["file"])
for source in sources:
for field in source["fields"]:
if field not in all_fields:
out = ds.CSVDataTarget("merged.csv")
out.fields = brewery.FieldList(all_fields)
out.initialize()
for source in sources:
path = source["file"]
# Initialize data source: skip reading of headers
# use XLSDataSource for XLS files
# We ignore the fields in the header, because we have set-up fields
# previously. We need to skip the header row.
src = ds.CSVDataSource(path,read_header=False,skip_rows=1)
src.fields = ds.FieldList(source["fields"])
src.initialize()
for record in src.records():
# Add file reference into ouput - to know where the row comes from
record["file"] = path
out.append(record)
# Close the source stream
src.finalize()
cat merged.csv | brewery pipe pretty_printer
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.