简体   繁体   English

将可变数量的csv文件中的一列合并到一个csv文件中

[英]Merge one column from variable number of csv files into one csv file

Novice Python programmer here. 新手Python程序员在这里。 I know there are a lot of SO posts relating to this, but none of the solutions I've reviewed seem to fit my problem. 我知道有很多与此相关的SO帖子,但是我所审查的解决方案都没有一个适合我的问题。

I have a variable number of csv files, all with the same number of columns. 我有可变数量的csv文件,所有文件都具有相同的列数。 The header for the fourth column will change with each csv file (it's a Julian date). 第四列的标题将随每个csv文件而变化(这是儒略日期)。 Incidentally, this fourth column stores surface temperatures from a satellite sensor. 顺便提及,该第四列存储来自卫星传感器的表面温度。 As an example: 举个例子:

UID,Latitude,Longitude,001
1,-151.01,45.20,13121
2,-151.13,45.16,15009
3,-151.02,45.09,10067
4,-151.33,45.03,14010

I would like to keep the first four columns (preferably from the first csv file in my list of files), and then join/merge the fourth column from all the remaining csv files to this first table. 我想保留前四列(最好是文件列表中的第一个csv文件),然后将其余所有csv文件中的第四列加入/合并到该第一个表中。 The final table will look something like this: 决赛桌看起来像这样:

UID,Latitude,Longitude,001,007,015,023,...
1,-151.01,45.20,13121,13129,13340,12995
2,-151.13,45.16,15009,15001,14997,15103
3,-151.02,45.09,10067,11036,10074,10921
4,-151.33,45.03,14010,14005,14102,14339

I know the Pandas package would probably be an easier way to do this, but I'd rather not require third party packages (requiring the user to use easy_install, PIP, etc.) in this tool. 我知道Pandas软件包可能是一种更简单的方法,但是我宁愿在此工具中不需要第三方软件包(要求用户使用easy_install,PIP等)。 I also realize I this would be much simpler in an RDBMS, but again, I don't want that to be a requirement. 我也意识到我在RDBMS中会简单得多,但是再次,我不希望这样。 So I'm only using the csv module. 所以我只使用csv模块。

I think I understand how to do this, and I'm assuming I should write the merged rows to a new csv file. 我想我知道如何做到这一点,并且我假设我应该将合并的行写入新的csv文件。 I've gotten as far as pulling out the headers from the first csv file, then looping through each of the subsequent csv files to add the new column name to the header row. 我已经尽力从第一个csv文件中提取标题,然后循环浏览每个随后的csv文件,以将新的列名添加到标题行中。 Where I'm coming up short is how to write values from the fourth column only in addition to the rows from the first csv file. 我要讲的是除了第一个csv文件中的行之外,如何仅从第四列中写入值。 All csv files have UID columns, which should match. 所有的csv文件都有应该匹配的UID列。

def build_table(acq_date_list, mosaic_io_array, input_dir, dir_list):
    acq_year = mosaic_io_array[0][0]
    out_dir = '%s\\%s\\' % (input_dir, dir_list[1])
    out_file = '%s%s_%s.%s' % (out_dir, 'LST_final', acq_year, 'csv')
    # get first csv file in the list of files
    first_file = acq_date_list[0][1]
    # open and read the first csv file
    with open(first_file, 'rb') as first_csv:
        r1 = csv.reader(first_csv, delimeter = ',')
        header1 = next(r1)
        allrows1 = []
        row1 = next(r1)
        allrows1.append(row1)
    # open and write to the new csv
    with open(out_file, 'wb') as out_csv:
        w = csv.writer(out_csv, delimeter = ',')
            # loop through the list of remaining csv files
            for acq_date in acq_date_list[1:]: # skip the first csv file
                # open and read other csv files
                with open(acq_date[1], 'rb') as other_csv:
                    rX = csv.reader(other_csv, delimeter = ',')
                    headerX = next(rX)
                    header_row = '%s,%s' % (header1, headerX)

                    # write header and subsequent merged rows to new csv file?

Maybe after: 也许之后:

headerX = next(rX)

I can split the header row into a list, and pull out the fourth item? 我可以将标题行拆分为列表,然后取出第四项? Would this also work for the remaining rows in the "other" csv files. 这对于“其他” csv文件中的其余行是否也有效。 Or is this just generally the wrong approach? 还是这通常是错误的方法?

UPDATE 2/26/2016 I actually only got the solution by Gijs to partially work. 2016年2月26日更新我实际上只得到Gijs的解决方案来部分工作。 The header columns are iteratively added, but not the rest of the values from the row. 标题列是迭代添加的,而不是行中其余的值。 I'm still unsure how to fill in the empty cells with values from the remaining csv files. 我仍然不确定如何使用其余csv文件中的值填充空白单元格。

Latitude,001,UID,Longitude,009,017,025,033,041
795670.198,13506,0,-1717516.429,,,,,
795670.198,13173,1,-1716125.286,,,,,
795670.198,13502,2,-1714734.143,,,,,

Loop through the files, keep track of which keys exist and write all records with csv.DictWriter and csv.DictReader . 循环浏览文件,跟踪存在哪些键,并使用csv.DictWritercsv.DictReader写入所有记录。

import csv

records = list()
all_keys = set()
for fn in ["table_1.csv", "table_2.csv"]:
    with open(fn) as f:
        reader = csv.DictReader(f)
        all_keys.update(set(reader.fieldnames))
        for r in reader:
            records.append(r)

with open("table_merged.csv", "wb") as f:
    writer = csv.DictWriter(f, fieldnames = all_keys)
    writer.writeheader()
    for r in records:
        writer.writerow(r)

This will write an empty 'cell' for records that didn't have the column. 这将为没有该列的记录写一个空的“单元格”。

With your file as both the first and the second .csv , with in the second case the last column renamed to 002 instead of 001 , you would get this: 将文件作为第一个和第二个.csv ,在第二种情况下,最后一列重命名为002而不是001 ,您将得到以下信息:

UID,Longitude,002,001,Latitude
1,45.20,,13121,-151.01
2,45.16,,15009,-151.13
3,45.09,,10067,-151.02
4,45.03,,14010,-151.33
1,45.20,13121,,-151.01
2,45.16,15009,,-151.13
3,45.09,10067,,-151.02
4,45.03,14010,,-151.33

If you want to keep the columns in a specific order, you will have to make all_keys a list , and then add only the columns in the new file that are not in all_keys . 如果all_keys特定顺序保留列,则必须使all_keys成为list ,然后仅在新文件中添加不在all_keys中的all_keys

all_keys = list()

... 
         all_keys += list(set(reader.fieldnames).difference(set(all_keys)))

try pandas approach: 尝试熊猫方法:

import pandas as pd

file_list = ['1.csv','2.csv','3.csv']

df = pd.read_csv(file_list[0])

for f in file_list[1:]:
    # use only 1-st and 4-th columns ...
    tmp = pd.read_csv(f, usecols=[0, 3])
    df = pd.merge(df, tmp, on='UID')

df.to_csv('output.csv', index=False)

print(df)

Output: 输出:

   UID  Latitude  Longitude    001    007  015
0    1   -151.01      45.20  13121  11111   11
1    2   -151.13      45.16  15009  22222   12
2    3   -151.02      45.09  10067  33333   13
3    4   -151.33      45.03  14010  44444   14

output.csv output.csv

UID,Latitude,Longitude,001,007,015
1,-151.01,45.2,13121,11111,11
2,-151.13,45.16,15009,22222,12
3,-151.02,45.09,10067,33333,13
4,-151.33,45.03,14010,44444,14

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM