簡體   English   中英

合並多個 csv 文件並將原始文件的文件名添加到合並的 output

[英]Combine multiple csv files and add filename of original file to combined output

我正在嘗試將一個目錄下的多個 csv 文件合並到一個 csv 文件中。 每個文件的所有標題都相同。 但是當我看到合並的文件時,很難理解數據實際上來自哪個文件。 我還為我的目的修復了我想要的列。 有沒有辦法使用下面的代碼來唯一地識別它們? 從 pathlib 導入路徑導入 csv

p = Path(r'E:\Neurogen\Merging_test_data') 

file_list = p.glob("*.csv")

desired_columns = ['Chr', 'Start', 'End', 'Ref', 'Alt', 'Func.refGene', 'Gene.refGene', 'GeneDetail.refGene', 'ExonicFunc.refGene', 'AAChange.refGene', 'Xref.refGene', 'cytoBand', 'cosmic70', 'avsnp147', 'ExAC_ALL', 'ExAC_AFR', 'ExAC_AMR', 'ExAC_EAS', 'ExAC_FIN', 'ExAC_NFE', 'ExAC_OTH', 'ExAC_SAS', 'CLINSIG', 'CLNDBN', 'CLNACC', 'CLNDSDB', 'CLNDSDBID', '1000g2015aug_all', 'SIFT_score', 'SIFT_pred', 'Polyphen2_HDIV_score', 'Polyphen2_HDIV_pred', 'Polyphen2_HVAR_score', 'Polyphen2_HVAR_pred', 'LRT_score', 'LRT_pred', 'MutationTaster_score', 'MutationTaster_pred', 'MutationAssessor_score', 'MutationAssessor_pred', 'FATHMM_score', 'FATHMM_pred', 'PROVEAN_score', 'PROVEAN_pred', 'VEST3_score', 'CADD_raw', 'CADD_phred', 'DANN_score', 'fathmm-MKL_coding_score', 'fathmm-MKL_coding_pred', 'MetaSVM_score', 'MetaSVM_pred', 'MetaLR_score', 'MetaLR_pred', 'integrated_fitCons_score', 'integrated_confidence_value', 'GERP++_RS', 'phyloP7way_vertebrate', 'phyloP20way_mammalian', 'phastCons7way_vertebrate', 'phastCons20way_mammalian', 'SiPhy_29way_logOdds', 'Otherinfo']
desired_rows = []

for csv_file in file_list:
    with open(csv_file, 'r') as f:
        reader = csv.DictReader(f)
        for row in reader:
            desired_rows.append({c: row[c] for c in desired_columns})

with open('merged.csv', 'w', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=desired_columns)
    writer.writeheader()
    writer.writerows(desired_rows)

由於您沒有提供任何示例數據,我生成了一些任意文件來顯示一般概念:

一個.txt:

col_1;col_2;col_3
1;2;3
4;5;6
7;8;9

b.txt:

col_1;col_2;col_3
10;20;30
40;50;60
70;80;90

假設您要過濾列col_1col_3 ,基於內置csv模塊的非常基本的方法可能如下所示:

import csv
from pathlib import Path


DIRECTORY = Path(__file__).parent
FILE_SUFFIX = '*.txt'

DESIRED_COLUMNS = ['col_1', 'col_3']


files = sorted(
    DIRECTORY.glob(FILE_SUFFIX),
    key=lambda x: x.name,
    )

filtered = []


for f in files:
    reader = csv.DictReader(f.open(), delimiter=';')
    for row in reader:
        d = {k: v for k, v in row.items() if k in DESIRED_COLUMNS}
        d['from_file'] = f.name
        filtered.append(d)

print(filtered)
# filtered is a list of dicts and can be written to file with csv.DictWriter

以上截圖:

[{'col_1': '1', 'col_3': '3', 'from_file': 'a.txt'}, {'col_1': '4', 'col_3': '6', 'from_file': 'a.txt'}, {'col_1': '7', 'col_3': '9', 'from_file': 'a.txt'}, {'col_1': '10', 'col_3': '30', 'from_file': 'b.txt'}, {'col_1': '40', 'col_3': '60', 'from_file': 'b.txt'}, {'col_1': '70', 'col_3': '90', 'from_file': 'b.txt'}]

更優雅的解決方案可以基於 pandas:

import pandas as pd
from pathlib import Path


DIRECTORY = Path(__file__).parent
FILE_SUFFIX = '*.txt'

DESIRED_COLUMNS = ['col_1', 'col_3']


files = sorted(
    DIRECTORY.glob(FILE_SUFFIX),
    key=lambda x: x.name,
    )

filtered = []


for f in files:
    df = pd.read_csv(
        f,
        delimiter=';',
        usecols=DESIRED_COLUMNS,
    )
    df['from_file'] = f.name
    filtered.append(df)

# print(filtered)

concated = pd.concat(filtered, ignore_index=True)
print(concated)
# concated is a pandas.DataFrame. Use `concated.to_csv()` to write it to file

pandas方法導致:

   col_1  col_3 from_file
0      1      3     a.txt
1      4      6     a.txt
2      7      9     a.txt
3     10     30     b.txt
4     40     60     b.txt
5     70     90     b.txt

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM