简体   繁体   English

根据列值将CSV拆分成多个文件

[英]Split CSV into multiple files based on column value

I have a poorly-structured CSV file named file.csv , and I want to split it up into multiple CSV using Python.我有一个名为file.csv的结构不良的 CSV 文件,我想使用 Python 将其拆分为多个 CSV。

|A|B|C|
|Continent||1|
|Family|44950|file1|
|Species|44950|12|
|Habitat||4|
|Species|44950|22|
|Condition|Tue Jan 24 00:00:00 UTC 2023|4|
|Family|Fish|file2|
|Species|Bass|8|
|Species|Trout|2|
|Habitat|River|3|

The new files need to be separated based on everything between the Family rows, so for example:新文件需要根据Family行之间的所有内容进行分隔,例如:

file1.csv

|A|B|C|
|Continent||1|
|Family|44950|file1|
|Species|44950|12|
|Habitat||4|
|Species|44950|22|
|Condition|Tue Jan 24 00:00:00 UTC 2023|4|

file2.csv

|A|B|C|
|Continent||1|
|Family|Fish|file2|
|Species|Bass|8|
|Species|Trout|2|
|Habitat|River|3|

What's the best way of achieving this when the number of rows between appearances of Species is not consistent?Species出现之间的行数不一致时,实现此目标的最佳方法是什么?

If your file really looks like that;) then you could use groupby from the standard library module itertools :如果您的文件确实看起来像那样;)那么您可以使用标准库模块itertools中的groupby

from itertools import groupby

def key(line): return line.startswith("|Family|")

family_line, file_no = None, 0
with open("file.csv", "r") as fin:
    for is_family_line, lines in groupby(fin, key=key):
        if is_family_line:
            family_line = list(lines).pop()
        elif family_line is None:
            header = "".join(lines)
        else:
            file_no += 1
            with open(f"file{file_no}.csv", "w") as fout:
                fout.write(header + family_line)
                for line in lines:
                    fout.write(line)

A Pandas solution would be: Pandas 解决方案是:

import pandas as pd

df = pd.read_csv("file.csv", header=None, delimiter="|").fillna("")
blocks = df.iloc[:, 1].eq("Family").cumsum()
header_df = df[blocks.eq(0)]
for no, sdf in df.groupby(blocks):
    if no > 0:
        sdf = pd.concat([header_df, sdf])
        sdf.to_csv(f"file{no}.csv", index=False, header=False, sep="|")
import pandas as pd
pd.read_csv('file.csv',delimiter='|')
groups = df.groupby('Family')
for name, group in groups:
    group.to_csv(name + '.csv', index=False)

Here is a pure python working method:下面是一个纯python的工作方式:

# Read file
with open('file.csv', 'r') as file:
    text = file.read()

# Split using |Family|
splitted_text = text.split("|Family|")

# Remove unwanted content before first |Family|
splitted_text = splitted_text[1:]

# Add |Family| back to each part
splitted_text = ['|Family|' + item for item in splitted_text]

# Write files
for i, content in enumerate(splitted_text ):
    with open('file{}.csv'.format(i), 'w') as file:
        file.write(content)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM