如何用标题洗牌和拆分大的 csv？

Question

I am trying to find a way to shuffle the lines of a large csv files in Python and then split it into multiple csv files (assigning a number of rows for each files) but I can't manage to find a way to shuffle the large dataset, and keep the headers in each csv. It would help a lot if someone would know how to我试图找到一种方法来洗牌 Python 中的大型 csv 文件的行，然后将其拆分为多个 csv 文件（为每个文件分配一些行）但我无法找到一种方法来洗牌大数据集，并将标题保留在每个 csv 中。如果有人知道如何做，将会有很大帮助

Here's the code I found useful for splitting a csv file:这是我发现对拆分 csv 文件有用的代码：

number_of_rows = 100

def write_splitted_csvs(part, lines):
    with open('mycsvhere.csv'+ str(part) +'.csv', 'w') as f_out:
        f_out.write(header)
        f_out.writelines(lines)

with open("mycsvhere.csv", "r") as f:
    count = 0
    header = f.readline()
    lines = []
    for line in f:
        count += 1
        lines.append(line)
        if count % number_of_rows == 0:
            write_splitted_csvs(count // number_of_rows, lines)
            lines = []
    
    if len(lines) > 0:
        write_splitted_csvs((count // number_of_rows) + 1, lines)

If anyone knows how to shuffle all these splitted csv this would help a lot!如果有人知道如何洗牌所有这些拆分的 csv 这将有很大帮助！ Thank you very much非常感谢你

Answer 1

I would suggest using Pandas if possible.如果可能，我建议使用 Pandas。

Shuffling rows, reset the index in place:洗牌行，重置索引到位：

import pandas as pd
df = pd.read_csv('mycsvhere.csv'+ str(part) +'.csv')
df.sample(frac=1).reset_index(drop=True)

Then you can split into multiple dataframes into a list:然后你可以将多个数据帧拆分成一个列表：

number_of_rows = 100
sub_dfs = [df[i:i + number_of_rows] for i in range(0, df.shape[0], number_of_rows)]

Then if you want to save the csvs locally:然后如果你想在本地保存 csvs：

for idx, sub_df in enumerate(sub_dfs):
    sub_df.to_csv(f'csv_{idx}.csv', index=False)

Answer 2

There are 3 needs here:这里有3个需求：

Shuffle your dataset洗牌你的数据集
Split your dataset拆分数据集
Formatting格式化

For the first 2 steps, there are some nice tools in Sklearn.对于前 2 个步骤，Sklearn 中有一些不错的工具。 You can try the stratified shuffle splitter.您可以尝试分层洗牌分离器。 Sklearn SSS You did not mention Stratified part, but you may need it without knowing it yet;) Sklearn SSS你没有提到分层部分，但你可能在不知不觉中需要它；）

Last part, formatting, it is all up to you.最后一部分，格式化，这完全取决于你。 You can check pandas to_csv() function where you can specify your headers, you can(need) specify your headers in the data object aswell (DataFrame).您可以检查 pandas to_csv() function 您可以在其中指定标题，您也可以（需要）在数据 object（DataFrame）中指定标题。 Nothing hard here, just spend a bit of time to specify what you want, and implement it easily:)这里没什么难的，只需花一点时间指定你想要的，然后轻松实现它:)

Side comments: You can drop pandas, depending on what 'big' is for you, pandas is not 'good' on big data.旁注：你可以放弃 pandas，这取决于你的“大”是什么，pandas 在大数据上并不“好”。

如何用标题洗牌和拆分大的 csv？

问题描述

2 个解决方案

解决方案1
2 已采纳 2022-02-09 16:58:11

解决方案2
1 2022-02-09 16:58:23

如何用标题洗牌和拆分大的 csv？

问题描述

2 个解决方案

解决方案1 2 已采纳 2022-02-09 16:58:11

解决方案2 1 2022-02-09 16:58:23

解决方案1
2 已采纳 2022-02-09 16:58:11

解决方案2
1 2022-02-09 16:58:23