简体   繁体   English

Pandas 保存到多个小 CSV 文件

[英]Pandas Save to multiple small CSV files

Im reading the CSV data from a stream (example cat /tmp/rawfile ).我从 stream (例如cat /tmp/rawfile )读取 CSV 数据。 The below demo code works.下面的演示代码有效。

cmd = ('cat', '/tmp/csvfile')
process = subprocess.Popen(cmd, stdout=subprocess.PIPE)
csv = io.StringIO(process.stdout.read().decode())
data = pandas.read_csv(csv, index_col=0)
csv.close()

This CSV contains 10K rows.这个 CSV 包含 10K 行。 I want to read from Pandas and save 1k rows into a CSV file.我想读取 Pandas 并将 1k 行保存到 CSV 文件中。 Then the next 1k rows will go to the next CSV file.然后接下来的 1k 行将 go 到下一个 CSV 文件。 This is what I'm trying to achieve.这就是我想要达到的目标。 Is this possible?这可能吗?

Some information about this:关于此的一些信息:

I have some CSV files which really huge in size.(billions of rows) I used the Split command, but a few rows having \n - newline characters so while splitting based on a number of lines, it's moving the rest of the columns after the \n is going to the next row.我有一些 CSV 文件,它们的大小非常大。(数十亿行)我使用了Split命令,但是有几行有\n - newline characters ,所以在根据多行拆分时,它移动了 rest 之后的列\n将进入下一行。

Example:例子:

Row 1:
"col1" | "col2" | "This is
my first row"
Row 2:
"col1" | "col2" | "This is my second row"

In row 1 - These two lines belong to a specific column.在第 1 行 - 这两行属于特定列。 But If I split based on lines, it'll split it into two different rows.但如果我根据行拆分,它会将其拆分为两个不同的行。

That's why I want to use Pandas to read the streaming data( stdin ) 100 rows per as a chunk and put it in a CSV file.这就是为什么我想使用 Pandas 来读取流数据( stdin )100 行,并将其放入 CSV 文件中。 Then read the next 100 rows and append the same CSV, because I want to put 1k rows per CSV.然后读取接下来的 100 行和 append 相同的 CSV,因为我想每个 CSV 放 1k 行。

Any suggestions or example code for this logic?对此逻辑有任何建议或示例代码吗?

Update:更新:

my intention is, put 1k rows in one CSV file.我的意图是,将 1k 行放入一个 CSV 文件中。 The reason why I'm read 100 rows per DF is, to save memory.我每个 DF 读取 100 行的原因是为了保存 memory。 Read 100 rows into DF and then flush it to a file, then next 100 --> repeat till 1000 rows (10 times), then repeat the whole process for next 1000 rows with a different csv file将 100 行读入 DF,然后将其刷新到文件中,然后将下一个 100 行 -> 重复直到 1000 行(10 次),然后使用不同的 csv 文件重复下 1000 行的整个过程

Not sure if you're wanting this,try:不确定你是否想要这个,试试:

df=pd.DataFrame()
df=df.append(["x"]*1000)


# Loop
i=0

while i <len(df)-1:
   df.iloc[i:i+1000,:].to_csv("output_{0}_{1}.csv".format(i,i+1000),index=False)
   i+=1000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM