简体   繁体   English

zipfile 和 pandas 循环失败

[英]zipfile and pandas failure mid-loop

I'm writing this on my phone, so a full code example is sorta out of the question at the moment, but I need some help.我正在我的手机上写这篇文章,所以目前还没有完整的代码示例,但我需要一些帮助。

I'm working on parsing a set of.csv files from a zipped infile, pulling out specific columns from each file, generating a new.csv with the chosen columns, and then exporting the new dataframes to a zipped outfile.我正在努力从压缩文件中解析一组 .csv 文件,从每个文件中提取特定的列,使用所选列生成新的 .csv,然后将新数据帧导出到压缩文件中。

I am doing this through a series of loops, but can't get beyond 78% success on the parse process, and 73% on the parse combined with the compression process.我通过一系列循环来执行此操作,但解析过程的成功率不能超过 78%,解析与压缩过程的成功率不能超过 73%。

Somewhere along the way either zipfile.ZipFile is breaking, or pandas.to_csv... and I'm not sure why. zipfile.ZipFile 正在中断,或者 pandas.to_csv... 的某个地方,我不确定为什么。 I've been trying to figure it out for two weeks and I'm finally breaking down to ask assistance.两周来我一直在努力解决这个问题,我终于崩溃了寻求帮助。

Brief code snippets for now:现在的简短代码片段:

Export function:出口function:

 def export(new_filename):

   os.chdir([import_file location])
   try:
      with zipfile.ZipFile(outfile_name,'a',zipfile=ZIP_DEFLATED, allowZip64=true) as outfile:
         try:
           outfile.write(new_filename)
           #random errors at runtime saying the writing handle is still open... Not sure why. 
         except:
           #print statement to alert of failure at this step. I have tried NameError 
           #and ValueError exceptions, but they don't help. 
   except:
      #another statement to alert failure

Pandas function: Pandas function:

 def infile_parser(filename, new_filename):

     #excluding code beyond making the dataframe and file generation
     df = pd.dataframe(data,columns=useful_columns)
     df.to_csv(new_filename,index=false)

Thank you in advance.先感谢您。 I can add more context if requested.如果需要,我可以添加更多上下文。

I figured out where it was breaking.我弄清楚它在哪里坏了。 Sorry I forgot to update this question with the solution.对不起,我忘了用解决方案更新这个问题。

The issue was in the data of some of the files.问题出在某些文件的数据中。 Added automated badfile checking based on length of dataframe. Basically, the files causing issues only had 1 or 2 rows in column A but the good files had full tables of many rows.添加了基于 dataframe 长度的自动坏文件检查。基本上,导致问题的文件在 A 列中只有 1 或 2 行,但好的文件有很多行的完整表格。 Pandas was assigning the string in the first cell to the header and basically breaking from there, since the columns being used in the other files did not exist in the badfiles. Pandas 将第一个单元格中的字符串分配给 header 并且基本上从那里中断,因为其他文件中使用的列在坏文件中不存在。

Pre-parse file verification / data checking, thereby omitting the badfiles from the process, solved all issues.预解析文件验证/数据检查,从而从过程中省略坏文件,解决了所有问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM