[英]Converting Multiple .xlsx Files to .csv - Pandas reading only 1 column
`` Hello everyone, I am working on a deep learning project. ``大家好,我正在做一个深度学习项目。 The data I will use for the project consists of multiple excel files.我将用于该项目的数据由多个 excel 文件组成。 Since I will be using the pd.read_csv command of the Pandas library, I used a VBA code that automatically converts all excel files to csv format.因为我将使用 Pandas 库的 pd.read_csv 命令,所以我使用了 VBA 代码自动将所有 excel 文件转换为 csv 格式。
Here is the VBA CODE: (xlsx to csv)这是 VBA 代码:(xlsx 到 csv)
Sub WorkbooksSaveAsCsvToFolder()
'UpdatebyExtendoffice20181031
Dim xObjWB As Workbook
Dim xObjWS As Worksheet
Dim xStrEFPath As String
Dim xStrEFFile As String
Dim xObjFD As FileDialog
Dim xObjSFD As FileDialog
Dim xStrSPath As String
Dim xStrCSVFName As String
Dim xS As String
Application.ScreenUpdating = False
Application.EnableEvents = False
Application.Calculation = xlCalculationManual
Application.DisplayAlerts = False
On Error Resume Next
Set xObjFD = Application.FileDialog(msoFileDialogFolderPicker)
xObjFD.AllowMultiSelect = False
xObjFD.Title = "Kutools for Excel - Select a folder which contains Excel files"
If xObjFD.Show <> -1 Then Exit Sub
xStrEFPath = xObjFD.SelectedItems(1) & "\"
Set xObjSFD = Application.FileDialog(msoFileDialogFolderPicker)
xObjSFD.AllowMultiSelect = False
xObjSFD.Title = "Kutools for Excel - Select a folder to locate CSV files"
If xObjSFD.Show <> -1 Then Exit Sub
xStrSPath = xObjSFD.SelectedItems(1) & "\"
xStrEFFile = Dir(xStrEFPath & "*.xlsx*")
Do While xStrEFFile <> ""
xS = xStrEFPath & xStrEFFile
Set xObjWB = Application.Workbooks.Open(xS)
xStrCSVFName = xStrSPath & Left(xStrEFFile, InStr(1, xStrEFFile, ".") - 1) & ".csv"
xObjWB.SaveAs Filename:=xStrCSVFName, FileFormat:=xlCSV
xObjWB.Close savechanges:=False
xStrEFFile = Dir
Loop
Application.Calculation = xlCalculationAutomatic
Application.EnableEvents = True
Application.ScreenUpdating = True
Application.DisplayAlerts = True
End Sub
With this code, thousands of.xlsx files become.csv.有了这段代码,成千上万的.xlsx文件就变成了.csv。 The problem here is that although the conversion happens correctly, when I use the pd.read_csv command, it only reads 1 column.这里的问题是,虽然转换正确发生,但当我使用 pd.read_csv 命令时,它只读取 1 列。
As it seems:看起来:
0
0 PlatformData,2,0.020000,43.000000,33.000000,32...
1 PlatformData,1,0.020000,42.730087,33.000000,25...
2 PlatformData,2,0.040000,43.000000,33.000000,32...
3 PlatformData,1,0.040000,42.730141,33.000006,25...
4 PlatformData,2,0.060000,43.000000,33.000000,32...
... ...
9520 PlatformData,1,119.520000,42.931132,33.056849,...
9521 PlatformData,1,119.540000,42.931184,33.056868,...
9522 PlatformData,1,119.560000,42.931184,33.056868,...
9523 PlatformData,1,119.580000,42.931237,33.056887,...
9524 PlatformData,1,119.600000,42.931237,33.056887,...
Because the column part is not correct, it combines the data and prevents me from training the model.由于列部分不正确,它结合了数据并阻止我训练 model。
Afterwards, in order to understand what the problem was, I saw that the problem disappeared when I converted only 1 excel file to.csv format manually using the "Save as" command and read it using the pandas library.后来为了弄明白是什么问题,看到只有1个excel文件手动用“另存为”命令转成.csv格式,再用pandas库读取,问题就消失了。
Which looks like this:看起来像这样:
0 1 2 3 4 5 6 7 8 9 10 11
0 PlatformData 2 0.02 43.000000 33.000000 3200.0 0.000000 0.0 0.0 0.000000 0.000000 -0.0
1 PlatformData 1 0.02 42.730087 33.000000 3050.0 60.000029 0.0 0.0 74.999931 129.903854 -0.0
2 PlatformData 2 0.04 43.000000 33.000000 3200.0 0.000000 -0.0 0.0 0.000000 0.000000 -0.0
3 PlatformData 1 0.04 42.730114 33.000064 3050.0 60.000029 0.0 0.0 74.999931 129.903854 -0.0
4 PlatformData 2 0.06 43.000000 33.000000 3200.0 0.000000 -0.0 0.0 0.000000 0.000000 -0.0
... ... ... ... ... ... ... ... ... ... ... ... ...
57867 PlatformData 1 119.72 42.891333 33.019166 2550.0 5.000000 0.0 0.0 149.429214 13.073360 -0.0
57868 PlatformData 1 119.74 42.891333 33.019166 2550.0 5.000000 0.0 0.0 149.429214 13.073360 -0.0
57869 PlatformData 1 119.76 42.891387 33.019172 2550.0 5.000000 0.0 0.0 149.429214 13.073360 -0.0
57870 PlatformData 1 119.78 42.891387 33.019172 2550.0 5.000000 0.0 0.0 149.429214 13.073360 -0.0
57871 PlatformData 1 119.80 42.891441 33.019178 2550.0 5.000000 0.0 0.0 149.429214 13.073360 -0.0
As seen here, each comma is separated as a separate column.如此处所示,每个逗号都被分隔为一个单独的列。
I need to convert multiple files using VBA or some other convert technique because I have so many excel files.我需要使用 VBA 或其他一些转换技术转换多个文件,因为我有太多 excel 文件。 But as you can see, even though the format of the files is translated correctly, it is read incorrectly by pandas.但是正如您所看到的,即使文件的格式被正确翻译,pandas 也读取错误。
I've tried converting with a bunch of different VBA codes so far.到目前为止,我已经尝试使用一堆不同的 VBA 代码进行转换。 Then I tried to read it with the read_excel command on python and then convert it with to_csv, but I encountered the same problem again.然后我尝试在python上用read_excel命令读取,然后用to_csv转换,但是又遇到了同样的问题。 (Reading only 1 column) (只读1栏)
What do I need to do to make it look like it was when I changed the format manually?我需要做什么才能让它看起来像我手动更改格式时的样子? Is there an error in the VBA code or do I need to implement another method for this operation?是VBA代码有错误还是我需要为这个操作实现另一种方法?
Thank you for your interest.感谢您的关注。 Thanks in advance for any help在此先感谢您的帮助
Dealing with CSV is a tricky thing (not only in Excel).处理 CSV 是一件棘手的事情(不仅在 Excel 中)。 "CSV" stands for "comma separated values" , and Excel takes this literally: When you use SaveAs FileFormat:=xlCSV
, it will put a comma between your values. “CSV”代表“逗号分隔值” ,Excel 从字面上理解:当您使用SaveAs FileFormat:=xlCSV
时,它会在您的值之间放置一个逗号。 Except if you are using local setting on your computer that have a different separator defined, then Excel is using that separator (on my computer, for example, a semicolon).除非您在计算机上使用定义了不同分隔符的本地设置,否则 Excel 将使用该分隔符(例如,在我的计算机上,分号)。
Your Pandas seems to expect tab characters as separator.您的 Pandas 似乎期望制表符作为分隔符。 You could try SaveAs FileFormat:=xlText
or xlTextWindows
- on my computer that generated tab separated files, but I couldn't find a documentation telling that this is always the case.您可以尝试SaveAs FileFormat:=xlText
或xlTextWindows
- 在我的计算机上生成制表符分隔文件,但我找不到说明情况总是如此的文档。 The alternative is to use a small routine that writes the file manually - see for example VBA code to save Excel sheet as tab-delimited text file另一种方法是使用手动写入文件的小例程 - 例如参见VBA 代码,将 Excel 工作表保存为制表符分隔的文本文件
However, I doubt that you cannot bring Pandas to read comma separated files.但是,我怀疑你不能带 Pandas 来读取逗号分隔的文件。 According to https://pandas.pydata.org/docs/user_guide/io.html#io-read-csv-table , you should be able to define the separation character.根据https://pandas.pydata.org/docs/user_guide/io.html#io-read-csv-table ,您应该能够定义分隔符。
I'm not sure how to change your OS separator like @FunThomas suggested, perhaps you could instead specify the delimiter used for read_csv()
or writing out to_csv()
我不确定如何像@FunThomas 建议的那样更改您的操作系统分隔符,也许您可以改为指定用于read_csv()
或写出to_csv()
的分隔符
Have you tried specifying a delimiter?您是否尝试过指定分隔符? ie IE
import pandas as pd
df = pd.read_csv('Book1.csv', sep='\t')
print(df)
See more here: https://www.geeksforgeeks.org/pandas-dataframe-to-csv-file-using-tab-separator/在此处查看更多信息: https://www.geeksforgeeks.org/pandas-dataframe-to-csv-file-using-tab-separator/
Note the link above shows to_csv, but the param sep
exists for read_csv
too.请注意,上面的链接显示了 to_csv,但read_csv
也存在参数sep
。 See docs here . 请在此处查看文档。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.