简体   繁体   English

将多个 .xlsx 文件转换为 .csv - Pandas 只读 1 列

[英]Converting Multiple .xlsx Files to .csv - Pandas reading only 1 column

`` Hello everyone, I am working on a deep learning project. ``大家好,我正在做一个深度学习项目。 The data I will use for the project consists of multiple excel files.我将用于该项目的数据由多个 excel 文件组成。 Since I will be using the pd.read_csv command of the Pandas library, I used a VBA code that automatically converts all excel files to csv format.因为我将使用 Pandas 库的 pd.read_csv 命令,所以我使用了 VBA 代码自动将所有 excel 文件转换为 csv 格式。

Here is the VBA CODE: (xlsx to csv)这是 VBA 代码:(xlsx 到 csv)

Sub WorkbooksSaveAsCsvToFolder()

'UpdatebyExtendoffice20181031

Dim xObjWB As Workbook

Dim xObjWS As Worksheet

Dim xStrEFPath As String

Dim xStrEFFile As String

Dim xObjFD As FileDialog

Dim xObjSFD As FileDialog

Dim xStrSPath As String

Dim xStrCSVFName As String

Dim xS  As String

    Application.ScreenUpdating = False

    Application.EnableEvents = False

    Application.Calculation = xlCalculationManual

    Application.DisplayAlerts = False

    On Error Resume Next

Set xObjFD = Application.FileDialog(msoFileDialogFolderPicker)

    xObjFD.AllowMultiSelect = False

    xObjFD.Title = "Kutools for Excel - Select a folder which contains Excel files"

    If xObjFD.Show <> -1 Then Exit Sub

    xStrEFPath = xObjFD.SelectedItems(1) & "\"

    Set xObjSFD = Application.FileDialog(msoFileDialogFolderPicker)

 
    xObjSFD.AllowMultiSelect = False

    xObjSFD.Title = "Kutools for Excel - Select a folder to locate CSV files"

    If xObjSFD.Show <> -1 Then Exit Sub

    xStrSPath = xObjSFD.SelectedItems(1) & "\"


    xStrEFFile = Dir(xStrEFPath & "*.xlsx*")


    Do While xStrEFFile <> ""

       xS = xStrEFPath & xStrEFFile

        Set xObjWB = Application.Workbooks.Open(xS)

        xStrCSVFName = xStrSPath & Left(xStrEFFile, InStr(1, xStrEFFile, ".") - 1) & ".csv"

        xObjWB.SaveAs Filename:=xStrCSVFName, FileFormat:=xlCSV

        xObjWB.Close savechanges:=False

        xStrEFFile = Dir

  Loop

    Application.Calculation = xlCalculationAutomatic

    Application.EnableEvents = True

    Application.ScreenUpdating = True

    Application.DisplayAlerts = True

End Sub

With this code, thousands of.xlsx files become.csv.有了这段代码,成千上万的.xlsx文件就变成了.csv。 The problem here is that although the conversion happens correctly, when I use the pd.read_csv command, it only reads 1 column.这里的问题是,虽然转换正确发生,但当我使用 pd.read_csv 命令时,它只读取 1 列。

As it seems:看起来:

    0
0   PlatformData,2,0.020000,43.000000,33.000000,32...
1   PlatformData,1,0.020000,42.730087,33.000000,25...
2   PlatformData,2,0.040000,43.000000,33.000000,32...
3   PlatformData,1,0.040000,42.730141,33.000006,25...
4   PlatformData,2,0.060000,43.000000,33.000000,32...
... ...
9520    PlatformData,1,119.520000,42.931132,33.056849,...
9521    PlatformData,1,119.540000,42.931184,33.056868,...
9522    PlatformData,1,119.560000,42.931184,33.056868,...
9523    PlatformData,1,119.580000,42.931237,33.056887,...
9524    PlatformData,1,119.600000,42.931237,33.056887,...

Because the column part is not correct, it combines the data and prevents me from training the model.由于列部分不正确,它结合了数据并阻止我训练 model。

Afterwards, in order to understand what the problem was, I saw that the problem disappeared when I converted only 1 excel file to.csv format manually using the "Save as" command and read it using the pandas library.后来为了弄明白是什么问题,看到只有1个excel文件手动用“另存为”命令转成.csv格式,再用pandas库读取,问题就消失了。

Which looks like this:看起来像这样:

0   1   2   3   4   5   6   7   8   9   10  11
0   PlatformData    2   0.02    43.000000   33.000000   3200.0  0.000000    0.0 0.0 0.000000    0.000000    -0.0
1   PlatformData    1   0.02    42.730087   33.000000   3050.0  60.000029   0.0 0.0 74.999931   129.903854  -0.0
2   PlatformData    2   0.04    43.000000   33.000000   3200.0  0.000000    -0.0    0.0 0.000000    0.000000    -0.0
3   PlatformData    1   0.04    42.730114   33.000064   3050.0  60.000029   0.0 0.0 74.999931   129.903854  -0.0
4   PlatformData    2   0.06    43.000000   33.000000   3200.0  0.000000    -0.0    0.0 0.000000    0.000000    -0.0
... ... ... ... ... ... ... ... ... ... ... ... ...
57867   PlatformData    1   119.72  42.891333   33.019166   2550.0  5.000000    0.0 0.0 149.429214  13.073360   -0.0
57868   PlatformData    1   119.74  42.891333   33.019166   2550.0  5.000000    0.0 0.0 149.429214  13.073360   -0.0
57869   PlatformData    1   119.76  42.891387   33.019172   2550.0  5.000000    0.0 0.0 149.429214  13.073360   -0.0
57870   PlatformData    1   119.78  42.891387   33.019172   2550.0  5.000000    0.0 0.0 149.429214  13.073360   -0.0
57871   PlatformData    1   119.80  42.891441   33.019178   2550.0  5.000000    0.0 0.0 149.429214  13.073360   -0.0

As seen here, each comma is separated as a separate column.如此处所示,每个逗号都被分隔为一个单独的列。

I need to convert multiple files using VBA or some other convert technique because I have so many excel files.我需要使用 VBA 或其他一些转换技术转换多个文件,因为我有太多 excel 文件。 But as you can see, even though the format of the files is translated correctly, it is read incorrectly by pandas.但是正如您所看到的,即使文件的格式被正确翻译,pandas 也读取错误。

I've tried converting with a bunch of different VBA codes so far.到目前为止,我已经尝试使用一堆不同的 VBA 代码进行转换。 Then I tried to read it with the read_excel command on python and then convert it with to_csv, but I encountered the same problem again.然后我尝试在python上用read_excel命令读取,然后用to_csv转换,但是又遇到了同样的问题。 (Reading only 1 column) (只读1栏)

What do I need to do to make it look like it was when I changed the format manually?我需要做什么才能让它看起来像我手动更改格式时的样子? Is there an error in the VBA code or do I need to implement another method for this operation?是VBA代码有错误还是我需要为这个操作实现另一种方法?

Thank you for your interest.感谢您的关注。 Thanks in advance for any help在此先感谢您的帮助

Dealing with CSV is a tricky thing (not only in Excel).处理 CSV 是一件棘手的事情(不仅在 Excel 中)。 "CSV" stands for "comma separated values" , and Excel takes this literally: When you use SaveAs FileFormat:=xlCSV , it will put a comma between your values. “CSV”代表“逗号分隔值” ,Excel 从字面上理解:当您使用SaveAs FileFormat:=xlCSV时,它会在您的值之间放置一个逗号。 Except if you are using local setting on your computer that have a different separator defined, then Excel is using that separator (on my computer, for example, a semicolon).除非您在计算机上使用定义了不同分隔符的本地设置,否则 Excel 将使用该分隔符(例如,在我的计算机上,分号)。

Your Pandas seems to expect tab characters as separator.您的 Pandas 似乎期望制表符作为分隔符。 You could try SaveAs FileFormat:=xlText or xlTextWindows - on my computer that generated tab separated files, but I couldn't find a documentation telling that this is always the case.您可以尝试SaveAs FileFormat:=xlTextxlTextWindows - 在我的计算机上生成制表符分隔文件,但我找不到说明情况总是如此的文档。 The alternative is to use a small routine that writes the file manually - see for example VBA code to save Excel sheet as tab-delimited text file另一种方法是使用手动写入文件的小例程 - 例如参见VBA 代码,将 Excel 工作表保存为制表符分隔的文本文件

However, I doubt that you cannot bring Pandas to read comma separated files.但是,我怀疑你不能带 Pandas 来读取逗号分隔的文件。 According to https://pandas.pydata.org/docs/user_guide/io.html#io-read-csv-table , you should be able to define the separation character.根据https://pandas.pydata.org/docs/user_guide/io.html#io-read-csv-table ,您应该能够定义分隔符。

I'm not sure how to change your OS separator like @FunThomas suggested, perhaps you could instead specify the delimiter used for read_csv() or writing out to_csv()我不确定如何像@FunThomas 建议的那样更改您的操作系统分隔符,也许您可以改为指定用于read_csv()或写出to_csv()的分隔符

Have you tried specifying a delimiter?您是否尝试过指定分隔符? ie IE

import pandas as pd
df = pd.read_csv('Book1.csv', sep='\t')
print(df)

See more here: https://www.geeksforgeeks.org/pandas-dataframe-to-csv-file-using-tab-separator/在此处查看更多信息: https://www.geeksforgeeks.org/pandas-dataframe-to-csv-file-using-tab-separator/

Note the link above shows to_csv, but the param sep exists for read_csv too.请注意,上面的链接显示了 to_csv,但read_csv也存在参数sep See docs here . 请在此处查看文档

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM