简体   繁体   English

Pandas 以不正确的顺序将列数据导入数据框

[英]Pandas importing column data in incorrect order into data frame

  • "xls" is the variable representing the excel file “xls”是代表excel文件的变量
  • "main" is a list of worksheets in the workbook "xls" to concatenate data from “main”是工作簿“xls”中的工作表列表,用于连接来自
  • All columns are unmerged and borders are just for printing aesthetic reasons所有列均未合并,边框仅出于印刷美学原因

Three sheets are imported correctly then the format problem occurs.正确导入了三张纸,然后出现格式问题。 Three sheets are then imported using this improper format.然后使用这种不正确的格式导入三张纸。 Then the problem occurs again by shifting the data in a similar way.然后通过以类似方式移动数据再次出现问题。 So basically every forth sheet that is imported appears to pull the data from columns out of order.因此,基本上每四张导入的工作表似乎都会乱序地从列中提取数据。

Original data:原始数据: 在此处输入图像描述

Output returns as expected: Output 按预期返回: 在此处输入图像描述

The problem occurs when it moves to the next sheet, even though it's column formatting is the same as the last.当它移动到下一张纸时会出现问题,即使它的列格式与上一张相同。

Original data:原始数据: 在此处输入图像描述

Output returned: Output 返回: 在此处输入图像描述

It appears to pull M:P correctly, then it jumbles the data by appearing to pull in this order: AC:AD, S:Z wile adding five extra blank columns, Q:R, AB:AC.它似乎正确地拉取了 M:P,然后它似乎按以下顺序拉取了数据:AC:AD、S:Z 并添加了五个额外的空白列、Q:R、AB:AC。

The only difference in the two worksheets is that the first has data in more columns than the second however, both have the save number of columns being queried.两个工作表的唯一区别是第一个工作表的数据列数比第二个工作表多,但是,两者都保存了查询的列数。

df1 = [pd.read_excel(xls, sheet_name=s, skiprows=4, nrows=32, usecols='M:AD') for s in main]
dfconcat = pd.concat(df1, ignore_index=True, sort=False)
dfconcat.dropna(axis=0, how='all', inplace=True)
writer = pd.ExcelWriter(f'{loc}/test.xlsx')
dfconcat.to_excel(writer, 'bananas', index=False, header=False, na_rep='', merge_cells=False)
writer.save()

Since it occurs every fourth sheet, I assume there is something incorrect in my code, or something to add to it to reset something in pandas after every pass.因为它每四张纸出现一次,我假设我的代码中有一些不正确的东西,或者在每次通过后要添加一些东西来重置 pandas 中的东西。 Any guidance would be appreciated.任何指导将不胜感激。

Add header=None at the end inside pd.read_excel .pd.read_excel的末尾添加header=None By default, read_excel will use the first row ( header=0 ) as the header. Ie in your case, in view of skiprows=4 , ROW 5:5 in each sheet will be interpreted as the header.默认情况下, read_excel将使用第一行 ( header=0 ) 作为 header。即在您的情况下,鉴于skiprows=4 ,每张工作表中的ROW 5:5将被解释为 header。

This causes problems, when you usepd.concat .当您使用pd.concat时,这会导致问题。 Eg if you have pd.concat([d1,d2]) and d1 has cols A, B , but d2 has cols B, A , then the result will actually have order A, B , following the first df.例如,如果您有pd.concat([d1,d2])并且d1有 cols A, B ,但d2有 cols B, A ,那么结果实际上有顺序A, B ,在第一个 df 之后。 Hence, the "shift" of the columns.因此,列的“移位”。

So, basically, you end up doing something like this:所以,基本上,你最终会做这样的事情:

dfs = [pd.DataFrame({'a':[1],'b':[2]}),
       pd.DataFrame({'b':[1],'a':[2]})]

print(pd.concat(dfs, ignore_index=True, sort=False))

   a  b
0  1  2
1  2  1

While you actually want to do:虽然你真的想做:

dfs = [pd.DataFrame([{0: 'a', 1: 'b'}, {0: 1, 1: 2}]),
       pd.DataFrame([{0: 'b', 1: 'a'}, {0: 1, 1: 2}])]

print(pd.concat(dfs, ignore_index=True, sort=False))

   0  1
0  a  b
1  1  2
2  b  a
3  1  2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM