簡體   English   中英

從熊貓的不同數據框中的同一文件中讀取三個表

[英]Reading three tables from same file in different Dataframes in pandas

我有一個 .xlsx 文件,其中有 3 個不同的表可用,由三個關鍵詞“已解決”、“退款”、“收費”分隔,以便在單獨的數據幀中讀取所有表,共享文件數據和所需的輸出。

文件數據:-

   Setteled
   IN.Type     STRA     STRB   STRC
   CRBD        2487     XR     XL0054
   DFRS        3754     MY     XL0684
   CRBD        7356     DF     XL8911
   DFRS        4487     DF     XL58999
   DFRS        7785     MY     XL76568
   CRBD        8235     GL     XL0635
   DFRS        2468     PQ     XL4569
   DFRS        9735     GR     XL7589
   CRBD        6486     TY     XL5566 
   DFRS        1023     PQ     XL27952

   Refund
   IN.Type     STRD     STRE   
   DFRS        5898     RT     
   DFRS        5684     YU     
   CRBD        2564     RT     
   DFRS        1564     OP   
   DFRS        2548     YU   
   CRBD        4478     GL   
   CRBD        4515     OP  
   DFRS        5695     YU   
   DFRS        8665     RT   
   CRBD        1487     LK    

   Charged
   IN.Type     STRF     STRG   
   CRBD        1289     GH     
   CRBD        8546     JK     
   CRBD        6599     LP     
   DFRS        7899     JK   
   DFRS        1456     GH   
   CRBD        6988     JK   
   DFRS        1468     LP  
   DFRS        4697     GH   
   DFRS        7941     LP   
   DFRS        1636     JK

文件圖片:-

文件圖片

現在閱讀文件后,我想要以下不同數據框中的三個表。

df = "已解決的可用行"

df:-

   IN.Type     STRA     STRB   STRC
   CRBD        2487     XR     XL0054
   DFRS        3754     MY     XL0684
   CRBD        7356     DF     XL8911
   DFRS        4487     DF     XL58999
   DFRS        7785     MY     XL76568
   CRBD        8235     GL     XL0635
   DFRS        2468     PQ     XL4569
   DFRS        9735     GR     XL7589
   CRBD        6486     TY     XL5566 
   DFRS        1023     PQ     XL27952

df2 = "退款下方可用的行"

df2:-

   IN.Type     STRD     STRE   
   DFRS        5898     RT     
   DFRS        5684     YU     
   CRBD        2564     RT     
   DFRS        1564     OP   
   DFRS        2548     YU   
   CRBD        4478     GL   
   CRBD        4515     OP  
   DFRS        5695     YU   
   DFRS        8665     RT   
   CRBD        1487     LK  

df3 = "收費下可用的行"

df3:-

   IN.Type     STRF     STRG   
   CRBD        1289     GH     
   CRBD        8546     JK     
   CRBD        6599     LP     
   DFRS        7899     JK   
   DFRS        1456     GH   
   CRBD        6988     JK   
   DFRS        1468     LP  
   DFRS        4697     GH   
   DFRS        7941     LP   
   DFRS        1636     JK

您的“表格”是實際的 Excel 表格嗎? 如果是這樣,您可以使用此處說明的方法。

例如:

import pandas as pd
from openpyxl import load_workbook

filename = "tables.xlsx"

#read file
wb = load_workbook(filename)

#access specific sheet
ws = wb["Sheet1"]

mapping = {}

for entry, data_boundary in ws.tables.items():
    #parse the data within the ref boundary
    data = ws[data_boundary]
    #extract the data 
    #the inner list comprehension gets the values for each cell in the table
    content = [[cell.value for cell in ent] 
               for ent in data
          ]
    
    header = content[0]
    
    #the contents ... excluding the header
    rest = content[1:]
    
    #create dataframe with the column names
    #and pair table name with dataframe
    df = pd.DataFrame(rest, columns = header)
    mapping[entry] = df

這將為您提供一個字典,其中包含特定工作表中的所有表格。


如果您的“表格”不是實際的 Excel 表格,而只是范圍,我們必須自己定義范圍。 下面的代碼應該可以工作,前提是您的所有“表格”都在同一個工作表中,所有關鍵字都在第 1 行,實際的“表格”從第 2 行開始。第一個表格從哪一列開始或者是否表格是否由空列分隔。

import pandas as pd
from openpyxl import load_workbook
from openpyxl.utils import get_column_letter

filename = "data_tables.xlsx"

#read file
wb = load_workbook(filename)

#access specific sheet
ws = wb["Sheet1"]

#create dict to store df "tables"
mapping = {}

#get cols for key words
col_numbers = [idx+1 for idx, cell in enumerate(ws[1]) if cell.value != None]

#set vars to empty strings
first_address = ''
last_address = ''
entry = ''

for col in range(1, ws.max_column + 1):

    #convert int to col letter
    col_letter = get_column_letter(col)  

    #if no value in col
    last_row = 0

    #find last cell in col with value with loop over reversed col entries
    for cell in ws[col_letter][::-1]:
        if cell.value != None:
            last_row = cell.row
            break
    
    #if col in col_numbers this is where a new "table" starts
    if col in col_numbers:

        #set entry for dict key
        entry = ws.cell(1,col).value
        
        #get first and last address
        first_address = f'{col_letter}{2}'
        last_address = f'{col_letter}{last_row}'
    
    #if col is not empty and last_address is not empty string, then we are
    #still inside one of our "tables", so update last_address
    if last_row != 0 and last_address != '':
        last_address = f'{col_letter}{last_row}'
    
    #create entry if
        # (we are in empty col | the next col starts a new "table" | we're in the last col)
        # AND we having yet created this table (e.g. tables separated by multiple empty cols)
        # AND first_address is not empty string (we are not yet inside the first table)
    if (last_row == 0 or col+1 in col_numbers or col == ws.max_column) and entry not in mapping.keys() \
        and first_address != '':
        
        #create string with table range
        table_range = f'{first_address}:{last_address}'
    
        #extract the data 
        #the inner list comprehension gets the values for each cell in the table
        data = ws[table_range]
        content = [[cell.value for cell in ent] for ent in data]
    
        #the contents ... excluding the header
        header = content[0]
        rest = content[1:]

        #create dataframe with the column names
        #and pair table name with dataframe
        df = pd.DataFrame(rest, columns = header)
        mapping[entry] = df

我已經在具有如下數據的工作表上測試了此代碼:

在此處輸入圖像描述

按預期工作。 如果您的關鍵字包含重復項,則當前代碼只會為第一個關鍵字創建一個 df。 如果您希望代碼處理重復項,則需要在entry = ws.cell(1,col).value之后添加一個檢查,以查看entry是否已用作dict中的key 如果是這樣,為entry分配一個不同的 val 並繼續。 如果您遇到任何困難,請告訴我。

我不確定這是否是最好的方法,但你可以使用

pd.read_excel(file, skiprows=1, skipfooter=#)

因此,對於第一個數據幀,您需要在開頭跳過一行,並在您擁有的最后一行數據下方跳過 #number of lines

您也可以將其全部讀取為數據框,然后使用 df.loc 對其進行切片

隨着工作表中數據外觀的更新,我認為另一種方法更容易。 出於這個原因,我正在添加一個新的答案。 在這種情況下,我們可以簡單地使用 pandas 和 numpy:將文件讀入 1 個 df,然后將其拆分為 3 個 df。 (也許在其他情況下這也是可能的,但這是另一回事。)

這應該這樣做:

import pandas as pd
import numpy as np

filename = "data_tables.xlsx"

# read excel file
df = pd.read_excel(filename, sheet_name='Sheet1')

# drop all cols with only NaN
df = df.dropna(axis=1, how="all")

# split dfs on rows with only NaN
df_list = np.split(df, df[df.isnull().all(1)].index) 

# dictionary to store dfs
mapping = {}

# loop through list of dfs
for df in df_list:
    
    # drop all rows and cols with only NaN
    df = df.dropna(how="all")
    df = df.dropna(axis=1, how="all")
    
    # first cell should now contain your key word
    key = df.iloc[0,0]
    
    # second row should now contain your headers
    df.columns = list(df.iloc[1])
    # content starts at third row
    df = df[2:]
    
    # reset the index
    df.reset_index(drop=True, inplace=True)
    
    # add to dictionary
    mapping[key] = df

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM