简体   繁体   中英

Reading three tables from same file in different Dataframes in pandas

I have a .xlsx file in which there are 3 different table available, seprated by three key words "Setteled", "Refund", "Charged" in want to read all the table in separate dataframes, sharing file data and desired output.

File Data:-

   Setteled
   IN.Type     STRA     STRB   STRC
   CRBD        2487     XR     XL0054
   DFRS        3754     MY     XL0684
   CRBD        7356     DF     XL8911
   DFRS        4487     DF     XL58999
   DFRS        7785     MY     XL76568
   CRBD        8235     GL     XL0635
   DFRS        2468     PQ     XL4569
   DFRS        9735     GR     XL7589
   CRBD        6486     TY     XL5566 
   DFRS        1023     PQ     XL27952

   Refund
   IN.Type     STRD     STRE   
   DFRS        5898     RT     
   DFRS        5684     YU     
   CRBD        2564     RT     
   DFRS        1564     OP   
   DFRS        2548     YU   
   CRBD        4478     GL   
   CRBD        4515     OP  
   DFRS        5695     YU   
   DFRS        8665     RT   
   CRBD        1487     LK    

   Charged
   IN.Type     STRF     STRG   
   CRBD        1289     GH     
   CRBD        8546     JK     
   CRBD        6599     LP     
   DFRS        7899     JK   
   DFRS        1456     GH   
   CRBD        6988     JK   
   DFRS        1468     LP  
   DFRS        4697     GH   
   DFRS        7941     LP   
   DFRS        1636     JK

File Image:-

文件图片

Now after reading the file, I want above three tables in different dataframe as below.

df = "Row available below Setteled"

df:-

   IN.Type     STRA     STRB   STRC
   CRBD        2487     XR     XL0054
   DFRS        3754     MY     XL0684
   CRBD        7356     DF     XL8911
   DFRS        4487     DF     XL58999
   DFRS        7785     MY     XL76568
   CRBD        8235     GL     XL0635
   DFRS        2468     PQ     XL4569
   DFRS        9735     GR     XL7589
   CRBD        6486     TY     XL5566 
   DFRS        1023     PQ     XL27952

df2 = "Row available below Refund"

df2:-

   IN.Type     STRD     STRE   
   DFRS        5898     RT     
   DFRS        5684     YU     
   CRBD        2564     RT     
   DFRS        1564     OP   
   DFRS        2548     YU   
   CRBD        4478     GL   
   CRBD        4515     OP  
   DFRS        5695     YU   
   DFRS        8665     RT   
   CRBD        1487     LK  

df3 = "Rows available below Charged"

df3:-

   IN.Type     STRF     STRG   
   CRBD        1289     GH     
   CRBD        8546     JK     
   CRBD        6599     LP     
   DFRS        7899     JK   
   DFRS        1456     GH   
   CRBD        6988     JK   
   DFRS        1468     LP  
   DFRS        4697     GH   
   DFRS        7941     LP   
   DFRS        1636     JK

Are your "tables" actual Excel tables? If so, you could use the approach explained here .

Eg:

import pandas as pd
from openpyxl import load_workbook

filename = "tables.xlsx"

#read file
wb = load_workbook(filename)

#access specific sheet
ws = wb["Sheet1"]

mapping = {}

for entry, data_boundary in ws.tables.items():
    #parse the data within the ref boundary
    data = ws[data_boundary]
    #extract the data 
    #the inner list comprehension gets the values for each cell in the table
    content = [[cell.value for cell in ent] 
               for ent in data
          ]
    
    header = content[0]
    
    #the contents ... excluding the header
    rest = content[1:]
    
    #create dataframe with the column names
    #and pair table name with dataframe
    df = pd.DataFrame(rest, columns = header)
    mapping[entry] = df

This will get you a dictionary with all the tables in a specific sheet.


If your "tables" aren't actual Excel tables, but simply ranges, we'll have to define the ranges ourselves. Code below should work, provided that all your "tables" are in the same worksheet, all key words are in row 1, and the actual "tables" start in row 2. It should not matter in what column the first table starts or whether tables are separated by empty cols or not.

import pandas as pd
from openpyxl import load_workbook
from openpyxl.utils import get_column_letter

filename = "data_tables.xlsx"

#read file
wb = load_workbook(filename)

#access specific sheet
ws = wb["Sheet1"]

#create dict to store df "tables"
mapping = {}

#get cols for key words
col_numbers = [idx+1 for idx, cell in enumerate(ws[1]) if cell.value != None]

#set vars to empty strings
first_address = ''
last_address = ''
entry = ''

for col in range(1, ws.max_column + 1):

    #convert int to col letter
    col_letter = get_column_letter(col)  

    #if no value in col
    last_row = 0

    #find last cell in col with value with loop over reversed col entries
    for cell in ws[col_letter][::-1]:
        if cell.value != None:
            last_row = cell.row
            break
    
    #if col in col_numbers this is where a new "table" starts
    if col in col_numbers:

        #set entry for dict key
        entry = ws.cell(1,col).value
        
        #get first and last address
        first_address = f'{col_letter}{2}'
        last_address = f'{col_letter}{last_row}'
    
    #if col is not empty and last_address is not empty string, then we are
    #still inside one of our "tables", so update last_address
    if last_row != 0 and last_address != '':
        last_address = f'{col_letter}{last_row}'
    
    #create entry if
        # (we are in empty col | the next col starts a new "table" | we're in the last col)
        # AND we having yet created this table (e.g. tables separated by multiple empty cols)
        # AND first_address is not empty string (we are not yet inside the first table)
    if (last_row == 0 or col+1 in col_numbers or col == ws.max_column) and entry not in mapping.keys() \
        and first_address != '':
        
        #create string with table range
        table_range = f'{first_address}:{last_address}'
    
        #extract the data 
        #the inner list comprehension gets the values for each cell in the table
        data = ws[table_range]
        content = [[cell.value for cell in ent] for ent in data]
    
        #the contents ... excluding the header
        header = content[0]
        rest = content[1:]

        #create dataframe with the column names
        #and pair table name with dataframe
        df = pd.DataFrame(rest, columns = header)
        mapping[entry] = df

I've testing this code on a worksheet that has data like this:

在此处输入图像描述

Works as expected. If your keywords contain duplicates, the present code will only create a df for the first keyword. If you'd like the code to handle duplicates, you need to add a check after entry = ws.cell(1,col).value to see if entry is already used as a key in the dict . If so, assign a different val to entry and continue. Let me know if you experience any difficulties.

I am not sure if this is the best approach, but you can use

pd.read_excel(file, skiprows=1, skipfooter=#)

So for the first dataframe you need to skip one line at the start and #number of lines below the last line of data you have

You can also read it all as dataframe and then slice it using df.loc

With the update of how the data looks in the worksheet, I think a different approach is easier. For that reason, I'm adding a new answer. In this case, we can simply use pandas and numpy: read the file into 1 df and then split it into 3 dfs. (Maybe with the other scenarios this is also possible, but this is another matter.)

This should do it:

import pandas as pd
import numpy as np

filename = "data_tables.xlsx"

# read excel file
df = pd.read_excel(filename, sheet_name='Sheet1')

# drop all cols with only NaN
df = df.dropna(axis=1, how="all")

# split dfs on rows with only NaN
df_list = np.split(df, df[df.isnull().all(1)].index) 

# dictionary to store dfs
mapping = {}

# loop through list of dfs
for df in df_list:
    
    # drop all rows and cols with only NaN
    df = df.dropna(how="all")
    df = df.dropna(axis=1, how="all")
    
    # first cell should now contain your key word
    key = df.iloc[0,0]
    
    # second row should now contain your headers
    df.columns = list(df.iloc[1])
    # content starts at third row
    df = df[2:]
    
    # reset the index
    df.reset_index(drop=True, inplace=True)
    
    # add to dictionary
    mapping[key] = df

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM