Python 從半結構化 .xlsx 文件中提取數據

Question

我有一個 .xlsx 文件，它看起來像附件。 在 Python 中從這個 excel 文件中提取不同數據部分的最常見方法是什么？

理想情況下，會有一個方法定義為：

pd.read_part_csv(columns=['data1', 'data2','data3'], rows=['val1', 'val2', 'val3'])並返回一個多迭代pandas dataframes其保持在給定的值桌子。

Answer 1

這是一個帶有 pylightxl 的解決方案，如果您所做的只是閱讀，它可能非常適合您的項目。 我按照行來編寫解決方案，但您也可以按照列來完成。 有關 pylightxl https://pylightxl.readthedocs.io/en/latest/quickstart.html 的更多信息，請參閱文檔

import pylightxl
db = pylightxl.readxl('Book1.xlsx')
# pull out all the rowIDs where data groups start
keyrows = [rowID for rowID, row in enumerate(db.ws('Sheet1').rows,1) if 'val1' in row]

# find the columnIDs where data groups start (like in your example, not all data groups start in col A)
keycols = []
for keyrow in keyrows:
    # add +1 since python index start from 0
    keycols.append(db.ws('Sheet1').row(keyrow).index('val1') + 1)

# define a dict to hold your data groups
datagroups = {}
# populate datatables
for tableIndex, keyrow in enumerate(keyrows,1):
    i = 0
    # data groups: keys are group IDs starting from 1, list: list of data rows (ie: val1, val2...)
    datagroups.update({tableIndex: []})
    while True:
        # pull out the current group row of data, and remove leading cells with keycols
        datarow = db.ws('Sheet1').row(keyrow + i)[keycols[tableIndex-1]:]
        # check if the current row is still part of the datagroup
        if datarow[0] == '':
            # current row is empty and is no longer part of the data group
            break
        datagroups[tableIndex].append(datarow)
        i += 1


print(datagroups[1])
print(datagroups[2])

[[1, 2, 3, ''], [4, 5, 6, ''], [7, 8, 9, '']]
[[9, 1, 4], [2, 4, 1], [3, 2, 1]]

請注意，表 1 的輸出上有額外的 ''，這是因為工作表數據的大小大於您的組大小。 如果您願意，您可以使用 list.remove('') 輕松刪除這些

Python 從半結構化 .xlsx 文件中提取數據

問題描述

1 個解決方案

解決方案1
1 已采納 2019-12-30 17:56:15

Python 從半結構化 .xlsx 文件中提取數據

問題描述

1 個解決方案

解決方案1 1 已采納 2019-12-30 17:56:15

解決方案1
1 已采納 2019-12-30 17:56:15