简体   繁体   中英

Extracting and manipulating data from excel worksheet with python

Scenario: I am trying to come up with a python code that reads all the workbooks in a given folder, gets the data of each and puts it to a single data frame (each workbook becomes a dataframe, so I can manipulate them individually).

Issue1: With this code, even though I am using the proper path and file types, I keep getting the error:

File "<ipython-input-3-2a450c707fbe>", line 14, in <module>
f = open(file,'r')

FileNotFoundError: [Errno 2] No such file or directory: '(1)Copy of 
Preisanfrage_17112016.xlsx'

Issue2: The reason for me to create different data frames is that each workbook has an individual format (rows are my identifiers and columns are dates). My problem is that some of these workbooks have data on a sheet named "Closing", or "Opening" or the name is not specified. So I will try to configure each data frame individually and them join them afterwards.

Issue3: Considering the final output once the data frame data is already unified, my objective is to output them in a format like:

date 1    identifier 1    value
date 1    identifier 2    value
date 1    identifier 3    value
date 1    identifier 4    value
date 2    identifier 1    value
date 2    identifier 4    value
date 2    identifier 5    value

Obs1: For the output, not all dates have the same array of identifiers.

Question 1: Any ideas why the code is yielding this error? Is there a better way to extract data from excel?

Question 2: Is it possible to create a unique dataframe for each worksheet? Is this a good practice?

Question 3: Can I do this type of output using a loop? Is this a good practice?

Obs2: I don't know how relevant this is, but I am using Python 3.6 with Anaconda.

Code so far:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob, os
import datetime as dt
from datetime import datetime
import matplotlib as mpl


directory = os.path.join("C:\\","Users\\Dgms\\Desktop\\final 2")
for root,dirs,files in os.walk(directory):
for file in files:
    print(file)
    f = open(file,'r')
    df1 = pd.read_excel(file)

think you do not need your open. And I would store them in a list. you can either use pd.concat(list_of_dfs) or some manual changes.

list_of_dfs = []
for root,dirs,files in os.walk(directory):
    for file in files:
        f = os.path.join(root, file)
        print(f)
        list_of_dfs .append(pd.read_excel(f))

or using glob :

import glob
list_of_dfs = []
for file in glob.iglob(directory + '*.xlsx')
    print(file)
    list_of_dfs .append(pd.read_excel(file))

or as jackie suggests you can read specific sheets list_of_dfs.append(pd.concat([pd.read_excel(file, 'Opening'), pd.read_excel(file, 'Closing')])) . If you have only either of them available, you could even change to

try:
     list_of_dfs.append(pd.concat([pd.read_excel(file, 'Opening'))
except:
     pass
try: 
     list_of_dfs.append(pd.concat([pd.read_excel(file, 'Closing'))
except:
     pass

(Of course, you should specify the exact error, but can't test that atm)

Issue 1: If you are using IDE or Jupyter put absolute path to file. Or add the project folder to system path (workaround, not recommended).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM