简体   繁体   中英

Cleaning Excel Spreadsheet using Python

I have what seems to be a simple task - I am almost done, but have one pesky issue I should be able to get rid of, but it's being elusive.

I have a number of Excel .xls files. The file name is in the format .xls. I created the filenames.txt file to iterate through to get the company names. Each file has garbage data in the first 4 or so rows, so I need to remove those first four rows in all the files. I then need to add a column with the in the first column position.

My code runs with no errors, but the output is not exactly what I need. The only problems I am running into are: 1. I am getting a leading column added that I wasn't expecting with index numbers. 2. The strip command doesn't appear to be stripping the '.xls' - so what ends up being inserted into the column in Excel is .xls instead of just . 3. Because the '.xls' is not being stripped properly, the to_excel command is saving the file with a '.xls.xls' extension.

I read a few similar scenarios, so I have this code being used:

import pandas as pd
import os
path = os.chdir(r"C:\Users\mheitz\Documents\testing")

filenames = [names.strip('\n') for names in \            
    open(r"C:\Users\mheitz\Documents\testing\filenames.txt",'r').readlines()]

for name in filenames:
    vendors = pd.read_excel(name, header = 11, skiprows =0-10)
    vendors.insert(0,'Vendor Name',(name[:-4]))
    vendors.to_excel(r"C:\Users\mheitz\Documents\testing\clean\clean" + name)
import pandas as pd

exhibit_company = [i.strip('\n')[:-4] for names in \
                  open('filenames.txt','r').readlines()]

for company in exhibit_company:
    vendors = pd.read_excel(company, header = 5, skiprows =0-4)
    vendors.insert(0,'Vendor Name',(company))
    vendors.to_excel('/Users/michaelheitz/Desktop/Work 
                     Stuff/Data/clean'+company+'.xls')

Explanation:

open('filenames.txt', 'r').readlines() 
['james.xls\n', 'nancy.xls\n', 'temitope.xls\n', 'bianca.xls\n']

To remove \\n , we use strip('\\n').

for name in names:
        name.strip('\n')
    james.xls
    nancy.xls
    temitope.xls
    bianca.xls

To remove .xls, we use [:-4], because len(.xls) = 4, using negative means slice after 4 characters , counting from back.

    for name in names:
            name[:-4]
        james
        nancy
        temitope
        bianca

For more on readlines(), see https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects

For more on generators, see https://docs.python.org/3.6/tutorial/datastructures.html#list-comprehensions

There is no need to loop the values into the dataframe. Lets go back to the list of names,

list_of_names = [name1,name2,name3]

df = pd.DataFrame(list_of_names,columns={'company_names'})

again, thanks for your help... amazing what a good night's sleep and some coffee will do for your state of mind. I realized this morning that I was doing too much. I only needed ONE list, not two - to iterate through. ;) I'll post my final code above - the only thing I still need to resolve is the leading column it is inserting with the index #'s, but that should be an easy fix - at least I can get through the 86 excel files though!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM