简体   繁体   中英

Cleaning Excel Spreadsheet using Python

I have what seems to be a simple task - I am almost done, but have one pesky issue I should be able to get rid of, but it's being elusive.

I have a number of Excel .xls files. The file name is in the format .xls. I created the filenames.txt file to iterate through to get the company names. Each file has garbage data in the first 4 or so rows, so I need to remove those first four rows in all the files. I then need to add a column with the in the first column position.

My code runs with no errors, but the output is not exactly what I need. The only problems I am running into are: 1. I am getting a leading column added that I wasn't expecting with index numbers. 2. The strip command doesn't appear to be stripping the '.xls' - so what ends up being inserted into the column in Excel is .xls instead of just . 3. Because the '.xls' is not being stripped properly, the to_excel command is saving the file with a '.xls.xls' extension.

I read a few similar scenarios, so I have this code being used:

import pandas as pd
import os
path = os.chdir(r"C:\Users\mheitz\Documents\testing")

filenames = [names.strip('\n') for names in \            

for name in filenames:
    vendors = pd.read_excel(name, header = 11, skiprows =0-10)
    vendors.insert(0,'Vendor Name',(name[:-4]))
    vendors.to_excel(r"C:\Users\mheitz\Documents\testing\clean\clean" + name)
import pandas as pd

exhibit_company = [i.strip('\n')[:-4] for names in \

for company in exhibit_company:
    vendors = pd.read_excel(company, header = 5, skiprows =0-4)
    vendors.insert(0,'Vendor Name',(company))


open('filenames.txt', 'r').readlines() 
['james.xls\n', 'nancy.xls\n', 'temitope.xls\n', 'bianca.xls\n']

To remove \\n , we use strip('\\n').

for name in names:

To remove .xls, we use [:-4], because len(.xls) = 4, using negative means slice after 4 characters , counting from back.

    for name in names:

For more on readlines(), see https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects

For more on generators, see https://docs.python.org/3.6/tutorial/datastructures.html#list-comprehensions

There is no need to loop the values into the dataframe. Lets go back to the list of names,

list_of_names = [name1,name2,name3]

df = pd.DataFrame(list_of_names,columns={'company_names'})

again, thanks for your help... amazing what a good night's sleep and some coffee will do for your state of mind. I realized this morning that I was doing too much. I only needed ONE list, not two - to iterate through. ;) I'll post my final code above - the only thing I still need to resolve is the leading column it is inserting with the index #'s, but that should be an easy fix - at least I can get through the 86 excel files though!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM