I have what seems to be a simple task - I am almost done, but have one pesky issue I should be able to get rid of, but it's being elusive.
I have a number of Excel .xls files. The file name is in the format .xls. I created the filenames.txt file to iterate through to get the company names. Each file has garbage data in the first 4 or so rows, so I need to remove those first four rows in all the files. I then need to add a column with the in the first column position.
My code runs with no errors, but the output is not exactly what I need. The only problems I am running into are: 1. I am getting a leading column added that I wasn't expecting with index numbers. 2. The strip command doesn't appear to be stripping the '.xls' - so what ends up being inserted into the column in Excel is .xls instead of just . 3. Because the '.xls' is not being stripped properly, the to_excel command is saving the file with a '.xls.xls' extension.
I read a few similar scenarios, so I have this code being used:
import pandas as pd
import os
path = os.chdir(r"C:\Users\mheitz\Documents\testing")
filenames = [names.strip('\n') for names in \
open(r"C:\Users\mheitz\Documents\testing\filenames.txt",'r').readlines()]
for name in filenames:
vendors = pd.read_excel(name, header = 11, skiprows =0-10)
vendors.insert(0,'Vendor Name',(name[:-4]))
vendors.to_excel(r"C:\Users\mheitz\Documents\testing\clean\clean" + name)
import pandas as pd
exhibit_company = [i.strip('\n')[:-4] for names in \
open('filenames.txt','r').readlines()]
for company in exhibit_company:
vendors = pd.read_excel(company, header = 5, skiprows =0-4)
vendors.insert(0,'Vendor Name',(company))
vendors.to_excel('/Users/michaelheitz/Desktop/Work
Stuff/Data/clean'+company+'.xls')
Explanation:
open('filenames.txt', 'r').readlines()
['james.xls\n', 'nancy.xls\n', 'temitope.xls\n', 'bianca.xls\n']
To remove \\n , we use strip('\\n').
for name in names:
name.strip('\n')
james.xls
nancy.xls
temitope.xls
bianca.xls
To remove .xls, we use [:-4], because len(.xls) = 4, using negative means slice after 4 characters , counting from back.
for name in names:
name[:-4]
james
nancy
temitope
bianca
For more on readlines(), see https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects
For more on generators, see https://docs.python.org/3.6/tutorial/datastructures.html#list-comprehensions
There is no need to loop the values into the dataframe. Lets go back to the list of names,
list_of_names = [name1,name2,name3]
df = pd.DataFrame(list_of_names,columns={'company_names'})
again, thanks for your help... amazing what a good night's sleep and some coffee will do for your state of mind. I realized this morning that I was doing too much. I only needed ONE list, not two - to iterate through. ;) I'll post my final code above - the only thing I still need to resolve is the leading column it is inserting with the index #'s, but that should be an easy fix - at least I can get through the 86 excel files though!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.