[英]Cleaning Excel Spreadsheet using Python
I have what seems to be a simple task - I am almost done, but have one pesky issue I should be able to get rid of, but it's being elusive. 我的任务似乎很简单-我几乎完成了,但是有一个令人讨厌的问题我应该可以解决,但这是难以捉摸的。
I have a number of Excel .xls files. 我有许多Excel .xls文件。 The file name is in the format .xls.
文件名格式为.xls。 I created the filenames.txt file to iterate through to get the company names.
我创建了filenames.txt文件以进行迭代以获取公司名称。 Each file has garbage data in the first 4 or so rows, so I need to remove those first four rows in all the files.
每个文件的前四行左右都有垃圾数据,因此我需要删除所有文件中的前四行。 I then need to add a column with the in the first column position.
然后,我需要在第一列位置添加一列。
My code runs with no errors, but the output is not exactly what I need. 我的代码运行没有错误,但是输出不完全是我所需要的。 The only problems I am running into are: 1. I am getting a leading column added that I wasn't expecting with index numbers.
我遇到的唯一问题是:1.我在前导栏中添加了我所期望的索引号。 2. The strip command doesn't appear to be stripping the '.xls' - so what ends up being inserted into the column in Excel is .xls instead of just .
2. strip命令似乎并没有剥离'.xls'-因此最终插入到Excel列中的是.xls而不是。 3. Because the '.xls' is not being stripped properly, the to_excel command is saving the file with a '.xls.xls' extension.
3.由于未正确剥离'.xls',因此to_excel命令将以'.xls.xls'扩展名保存文件。
I read a few similar scenarios, so I have this code being used: 我阅读了一些类似的场景,因此使用了以下代码:
import pandas as pd
import os
path = os.chdir(r"C:\Users\mheitz\Documents\testing")
filenames = [names.strip('\n') for names in \
open(r"C:\Users\mheitz\Documents\testing\filenames.txt",'r').readlines()]
for name in filenames:
vendors = pd.read_excel(name, header = 11, skiprows =0-10)
vendors.insert(0,'Vendor Name',(name[:-4]))
vendors.to_excel(r"C:\Users\mheitz\Documents\testing\clean\clean" + name)
import pandas as pd
exhibit_company = [i.strip('\n')[:-4] for names in \
open('filenames.txt','r').readlines()]
for company in exhibit_company:
vendors = pd.read_excel(company, header = 5, skiprows =0-4)
vendors.insert(0,'Vendor Name',(company))
vendors.to_excel('/Users/michaelheitz/Desktop/Work
Stuff/Data/clean'+company+'.xls')
Explanation: 说明:
open('filenames.txt', 'r').readlines()
['james.xls\n', 'nancy.xls\n', 'temitope.xls\n', 'bianca.xls\n']
To remove \\n , we use strip('\\n'). 要删除\\ n,我们使用strip('\\ n')。
for name in names:
name.strip('\n')
james.xls
nancy.xls
temitope.xls
bianca.xls
To remove .xls, we use [:-4], because len(.xls) = 4, using negative means slice after 4 characters , counting from back. 要删除.xls,我们使用[:-4],因为len(.xls)= 4,使用负数表示在4个字符之后进行切片,从后面算起。
for name in names:
name[:-4]
james
nancy
temitope
bianca
For more on readlines(), see https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects 有关readlines()的更多信息,请参见https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects
For more on generators, see https://docs.python.org/3.6/tutorial/datastructures.html#list-comprehensions 有关生成器的更多信息,请参见https://docs.python.org/3.6/tutorial/datastructures.html#list-comprehensions
There is no need to loop the values into the dataframe. 无需将值循环到数据帧中。 Lets go back to the list of names,
让我们回到名字列表,
list_of_names = [name1,name2,name3]
df = pd.DataFrame(list_of_names,columns={'company_names'})
again, thanks for your help... amazing what a good night's sleep and some coffee will do for your state of mind. 再次感谢您的帮助……让您睡个好觉和喝杯咖啡对您的心理状态有何好处。 I realized this morning that I was doing too much.
今天早上我意识到自己做得太多。 I only needed ONE list, not two - to iterate through.
我只需要一个列表,而不是两个列表即可进行迭代。 ;) I'll post my final code above - the only thing I still need to resolve is the leading column it is inserting with the index #'s, but that should be an easy fix - at least I can get through the 86 excel files though!
;)我将在上面发布我的最终代码-我仍然需要解决的唯一事情是它要插入带有索引#的前导列,但这应该是一个简单的解决方法-至少我可以通过86 excel文件!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.