简体   繁体   English

使用Python清理Excel电子表格

[英]Cleaning Excel Spreadsheet using Python

I have what seems to be a simple task - I am almost done, but have one pesky issue I should be able to get rid of, but it's being elusive. 我的任务似乎很简单-我几乎完成了,但是有一个令人讨厌的问题我应该可以解决,但这是难以捉摸的。

I have a number of Excel .xls files. 我有许多Excel .xls文件。 The file name is in the format .xls. 文件名格式为.xls。 I created the filenames.txt file to iterate through to get the company names. 我创建了filenames.txt文件以进行迭代以获取公司名称。 Each file has garbage data in the first 4 or so rows, so I need to remove those first four rows in all the files. 每个文件的前四行左右都有垃圾数据,因此我需要删除所有文件中的前四行。 I then need to add a column with the in the first column position. 然后,我需要在第一列位置添加一列。

My code runs with no errors, but the output is not exactly what I need. 我的代码运行没有错误,但是输出不完全是我所需要的。 The only problems I am running into are: 1. I am getting a leading column added that I wasn't expecting with index numbers. 我遇到的唯一问题是:1.我在前导栏中添加了我所期望的索引号。 2. The strip command doesn't appear to be stripping the '.xls' - so what ends up being inserted into the column in Excel is .xls instead of just . 2. strip命令似乎并没有剥离'.xls'-因此最终插入到Excel列中的是.xls而不是。 3. Because the '.xls' is not being stripped properly, the to_excel command is saving the file with a '.xls.xls' extension. 3.由于未正确剥离'.xls',因此to_excel命令将以'.xls.xls'扩展名保存文件。

I read a few similar scenarios, so I have this code being used: 我阅读了一些类似的场景,因此使用了以下代码:

import pandas as pd
import os
path = os.chdir(r"C:\Users\mheitz\Documents\testing")

filenames = [names.strip('\n') for names in \            
    open(r"C:\Users\mheitz\Documents\testing\filenames.txt",'r').readlines()]

for name in filenames:
    vendors = pd.read_excel(name, header = 11, skiprows =0-10)
    vendors.insert(0,'Vendor Name',(name[:-4]))
    vendors.to_excel(r"C:\Users\mheitz\Documents\testing\clean\clean" + name)
import pandas as pd

exhibit_company = [i.strip('\n')[:-4] for names in \
                  open('filenames.txt','r').readlines()]

for company in exhibit_company:
    vendors = pd.read_excel(company, header = 5, skiprows =0-4)
    vendors.insert(0,'Vendor Name',(company))
    vendors.to_excel('/Users/michaelheitz/Desktop/Work 
                     Stuff/Data/clean'+company+'.xls')

Explanation: 说明:

open('filenames.txt', 'r').readlines() 
['james.xls\n', 'nancy.xls\n', 'temitope.xls\n', 'bianca.xls\n']

To remove \\n , we use strip('\\n'). 要删除\\ n,我们使用strip('\\ n')。

for name in names:
        name.strip('\n')
    james.xls
    nancy.xls
    temitope.xls
    bianca.xls

To remove .xls, we use [:-4], because len(.xls) = 4, using negative means slice after 4 characters , counting from back. 要删除.xls,我们使用[:-4],因为len(.xls)= 4,使用负数表示在4个字符之后进行切片,从后面算起。

    for name in names:
            name[:-4]
        james
        nancy
        temitope
        bianca

For more on readlines(), see https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects 有关readlines()的更多信息,请参见https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects

For more on generators, see https://docs.python.org/3.6/tutorial/datastructures.html#list-comprehensions 有关生成器的更多信息,请参见https://docs.python.org/3.6/tutorial/datastructures.html#list-comprehensions

There is no need to loop the values into the dataframe. 无需将值循环到数据帧中。 Lets go back to the list of names, 让我们回到名字列表,

list_of_names = [name1,name2,name3]

df = pd.DataFrame(list_of_names,columns={'company_names'})

again, thanks for your help... amazing what a good night's sleep and some coffee will do for your state of mind. 再次感谢您的帮助……让您睡个好觉和喝杯咖啡对您的心理状态有何好处。 I realized this morning that I was doing too much. 今天早上我意识到自己做得太多。 I only needed ONE list, not two - to iterate through. 我只需要一个列表,而不是两个列表即可进行迭代。 ;) I'll post my final code above - the only thing I still need to resolve is the leading column it is inserting with the index #'s, but that should be an easy fix - at least I can get through the 86 excel files though! ;)我将在上面发布我的最终代码-我仍然需要解决的唯一事情是它要插入带有索引#的前导列,但这应该是一个简单的解决方法-至少我可以通过86 excel文件!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM