简体   繁体   English

Python Openpyxl 将 CSV 转换为 XLSX 并从包含数字的单元格中删除“$”

[英]Python Openpyxl convert CSV to XLSX & removing “$ ,” from cells containing numbers

I have to read a csv file that's generated by a third party and contains a mixture of strings, integers and prices (sometimes with $ signs) into a XLSX file.我必须将由第三方生成的 csv 文件读取到 XLSX 文件中,该文件包含字符串、整数和价格(有时带有 $ 符号)的混合。 This is the sample data that's stored in the csv file, a_test_f.csv, that I've got:这是存储在 csv 文件 a_test_f.csv 中的示例数据,我得到了:

ColA,ColB
1,$11.00
2,22
3,"$1,000.56"
4,44

and here is the code that I've written.这是我写的代码。 My question is, is this the most efficient way of performing this conversion.我的问题是,这是执行此转换的最有效方式吗? Is there an alternative method that would use less processing power / memory?是否有替代方法可以使用更少的处理能力/memory? This is especially important given that the real csv file will contain thousands of records and hundreds of columns and the conversion operation will have to be performed tens of thousands of times per day.这一点尤其重要,因为真正的 csv 文件将包含数千条记录和数百列,并且每天必须执行数万次转换操作。

import csv
import openpyxl

#
# Convert the data in csv file format that contains a mix of
# strings, integers and dollar amounts into xlsx file format
#

csvfile  = 'a_test_f.csv'
xlsxfile = 'new_xlsx_f.xlsx'

wb = openpyxl.Workbook()
ws = wb.active

# remove $ and , from numbers
class Clean:
    def __init__(self, data=''):
        self.__obj = data
    def __repr__(self):
        return f"{self.__obj}"
    def getData(self):
        return self.__obj

    def dollar(self):
        try:
            return Clean(data=self.__obj.replace('$',''))
        except TypeError as err:
            print(err)

    def comma(self):
        try:
            return Clean(data=self.__obj.replace(',',''))
        except TypeError as err:
            print(err)

    def digit(self):
        try:
            float(self.__obj)
            return True
        except ValueError:
            return False            

with open(csvfile) as f:
    reader = csv.reader(f, delimiter=',', quotechar='"')
    row_count=1
    for row in reader:
        for i in range(len(row)):
            if Clean(data=row[i]).dollar().comma().digit():
                content = float(repr(Clean(data=row[i]).dollar().comma()))
            else:
                content = row[i]                
            ws.cell(row=row_count,column=i+1).value = content
        row_count +=1

wb.save(xlsxfile)

print('Finished!')

Following Charlie's suggestion, I rewrote the conversion using Functions instead of a Class and then tried processing a million items in a csv file using the Class and Functions methods.按照 Charlie 的建议,我使用 Functions 而不是 Class 重写了转换,然后尝试使用 Class 和 Functions 方法处理 csv 文件中的一百万个项目。 Results:结果:

  • The Function and the Class methods used equivalent amount of CPU and memory Function 和 Class 方法使用等量的 CPU 和 memory
  • The Class method was 9.4% slower than using Functions Class 方法比使用函数慢 9.4%

Functions win.函数取胜。 Thank you Charlie!谢谢查理!

The Function method is below: Function方法如下:

import csv
import openpyxl

#
# Convert the data in csv file format that contains a mix of
# strings, integers and dollar amounts into xlsx file format
#

csvfile  = 'large_test_export.csv'
xlsxfile = 'new_xlsx_f.xlsx'

wb = openpyxl.Workbook()
ws = wb.active

# remove $ and , from numbers

def strip_stuff(a_string):
    try:
        temp = a_string.replace(',','')
    except TypeError as err:
        print(err)      
    
    try:
        temp2 = temp.replace('$','')
    except TypeError as err:
        print(err)      

    try:
        temp3 = float(temp2)
        return temp3
    except ValueError as err:
        return temp2


def is_number(b_string):
    temp = strip_stuff(b_string)
    try:
        float (temp)
        return True
    except ValueError:
        return False

with open(csvfile) as f:
    reader = csv.reader(f, delimiter=',', quotechar='"')
    row_count=1
    for row in reader:
        for i in range(len(row)):
            if is_number(row[i]):
                content = strip_stuff(row[i])
            else:
                content = row[i]                
            ws.cell(row=row_count,column=i+1).value = content
        row_count +=1

wb.save(xlsxfile)

print('Finished!')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM