简体   繁体   English

Python pd.read_excel 具有混合十进制逗号或点和整数的值

[英]Python pd.read_excel with values of mixed decimal comma or point and integers

(I haven't found a solution that would solve my "combined" case entirely) I am running into issues while reading through the answers/solutions I am still facing the obstacles as below. (我还没有找到可以完全解决我的“组合”案例的解决方案)我在阅读答案/解决方案时遇到了问题我仍然面临如下障碍。 Obviously, it is not about a few files that would be faster to clean "manually" but a flow of multiple excel-format files that the Python script seems to (would) be a perfect tool for.显然,这不是关于“手动”清理会更快的几个文件,而是 Python 脚本似乎(将)是一个完美工具的多个 excel 格式文件流。

Excel-format files I am getting have numbers in some columns (like "Unit selling price" or "Sales Amount") stored and displayed by MS Excel as "general" format.我得到的 Excel 格式文件在某些列(如“单价”或“销售额”)中有数字,由 MS Excel 作为“通用”格式存储和显示。 Which gives a fancy result even in Excel itself, since those with decimal signs are shown as strings/text (adjusted to left), while "integer" strings, ie.即使在 Excel 本身也给出了一个奇特的结果,因为带有小数点的那些显示为字符串/文本(向左调整),而“整数”字符串,即。 without any decimal part or sign are shown as digits in the same column (adjusted to right).没有任何小数部分或符号的在同一列中显示为数字(向右调整)。 But the look is not what matters.但外观并不是最重要的。 And YES, there is a mixture: basically rows should come with a decimal coma "," but some rows - in the same file - might have a decimal point - which is an ERROR for my system and that is why I am trying to clean it up with Python script.是的,有一个混合:基本上行应该带有小数点逗号“,”但是某些行 - 在同一个文件中 - 可能有一个小数点 - 这对我的系统来说是一个错误,这就是我试图清理的原因它与 Python 脚本。 In the end I could manage WHATEVER decimal sign (comma or dot) it was as long as it was unified across all the files in specific column/columns最后我可以管理任何小数点符号(逗号或点),只要它在特定列/列中的所有文件中统一

The decimal comma and/or point is one of things to be managed and I have tried some working solutions being provided also in stackoverflow (THANKS:) like in here: Comma as decimal separator in read_excel for Pandas小数点逗号和/或点是要管理的事情之一,我已经尝试了一些在 stackoverflow 中也提供的工作解决方案(谢谢 :) 就像在这里: Comma as decimal separator in read_excel for Pandas

But then are also some of rows (ie. cells in those columns mentioned above) containing a value that is actually and integer number (no wonder, some prices might be like USD 100 without any cents, right?) Then I am loosing those values if they are shown in Excel as (ie) 100, instead of a 100,00 or 100.00.但是还有一些行(即上面提到的那些列中的单元格)包含一个实际值和 integer 数字(难怪,有些价格可能像 100 美元没有任何美分,对吧?)然后我失去了这些价值如果它们在 Excel 中显示为(即)100,而不是 100,00 或 100.00。

Issue 1. Python cannot "pd.read_excel" values and re-format them DIRECTLY into float() properly without me telling that there might be a decimal point OR a decimal comma (.astype('float') or float() would love to have decimal point ONLY by default)问题 1. Python 无法“pd.read_excel”值并将它们直接重新格式化为 float() 正确而没有我告诉可能有小数点或小数点逗号(.astype('float')或 float()会喜欢默认情况下只有小数点)

Issue 2. While solving Issue 1. I cannot make the script smart enough to PROPERLY re-format into float() those values that are actually integers without any sign or decimal part.问题 2。在解决问题 1 时。我无法使脚本足够智能,无法将那些实际上是没有任何符号或小数部分的整数的值正确地重新格式化为 float()。

Issue 3. If I am "pd.excel_read"-ing excel directly and getting "integers" read-in properly (which allows to avoid Issue 2.),then I have no chance to tell the pd.excel_read() function, that it sholud read the comma "," as a decimal sign.问题 3。如果我直接“pd.excel_read”-ing excel 并正确读入“整数”(这可以避免问题 2),那么我没有机会告诉 pd.excel_read() function,那它应该将逗号“,”读作小数点。 That is because the pd.read_excel("file.xlsx", decimal=',') - throws an error saying that 'decimal" is an unknown to the pd.read_excel(). Multiple-checked for misspells etc. have I..那是因为 pd.read_excel("file.xlsx", decimal=',') - 抛出一个错误,指出 'decimal' 是 pd.read_excel() 的未知数。我对拼写错误等进行了多次检查。 .

"Conversions" function approach works for comma/point issue EXCEPT for all the cells with "strings" that are equivalent to INT, meaning pure integer figures without any decimal part or any sign, are simple returned as nulls/dissapear. “转换”function 方法适用于逗号/点问题,除了所有具有等效于 INT 的“字符串”的单元格,这意味着没有任何小数部分或任何符号的纯 integer 数字,简单地返回为 nulls/dissapear。

Those kind of issues I found on foras dated some years back already, still, nowhere firm solution of all of them AT ONCE.我在 foras 上发现的这类问题早在几年前就已经存在,但仍然没有立即解决所有问题的可靠方法。 Today is JAN 02, 2023, my pandas version is 1.3.4.今天是 2023 年 1 月 2 日,我的 pandas 版本是 1.3.4。 Would greatly appreciate "combined" advise to the above.非常感谢对上述内容的“综合”建议。 Only way that see now would be more elaboated regex-on-string-like approach but I have feeling like I am missing some more proper solution.现在看到的唯一方法是更详细的类似字符串正则表达式的方法,但我觉得我错过了一些更合适的解决方案。

The decimal comma and/or point is one of things to be managed and I have tried some working solutions being provided also in stackoverflow (THANKS:) like in here: Comma as decimal separator in read_excel for Pandas but the "integer-like" string/objects ( as Python reads their type) are not properly converted to floats, actually lost to null.小数点逗号和/或点是要管理的事情之一,我已经尝试了一些在 stackoverflow 中也提供的工作解决方案(谢谢 :) 就像在这里: Comma as decimal separator in read_excel for Pandas but the "integer-like" string /对象(如 Python 读取其类型)未正确转换为浮点数,实际上丢失为 null。

I have come with such a solution, but hope something simplier might be proposed:我提出了这样的解决方案,但希望可以提出更简单的建议:

    # df=pd.read_excel("file.xlsx", decimal=',') # <<< my pandas does NOT recognize decimal=',' as a valid option/argument
    
    df=pd.read_excel("file.xlsx")
    for a_column in columns_to_have_fomat_changed:# <<< that is to avoid wasting time for processing columns of no importance. My Excel files come with 150+ columns.

            # df[a_column] = df[a_column].astype('float')# <<< here will be errors since comma instead of a point may happen
            # df[a_column] = df[a_column].str.replace(",", ".").astype(float) # <<< here all "integers" will be lost
            
        
        for i in range(len(df[a_column]-1)): # <<< this is for distinguishing from the "integer" strings
            if r"," in df[a_column][i] or r"." in df[a_column][i]:
                df[a_column][i] = pd.to_numeric(df[a_column][i].str.replace(',', '.').str.replace(' ', ''), # <<< overstack solution working for mixed decimal signs etc.
                            errors='coerce')
            else:
                df[a_column][i]=df[a_column][i].astype('str')+'.00'# <<< changing "integer" strings into "decimal" strings
                df[a_column][i]=df[a_column][i].astype('float') <<< now it works without "integers" being lost

I am not sure I understand what you mean by "loosing values" in "Then I am loosing those values if they are shown in Excel as (ie) 100, instead of a 100,00 or 100.00."我不确定我是否理解“丢失值”中的“如果它们在 Excel 中显示为(即)100,而不是 100,00 或 100.00,我将丢失这些值”是什么意思。 Maybe you mean adding only one decimal to the end.也许你的意思是最后只加一位小数。

Anyways, I tried reproducing your code in a much more efficient way.无论如何,我尝试以更有效的方式重现您的代码。 Looping through cells of a pandas dataframes is painfully slow, and everyone advices against it.遍历 pandas 数据帧的单元格非常慢,每个人都反对它。 You can use a function (a lambda function in this answer) and use .apply() to apply the function:您可以使用 function(此答案中的 lambda function)并使用.apply()应用 function:

import pandas as pd
# Create some sample data based on the description
df = pd.DataFrame(data={"unit_selling_price" : ['100,00 ', '92.20 ', '90,00 ', '156']
                          ,"sales_amount" : ['89.45 ', '91.23 ', '45,458 ', '5784']
                        }
    )
columns_to_have_fomat_changed = ["unit_selling_price","sales_amount"]

for column in df[columns_to_have_fomat_changed].columns:
    # Replace commas with .
    df[column] = df[column].replace(',', '.', regex=True)

    # Strip white spaces from left and right side of the strings
    df[column] = df[column].str.strip()

    # Convert numbers to numeric
    df[column] = df[column].apply(lambda x: float(x) if '.' in x else float(str(x)+'.00'))

Output: Output:

    unit_selling_price  sales_amount
0   100.0                 89.450
1   92.2                  91.230
2   90.0                  45.458
3   156.0                 5784.000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM