简体   繁体   中英

Python pd.read_excel with values of mixed decimal comma or point and integers

(I haven't found a solution that would solve my "combined" case entirely) I am running into issues while reading through the answers/solutions I am still facing the obstacles as below. Obviously, it is not about a few files that would be faster to clean "manually" but a flow of multiple excel-format files that the Python script seems to (would) be a perfect tool for.

Excel-format files I am getting have numbers in some columns (like "Unit selling price" or "Sales Amount") stored and displayed by MS Excel as "general" format. Which gives a fancy result even in Excel itself, since those with decimal signs are shown as strings/text (adjusted to left), while "integer" strings, ie. without any decimal part or sign are shown as digits in the same column (adjusted to right). But the look is not what matters. And YES, there is a mixture: basically rows should come with a decimal coma "," but some rows - in the same file - might have a decimal point - which is an ERROR for my system and that is why I am trying to clean it up with Python script. In the end I could manage WHATEVER decimal sign (comma or dot) it was as long as it was unified across all the files in specific column/columns

The decimal comma and/or point is one of things to be managed and I have tried some working solutions being provided also in stackoverflow (THANKS:) like in here: Comma as decimal separator in read_excel for Pandas

But then are also some of rows (ie. cells in those columns mentioned above) containing a value that is actually and integer number (no wonder, some prices might be like USD 100 without any cents, right?) Then I am loosing those values if they are shown in Excel as (ie) 100, instead of a 100,00 or 100.00.

Issue 1. Python cannot "pd.read_excel" values and re-format them DIRECTLY into float() properly without me telling that there might be a decimal point OR a decimal comma (.astype('float') or float() would love to have decimal point ONLY by default)

Issue 2. While solving Issue 1. I cannot make the script smart enough to PROPERLY re-format into float() those values that are actually integers without any sign or decimal part.

Issue 3. If I am "pd.excel_read"-ing excel directly and getting "integers" read-in properly (which allows to avoid Issue 2.),then I have no chance to tell the pd.excel_read() function, that it sholud read the comma "," as a decimal sign. That is because the pd.read_excel("file.xlsx", decimal=',') - throws an error saying that 'decimal" is an unknown to the pd.read_excel(). Multiple-checked for misspells etc. have I..

"Conversions" function approach works for comma/point issue EXCEPT for all the cells with "strings" that are equivalent to INT, meaning pure integer figures without any decimal part or any sign, are simple returned as nulls/dissapear.

Those kind of issues I found on foras dated some years back already, still, nowhere firm solution of all of them AT ONCE. Today is JAN 02, 2023, my pandas version is 1.3.4. Would greatly appreciate "combined" advise to the above. Only way that see now would be more elaboated regex-on-string-like approach but I have feeling like I am missing some more proper solution.

The decimal comma and/or point is one of things to be managed and I have tried some working solutions being provided also in stackoverflow (THANKS:) like in here: Comma as decimal separator in read_excel for Pandas but the "integer-like" string/objects ( as Python reads their type) are not properly converted to floats, actually lost to null.

I have come with such a solution, but hope something simplier might be proposed:

    # df=pd.read_excel("file.xlsx", decimal=',') # <<< my pandas does NOT recognize decimal=',' as a valid option/argument
    
    df=pd.read_excel("file.xlsx")
    for a_column in columns_to_have_fomat_changed:# <<< that is to avoid wasting time for processing columns of no importance. My Excel files come with 150+ columns.

            # df[a_column] = df[a_column].astype('float')# <<< here will be errors since comma instead of a point may happen
            # df[a_column] = df[a_column].str.replace(",", ".").astype(float) # <<< here all "integers" will be lost
            
        
        for i in range(len(df[a_column]-1)): # <<< this is for distinguishing from the "integer" strings
            if r"," in df[a_column][i] or r"." in df[a_column][i]:
                df[a_column][i] = pd.to_numeric(df[a_column][i].str.replace(',', '.').str.replace(' ', ''), # <<< overstack solution working for mixed decimal signs etc.
                            errors='coerce')
            else:
                df[a_column][i]=df[a_column][i].astype('str')+'.00'# <<< changing "integer" strings into "decimal" strings
                df[a_column][i]=df[a_column][i].astype('float') <<< now it works without "integers" being lost

I am not sure I understand what you mean by "loosing values" in "Then I am loosing those values if they are shown in Excel as (ie) 100, instead of a 100,00 or 100.00." Maybe you mean adding only one decimal to the end.

Anyways, I tried reproducing your code in a much more efficient way. Looping through cells of a pandas dataframes is painfully slow, and everyone advices against it. You can use a function (a lambda function in this answer) and use .apply() to apply the function:

import pandas as pd
# Create some sample data based on the description
df = pd.DataFrame(data={"unit_selling_price" : ['100,00 ', '92.20 ', '90,00 ', '156']
                          ,"sales_amount" : ['89.45 ', '91.23 ', '45,458 ', '5784']
                        }
    )
columns_to_have_fomat_changed = ["unit_selling_price","sales_amount"]

for column in df[columns_to_have_fomat_changed].columns:
    # Replace commas with .
    df[column] = df[column].replace(',', '.', regex=True)

    # Strip white spaces from left and right side of the strings
    df[column] = df[column].str.strip()

    # Convert numbers to numeric
    df[column] = df[column].apply(lambda x: float(x) if '.' in x else float(str(x)+'.00'))

Output:

    unit_selling_price  sales_amount
0   100.0                 89.450
1   92.2                  91.230
2   90.0                  45.458
3   156.0                 5784.000

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM