简体   繁体   中英

How to remove special characters from Pandas DF?

I have a Python BOT that queries a database, saves the output to Pandas Dataframe and writes the data to an Excel template.

Yesterday the data did not saved to the Excel template because one of the fields in a record contain the following characters:

",  *,  /,  (, ), :,\n

Pandas failed to save the data to the file.

This is the code that creates the dataframe:

upload_df = sql_df.copy()

This code prepares the template file with time/date stamp

src = file_name.format(val="")
date_str = " " + str(datetime.today().strftime("%d%m%Y%H%M%S"))
dst_file = file_name.format(val=date_str)
copyfile(src, os.path.join(save_path, dst_file))
work_book = load_workbook(os.path.join(save_path, dst_file))

and this code saves the dataframe to the excel file

writer = pd.ExcelWriter(os.path.join(save_path, dst_file), engine='openpyxl')
writer.book = work_book
writer.sheets = {ws.title: ws for ws in work_book.worksheets}
upload_df.to_excel(writer, sheet_name=sheet_name, startrow = 1, index=False, header = False)
writer.save()

My question is how can I clean the special characters from an specific column [description] in my dataframe before I write it to the Excel template?

I have tried:

upload_df['Name'] = upload_df['Name'].replace(to_replace= r'\W',value=' ',regex=True)

But this removes everything and not a certain type of special character. I guess we could use a list of items and iterate through the list and run the replace but is there a more Pythonic solution?

adding the data that corrupted the excel file and prevent pandas to write the information:

this is an example of the text that created the problem i changed a few normal characters to keep privacy but is the same data tha corrupted the file:

"""*** CRQ.: N/A *** DF2100109 SADSFO CADSFVO EN SERWO JL1047 EL PUWERTDTO EL DIA 08-09-2021 A LAS 11:00 HRS. PERA REALIZAR TRWEROS DE AWERWRTURA DE SITIO PARA MWERWO PWERRVO.

  • RWERE DE WERDDFF EN SITIO: ING. JWER ERR3WRR ERRSDFF DFFF:RERFD DDDDF: 33 315678905. 1) ADFDSF SDFDF Y DFDFF DE DFDF Y DFFF XXCVV Y CXCVDDÓN DE DFFFD EN DFDFFDD 2) EN SDFF DE REQUERIRSE: SDFFDF Y SDFDFF DE EEERRW HJGHJ (ACCESO, GHJHJ, GHJHJ, RRRTTEE Y ACCESO A LA YUYUGGG
  1. RETIRAR JJGHJGHGH
  • CONSIDERACIONES FGFFDGFG: SE FGGG LLAVE DE FF LLEVAR FFDDF PARA ERTBGFY Y SOLDAR.""S: SE GDFGDFG LLAVE DE ERTFFFGG, FGGGFF EQUIPO PARA DFGFGFGFG Y SOLDAR."""

You could use the following (pass characters as a list to the method parameter):

upload_df['Name'] = upload_df['Name'].replace(
    to_replace=['"', '*', '/', '()', ':', '\n'],
    value=' '
)

As some of the special characters to remove are regex meta-characters, we have to escape these characters before we can replace them to empty strings with regex.

You can automate escaping these special character by re.escape , as follows:

import re

# put the special characters in a list
special_char = ['"', '*', '/', '(', ')', ':', '\n']

special_char_escaped = list(map(re.escape, special_char))

The resultant list of escaped special characters is as follows:

print(special_char_escaped)

['"', '\\*', '/', '\\(', '\\)', ':', '\\\n'] 

Then, we can remove the special characters with .replace() as follows:

upload_df['Name'] = upload_df['Name'].replace(special_char_escaped, '', regex=True)

Demo

Data Setup

upload_df = pd.DataFrame({'Name': ['"abc*/(xyz):\npqr']})

                Name
0  "abc*/(xyz):\npqr

Run codes:

import re

# put the special characters in a list
special_char = ['"', '*', '/', '(', ')', ':', '\n']

special_char_escaped = list(map(re.escape, special_char))

upload_df['Name'] = upload_df['Name'].replace(special_char_escaped, '', regex=True)

Output:

print(upload_df)


        Name
0  abcxyzpqr

Edit

With your edited text sample, here is the result after removing the special characters:

print(upload_df)

                                                                                                                                                                                                                                                                          Name
0                                                                                                           CRQ. NA  DF2100109 SADSFO CADSFVO EN SERWO JL1047 EL PUWERTDTO EL DIA 08-09-2021 A LAS 1100 HRS. PERA REALIZAR TRWEROS DE AWERWRTURA DE SITIO PARA MWERWO PWERRVO.
1  RWERE DE WERDDFF EN SITIO  ING. JWER ERR3WRR ERRSDFF DFFF RERFD DDDDF  33 315678905. 1 ADFDSF SDFDF Y DFDFF DE DFDF Y DFFF XXCVV Y CXCVDDÓN DE DFFFD EN DFDFFDD 2 EN SDFF DE REQUERIRSE SDFFDF Y SDFDFF DE EEERRW HJGHJ ACCESO, GHJHJ, GHJHJ, RRRTTEE Y ACCESO A LA YUYUGGG
2                                                                                                                                                                                                                                                         3. RETIRAR JJGHJGHGH
3                                                                                                                       CONSIDERACIONES FGFFDGFG SE FGGG LLAVE DE FF LLEVAR FFDDF PARA ERTBGFY Y SOLDAR.S SE GDFGDFG LLAVE DE ERTFFFGG, FGGGFF EQUIPO PARA DFGFGFGFG Y SOLDAR.

The special characters listed in your question have all been removed. Please check whether it is ok now.

Use str.replace :

>>> df
                Name
0  (**Hello\nWorld:)


>>> df['Name'] = df['Name'].str.replace(r'''["*/():\\n]''', '', regex=True)
>>> df
         Name
0  HelloWorld

Maybe you want to replace line breaks by whitespaces:

>>> df = df.replace({'Name': {r'["*/():]': '',
                              r'\\n': ' '}}, regex=True)
>>> df
          Name
0  Hello World

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM