简体   繁体   English

替换字符串中的特殊字符

[英]replacing special characters from string

I am having raw input in text format having special characters in string.I want to change these special character from strings so that after running code there will not be any special character in it.我有文本格式的原始输入,字符串中有特殊字符。我想从字符串中更改这些特殊字符,以便在运行代码后不会有任何特殊字符。

在此处输入图片说明

在此处输入图片说明

I tried to write below code.I am not sure whether it is right or wrong.我试着写下面的代码。我不确定它是对还是错。

def avoid(x):
#print(x)
#value=[]
for ele in range(0, len(x)):
    
    p=invalidcharch(ele)
    #value.append(p)
      #value=''.join(p)
    print(p)    
return p
def invalidcharch(e):
items={"ä":"a","ç":"c","è":"e","º":"","Ã":"A","Í":"I","í":"i","Ü":"U","â":"a","ò":"o","¿":"","ó":"o","á":"a","à":"a","õ":"o","¡":"","Ó":"O","ù":"u","Ú":"U","´":"","Ñ":"N","Ò":"O","ï":"i","Ï":"I","Ç":"C","À":"A","É":"E","ë":"e","Á":"A","ã":"a","Ö":"O","ú":"u","ñ":"n","é":"e","ê":"e","·":"-","ª":"a","°":"","ü":"u","ô":"o"} 

for i, j in items.items():
    e = e.replace(i, j)
return e

for col in df.columns:
 df[col]=df[col].apply(lambda x:avoid(x))

but in above code I am unable to store whole string in variable pI need to store whole string value in p so that it will store replace cell value.但在上面的代码中,我无法将整个字符串存储在变量 p 中,我需要将整个字符串值存储在 p 中,以便它存储替换单元格值。 Data containing mix datatype values like string integer.包含混合数据类型值的数据,如字符串整数。

col A可乐
Junto à Estação de Carcavelos; Junto à Estação de Carcavelos;
Bragança布拉干萨
Situado en el núcleo de Es Caló de Sant Agustí frente al Hostal Rafalet. Situado en el núcleo de Es Caló de Sant Agustí frente al Hostal Rafalet。
Cartão MOBI.E R. Conselheiro Emídio Navarro (frente ao ISEL) Cartão MOBI.E R. Conselheiro Emídio Navarro (frente ao ISEL)

After chnage更改后
Junto a Estacao de Carcavelos; Junto a Estacao de Carcavelos;
Braganca布拉干萨
Situado en el nucleo de Es Calo de Sant Agusti frente al Hostal Rafalet. Situado en el nucleo de Es Calo de Sant Agusti frente al Hostal Rafalet。
Cartao MOBI.E卡淘手机
R. Conselheiro Emidio Navarro (frente ao ISEL) R. Conselheiro Emidio Navarro (frente ao ISEL)

Adding to Achille Huet's comment that links this question , you can use this on a pandas dataframe column like this:添加到链接此问题的Achille Huet 评论中,您可以在 Pandas 数据框列上使用它,如下所示:

import unidecode
df['col A'] = df['col A'].apply(lambda x: unidecode.unidecode(x))

OR要么

import unidecode
for col in df.columns:
    df[col]=df[col].apply(lambda x: unidecode.unidecode(x))

However, since you have already created the special characters dictionary, you can use it:但是,由于您已经创建了特殊字符字典,您可以使用它:

Just create a dictionary special_chars and replace the values on the entire dataframe by passing regex=True .只需创建一个字典special_chars并通过传递regex=True replace整个数据帧上的值。 This should also be faster.这也应该更快。 I don't know if there is a faster solution using unicode.我不知道是否有使用 unicode 的更快解决方案。 It also depends on what you are doing with it.这也取决于你用它做什么。 If sending to a .csv file for example, I believe there is a parameter in to_csv() as well, but I am not sure if that is relevant:例如,如果发送到 .csv 文件,我相信to_csv()中也有一个参数,但我不确定这是否相关:

special_chars = {"ä":"a","ç":"c","è":"e","º":"","Ã":"A","Í":"I","í":"i","Ü":"U","â":"a","ò":"o","¿":"",
"ó":"o","á":"a","à":"a","õ":"o","¡":"","Ó":"O","ù":"u","Ú":"U","´":"","Ñ":"N",
"Ò":"O","ï":"i","Ï":"I","Ç":"C","À":"A","É":"E","ë":"e","Á":"A","ã":"a","Ö":"O",
"ú":"u","ñ":"n","é":"e","ê":"e","·":"-","ª":"a","°":"","ü":"u","ô":"o"}

df.replace(special_chars, regex=True)

We can use Series.str.translate which is equivalent to str.maketrans + str.translate in python.我们可以使用Series.str.translate ,它相当于 python 中的str.maketrans + str.translate

converter = str.maketrans(items) # `items` is special chars dict.
df['colA'].str.translate(converter)

0                                              Junto a Estacao de Carcavelos;
1                                                                    Braganca
2    Situado en el nucleo de Es Calo de Sant Agusti frente al Hostal Rafalet.
3                Cartao MOBI.E R. Conselheiro Emidio Navarro (frente ao ISEL)
Name: col A, dtype: object

Not fully understood what you are trying to achieve, but you can try something like不完全了解您要实现的目标,但您可以尝试类似的方法

items={"ä":"a","ç":"c","è":"e","º":"","Ã":"A","Í":"I","í":"i","Ü":"U","â":"a","ò":"o","¿":"","ó":"o","á":"a","à":"a","õ":"o","¡":"","Ó":"O","ù":"u","Ú":"U","´":"","Ñ":"N","Ò":"O","ï":"i","Ï":"I","Ç":"C","À":"A","É":"E","ë":"e","Á":"A","ã":"a","Ö":"O","ú":"u","ñ":"n","é":"e","ê":"e","·":"-","ª":"a","°":"","ü":"u","ô":"o"} 

df = pd.DataFrame([
    'abcä',
    'Ãbcd12345'
], columns=['colA'])

df['colA'] = df['colA'].str.replace(r'[^\x00-\x7F]', lambda x: items.get(x.group(0)) or '_', regex=True)

df
    colA
0   abca
1   Abcd12345

For r'[^\\x00-\\x7F] check Regular expression that finds and replaces non-ascii characters with Python对于r'[^\\x00-\\x7F]检查使用 Python 查找和替换非 ascii 字符的正则表达式

You can do that simply with the following part of code.您可以使用以下代码部分简单地做到这一点。

for i in df.columns:

    df[i] = df[i].replace(items, regex=True)

Using standard unicodedata module:使用标准unicodedata模块:

import unicodedata

df["col A"] = df["col A"].apply(
    lambda x: unicodedata.normalize("NFD", x)
    .encode("ascii", "ignore")
    .decode("utf-8")
)
print(df)

Prints:印刷:

                                                                      col A
0                                            Junto a Estacao de Carcavelos;
1                                                                  Braganca
2  Situado en el nucleo de Es Calo de Sant Agusti frente al Hostal Rafalet.
3              Cartao MOBI.E R. Conselheiro Emidio Navarro (frente ao ISEL)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM