简体   繁体   English

为什么 python 正则表达式不匹配特殊字符?

[英]Why is python regex not matching special characters?

I am wondering why the following isn't working.我想知道为什么以下不起作用。 The expression works on Regex101.com.该表达式适用于 Regex101.com。 However, when I add ’ into the spreadsheet, it returns an empty array rather than at least matching that string.但是,当我将 ’ 添加到电子表格中时,它会返回一个空数组,而不是至少匹配该字符串。

This is the Regular expression:这是正则表达式:

[^A-z0-9\s,.][^-_+=]

This is what I'm looking:这就是我正在寻找的:

’
Â

Try it here (It worked for me): https://regex101.com/在这里试试(它对我有用): https://regex101.com/

Here is the code:这是代码:

import pandas as pd
import chardet
import csv 
import re

def get_file_encoding(file):
    rawdata = open(file, "rb").read()
    encoding = chardet.detect(rawdata)['encoding']
    return encoding

#Type in sanitized_ACAS_FULL_1
data = 'sanitized_ACAS_FULL_1.csv'
my_encoding = get_file_encoding(data)
#print(my_encoding)
my_encoding = 'UTF-8-SIG'
df = pd.read_csv(data, encoding=my_encoding, header=None, low_memory=False)

csv_rows = df.apply(lambda x: x.tolist(), axis=1)

sanitized_rows = []
for row in csv_rows:
    for item in row:
        index = row.index(item) 
        row[index] = str(item).strip()
        if 'nan' in str(item).strip():
            row[index] = "NA"

for row in csv_rows:
    for item in row:
        sanitized_rows.append(item)

match = []
for row in sanitized_rows:
    for entry in row:   
        if re.match(r'[^A-z0-9\s,.][^-_+=]', entry):
            match.append(entry)

print(match)

(\GÂ)|(\Gâ)|(\G€)|(\G™)

This gets the characters that you want separately.这将分别获得您想要的字符。 If you want them grouped you can use (\G’) as example.如果您希望将它们分组,您可以使用(\G’)作为示例。 Remembering that \G means start of a match .记住\G表示比赛开始

I hope that helps.我希望这会有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM