简体   繁体   中英

How to use Regular Expression to remove repeated characters Python

Below is a list of unique values from a column df

aa                2     
aaa               10    
aaaa              14    
aaaaa             2     
aaaaaa            1     
aableasing        25    
yy                1     
yyy               6        
überimexcars      1     
üüberimexcars     1     
üüüüüüüüü         2     

The aim is to 'clean' the data by grouping on Name.

Thus:

  • aa = aaa = aaaa
  • ü = üüü = üüüüüü
  • ...

The desired output would be as shown below

a                 29      
aableasing        25    
y                 7           
überimexcars      2  
üüüüüüüüü         2   

I was thinking of something like

df['name'] = df['name'].astype(str).str.replace('aaa', 'a')

However, I would have to do it for each letter. Furthermore, that's not really an efficient of doing the thing.

Using Regular Expression in this case might be a better option?

Thanks anyone who is helping!

This should do the trick:

df['name']=df['name'].replace(r"^(.)\1*$", r"\1", regex=True)

Some explanation:

It will try to match the whole cell (from the beginning - ^ , till the end - $ ) to any character (.) which then is repeated 0, or more times (reference to first group, denoted by square brackets) - \1* and all this will be replaced (if it's matched only) with this first group \1 .

if t contains a string, eg 'aaaaa', try the following:

t.join(sorted(set(t), key=t.index))

you'll get 'a'.

Now run this on your dataframe and group

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM