How to use Regular Expression to remove repeated characters Python

Question

Below is a list of unique values from a column df

aa                2     
aaa               10    
aaaa              14    
aaaaa             2     
aaaaaa            1     
aableasing        25    
yy                1     
yyy               6        
überimexcars      1     
üüberimexcars     1     
üüüüüüüüü         2

The aim is to 'clean' the data by grouping on Name.

Thus:

aa = aaa = aaaa
ü = üüü = üüüüüü
...

The desired output would be as shown below

a                 29      
aableasing        25    
y                 7           
überimexcars      2  
üüüüüüüüü         2

I was thinking of something like

df['name'] = df['name'].astype(str).str.replace('aaa', 'a')

However, I would have to do it for each letter. Furthermore, that's not really an efficient of doing the thing.

Using Regular Expression in this case might be a better option?

Thanks anyone who is helping!

Answer 1

This should do the trick:

df['name']=df['name'].replace(r"^(.)\1*$", r"\1", regex=True)

Some explanation:

It will try to match the whole cell (from the beginning - ^ , till the end - $ ) to any character (.) which then is repeated 0, or more times (reference to first group, denoted by square brackets) - \1* and all this will be replaced (if it's matched only) with this first group \1 .

Answer 2

if t contains a string, eg 'aaaaa', try the following:

t.join(sorted(set(t), key=t.index))

you'll get 'a'.

Now run this on your dataframe and group

How to use Regular Expression to remove repeated characters Python

Question

2 answers

solution1
1 ACCPTED 2020-05-05 22:45:58

solution2
0 2020-05-05 22:29:05

How to use Regular Expression to remove repeated characters Python

Question

2 answers

solution1 1 ACCPTED 2020-05-05 22:45:58

solution2 0 2020-05-05 22:29:05

solution1
1 ACCPTED 2020-05-05 22:45:58

solution2
0 2020-05-05 22:29:05