Below is a list of unique values from a column df
aa 2
aaa 10
aaaa 14
aaaaa 2
aaaaaa 1
aableasing 25
yy 1
yyy 6
überimexcars 1
üüberimexcars 1
üüüüüüüüü 2
The aim is to 'clean' the data by grouping on Name.
Thus:
The desired output would be as shown below
a 29
aableasing 25
y 7
überimexcars 2
üüüüüüüüü 2
I was thinking of something like
df['name'] = df['name'].astype(str).str.replace('aaa', 'a')
However, I would have to do it for each letter. Furthermore, that's not really an efficient of doing the thing.
Using Regular Expression in this case might be a better option?
Thanks anyone who is helping!
This should do the trick:
df['name']=df['name'].replace(r"^(.)\1*$", r"\1", regex=True)
Some explanation:
It will try to match the whole cell (from the beginning - ^
, till the end - $
) to any character (.)
which then is repeated 0, or more times (reference to first group, denoted by square brackets) - \1*
and all this will be replaced (if it's matched only) with this first group \1
.
if t contains a string, eg 'aaaaa', try the following:
t.join(sorted(set(t), key=t.index))
you'll get 'a'.
Now run this on your dataframe and group
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.