Repeat pattern using python regex

Question

Well, I'm cleaning a dataset, using Pandas. I have a column called "Country", where different rows could have numbers or other information into parenthesis and I have to remove them, for example: Australia1, Perú (country), 3Costa Rica, etc. To do this, I'm getting the column and I make a mapping over it.

pattern = "([a-zA-Z]+[\s]*[a-aZ-Z]+)(?:[(]*.*[)]*)"
df['Country'] = df['Country'].str.extract(pattern)

But I have a problem with this regex, I cannot match names as "United States of America", because it only takes "United ". How can I repeat unlimited the pattern of the fisrt group to match the whole name?
Thanks!

Answer 1

In this situation, I will clean the data step by step.

df_str = '''
Country
Australia1
Perú (country)
3Costa Rica
United States of America
'''
df = pd.read_csv(io.StringIO(df_str.strip()), sep='\n')

# handle the data
(df['Country']
 .str.replace('\d+', '', regex=True)  # remove number
 .str.split('\(').str[0]              # get items before `(`
 .str.strip()                         # strip spaces 
)

Answer 2

Thanks for you answer, it worked, I found other solution. and it was doing a match of the things that I don't want on the df.

pattern = "([\s]*[(][\w ]*[)][\s]*)|([\d]*)" #I'm selecting info that I don't want
df['Country'] = df['Country'].replace(pattern, "", regex = True) #I replace that information to an empty string

Repeat pattern using python regex

Question

2 answers

solution1
1 ACCPTED 2021-02-04 02:09:02

solution2
1 2021-02-04 02:37:01

Repeat pattern using python regex

Question

2 answers

solution1 1 ACCPTED 2021-02-04 02:09:02

solution2 1 2021-02-04 02:37:01

solution1
1 ACCPTED 2021-02-04 02:09:02

solution2
1 2021-02-04 02:37:01