Removing whitespaces using regex python

Question

I am trying to amend each line of a file to remove any parts beginning with the character '(' or containing a number/character in square brackets i.e.'[2] ':

f = open('/Users/name/Desktop/university_towns.txt',"r")
listed = []
import re 
for i in f.readlines():
    if i.find(r'\(.*?\)\n'): 
       here = re.sub(r'\(.*?\)\[.*?\]\n', "", i)
       listed.append(here)
    elif i.find(r' \(.*?\)\n'):
       here = re.sub(r' \(.*?\)\[.*?\]\n', "", i)
       listed.append(here)
    elif i.find(r' \[.*?\]\n'): 
       here = re.sub(r' \[.*?\]\n', "", i)
       listed.append(here) 
    else:
       here = re.sub(r'\[.*?\]\n', "", i)
       listed.append(here)

A sample of my input data:

Platteville (University of Wisconsin–Platteville)[2]
River Falls (University of Wisconsin–River Falls)[2]
Stevens Point (University of Wisconsin–Stevens Point)[2]
Waukesha (Carroll University)
Whitewater (University of Wisconsin–Whitewater)[2]
Wyoming[edit]
Laramie (University of Wyoming)[5]

A sample of my output data:

Platteville 
River Falls 
Stevens Point 
Waukesha (Carroll University)
Whitewater 
Wyoming[edit]
Laramie

However, I do not want the parts such as '(Carroll University)' or '[edit]'.

How can I amend my formula ?

I would be so grateful if anyone could give me any advice!

Answer 1

You can do:

import re 

with open(ur_file) as f_in:
    for line in f_in:
        if m:=re.search(r'^([^([]+)', line):  # Python 3.8+
            print(m.group(1))

If your Python is prior to 3.8 without the Walrus :

with open(ur_file) as f_in:
    for line in f_in:
        m=re.search(r'^([^([]+)', line)
        if m:
            print(m.group(1))

Prints:

Platteville 
River Falls 
Stevens Point 
Waukesha 
Whitewater 
Wyoming
Laramie

The regex explained:

^([^([]+)

^                            start of the line
 ^       ^                   capture group
   ^   ^                     character class
     ^                       class of characters OTHER THAN ( and [
        ^                    + means one or more

Here is the regex on Regex101

Answer 2

Use this RegEx instead:

\(.*\)|\[.*\]

Like so:

re.sub(r'\(.*\)|\[.*\]', '', i)

This will substitute anything in parenthesis ( \(.*\) ) or ( | ) anything in square brackets ( \[.*\] )

Answer 3

If after a vectorised solution which is much faster and more readable than a loop. Then try;

Data

df=pd.DataFrame({'text':['Platteville (University of Wisconsin–Platteville)[2]','River Falls (University of Wisconsin–River Falls)[2]','Stevens Point (University of Wisconsin–Stevens Point)[2]','Waukesha (Carroll University)','Whitewater (University of Wisconsin–Whitewater)[2]','Wyoming[edit]','Wyoming[edit]']})

Regex extract

df['name']=df.text.str.extract('([A-Za-z\s+]+(?=\(|\[))')

Regex Breakdown

Capture any [A-Za-z\s+] UpperCase, Lowercase letters that are followed by space

(?=\(|\[)) and that are immediately followed by special character(` or special character [

Removing whitespaces using regex python

Question

3 answers

solution1
3 ACCPTED 2020-05-17 19:36:21

solution2
0 2020-05-17 19:35:19

solution3
0 2020-05-17 20:03:54

Removing whitespaces using regex python

Question

3 answers

solution1 3 ACCPTED 2020-05-17 19:36:21

solution2 0 2020-05-17 19:35:19

solution3 0 2020-05-17 20:03:54

solution1
3 ACCPTED 2020-05-17 19:36:21

solution2
0 2020-05-17 19:35:19

solution3
0 2020-05-17 20:03:54