I am trying to amend each line of a file to remove any parts beginning with the character '(' or containing a number/character in square brackets i.e.'[2] ':
f = open('/Users/name/Desktop/university_towns.txt',"r")
listed = []
import re
for i in f.readlines():
if i.find(r'\(.*?\)\n'):
here = re.sub(r'\(.*?\)\[.*?\]\n', "", i)
listed.append(here)
elif i.find(r' \(.*?\)\n'):
here = re.sub(r' \(.*?\)\[.*?\]\n', "", i)
listed.append(here)
elif i.find(r' \[.*?\]\n'):
here = re.sub(r' \[.*?\]\n', "", i)
listed.append(here)
else:
here = re.sub(r'\[.*?\]\n', "", i)
listed.append(here)
A sample of my input data:
Platteville (University of Wisconsin–Platteville)[2]
River Falls (University of Wisconsin–River Falls)[2]
Stevens Point (University of Wisconsin–Stevens Point)[2]
Waukesha (Carroll University)
Whitewater (University of Wisconsin–Whitewater)[2]
Wyoming[edit]
Laramie (University of Wyoming)[5]
A sample of my output data:
Platteville
River Falls
Stevens Point
Waukesha (Carroll University)
Whitewater
Wyoming[edit]
Laramie
However, I do not want the parts such as '(Carroll University)' or '[edit]'.
How can I amend my formula ?
I would be so grateful if anyone could give me any advice!
You can do:
import re
with open(ur_file) as f_in:
for line in f_in:
if m:=re.search(r'^([^([]+)', line): # Python 3.8+
print(m.group(1))
If your Python is prior to 3.8 without the Walrus :
with open(ur_file) as f_in:
for line in f_in:
m=re.search(r'^([^([]+)', line)
if m:
print(m.group(1))
Prints:
Platteville
River Falls
Stevens Point
Waukesha
Whitewater
Wyoming
Laramie
The regex explained:
^([^([]+)
^ start of the line
^ ^ capture group
^ ^ character class
^ class of characters OTHER THAN ( and [
^ + means one or more
Use this RegEx instead:
\(.*\)|\[.*\]
Like so:
re.sub(r'\(.*\)|\[.*\]', '', i)
This will substitute anything in parenthesis ( \(.*\)
) or ( |
) anything in square brackets ( \[.*\]
)
If after a vectorised solution which is much faster and more readable than a loop. Then try;
Data
df=pd.DataFrame({'text':['Platteville (University of Wisconsin–Platteville)[2]','River Falls (University of Wisconsin–River Falls)[2]','Stevens Point (University of Wisconsin–Stevens Point)[2]','Waukesha (Carroll University)','Whitewater (University of Wisconsin–Whitewater)[2]','Wyoming[edit]','Wyoming[edit]']})
Regex extract
df['name']=df.text.str.extract('([A-Za-z\s+]+(?=\(|\[))')
Regex Breakdown
Capture any [A-Za-z\s+]
UpperCase, Lowercase letters that are followed by space
(?=\(|\[))
and that are immediately followed by special character(` or special character [
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.