[英]Removing whitespaces using regex python
我正在尝试修改文件的每一行以删除以字符“(”开头或在方括号中包含数字/字符的任何部分,即“[2] ”:
f = open('/Users/name/Desktop/university_towns.txt',"r")
listed = []
import re
for i in f.readlines():
if i.find(r'\(.*?\)\n'):
here = re.sub(r'\(.*?\)\[.*?\]\n', "", i)
listed.append(here)
elif i.find(r' \(.*?\)\n'):
here = re.sub(r' \(.*?\)\[.*?\]\n', "", i)
listed.append(here)
elif i.find(r' \[.*?\]\n'):
here = re.sub(r' \[.*?\]\n', "", i)
listed.append(here)
else:
here = re.sub(r'\[.*?\]\n', "", i)
listed.append(here)
我的输入数据示例:
Platteville (University of Wisconsin–Platteville)[2]
River Falls (University of Wisconsin–River Falls)[2]
Stevens Point (University of Wisconsin–Stevens Point)[2]
Waukesha (Carroll University)
Whitewater (University of Wisconsin–Whitewater)[2]
Wyoming[edit]
Laramie (University of Wyoming)[5]
我的 output 数据样本:
Platteville
River Falls
Stevens Point
Waukesha (Carroll University)
Whitewater
Wyoming[edit]
Laramie
但是,我不想要诸如“(卡罗尔大学)”或“[编辑]”之类的部分。
如何修改我的公式?
如果有人能给我任何建议,我将不胜感激!
你可以做:
import re
with open(ur_file) as f_in:
for line in f_in:
if m:=re.search(r'^([^([]+)', line): # Python 3.8+
print(m.group(1))
如果您的 Python 在 3.8 之前没有海象:
with open(ur_file) as f_in:
for line in f_in:
m=re.search(r'^([^([]+)', line)
if m:
print(m.group(1))
印刷:
Platteville
River Falls
Stevens Point
Waukesha
Whitewater
Wyoming
Laramie
正则表达式解释:
^([^([]+)
^ start of the line
^ ^ capture group
^ ^ character class
^ class of characters OTHER THAN ( and [
^ + means one or more
请改用此正则表达式:
\(.*\)|\[.*\]
像这样:
re.sub(r'\(.*\)|\[.*\]', '', i)
这将替换括号中的任何内容( \(.*\)
)或( |
)方括号中的任何内容( \[.*\]
)
如果在比循环更快、更具可读性的矢量化解决方案之后。 然后尝试;
数据
df=pd.DataFrame({'text':['Platteville (University of Wisconsin–Platteville)[2]','River Falls (University of Wisconsin–River Falls)[2]','Stevens Point (University of Wisconsin–Stevens Point)[2]','Waukesha (Carroll University)','Whitewater (University of Wisconsin–Whitewater)[2]','Wyoming[edit]','Wyoming[edit]']})
正则表达式提取
df['name']=df.text.str.extract('([A-Za-z\s+]+(?=\(|\[))')
正则表达式分解
捕获任何[A-Za-z\s+]
大写、小写字母,后跟空格
(?=\(|\[))
并且紧跟特殊字符(` 或特殊字符 [
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.