繁体   English   中英

在python中使用regex提取多个特定单词之间的子字符串

[英]Extract sub-string between multiple certain words using regex in python

正则表达式子字符串

我要提取我从字符串中获取的电话,传真,移动电话,否则返回空字符串。 我想从任何给定的文本字符串字符串示例中给出3个电话,传真,移动列表,如下所示。

ex1 = "miramar road margie shoop san diego ca 12793 manager  phone 6035550160 fax 6035550161 mobile 6035550178  marsgies travel  wwwmarpiestravelcom"
ex2 = "david packard electrical engineering  350 serra mall room 170 phone 650 7259327  stanford university fax 650 723 1882 stanford california 943059505 ulateecestanfordedu"
ex3 = "stanford  electrical  engineering  vijay chandrasekhar  electrical engineering 17 comstock circle apt 101  stanford ca 94305  phone 9162210411"

正则表达式可能是这样的:

phone_regex  = re.match(".*phone(.*)fax(.*)mobile(.*)",ex1)
phone = [re.sub("[^0-9]","",x) for x in phone_regex.groups()][0]
mobile = [re.sub("[^0-9]","",x) for x in phone_regex.groups()][2]
fax = [re.sub("[^0-9]","",x) for x in phone_regex.groups()][1]

ex1结果
电话= 6035550160
传真= 6035550161
手机= 6035550178

ex2没有移动条目,所以我得到:

追溯(最近一次通话):
phone = [re.sub(“ [^ 0-9]”,“”,x)对于phone_regex.groups()中的x,[0]
AttributeError:“ NoneType”对象没有属性“ groups”


我需要一个更好的正则表达式解决方案(因为我不熟悉正则表达式),或者一个解决方案,以捕获AttributeError并分配null string

您可以这样使用简单的re.findall

dict(re.findall(r'\b({})\s*(\d+)'.format("|".join(keys)), ex))

正则表达式看起来像

\b(phone|fax|mobile)\s*(\d+)

在线观看正则表达式演示

图案细节

  • \\b单词边界
  • (phone|fax|mobile) -第1组:列出的单词之一
  • \\s* -0+空格
  • (\\d+) -第2组:一个或多个数字

参见Python演示

import re
exs = ["miramar road margie shoop san diego ca 12793 manager  phone 6035550160 fax 6035550161 mobile 6035550178  marsgies travel  wwwmarpiestravelcom",
   "david packard electrical engineering  350 serra mall room 170 phone 650 7259327  stanford university fax 650 723 1882 stanford california 943059505 ulateecestanfordedu", 
   "stanford  electrical  engineering  vijay chandrasekhar  electrical engineering 17 comstock circle apt 101  stanford ca 94305  phone 9162210411"]
keys = ['phone', 'fax', 'mobile']
for ex in exs:
    res = dict(re.findall(r'\b({})\s*(\d+)'.format("|".join(keys)), ex))
    print(res)

输出:

{'fax': '6035550161', 'phone': '6035550160', 'mobile': '6035550178'}
{'fax': '650', 'phone': '650'}
{'phone': '9162210411'}

我想我了解您想要的..这与准确获取关键字后的第一个匹配项有关。 在这种情况下,您需要的是问号?:

”“? 也是量词。{0,1}的缩写。表示“匹配零或该问号前面的组之一。”也可以解释为问号前面的部分是可选的。

如果定义不够,这是一些应该起作用的代码

import re
res_dict = {}
list_keywords = ['phone', 'cell', 'fax']
for i_key in list_keywords:
    temp_res = re.findall(i_key + '(.*?) [a-zA-Z]', ex1)
    res_dict[i_key] = temp_res

使用re.search

演示:

import re

ex1 = "miramar road margie shoop san diego ca 12793 manager  phone 6035550160 fax 6035550161 mobile 6035550178  marsgies travel  wwwmarpiestravelcom"
ex2 = "david packard electrical engineering  350 serra mall room 170 phone 650 7259327  stanford university fax 650 723 1882 stanford california 943059505 ulateecestanfordedu"
ex3 = "stanford  electrical  engineering  vijay chandrasekhar  electrical engineering 17 comstock circle apt 101  stanford ca 94305  phone 9162210411"

for i in [ex1, ex2, ex3]:
    phone = re.search(r"(?P<phone>(?<=\phone\b).*?(?=([a-z]|$)))", i)
    if phone:
        print "Phone: ", phone.group("phone")

    fax = re.search(r"(?P<fax>(?<=\bfax\b).*?(?=([a-z]|$)))", i)
    if fax:
        print "Fax: ", fax.group("fax")

    mob = re.search(r"(?P<mob>(?<=\bmobile\b).*?(?=([a-z]|$)))", i)
    if mob:
        print "mob: ", mob.group("mob")
    print("-----")

输出:

Phone:   6035550160 
Fax:   6035550161 
mob:   6035550178  
-----
Phone:   650 7259327  
Fax:   650 723 1882 
-----
Phone:   9162210411
-----

我认为以下正则表达式应该可以正常工作:

mobile = re.findall('mobile([0-9]*)', ex1.replace(" ",""))[0]
fax = re.findall('fax([0-9]*)', ex1.replace(" ",""))[0]
phone = re.findall('phone([0-9]*)', ex1.replace(" ",""))[0]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM