简体   繁体   English

python中多次出现的正则表达式

[英]Regular expression for multiple occurances in python

I need to parse lines having multiple language codes as below 我需要解析具有多个语言代码的行,如下所示

008800002     Bruxelles-Nord$Br�ussel Nord$<deu>$Brussel Noord$<nld>
  • 008800002 being a id 008800002是一个id
  • Bruxelles-Nord$Br ussel Nord$ being name1 Bruxelles-Nord$Br ussel Nord$是name1
  • deu being language one deu是语言之一
  • $Brussel Noord$ being name two $Brussel Noord$是名字二
  • nld being language two. nld是语言二。

SO, the idea is name and language can appear N number of times. 所以,这个想法是名称和语言可以出现N次。 I need to collect them all. 我需要全部收集它们。 the language in <> is 3 characters in length (fixed) and all names end with $ sign. <>的语言长度为3个字符(固定),所有名称以$符号结尾。

I tried this one but it is not giving expected output. 我试过这个,但它没有给出预期的输出。

x = re.compile('(?P<stop_id>\d{9})\s(?P<authority>[[\x00-\x7F]{3}|\s{3}])\s(?P<stop_name>.*)
    (?P<lang_code>(?:[<]\S{0,4}))',flags=re.UNICODE)

I have no idea how to get repeated elements. 我不知道如何获得重复的元素。 It takes 它需要

Bruxelles-Nord$Br ussel Nord$<deu>$Brussel Noord$ as stop_name and <nld> as language. Bruxelles-Nord$Br ussel Nord$<deu>$Brussel Noord$ as stop_name和<nld> as language。

Do it in two steps. 分两步完成。 First separate ID from name/language pairs; 第一个单独的ID来自名称/语言对; then use re.finditer on the name/language section to iterate over the pairs and stuff them into a dict. 然后在名称/语言部分使用re.finditer迭代对,并将它们填入dict。

import re

line = u"008800002     Bruxelles-Nord$Br�ussel Nord$<deu>$Brussel Noord$<nld>"
m = re.search("(\d+)\s+(.*)", line, re.UNICODE)
id = m.group(1)
names = {}
for m in re.finditer("(.*?)<(.*?)>", m.group(2), re.UNICODE):
    names[m.group(2)] = m.group(1)
print id, names
\b(\d+)\b\s*|(.*?)(?=<)<(.*?)>

Try this.Just grab the captures.see demo. 试试这个。只需抓住captures.see演示。

http://regex101.com/r/hS3dT7/4 http://regex101.com/r/hS3dT7/4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM