简体   繁体   English

用正则表达式分割多行

[英]Split multiple lines by regex

I'm trying to split multiple lines of a segment from an ttl document, here's the relevant code. 我正在尝试从ttl文档中拆分段的多行,这是相关的代码。

entry_obj = str(Entry(*re.findall(r'([;\s]+[^\s+|\s+$])', ''.join(buf))))
            yield process_entry_obj(entry_obj)

The code returns the error and as it is not able to split the string, the number of matching arguments are different every time and code doesn't run. 代码返回错误,并且由于无法拆分字符串,因此每次匹配参数的数量都不同,并且代码不会运行。

Below is my file format: 以下是我的文件格式:

 File input

 ##  http://www.example.com/abc#AAA
                pms:ecCreatedBy rms:type ;
                rmfs:lag "Ersteller"@newyork ,
                "AAA"@wdc .

There are multiple entries like above in the file. 该文件中有多个类似上面的条目。

From what I understand you need \\s*;\\s* 据我了解,您需要\\s*;\\s*

Explanation: 说明:

\\s* - match whitespace character zero or more times \\s* -匹配空白字符零次或多次

; - match ; 比赛; literally 从字面上看

Demo 演示版

You may use 您可以使用

import re

s = "" # File contents
with open(filepath, 'r') as fr:
    s =fr.read()
s = re.sub(r'(?m)(rmfs:label\s*)("[^"]*"@(?!en)\w*)(\s*,\s*)("[^"]*"@en) \.$', r'\1\4\3\2 .', s)
s = re.sub(r'(?m)^(\s*###\s*http.*/v\d+#)\w*((?:\n(?!\n).*)*rmfs:label\s*")([^"]*)("@en)', r'\1\3\2\3\4', s)
# Wrtie to file:
with open(filepath, 'w') as fw:
    fw.write(s)

See the Python demo . 参见Python演示

Here are the Regex 1 and Regex 2 demos . 这是Regex 1Regex 2演示

Regex 1 details 正则表达式1的详细信息

  • (?m) - multiline mode, $ will match end of a line (?m) -多行模式, $将匹配行尾
  • (rmfs:label\\s*) - Group 1 ( \\1 ): rmfs:label and then 0+ whitespaces (rmfs:label\\s*) -组1( \\1 ): rmfs:label然后是0+空格
  • ("[^"]*"@(?!en)\\w*) - Group 2 ( \\2 ): " , 0+ non- " chars, "@ , a lookahead check ensuring no en immediately to the right of the current position, and then 0+ word chars ("[^"]*"@(?!en)\\w*) -第2组( \\2 ): " ,0 +非"字符, "@ ,先行检查,确保无en立即的权当前位置,然后是0+个字符
  • (\\s*,\\s*) - Group 3 ( \\3 ): a , enclosed with 0+ whitespaces (\\s*,\\s*) -第3组( \\3 ):a ,用0+空格包围
  • ("[^"]*"@en) - Group 4 ( \\4 ): " , 0+ chars other than " , " and @en ("[^"]*"@en) -组4( \\4 ): " ,0 +除""@en以外的字符
  • .$ - space, . .$ -空格, . , end of line. , 行结束。

Regex 2 details 正则表达式2的详细信息

  • (?m) - multiline mnode, ^ matche line start (?m) -多行mnode, ^ matche行开始
  • ^ - start of a line ^ -一行的开始
  • (\\s*###\\s*http.*/v\\d+#) - Group 1: 0+ whitespaces, ### , 0+ whitespaces, http , any 0+ chars, /v , 1+ digits and # (\\s*###\\s*http.*/v\\d+#) -组1:0+个空格, ### ,0+个空格, http ,任意0+个字符, /v ,1 +个数字和#
  • \\w* - 0+ word chars \\w* -0+字字符
  • ((?:\\n(?!\\n).*)*rmfs:label\\s*") - Group 2: any amount of lines before a double line break ( (?:\\n(?!\\n).*)* ) and then rmfs:label , 0+ whitespaces and " ((?:\\n(?!\\n).*)*rmfs:label\\s*") -组2:双换行符( (?:\\n(?!\\n).*)* ),然后rmfs:label ,0+空格和"
  • ([^"]*) - Group 3: any 0+ chars other than " ([^"]*) -第3组:除"
  • ("@en) - Group 4: "@en siubstring. ("@en) @en ("@en) -组4: "@en @ en siubstring。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM