简体   繁体   中英

Why isn't my regex in Python not working properly?

I have a.txt file (with a kind of XML code) that I am trying to restructure. I have 2 questions about things not working the way I want them to. (Both problems have been solved by the comments of Wiktor).

The file looks like this:

<str name="name">John</str>
<date name="year">2021</date>
<arr name="food">
   <str>Pizza</str>
   <str>Meat</str>
</arr>

I want to restructure this text into this correct XML structure:

<name>John</name>
<year>2021</year>
<food>
   Pizza
   Meat
</food>

To achieve this, I already made a regular expression:

<(str|date|arr|int|long).*="(.+)">(.*)</(str|date|arr|int|long)>

You also can find the regular expression HERE , with the small sample string.

The first question: As you can see on Pythex, the str and date parts are recognized correctly, but the array part is not. This is because \n is not part of the . symbol in the regular expression. I can activate this with the dotall parameter. But when I do that, the entire file becomes one match. Which makes sense. However, I want to have separate matches, as happens with the str and date parts when dotall is not active. The first question: How can I make sure that the part between <arr...> and </arr> is seen as a match, without searching further after the </arr> ? I need the match captures for each individual match that you can see on the right. So from <arr it should work, including \n , until </arr> and then it should stop.

The second question: I want to use the match* captures you see at on the right (at Pythex) to assemble the new structure. So I need a method that allows me to use those pieces text from the regular expression to replace the original text with. I read that this can be done with the compile method of the re package. But it's not working. This is my code:

from re import compile

file = open("file.txt")
content = file.read()
p = compile('<(str|date|arr|int|long).*="(.+)">(.*)</(str|date|arr|int|long)>')
p.sub('<\\2>\\3</\\2>', content)

print(content)

The new structure on the p.sub line may not be completely correct, but that's not the problem: If i use p.sub('test', content) , and I print the content at the end of the code, the matches are also not replaced by 'test' . The content is like it was at the beginning. So, the entire function doesn't seem to work. What am I doing wrong?

You need to make sure the pattern matches across lines by adding the re.S or re.DOTALL flag, the .* must be made non-greedy by using the lazy dot, .*? , and you need to make sure the close tags are the same as open tags (by means of an inline backreference). Also, do not forget you need to assign the result of re.sub to a variable, since strings are immutable in Python.

You need to use

p = compile(r'<(str|date|arr|int|long)\b.*?="(.*?)">(.*?)</\1>', re.I | re.S)
content = p.sub(r'<\2>\3</\2>', content)

See the regex demo .

Details

  • < - a < char
  • (str|date|arr|int|long) - Capturing group 1: any of the alternative substrings
  • \b - a word boundary
  • .*? - zero or more chars (but as few as possible)
  • =" - a =" substring
  • (.*?) - Group 2: any zero or more chars as few as possible
  • "> - a "> substring
  • (.*?) - Group 3: any zero or more chars as few as possible
  • </\1> - </ , same value as in Group 1, and a > .

This site can be helpful with regex: https://www.rexegg.com/regex-conditionals.html

I'm not an expert at Regex, but I think adding the /n parameter is necessary, similar to how you checked for 0+ wildcards.

Edited: <(str|date|arr|int|long). ="(.+)">\n (. )\n </(str|date|arr|int|long)>

You could try that? Again, I'm no expert on regex. Just trying to lend a helping hand.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM