如何使用正则表达式查找子字符串

Question

The string that I ended up after scraping 1000 Reuters articles looks like this:我在抓取 1000 篇路透社文章后得到的字符串如下所示：

<TEXT>&#2;
<TITLE>IF DOLLAR FOLLOWS WALL STREET JAPANESE WILL DIVEST</TITLE>
<AUTHOR>    By Yoshiko Mori</AUTHOR>
<DATELINE>    TOKYO, Oct 20 - </DATELINE><BODY>If the dollar goes the way of Wall Street,
Japanese will finally move out of dollar investments in a
serious way, Japan investment managers say.
 REUTER
&#3;</BODY></TEXT>

I want to extract the title, author, dateline and body out of this string.我想从这个字符串中提取标题、作者、日期和正文。 To do that, I have the below regex but unfortunately, it is not working for the body section.为此，我有以下正则表达式，但不幸的是，它不适用于正文部分。

try:
  body=re.search('<BODY>(.)</BODY>',example_txt).group(1)
except:
  body='NA'

This try-except always returns NA for body but works for title, author and dateline.这个 try-except 总是为 body 返回NA但适用于 title、author 和 dateline。 Any idea why?知道为什么吗？

Thanks!谢谢！

Answer 1

Use re.DOTALL so that .使用re.DOTALL以便. matches newline as well.也匹配换行符。

re.关于。 DOTALL淘宝

Make the '.'制作'.' special character match any character at all, including a newline;特殊字符完全匹配任何字符，包括换行符； without this flag, '.'没有这个标志， '.' will match anything except a newline.将匹配除换行符以外的任何内容。

https://docs.python.org/3/library/re.html https://docs.python.org/3/library/re.html

Also you need * for multiple characters matching, and ?您还需要*来匹配多个字符，而? for non-greedy matching.用于非贪婪匹配。

Finally, I have a hunch that try here is not quite recommended.最后，我有一个预感，不太推荐try这里。 You can instead check whether the match object from re.search is None or not.您可以改为检查来自re.search的匹配对象是否为None 。

import re

example_txt = '''<TEXT>&#2;
<TITLE>IF DOLLAR FOLLOWS WALL STREET JAPANESE WILL DIVEST</TITLE>
<AUTHOR>    By Yoshiko Mori</AUTHOR>
<DATELINE>    TOKYO, Oct 20 - </DATELINE><BODY>If the dollar goes the way of Wall Street,
Japanese will finally move out of dollar investments in a
serious way, Japan investment managers say.
 REUTER
&#3;</BODY></TEXT>'''

m = re.search(r'<BODY>(.*?)</BODY>', example_txt, flags=re.DOTALL)
body = m.group(1) if m else 'NA'

print(body)

Output:输出：

Japanese will finally move out of dollar investments in a
serious way, Japan investment managers say.
 REUTER
&#3;

如何使用正则表达式查找子字符串

问题描述

1 个解决方案

解决方案1
3 已采纳 2021-11-15 04:29:41

如何使用正则表达式查找子字符串

问题描述

1 个解决方案

解决方案1 3 已采纳 2021-11-15 04:29:41

解决方案1
3 已采纳 2021-11-15 04:29:41