[英]How to find a substring using regex
The string that I ended up after scraping 1000 Reuters articles looks like this:我在抓取 1000 篇路透社文章后得到的字符串如下所示:
<TEXT>
<TITLE>IF DOLLAR FOLLOWS WALL STREET JAPANESE WILL DIVEST</TITLE>
<AUTHOR> By Yoshiko Mori</AUTHOR>
<DATELINE> TOKYO, Oct 20 - </DATELINE><BODY>If the dollar goes the way of Wall Street,
Japanese will finally move out of dollar investments in a
serious way, Japan investment managers say.
REUTER
</BODY></TEXT>
I want to extract the title, author, dateline and body out of this string.我想从这个字符串中提取标题、作者、日期和正文。 To do that, I have the below regex but unfortunately, it is not working for the body section.
为此,我有以下正则表达式,但不幸的是,它不适用于正文部分。
try:
body=re.search('<BODY>(.)</BODY>',example_txt).group(1)
except:
body='NA'
This try-except always returns NA
for body but works for title, author and dateline.这个 try-except 总是为 body 返回
NA
但适用于 title、author 和 dateline。 Any idea why?知道为什么吗?
Thanks!谢谢!
Use re.DOTALL
so that .
使用
re.DOTALL
以便.
matches newline as well.也匹配换行符。
re.
关于。 DOTALL
淘宝
Make the
'.'
制作
'.'
special character match any character at all, including a newline;特殊字符完全匹配任何字符,包括换行符; without this flag,
'.'
没有这个标志,
'.'
will match anything except a newline.将匹配除换行符以外的任何内容。
https://docs.python.org/3/library/re.html
https://docs.python.org/3/library/re.html
Also you need *
for multiple characters matching, and ?
您还需要
*
来匹配多个字符,而?
for non-greedy matching.用于非贪婪匹配。
Finally, I have a hunch that try
here is not quite recommended.最后,我有一个预感,不太推荐
try
这里。 You can instead check whether the match object from re.search
is None
or not.您可以改为检查来自
re.search
的匹配对象是否为None
。
import re
example_txt = '''<TEXT>
<TITLE>IF DOLLAR FOLLOWS WALL STREET JAPANESE WILL DIVEST</TITLE>
<AUTHOR> By Yoshiko Mori</AUTHOR>
<DATELINE> TOKYO, Oct 20 - </DATELINE><BODY>If the dollar goes the way of Wall Street,
Japanese will finally move out of dollar investments in a
serious way, Japan investment managers say.
REUTER
</BODY></TEXT>'''
m = re.search(r'<BODY>(.*?)</BODY>', example_txt, flags=re.DOTALL)
body = m.group(1) if m else 'NA'
print(body)
Output:输出:
Japanese will finally move out of dollar investments in a
serious way, Japan investment managers say.
REUTER

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.