简体   繁体   English

使用python查找子字符串

[英]Find substring by using python

I extracted a raw string from a Q&A forum.我从问答论坛中提取了一个原始字符串。 I have a string like this:我有一个这样的字符串:

s = 'Take about 2 + <font color="blue"><font face="Times New Roman">but double check with teacher <font color="green"><font face="Arial">before you do'

I want to extract this substring " <font color="blue"><font face="Times New Roman"> " and assign it to a new variable.我想提取这个子字符串“ <font color="blue"><font face="Times New Roman"> ”并将其分配给一个新变量。 I am able to remove it with regex but I don't know how to assign it to a new variable.我可以用正则表达式删除它,但我不知道如何将它分配给一个新变量。 I am new to regex.我是正则表达式的新手。

import re
s1 = re.sub('<.*?>', '', s)

This is removes the sub but I'd like to keep the removed sub for the record, ideally reassign it to a varialbe.这是删除子但我想保留删除的子记录,理想情况下将其重新分配给变量。

How can I do this?我怎样才能做到这一点? I may prefer regular expressions.我可能更喜欢正则表达式。

Though bs4 is more approprate for webscraping but if you are okay with regex for your case you could do following虽然 bs4 更适合网页抓取,但如果您对正则表达式没问题,您可以执行以下操作

>>> import re
>>> s = 'Take about 2 + <font color="blue"><font face="Times New Roman">but double check with teacher <font color="green"><font face="Arial">before you do'
>>> regex = re.compile('<.*?>')
>>> regex.findall(s)
['<font color="blue">', '<font face="Times New Roman">', '<font color="green">', '<font face="Arial">']
>>> regex.sub('', s)
'Take about 2 + but double check with teacher before you do'

Regex is not exactly the easiest tool to parse HTML components.正则表达式并不是解析 HTML 组件的最简单的工具。 You can try using BeautifulSoup to parse the components and make your substring.您可以尝试使用BeautifulSoup来解析组件并制作您的子字符串。

from bs4 import BeautifulSoup

s = """Take about 2 + <font color="blue">
       <font face="Times New Roman">but double check with teacher <font color="green">
       <font face="Arial">before you do"""


soup = BeautifulSoup(s, "html.parser")

Print the html:打印html:

Take about 2 +
<font color="blue">
 <font face="Times New Roman">
  but double check with teacher
  <font color="green">
   <font face="Arial">
    before you do
   </font>
  </font>
 </font>
</font>

Extract components:提取成分:

soup.font.font['face']
> 'Times New Roman'
soup.font["color"]
> 'blue'

Now make and save your substring as a variable:现在制作并保存您的子字符串作为变量:

variable = f"<font color={soup.font.font['face']}><font face={soup.font.font['face']}>"

This will give you:这会给你:

"<font color="blue"><font face="Times New Roman">"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM