使用正则表达式从html标记中提取文本

Question

My HTML text looks like this..I want to extract only PLAIN TEXT from HTML text using REGEX in python (NOT USING HTML PARSERS) 我的HTML文本如下所示。我只想在python中使用REGEX从HTML文本中提取纯文本（不使用HTML注释）

&lt;p style=&quot;text-align: justify;&quot;&gt;&lt;span style=&quot;font-size: small; font-family: lato, arial, h elvetica, sans-serif;&quot;&gt;
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
&lt;/span&gt;&lt;/p&gt;

How to find exact regex to get the plain text? 如何找到确切的正则表达式以获取纯文本？

Answer 1

You can do this with Javascript with a simple selector method and then retrieving the .innerHTML property. 您可以使用Java使用简单的选择器方法来执行此操作，然后检索.innerHTML属性。

//select the class for which you want to pull the HTML from
let div = document.getElementsByClassName('text-div');
//select the first element of NodeList returned from selector method and get the inner HTML 
let text = div[0].innerHTML;

This will select the element whose HTML you want to retrieve and then it will pull the inner HTML text, assuming you only want what is between the HTML tags and not the tags themselves. 这将选择要检索其HTML的元素，然后将提取内部HTML文本，假设您只想要HTML标记之间的内容，而不是标记本身。

Regex is not necessary for this. 正则表达式不是必需的。 You'd have to implement the Regex with JS or some back-end and as long as you can insert a JS script into your project, then you can get the inner HTML. 您必须使用JS或某些后端来实现Regex，只要您可以在项目中插入JS脚本，就可以获取内部HTML。

If you're scraping data, your library in whatever language will most likely have selector methods and ways to easily retrieve the HTML text without the need for Regex. 如果您要抓取数据，则无论使用哪种语言，您的库都极有可能使用选择器方法和方法来轻松检索HTML文本，而无需使用正则表达式。

Answer 2

You might be better of using a parser here: 您最好在这里使用解析器：

import html, xml.etree.ElementTree as ET

# decode
string = """&lt;p style=&quot;text-align: justify;&quot;&gt;&lt;span style=&quot;font-size: small; font-family: lato, arial, h elvetica, sans-serif;&quot;&gt;
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
&lt;/span&gt;&lt;/p&gt;"""

# construct the dom
root = ET.fromstring(html.unescape(string))

# search it
for p in root.findall("*"):
    print(p.text)

This yields 这产生

Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.

Obviously, you might want to change the xpath , thus have a look at the possibilities . 显然，您可能想要更改xpath ，从而查看可能性 。

Addendum: 附录：

It is possible to use a regular expression here, but this approach is really error-prone and not advisable : 可以在此处使用正则表达式，但是这种方法确实容易出错并且不建议使用 ：

import re

string = """&lt;p style=&quot;text-align: justify;&quot;&gt;&lt;span style=&quot;font-size: small; font-family: lato, arial, h elvetica, sans-serif;&quot;&gt;
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
&lt;/span&gt;&lt;/p&gt;"""

rx = re.compile(r'(\b[A-Z][\w\s,]+\.)')

print(rx.findall(string))
# ['Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.']

The idea is to look for an uppercase letter and match word characters, whitespaces and commas up to a dot. 这个想法是寻找一个大写字母并匹配单词字符，空格和逗号，直到一个点。 See a demo on regex101.com . 参见regex101.com上的演示 。

使用正则表达式从html标记中提取文本

问题描述

2 个解决方案

解决方案1
1 2017-11-24 06:02:58

解决方案2
1 已采纳 2017-11-24 06:51:38

Addendum: 附录：

使用正则表达式从html标记中提取文本

问题描述

2 个解决方案

解决方案1 1 2017-11-24 06:02:58

解决方案2 1 已采纳 2017-11-24 06:51:38

Addendum: 附录：

解决方案1
1 2017-11-24 06:02:58

解决方案2
1 已采纳 2017-11-24 06:51:38