简体   繁体   中英

extract text from html tags using regex

My HTML text looks like this..I want to extract only PLAIN TEXT from HTML text using REGEX in python (NOT USING HTML PARSERS)

<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, h elvetica, sans-serif;">
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
</span></p>

How to find exact regex to get the plain text?

You can do this with Javascript with a simple selector method and then retrieving the .innerHTML property.

//select the class for which you want to pull the HTML from
let div = document.getElementsByClassName('text-div');
//select the first element of NodeList returned from selector method and get the inner HTML 
let text = div[0].innerHTML; 

This will select the element whose HTML you want to retrieve and then it will pull the inner HTML text, assuming you only want what is between the HTML tags and not the tags themselves.

Regex is not necessary for this. You'd have to implement the Regex with JS or some back-end and as long as you can insert a JS script into your project, then you can get the inner HTML.

If you're scraping data, your library in whatever language will most likely have selector methods and ways to easily retrieve the HTML text without the need for Regex.

You might be better of using a parser here:

import html, xml.etree.ElementTree as ET

# decode
string = """<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, h elvetica, sans-serif;">
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
</span></p>"""

# construct the dom
root = ET.fromstring(html.unescape(string))

# search it
for p in root.findall("*"):
    print(p.text)

This yields

Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.

Obviously, you might want to change the xpath , thus have a look at the possibilities .


Addendum:

It is possible to use a regular expression here, but this approach is really error-prone and not advisable :

import re

string = """<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, h elvetica, sans-serif;">
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
</span></p>"""

rx = re.compile(r'(\b[A-Z][\w\s,]+\.)')

print(rx.findall(string))
# ['Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.']

The idea is to look for an uppercase letter and match word characters, whitespaces and commas up to a dot. See a demo on regex101.com .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM