简体   繁体   中英

Python extract html webpage content using keywords

Using python would like to extract context by matching keywords,

Here is my python script

import requests
from bs4 import BeautifulSoup
import re
html = """ <pre>
      Companies:
       Telstra VI Huawei
      Countries:
       JPN CHN MLY
   </pre>
   <pre>
   Data center:
    US UK
   </pre>"""
r = requests.get(html)
soup = BeautifulSoup(r.content, "html.parser")
k = soup.find(text=re.compile("companies:")).parent.text
print (k)

Expected output:

Companies:
       Telstra VI Huawei

Try this.

from simplified_scrapy import SimplifiedDoc

html = """ <pre>
      Companies:
       Telstra VI Huawei
      Countries:
       JPN CHN MLY
   </pre>
   <pre>
   Data center:
    US UK
   </pre>"""
doc = SimplifiedDoc(html)
pre = doc.getElementByReg('Companies:')
print(pre.text)
print('-' * 50)
print(pre.replaceReg('Countries:[\s\S]*', '').strip())

Result:

Companies: Telstra VI Huawei Countries: JPN CHN MLY
--------------------------------------------------
Companies:
       Telstra VI Huawei

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM