简体   繁体   中英

Scrape Certain elements from HTML using Python and Beautifulsoup

So this is the html I'm working with

<hr>
<b>1914 December 12 - </b>. 
<ul>
    <li>
        <b>Birth of Herbert Hans Guendel</b> - . 
        <i>Nation</i>: 
        <a href="http://www.astronautix.com/g/germany.html">Germany</a>, 
        <a href="http://www.astronautix.com/u/usa.html">USA</a>. 
        <i>Related Persons</i>: 
        <a href="http://www.astronautix.com/g/guendel.html">Guendel</a>.
     
    German-American engineer in WW2, member of the Rocket Team in the United
     States thereafter. German expert in guided missiles during WW2. As of 
    January 1947, working at Fort Bliss, Texas. Died at Boston, New York.. 
    </li>
</ul>

I would like for it to look like this:

Birth of Herbert Hans Guendel
German-American engineer in WW2, member of the Rocket Team in the United
     States thereafter. German expert in guided missiles during WW2. As of 
    January 1947, working at Fort Bliss, Texas. Died at Boston, New York.

Here's my code:

from bs4 import BeautifulSoup
import requests
import linkMaker as linkMaker

url = linkMaker.link

page = requests.get(url)

soup = BeautifulSoup(page.content, "html.parser")

with open("test1.txt", "w") as file:
    hrs = soup.find_all('hr')
    for hr in hrs:
        lis = soup.find_all('li')
        for li in lis:
            file.write(str(li.text)+str(hr.text)+"\n"+"\n"+"\n")

Here's what it's returning:

Birth of Herbert Hans Guendel - . 
: Germany, 
USA. 
Related Persons: Guendel. 
German-American engineer in WW2, member of the Rocket Team in the United States thereafter. German expert in guided missiles during WW2. As of January 1947, working at Fort Bliss, Texas. Died at Boston, New York.. 

My ultimate Goal is to get those two parts of the html tags to tweet them out.

Looking at the HTML snippet for title you can search for first <b> inside the <li> tag. For the text you can search the last .contents of the <li> tag:

from bs4 import BeautifulSoup


html_doc = """\
<hr>
<b>1914 December 12 - </b>. 
<ul>
    <li>
        <b>Birth of Herbert Hans Guendel</b> - . 
        <i>Nation</i>: 
        <a href="http://www.astronautix.com/g/germany.html">Germany</a>, 
        <a href="http://www.astronautix.com/u/usa.html">USA</a>. 
        <i>Related Persons</i>: 
        <a href="http://www.astronautix.com/g/guendel.html">Guendel</a>.
     
    German-American engineer in WW2, member of the Rocket Team in the United
     States thereafter. German expert in guided missiles during WW2. As of 
    January 1947, working at Fort Bliss, Texas. Died at Boston, New York.. 
    </li>
</ul>"""

soup = BeautifulSoup(html_doc, "html.parser")

title = soup.find("li").b.text
text = soup.find("li").contents[-1].strip(" .\n")

print(title)
print(text)

Prints:

Birth of Herbert Hans Guendel
German-American engineer in WW2, member of the Rocket Team in the United
     States thereafter. German expert in guided missiles during WW2. As of 
    January 1947, working at Fort Bliss, Texas. Died at Boston, New York

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM