简体   繁体   中英

web scraping python3.4 extract a paragraph

I use requests and bs4 to scrap data from webpage I have a string that contains few words from a paragraph in a webpage and I wish to know how to extract the whole paragraph containing it. If anyone knows how, please tell me! Thank you :)

The obvious way is to just iterate all the paragraphs and find the one that contains your words:

for p in soup.find_all('p'):
    if few_words in p.text:
        # found it, do something

Here are some really simple cases that's good to have when webscraping. This partly answers your question, but since you have not given more info, my data and approach are assumptions at best.

from bs4 import BeautifulSoup as bsoup
import re

html = """
<span>
    <div id="foo">
        The quick brown fox jumped
    </div>
    <p id="bar">
        over the lazy dog.
    </p>
</span>
"""

soup = bsoup(html)
soup.prettify()

# Find the div with id "foo" and get
# its inner text and print it.

foo = soup.find_all(id="foo")
f = foo[0].get_text()
print f

print "-" * 50

# Find the p with id "bar", get its
# inner text, strip all whitespace,
# and print it out.

bar = soup.find_all(id="bar")
b = bar[0].get_text().strip()
print b

print "-" * 50

# Find the word "lazy". Get its parent
# tag. If it's a p tag, get that p tag's
# parent, then get all the text inside that
# parent, strip all extra spaces, and print.
lazy = soup.find_all(text=re.compile("lazy"))
lazy_tag = lazy[0].parent

if lazy_tag.name == "p":
    lazy_grandparent = lazy_tag.parent
    all_text = lazy_grandparent.get_text()
    all_text = " ".join(all_text.split())
    print all_text

Result:

        The quick brown fox jumped

--------------------------------------------------
over the lazy dog.
--------------------------------------------------
The quick brown fox jumped over the lazy dog.
for para in request_soup.p.find_all(text=True,recursive=True):

您可以使用它来提取段落,即使<p>标记之前有任何标记

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM