简体   繁体   中英

Python: Get text from all HTML child elements texts with lxml xpath

I am using the lxml xpath of python. I am able to extract text if I give the full path to a HTML tag. However I can't extract all the text from a tag and it's child elements into a list. So for example given this html I would like to get all the texts of the "example" class:

<div class="example">
    "Some text"
    <div>
        "Some text 2"
        <p>"Some text 3"</p>
        <p>"Some text 4"</p>
        <span>"Some text 5"</span>
    </div>
    <p>"Some text 6"</p> 
</div>

I would like to get:

["Some text", "Some text 2", "Some text 3", "Some text 4", "Some text 5", "Some text 6"]

mzjn-s anwer is correct. After some trial and error I've managed to get it working. This is what the end code looks like. You need to put //text() to the end of the xpath. It is without refactoring for the moment, so there will definitely be some mistakes and bad practices but it works.

    session = requests.Session()
    retry = Retry(connect=3, backoff_factor=0.5)
    adapter = HTTPAdapter(max_retries=retry)
    session.mount('http://', adapter)
    session.mount('https://', adapter)
    page = session.get("The url you are webscraping")
    content = page.content

    htmlsite = urllib.request.urlopen("The url you are webscraping")
    soup = BeautifulSoup(htmlsite, 'lxml')
    htmlsite.close()

    tree = html.fromstring(content)
    scraped = tree.xpath('//html[contains(@class, "no-js")]/body/div[contains(@class, "container")]/div[contains(@class, "content")]/div[contains(@class, "row")]/div[contains(@class, "col-md-6")]/div[contains(@class, "clearfix")]//text()')

I've tried it out on the team introduction page of keeleyteton.com. It returned the following list which is correct (although needs lots of amending!) because they are in different tags and some are children tags. Thank you for the help!

['\r\n        ', '\r\n        ', 'Nicholas F. Galluccio', '\r\n        ', '\r\n        ', 'Managing Director and Portfolio Manager', '\r\n        ', 'Teton Small Cap Select Value', '\r\n        ', 'Keeley Teton Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Scott R. Butler', '\r\n        ', '\r\n        ', 'Senior Vice President and Portfolio Manager ', '\r\n        ', 'Teton Small Cap Select Value', '\r\n        ', 'Keeley Teton Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Thomas E. Browne, Jr., CFA', '\r\n        ', '\r\n        ', 'Portfolio Manager', '\r\n        ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n        ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Brian P. Leonard, CFA', '\r\n        ', '\r\n
  ', 'Portfolio Manager', '\r\n        ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n        ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Robert M. Goldsborough', '\r\n        ', '\r\n        ', 'Research Analyst', '\r\n        ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Brian R. Keeley, CFA', '\r\n        ', '\r\n        ', 'Portfolio Manager', '\r\n        ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Edward S. Borland', '\r\n        ', '\r\n
  ', 'Research Analyst', '\r\n        ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n      ', '\r\n        ', '\r\n        ', 'Kevin M. Keeley', '\r\n        ', '\r\n        ', 'President', '\r\n
 ', '\r\n        ', '\r\n        ', 'Deanna B. Marotz', '\r\n        ', '\r\n        ', 'Chief Compliance Officer', '\r\n      ']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM