简体   繁体   中英

How to extract the text of a text node within an html dom through xpath?

I'm trying to access a web database for their categorizations of certain mathematics papers. In the below HTML, "Mathematics" would be the desired result. Categories include "Applied Mathematics" and "Statistics" as well. Specifically, I want to iterate doing this process for many different math papers on different websites on this online database, and I can't search for a specific xpath because the xpath changes from paper to paper.

HTML Code:

<p class="FR_field">
    <span class="FR_label">Web of Science Categories:</span>Mathematics</p>

For instance, "Mathematics" is located at

//*[@id="records_form"]/div/div/div/div[1]/div/div[8]/p[2]/text()

for that particular paper, but the index of the p tag or one of the div tags might change from paper to paper. The code I wrote to find the category is

Python Code for remote access:

driver.find_element_by_xpath("//*[contains(text(), 'Web of Science Categories:')]").text[26:]

But this does not seem to work, and if I print the result it will print nothing. Could it perhaps be that I am encountering this error because of the extra text splitting that I am attempting to do? I want simply "Mathematics" and not "Web of Science Categories: Mathematics" so I'm splitting the result after 26 characters.

EDIT: So after some further testing, it seems that I was indeed getting a result, but it was not printing because my python code only sees "Web of Science Categories:" Naturally, splitting this string at the 26th character will print nothing. However, this presents the new conundrum of how to actually acquire "Mathematics" and not "Web of Science Categories:"

根据您提供的用于提取文本数学的 HTML,您可以使用以下代码行:

text1 = driver.find_element_by_xpath("//p[@class='FR_field']").get_attribute("innerHTML").splitlines()[2]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM