简体   繁体   English

如何通过xpath提取html dom中文本节点的文本?

[英]How to extract the text of a text node within an html dom through xpath?

I'm trying to access a web database for their categorizations of certain mathematics papers.我正在尝试访问网络数据库以对某些数学论文进行分类。 In the below HTML, "Mathematics" would be the desired result.在下面的 HTML 中,“数学”将是所需的结果。 Categories include "Applied Mathematics" and "Statistics" as well.类别还包括“应用数学”和“统计学”。 Specifically, I want to iterate doing this process for many different math papers on different websites on this online database, and I can't search for a specific xpath because the xpath changes from paper to paper.具体来说,我想在这个在线数据库的不同网站上为许多不同的数学论文迭代执行这个过程,但我无法搜索特定的 xpath,因为 xpath 因论文而异。

HTML Code: HTML代码:

<p class="FR_field">
    <span class="FR_label">Web of Science Categories:</span>Mathematics</p>

For instance, "Mathematics" is located at例如,“数学”位于

//*[@id="records_form"]/div/div/div/div[1]/div/div[8]/p[2]/text()

for that particular paper, but the index of the p tag or one of the div tags might change from paper to paper.对于该特定纸张,但 p 标签或 div 标签之一的索引可能因纸张而异。 The code I wrote to find the category is我编写的用于查找类别的代码是

Python Code for remote access:用于远程访问的 Python 代码:

driver.find_element_by_xpath("//*[contains(text(), 'Web of Science Categories:')]").text[26:]

But this does not seem to work, and if I print the result it will print nothing.但这似乎不起作用,如果我打印结果,它不会打印任何内容。 Could it perhaps be that I am encountering this error because of the extra text splitting that I am attempting to do?可能是因为我尝试进行额外的文本拆分而遇到此错误吗? I want simply "Mathematics" and not "Web of Science Categories: Mathematics" so I'm splitting the result after 26 characters.我只想要“数学”而不是“科学网类别:数学”,所以我将结果拆分为 26 个字符。

EDIT: So after some further testing, it seems that I was indeed getting a result, but it was not printing because my python code only sees "Web of Science Categories:" Naturally, splitting this string at the 26th character will print nothing.编辑:所以经过一些进一步的测试,似乎我确实得到了结果,但它没有打印,因为我的 python 代码只看到“Web of Science Categories:”自然,在第 26 个字符处拆分此字符串将不会打印任何内容。 However, this presents the new conundrum of how to actually acquire "Mathematics" and not "Web of Science Categories:"然而,这提出了如何真正获得“数学”而不是“科学网络类别”的新难题:

根据您提供的用于提取文本数学的 HTML,您可以使用以下代码行:

text1 = driver.find_element_by_xpath("//p[@class='FR_field']").get_attribute("innerHTML").splitlines()[2]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用lxml xpath和python中的请求在文本中提取href - How to extract the href within the text using lxml xpath and requests in python xpath:如何在<strong>元素之前,之后和之后提取文本 - xpath: how to extract text before, AND within, AND after the <strong> element 如何将文本和 xpath 提取到 Python 中 HTML 页面的那个元素 - How to extract text and the xpath to that element of the HTML page in Python XPATH检查节点内的特定文本 - XPATH to check on a specific text within a node 如何使用beautifulsoup python在HTML列表中提取文本 - How to extract text within HTML lists using beautifulsoup python 如何使用 Xpath 提取带有 css 的文本字段 - How to extract text field with css using Xpath 如何从所有'中提取文本内容<t> ' 使用 xpath 的段落标签内的标签</t> - How to extract text content from all '<t>' tags within paragraph tag using xpath 如何在源代码(Xpath)中找到特定字符串并提取后续文本? - How to find a particular string within a Source code(Xpath) and extract the proceeding text? 使用 xPath 从父节点和后代节点中提取文本 - Extract text from parent node and descendant nodes w/ xPath 如何使用 Selenium 和 Python 从通过 xpath 找到的 webdriver 元素中提取文本 - How to extract text from webdriver elements found through xpath using Selenium and Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM