简体   繁体   English

从BR​​标签提取文本

[英]Extract Text from BR tags

I have been able to extract text using Selenium before, however I'am having trouble with just extracting the numbers between < BR > tags. 我以前已经可以使用Selenium提取文本,但是仅提取<BR>标记之间的数字时遇到了麻烦。 Here is a sample of the html code. 这是html代码的示例。

<DIV class="pagebodydiv">
    <TABLE  CLASS="datadisplaytable" SUMMARY="This table will display needed information." WIDTH="100%">
<TR>
<TD CLASS="nttitle" scope="colgroup" >Working Title</A></TD>
</TR>
<TR>
<TD CLASS="ntdefault">
 Further information on subject
<BR>
    3.000
<BR>
    2.000  
<BR>
<BR>
<BR>
<BR>
<BR>
More information
<BR>
<BR>
</TABLE>

So far I have tried using: 到目前为止,我已经尝试使用:

WebElement creditinfo = driver.findElement(By.xpath("//div[@class='pagebodydiv']/text()[preceding-sibling::br]

and Elements numInfo = doc.select("br"); Elements numInfo = doc.select("br");

However, I keep running into a NoSuchElementException error, an InvalidSelectorException error, or it just doesn't return anything. 但是,我一直遇到NoSuchElementException错误,InvalidSelectorException错误,或者它什么都不返回。 Any ideas on how I can get the information? 关于如何获取信息的任何想法?

You actually can select the text nodes between <BR> tags. 实际上,您可以选择<BR>标记之间的文本节点。 In HTML (not XHTML) they act as self-closing tags (like <br/> ). 在HTML(不是XHTML)中,它们充当自动关闭标签(如<br/> )。 Based on that behaviour, you could select all text nodes that have a <BR> tag before and after it using: 根据该行为,您可以使用以下命令选择所有带有<BR>标记的文本节点:

//TABLE[@CLASS='datadisplaytable']/TR/TD[@CLASS="ntdefault"]
/text()[preceding-sibling::node()[1][self::BR] 
        and following-sibling::node()[1][self::BR]]

That would select also the blank lines and the character text which is not a number. 那还将选择空白行和不是数字的字符文本。

You can get rid of the empty space nodes adding a [normalize-space(.) != ''] to the end of the expression (which will now only return three nodes). 您可以删除在表达式末尾添加[normalize-space(.) != '']的空白节点(现在将仅返回三个节点)。 And you can select which node you want using a positional predicate at the end of the expression ( [1] to select the first node. 并且可以选择使用的位置谓词在表达式的端节点( [1]来选择所述第一节点。

The expression below selects the text node containing the value 2.000 : 下面的表达式选择包含值2.000的文本节点:

//TABLE[@CLASS='datadisplaytable']/TR/TD[@CLASS="ntdefault"]
/text()[preceding-sibling::node()[1][self::BR] 
        and following-sibling::node()[1][self::BR]][normalize-space(.) != ''][2]

Note: I'm assuming your source actually has tag names in uppercase, since in XPath <TD> is not the same as <td> . 注意:我假设您的源实际上具有标记名,因为在XPath中<TD><td> I'm not sure how tolerant Selenium is about this when parsing HTML. 我不确定在解析HTML时Selenium对此有多宽容。

It may help : 这可能会有所帮助:

  WebElement table =driver.findElement(By.xpath("//table[@class='datadisplaytable']"));
   WebElement tbody=table.findElement(By.tagName("tbody"));
   List<WebElement> rows=tbody.findElements(By.tagName("tr"));
   System.out.println("Row  size:"+rows.size());
   ArrayList<String> list=new ArrayList<>();

   for(int i=0;i<rows.size();i++)
   {
     WebElement column = tbody.findElement(By.xpath("//table[@class='datadisplaytable']/tbody/tr[2]/td"));
     if(column.getText().trim().contains("."))
     {
        System.out.println("text : "+column.getText().trim());
        list.add(column.getText().trim());
     }

   }

I believe that BR are not considered enclosing tags, and so you will not be able to extract the "enclosing text". 我认为BR不被视为封闭标签,因此您将无法提取“封闭文本”。 You will probably have to extract the text enclosed in your TD CLASS="ntdefault" , where all the BR s are going to get translated into newlines. 您可能必须提取TD CLASS="ntdefault"包含的文本,在此所有BR都将被翻译成换行符。 You will then have to perform string manipulation to extract only the parts you are interested in. 然后,您将必须执行字符串操作以仅提取您感兴趣的部分。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM