简体   繁体   English

使用 xpath 检索空节点和非空节点

[英]Retrieving empty and non-empty node with xpath

I am trying to get a good representation of an XML... To keep it simple, let's say we have the following XML我试图得到一个很好的 XML 表示......为了简单起见,假设我们有以下 XML

<div>
    <em>5</em>
    <em></em>
    <em></em>
    <em>A</em>
</div>

Ideally I would like to convert this to some table having one column:理想情况下,我想将其转换为具有一列的某个表:

| em |
------
| "5"| 
| "" |
| "" |
| "A"|

(I used quotes here to clearly show that I want the empty nodes as well) (我在这里使用引号来清楚地表明我也想要空节点)

I tried several xpath queries.. the easiest one is something I tested with R, here I would get我尝试了几个 xpath 查询.. 最简单的一个是我用 R 测试过的,在这里我会得到

z = read_xml("<div>
        <em>5</em>
        <em></em>
        <em></em>
        <em>A</em>
</div>")
z

xml_find_all(z,"//*[name() = 'em']/text()")

{xml_nodeset (2)}
[1] 5
[2] A

Most other questions are about only detecting empty/non-empty cells.. or selecting the first non-empty one.. but I don't see how I can use that here.大多数其他问题都是关于只检测空/非空单元格..或选择第一个非空单元格..但我不知道如何在这里使用它。

One idea I had was trying to use concat... to add some string to all nodes (including the empty ones).我的一个想法是尝试使用 concat... 向所有节点(包括空节点)添加一些字符串。 However, this is an Xpath 2.0 solution (AFAIK) and this will not be a viable solution.但是,这是一个 Xpath 2.0 解决方案 (AFAIK),这不是一个可行的解决方案。

The final solution (extracting information from this XML) will be implemented in Hive.最终的解决方案(从这个 XML 中提取信息)将在 Hive 中实现。 I use some Serde functionality to get the information.. which is then stored as arrays.. then I want to convert it to a normal table.. but this is not possible if the missing values are not retrieved because of length differences我使用一些 Serde 功能来获取信息..然后将其存储为数组..然后我想将其转换为普通表..但如果由于长度差异而未检索到缺失值,则这是不可能的

in R you can do:R您可以执行以下操作:

library(xml2)
library(magrittr)
z = read_xml("<div>
             <em>5</em>
             <em></em>
             <em></em>
             <em>A</em>
        </div>")
z %>% 
    xml_find_all('em') %>% 
    xml_text()

#> [1] "5" ""  ""  "A"

Or, without the piping:或者,没有管道:

library(xml2) 
xml_text(xml_find_all(z, 'em'))
#> [1] "5" ""  ""  "A"

It is possible to do it in Hive with xpath().可以使用 xpath() 在 Hive 中做到这一点。 Unfortunately Hive implements xpath 1.0.不幸的是,Hive 实现了 xpath 1.0。 Therefore functions that would help deal with missing values in a more elegant way are not accessible.因此,无法访问有助于以更优雅的方式处理缺失值的函数。

The only way I could deal with it is by using an 'or' statement in the xpath expression that would output an default value when xpath value is empty.我可以处理它的唯一方法是在 xpath 表达式中使用“或”语句,当 xpath 值为空时,该语句将输出默认值。 In your case there is no default element, so I create one with regexp_replace():在您的情况下,没有默认元素,因此我使用 regexp_replace() 创建了一个:

select xplode.*
     from (select 0) t
     lateral view explode(xpath(regexp_replace('<div><em>5</em><em></em><em></em><em>A</em></div>', '<em>','<em dflt = "">'),'div/em/text()| div/em[not(./text())]/@dflt')) xplode  as em;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM