简体   繁体   English

使用Pig Latin进行Xml解析

[英]Xml parsing with pig Latin

I am very new to both Hadoop and Pig. 我对Hadoop和Pig都很新。 I have been able to do a number of simple programs but one which is taxing me is processing XML when part of an XML file is malformed. 我已经能够完成许多简单的程序,但是当我对XML文件的一部分格式错误时,正在处理的问题就是处理XML。

I can use XMLLoader('tag') to get all of the tags from an xml file which is great. 我可以使用XMLLoader('tag')从xml文件中获取所有标记,这很棒。 However if one is missing a well formed close tag pig will stop at that one. 然而,如果缺少一个形状良好的密切标签,猪会停在那个。 for example 例如

<tag>
</tag>
<tag>
</tag1>
<tag>
</tag>

This will only pick up the first valid tag. 这只会获取第一个有效标签。 Now, I have experience with JAQL and am able to ignore the error record so that the application picks up the second tag. 现在,我有使用JAQL的经验,并且能够忽略错误记录,以便应用程序选择第二个标记。

My question is: is their was a way to do handle poor formatting of XML using Pig, rather than JAQL? 我的问题是:他们是使用Pig而不是JAQL来处理XML格式不佳的方法吗?

I've been looking at the pig XMLLoader code, and what appears to be happening with the malformed tag is that the loader is never noticing that the tag ends, and has no way of noticing that it has entered a new main tag. 我一直在查看pig XMLLoader代码,而且格式错误的标记似乎正在发生的事情是加载器从未注意到标记结束,并且无法注意到它已经输入了新的主标记。 There appears to be no way to use the XMLLoader as it currently stands to get around this. 似乎没有办法使用XMLLoader,因为它目前正在解决这个问题。

It might however be possibble to modify XMLLoader so that it works in the manner you want it to. 但是,可能有可能修改XMLLoader,使其以您希望的方式工作。 Probably by changing the conditions in the skipToTag method so that if it runs into another instance of the specified opening tag it skips ahead to that, ignoring the malformed tag. 可能通过更改skipToTag方法中的条件,以便如果它运行到指定的开始标记的另一个实例,它会向前跳过,忽略格式错误的标记。 Keep in mind that this will mess up if you have nested tags with the same name (ex. address as root, but have address as an element lower in the doc), so it isn't foolproof. 请记住,如果您使用具有相同名称的嵌套标记(例如,根目录为地址,但将地址作为文档中较低的元素),这将会搞乱,因此它并非万无一失。

It would seem however that in most cases validating the XML beforehand might be a better option, or having a pre-processor extract only the valid XML to a file which pig then runs on. 然而,似乎在大多数情况下,事先验证XML可能是更好的选择,或者让预处理器仅将有效的XML提取到猪然后运行的文件。

Hope this helps. 希望这可以帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM