简体   繁体   中英

Melting repeated child nodes XML into a tidy data set using R

I am trying to build a data mash-up from a wide variety o security controls in R. I have had great success with the devices that output CSV, JSON, etc, but XML is really tripping me up. You will quickly see that I am not the boss R developer I wish to be, but I greatly appreciate any help once could provide. Here is a simplified version of the XML I am trying to parse.

 <devices>
    <host id="169274" persistent_id="21741">
      <ip>some_IP_here</ip>
      <hostname>Some_DNS_name_here </hostname>
      <netbiosname>Some_NetBios_Name_here</netbiosname>
      <hscore>663</hscore>
      <howner>4</howner>
      <assetvalue>4</assetvalue>
      <os>Unix Variant</os>
      <nbtshares/>
      <fndvuln id="534" port="80" proto="tcp"/>
      <fndvuln id="1191" port="22" proto="tcp"/>
    </host>
    <host id="169275" persistent_id="21003">
      <ip>some_IP_here</ip>
      <hostname>Some_DNS_name_here </hostname>
      <netbiosname>Some_NetBios_Name_here</netbiosname>
      <hscore>0</hscore>
      <howner>4</howner>
      <assetvalue>4</assetvalue>
      <os>OS Undetermined</os>
      <nbtshares/>
      <fndvuln id="5452" port="ip" proto="ip"/>
      <fndvuln id="5092" port="123" proto="udp"/>
      <fndvuln id="16157" port="123" proto="udp"/>
    </host>
</devices>

The end result that I am hoping to achieve is a tidy R dataframe that I can use for analysis. It a perfect world it would like as follows

host           ip            hostname            netbiosname     VulnID   port   protocol
1 169274 some_IP_here Some_DNS_name_here  Some_NetBios_Name_here  534      80     tcp
2 169274 some_IP_here Some_DNS_name_here  Some_NetBios_Name_here  1191     22     tcp
3 169275 some_IP_here Some_DNS_name_here  Some_NetBios_Name_here  5452     ip     ip
4 169275 some_IP_here Some_DNS_name_here  Some_NetBios_Name_here  5092     123    udp
5 169275 some_IP_here Some_DNS_name_here  Some_NetBios_Name_here  16157    123    udp

On the simplest level, I have no problem parsing the XML and extracting the data I need to build the basic dataframe. However, I struggle with how to iterate through the parsed XML and essentially create a separate line for each time the fndvuln element appears in parent XML node.

So far, I am guessing it is best to load each element individually and then bind them at the end. I am thinking this would allow me to use sapply to run through the various instances of fndvuln and create a separate entry. so far, I have this for the basic structure:

library(XML)

setwd("My_file_location_here")

xmlfile <- "vuln.xml"
xmldoc <- xmlParse(xmlfile)
vuln <-getNodeSet(xmldoc, "//host")
x <- lapply(vuln, function(x)  data.frame(host = xpathSApply(x, "." , xmlGetAttr, "id"),
                                        ip = xpathSApply(x, ".//ip", xmlValue),
                                        hostname = xpathSApply(x, ".//hostname", xmlValue),
                                        netbiosname = xpathSApply(x, ".//netbiosname", xmlValue) ))

do.call("rbind", x)

Which basically gives me this:

    host           ip            hostname            netbiosname
1 169274 some_IP_here Some_DNS_name_here  Some_NetBios_Name_here
2 169275 some_IP_here Some_DNS_name_here  Some_NetBios_Name_here

Not sure how I would go about doing the rest. Also, because this device will kick out quite a hefty XML file, knowing how to do this efficiently would be my end goal.

The host, ip, hostname, etc will be repeated when you add the fndvuln elements to your data.frame (try data.frame("a", 1:3) )

x <- lapply(vuln, function(x)  data.frame(
    host = xpathSApply(x, "." , xmlGetAttr, "id"),
     ip  = xpathSApply(x, ".//ip", xmlValue),
hostname = xpathSApply(x, ".//hostname", xmlValue),
  VulnID = xpathSApply(x, ".//fndvuln" , xmlGetAttr, "id"),
   port  = xpathSApply(x, ".//fndvuln" , xmlGetAttr, "port") ))

do.call("rbind", x)
    host           ip            hostname VulnID port
1 169274 some_IP_here Some_DNS_name_here     534   80
2 169274 some_IP_here Some_DNS_name_here    1191   22
3 169275 some_IP_here Some_DNS_name_here    5452   ip
4 169275 some_IP_here Some_DNS_name_here    5092  123
5 169275 some_IP_here Some_DNS_name_here   16157  123

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM