I am trying to build a data mash-up from a wide variety o security controls in R. I have had great success with the devices that output CSV, JSON, etc, but XML is really tripping me up. You will quickly see that I am not the boss R developer I wish to be, but I greatly appreciate any help once could provide. Here is a simplified version of the XML I am trying to parse.
<devices>
<host id="169274" persistent_id="21741">
<ip>some_IP_here</ip>
<hostname>Some_DNS_name_here </hostname>
<netbiosname>Some_NetBios_Name_here</netbiosname>
<hscore>663</hscore>
<howner>4</howner>
<assetvalue>4</assetvalue>
<os>Unix Variant</os>
<nbtshares/>
<fndvuln id="534" port="80" proto="tcp"/>
<fndvuln id="1191" port="22" proto="tcp"/>
</host>
<host id="169275" persistent_id="21003">
<ip>some_IP_here</ip>
<hostname>Some_DNS_name_here </hostname>
<netbiosname>Some_NetBios_Name_here</netbiosname>
<hscore>0</hscore>
<howner>4</howner>
<assetvalue>4</assetvalue>
<os>OS Undetermined</os>
<nbtshares/>
<fndvuln id="5452" port="ip" proto="ip"/>
<fndvuln id="5092" port="123" proto="udp"/>
<fndvuln id="16157" port="123" proto="udp"/>
</host>
</devices>
The end result that I am hoping to achieve is a tidy R dataframe that I can use for analysis. It a perfect world it would like as follows
host ip hostname netbiosname VulnID port protocol
1 169274 some_IP_here Some_DNS_name_here Some_NetBios_Name_here 534 80 tcp
2 169274 some_IP_here Some_DNS_name_here Some_NetBios_Name_here 1191 22 tcp
3 169275 some_IP_here Some_DNS_name_here Some_NetBios_Name_here 5452 ip ip
4 169275 some_IP_here Some_DNS_name_here Some_NetBios_Name_here 5092 123 udp
5 169275 some_IP_here Some_DNS_name_here Some_NetBios_Name_here 16157 123 udp
On the simplest level, I have no problem parsing the XML and extracting the data I need to build the basic dataframe. However, I struggle with how to iterate through the parsed XML and essentially create a separate line for each time the fndvuln element appears in parent XML node.
So far, I am guessing it is best to load each element individually and then bind them at the end. I am thinking this would allow me to use sapply to run through the various instances of fndvuln and create a separate entry. so far, I have this for the basic structure:
library(XML)
setwd("My_file_location_here")
xmlfile <- "vuln.xml"
xmldoc <- xmlParse(xmlfile)
vuln <-getNodeSet(xmldoc, "//host")
x <- lapply(vuln, function(x) data.frame(host = xpathSApply(x, "." , xmlGetAttr, "id"),
ip = xpathSApply(x, ".//ip", xmlValue),
hostname = xpathSApply(x, ".//hostname", xmlValue),
netbiosname = xpathSApply(x, ".//netbiosname", xmlValue) ))
do.call("rbind", x)
Which basically gives me this:
host ip hostname netbiosname
1 169274 some_IP_here Some_DNS_name_here Some_NetBios_Name_here
2 169275 some_IP_here Some_DNS_name_here Some_NetBios_Name_here
Not sure how I would go about doing the rest. Also, because this device will kick out quite a hefty XML file, knowing how to do this efficiently would be my end goal.
The host, ip, hostname, etc will be repeated when you add the fndvuln elements to your data.frame (try data.frame("a", 1:3)
)
x <- lapply(vuln, function(x) data.frame(
host = xpathSApply(x, "." , xmlGetAttr, "id"),
ip = xpathSApply(x, ".//ip", xmlValue),
hostname = xpathSApply(x, ".//hostname", xmlValue),
VulnID = xpathSApply(x, ".//fndvuln" , xmlGetAttr, "id"),
port = xpathSApply(x, ".//fndvuln" , xmlGetAttr, "port") ))
do.call("rbind", x)
host ip hostname VulnID port
1 169274 some_IP_here Some_DNS_name_here 534 80
2 169274 some_IP_here Some_DNS_name_here 1191 22
3 169275 some_IP_here Some_DNS_name_here 5452 ip
4 169275 some_IP_here Some_DNS_name_here 5092 123
5 169275 some_IP_here Some_DNS_name_here 16157 123
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.