简体   繁体   中英

XML to data frame - R

I'm having difficulty getting a XML file into R as a data frame. The data I working with is free to download from The Securities and Exchange Commission and can be found HERE .

The first downloadable option SEC Investment Advisor Report (37 MB) is the one I'll use as an example. After I download the file, I unzip it in my terminal (mac) with the following code: gunzip -d IA_FIRM_SEC_Feed_02_05_2017.xml.gz . This unzips into XML format. And from here, I've been struggling to get this into an R data frame. I've tried different variations of code using the XML package, but can only get one variable or two variables into the data frame. Here is an example of what the data looks like when I open the XML file:

<?xml version="1.0" encoding="ISO-8859-1"?>
<IAPDFirmSECReport GenOn="2017-02-05">
<Firms>
<Firm>
<Info SECRgnCD="DRO" FirmCrdNb="116085" SECNb="801-63010" BusNm="JACOBSEN CAPITAL MANAGEMENT" LegalNm="JACOBSEN CAPITAL MANAGEMENT, LLC"/>
<MainAddr Strt1="280 S 400 W" Strt2="SUITE 220" City="SALT LAKE CITY" State="UT" Cntry="United States" PostlCd="84101" PhNb="801.746.7171" FaxNb="801.363.5886"/>
<MailingAddr/>
<Rgstn FirmType="Registered" St="APPROVED" Dt="2004-04-07"/>
<NoticeFiled>
<States RgltrCd="AZ" St="FILED" Dt="2009-12-10"/>
<States RgltrCd="UT" St="FILED" Dt="2009-12-10"/>
</NoticeFiled>
<Filing Dt="2016-11-09" FormVrsn="10/2012"/>
<FormInfo>
<Part1A>
<Item1 Q1I="Y" Q1M="N" Q1N="N" Q1O="N">
<WebAddrs>
<WebAddr>HTTP://WWW.JACOBSENCAPITAL.COM</WebAddr>
</WebAddrs>
</Item1>
<Item2A Q2A1="Y" Q2A2="N" Q2A3="N" Q2A4="N" Q2A5="N" Q2A6="N" Q2A7="N" Q2A8="N" Q2A9="N" Q2A10="N" Q2A11="N" Q2A12="N" Q2A13="N"/>
<Item2B/>
<Item3A OrgFormNm="Limited Liability Company"/>
<Item3B Q3B="NOVEMBER"/>
<Item3C StateCD="UT" CntryNm="United States"/>
<Item5A TtlEmp="4"/>
<Item5B Q5B1="2" Q5B2="0" Q5B3="4" Q5B4="0" Q5B5="0" Q5B6="0"/>
<Item5C Q5C1="26-100" Q5C2="0"/>
<Item5D Q5D1A="Up to 10 percent" Q5D1B="76-99 percent" Q5D1C="0 percent" Q5D1D="0 percent" Q5D1E="0 percent" Q5D1F="0 percent" Q5D1G="Up to 10 percent" Q5D1H="Up to 10 percent" Q5D1I="0 percent" Q5D1J="0 percent" Q5D1K="0 percent" Q5D1L="0 percent" Q5D1M="0 percent" Q5D2A="Up to 25 percent" Q5D2B="More than 75 percent" Q5D2C="0 percent" Q5D2D="0 percent" Q5D2E="0 percent" Q5D2F="0 percent" Q5D2G="Up to 25 percent" Q5D2H="Up to 25 percent" Q5D2I="0 percent" Q5D2J="0 percent" Q5D2K="0 percent" Q5D2L="0 percent" Q5D2M="0 percent"/>
<Item5E Q5E1="Y" Q5E2="Y" Q5E3="N" Q5E4="Y" Q5E5="N" Q5E6="N" Q5E7="N"/>
<Item5F Q5F1="Y" Q5F2A="190000000" Q5F2B="0" Q5F2C="190000000" Q5F2D="342" Q5F2E="0" Q5F2F="342"/>
<Item5G Q5G1="Y" Q5G2="Y" Q5G3="N" Q5G4="N" Q5G5="N" Q5G6="Y" Q5G7="Y" Q5G8="N" Q5G9="N" Q5G10="N" Q5G11="N" Q5G12="N"/>
<Item5H Q5H="26-50"/>
<Item5I Q5I1="N" Q5I2="N"/>
<Item5J Q5J="N"/>
<Item6A/>
<Item6B Q6B1="N" Q6B3="N"/>
<Item7A Q7A1="N" Q7A2="N" Q7A3="N" Q7A4="N" Q7A5="N" Q7A6="N" Q7A7="N" Q7A8="N" Q7A9="N" Q7A10="Y" Q7A11="N" Q7A12="N" Q7A13="N" Q7A14="N" Q7A15="N" Q7A16="N"/>
<Item7B Q7B="N"/>
<Item8A Q8A1="N" Q8A2="Y" Q8A3="N"/>
<Item8B Q8B1="N" Q8B2="N" Q8B3="N"/>
<Item8C Q8C1="Y" Q8C2="Y" Q8C3="N" Q8C4="N"/>
<Item8D/>
<Item8E Q8E="N"/>
<Item8F/>
<Item8G Q8G1="N"/>
<Item8H Q8H="N"/>
<Item8I Q8I="N"/>
<Item9A Q9A1A="N" Q9A1B="N"/>
<Item9B Q9B1A="N" Q9B1B="N"/>
<Item9C Q9C1="N" Q9C2="N" Q9C3="N" Q9C4="N"/>
<Item9D Q9D1="N" Q9D2="N"/>
<Item9E/>
<Item9F Q9F="1"/>
<Item10A Q10A="N"/>
<Item11 Q11="N"/>
<Item11A Q11A1="N" Q11A2="N"/>
<Item11B Q11B1="N" Q11B2="N"/>
<Item11C Q11C1="N" Q11C2="N" Q11C3="N" Q11C4="N" Q11C5="N"/>
<Item11D Q11D1="N" Q11D2="N" Q11D3="N" Q11D4="N" Q11D5="N"/>
<Item11E Q11E1="N" Q11E2="N" Q11E3="N" Q11E4="N"/>
<Item11F Q11F="N"/>
<Item11G Q11G="N"/>
<Item11H Q11H1A="N" Q11H1B="N" Q11H1C="N" Q11H2="N"/>
</Part1A>
</FormInfo>
</Firm>
<Firm>

It looks like the variables are listed directly before the equals sign and the content(row item) is listed directly after the equals sign. For example in the line: <Info SECRgnCD="DRO" FirmCrdNb="116085" SECNb="801-63010" BusNm="JACOBSEN CAPITAL MANAGEMENT" LegalNm="JACOBSEN CAPITAL MANAGEMENT, LLC"/>

I would want the data frame to look as follows:

SECRgnCD FirmCrdNB  SECNb      BusNm
DRO       116085    801-63010  JACOBSEN CAPITAL MANAGEMENT

Thanks for your help.

The following code should work;

 require(XML)
 data <- xmlParse('./IA_FIRM_SEC_Feed_02_05_2017.xml')
 result <- as.data.frame(t(xmlSApply(data["//Firms/Firm/Info"],xmlAttrs)),
                    stringsAsFactors=FALSE)

Consider even using XML's internal variable, xmlAttrsToDataFrame() using the triple colon operator:

library(XML)

df <- XML:::xmlAttrsToDataFrame(getNodeSet(doc, path='//Info'))
# OR df <- XML:::xmlAttrsToDataFrame(doc['//Info'])

df
#   SECRgnCD FirmCrdNb     SECNb                       BusNm                          LegalNm
# 1      DRO    116085 801-63010 JACOBSEN CAPITAL MANAGEMENT JACOBSEN CAPITAL MANAGEMENT, LLC

And per OP's update to capture all attributes under <Firm> , use above method in a lapply loop. Below trycatch is used to capture elements without attributes that raises a warning:

# CAPTURE ALL ELEMENT NAMES UNDER <Firm> AND <Part1A>
firmsnames <- append(xpathSApply(doc, "//Firm/*", xmlName),
                     xpathSApply(doc, "//Part1A/*", xmlName))

# ITERATE THROUGH EACH ELEMENT BINDING ATTRS TO DATAFRAME    
dfList <- lapply(firmsnames, function(n) {
  tryCatch({
    return(XML:::xmlAttrsToDataFrame(doc[paste0('//', n)]))
  }, warning=function(w) return(data.frame(NA)))
})

# FILTER OUT EMPTY DF ELEMENTS IN LIST
dfList <- Filter(function(i) length(i)>0, dfList) 

# COLUMN BIND ALL LIST ELEMENTS
df <- do.call(cbind, dfList)

# RETURNS A DATA FRAME OF 1 ROW AND 182 COLUMNS 
str(df)
# data.frame':  1 obs. of  182 variables:
#  $ SECRgnCD : Factor w/ 1 level "DRO": 1
#   ..- attr(*, "names")= chr "SECRgnCD"
#  $ FirmCrdNb: Factor w/ 1 level "116085": 1
#   ..- attr(*, "names")= chr "FirmCrdNb"
#  $ SECNb    : Factor w/ 1 level "801-63010": 1
#   ..- attr(*, "names")= chr "SECNb"
#  $ BusNm    : Factor w/ 1 level "JACOBSEN CAPITAL MANAGEMENT": 1
#   ..- attr(*, "names")= chr "BusNm"
#  $ LegalNm  : Factor w/ 1 level "JACOBSEN CAPITAL MANAGEMENT, LLC": 1
#   ..- attr(*, "names")= chr "LegalNm"
# ...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM