简体   繁体   English

XML到数据框架 - R.

[英]XML to data frame - R

I'm having difficulty getting a XML file into R as a data frame. 我很难将XML文件作为数据框导入R中。 The data I working with is free to download from The Securities and Exchange Commission and can be found HERE . 我使用的数据可以从证券交易委员会免费下载,可以在这里找到。

The first downloadable option SEC Investment Advisor Report (37 MB) is the one I'll use as an example. 第一个可下载的选项SEC Investment Advisor Report (37 MB)是我将用作示例的那个。 After I download the file, I unzip it in my terminal (mac) with the following code: gunzip -d IA_FIRM_SEC_Feed_02_05_2017.xml.gz . 下载文件后,我使用以下代码在终端(mac)中解压缩它: gunzip -d IA_FIRM_SEC_Feed_02_05_2017.xml.gz This unzips into XML format. 这解压缩为XML格式。 And from here, I've been struggling to get this into an R data frame. 从这里开始,我一直在努力将其纳入R数据框架。 I've tried different variations of code using the XML package, but can only get one variable or two variables into the data frame. 我已经使用XML包尝试了不同的代码变体,但只能在数据框中获得一个变量或两个变量。 Here is an example of what the data looks like when I open the XML file: 以下是打开XML文件时数据的示例:

<?xml version="1.0" encoding="ISO-8859-1"?>
<IAPDFirmSECReport GenOn="2017-02-05">
<Firms>
<Firm>
<Info SECRgnCD="DRO" FirmCrdNb="116085" SECNb="801-63010" BusNm="JACOBSEN CAPITAL MANAGEMENT" LegalNm="JACOBSEN CAPITAL MANAGEMENT, LLC"/>
<MainAddr Strt1="280 S 400 W" Strt2="SUITE 220" City="SALT LAKE CITY" State="UT" Cntry="United States" PostlCd="84101" PhNb="801.746.7171" FaxNb="801.363.5886"/>
<MailingAddr/>
<Rgstn FirmType="Registered" St="APPROVED" Dt="2004-04-07"/>
<NoticeFiled>
<States RgltrCd="AZ" St="FILED" Dt="2009-12-10"/>
<States RgltrCd="UT" St="FILED" Dt="2009-12-10"/>
</NoticeFiled>
<Filing Dt="2016-11-09" FormVrsn="10/2012"/>
<FormInfo>
<Part1A>
<Item1 Q1I="Y" Q1M="N" Q1N="N" Q1O="N">
<WebAddrs>
<WebAddr>HTTP://WWW.JACOBSENCAPITAL.COM</WebAddr>
</WebAddrs>
</Item1>
<Item2A Q2A1="Y" Q2A2="N" Q2A3="N" Q2A4="N" Q2A5="N" Q2A6="N" Q2A7="N" Q2A8="N" Q2A9="N" Q2A10="N" Q2A11="N" Q2A12="N" Q2A13="N"/>
<Item2B/>
<Item3A OrgFormNm="Limited Liability Company"/>
<Item3B Q3B="NOVEMBER"/>
<Item3C StateCD="UT" CntryNm="United States"/>
<Item5A TtlEmp="4"/>
<Item5B Q5B1="2" Q5B2="0" Q5B3="4" Q5B4="0" Q5B5="0" Q5B6="0"/>
<Item5C Q5C1="26-100" Q5C2="0"/>
<Item5D Q5D1A="Up to 10 percent" Q5D1B="76-99 percent" Q5D1C="0 percent" Q5D1D="0 percent" Q5D1E="0 percent" Q5D1F="0 percent" Q5D1G="Up to 10 percent" Q5D1H="Up to 10 percent" Q5D1I="0 percent" Q5D1J="0 percent" Q5D1K="0 percent" Q5D1L="0 percent" Q5D1M="0 percent" Q5D2A="Up to 25 percent" Q5D2B="More than 75 percent" Q5D2C="0 percent" Q5D2D="0 percent" Q5D2E="0 percent" Q5D2F="0 percent" Q5D2G="Up to 25 percent" Q5D2H="Up to 25 percent" Q5D2I="0 percent" Q5D2J="0 percent" Q5D2K="0 percent" Q5D2L="0 percent" Q5D2M="0 percent"/>
<Item5E Q5E1="Y" Q5E2="Y" Q5E3="N" Q5E4="Y" Q5E5="N" Q5E6="N" Q5E7="N"/>
<Item5F Q5F1="Y" Q5F2A="190000000" Q5F2B="0" Q5F2C="190000000" Q5F2D="342" Q5F2E="0" Q5F2F="342"/>
<Item5G Q5G1="Y" Q5G2="Y" Q5G3="N" Q5G4="N" Q5G5="N" Q5G6="Y" Q5G7="Y" Q5G8="N" Q5G9="N" Q5G10="N" Q5G11="N" Q5G12="N"/>
<Item5H Q5H="26-50"/>
<Item5I Q5I1="N" Q5I2="N"/>
<Item5J Q5J="N"/>
<Item6A/>
<Item6B Q6B1="N" Q6B3="N"/>
<Item7A Q7A1="N" Q7A2="N" Q7A3="N" Q7A4="N" Q7A5="N" Q7A6="N" Q7A7="N" Q7A8="N" Q7A9="N" Q7A10="Y" Q7A11="N" Q7A12="N" Q7A13="N" Q7A14="N" Q7A15="N" Q7A16="N"/>
<Item7B Q7B="N"/>
<Item8A Q8A1="N" Q8A2="Y" Q8A3="N"/>
<Item8B Q8B1="N" Q8B2="N" Q8B3="N"/>
<Item8C Q8C1="Y" Q8C2="Y" Q8C3="N" Q8C4="N"/>
<Item8D/>
<Item8E Q8E="N"/>
<Item8F/>
<Item8G Q8G1="N"/>
<Item8H Q8H="N"/>
<Item8I Q8I="N"/>
<Item9A Q9A1A="N" Q9A1B="N"/>
<Item9B Q9B1A="N" Q9B1B="N"/>
<Item9C Q9C1="N" Q9C2="N" Q9C3="N" Q9C4="N"/>
<Item9D Q9D1="N" Q9D2="N"/>
<Item9E/>
<Item9F Q9F="1"/>
<Item10A Q10A="N"/>
<Item11 Q11="N"/>
<Item11A Q11A1="N" Q11A2="N"/>
<Item11B Q11B1="N" Q11B2="N"/>
<Item11C Q11C1="N" Q11C2="N" Q11C3="N" Q11C4="N" Q11C5="N"/>
<Item11D Q11D1="N" Q11D2="N" Q11D3="N" Q11D4="N" Q11D5="N"/>
<Item11E Q11E1="N" Q11E2="N" Q11E3="N" Q11E4="N"/>
<Item11F Q11F="N"/>
<Item11G Q11G="N"/>
<Item11H Q11H1A="N" Q11H1B="N" Q11H1C="N" Q11H2="N"/>
</Part1A>
</FormInfo>
</Firm>
<Firm>

It looks like the variables are listed directly before the equals sign and the content(row item) is listed directly after the equals sign. 看起来变量直接列在等号之前,内容(行项)直接列在等号后面。 For example in the line: <Info SECRgnCD="DRO" FirmCrdNb="116085" SECNb="801-63010" BusNm="JACOBSEN CAPITAL MANAGEMENT" LegalNm="JACOBSEN CAPITAL MANAGEMENT, LLC"/> 例如,在行中: <Info SECRgnCD="DRO" FirmCrdNb="116085" SECNb="801-63010" BusNm="JACOBSEN CAPITAL MANAGEMENT" LegalNm="JACOBSEN CAPITAL MANAGEMENT, LLC"/>

I would want the data frame to look as follows: 我希望数据框看起来如下:

SECRgnCD FirmCrdNB  SECNb      BusNm
DRO       116085    801-63010  JACOBSEN CAPITAL MANAGEMENT

Thanks for your help. 谢谢你的帮助。

The following code should work; 以下代码应该有效;

 require(XML)
 data <- xmlParse('./IA_FIRM_SEC_Feed_02_05_2017.xml')
 result <- as.data.frame(t(xmlSApply(data["//Firms/Firm/Info"],xmlAttrs)),
                    stringsAsFactors=FALSE)

Consider even using XML's internal variable, xmlAttrsToDataFrame() using the triple colon operator: 考虑使用XML的内部变量xmlAttrsToDataFrame()使用三重冒号运算符:

library(XML)

df <- XML:::xmlAttrsToDataFrame(getNodeSet(doc, path='//Info'))
# OR df <- XML:::xmlAttrsToDataFrame(doc['//Info'])

df
#   SECRgnCD FirmCrdNb     SECNb                       BusNm                          LegalNm
# 1      DRO    116085 801-63010 JACOBSEN CAPITAL MANAGEMENT JACOBSEN CAPITAL MANAGEMENT, LLC

And per OP's update to capture all attributes under <Firm> , use above method in a lapply loop. 并且根据OP的更新来捕获<Firm>下的所有属性,在lapply循环中使用上面的方法。 Below trycatch is used to capture elements without attributes that raises a warning: trycatch下面用于捕获没有引发警告的属性的元素:

# CAPTURE ALL ELEMENT NAMES UNDER <Firm> AND <Part1A>
firmsnames <- append(xpathSApply(doc, "//Firm/*", xmlName),
                     xpathSApply(doc, "//Part1A/*", xmlName))

# ITERATE THROUGH EACH ELEMENT BINDING ATTRS TO DATAFRAME    
dfList <- lapply(firmsnames, function(n) {
  tryCatch({
    return(XML:::xmlAttrsToDataFrame(doc[paste0('//', n)]))
  }, warning=function(w) return(data.frame(NA)))
})

# FILTER OUT EMPTY DF ELEMENTS IN LIST
dfList <- Filter(function(i) length(i)>0, dfList) 

# COLUMN BIND ALL LIST ELEMENTS
df <- do.call(cbind, dfList)

# RETURNS A DATA FRAME OF 1 ROW AND 182 COLUMNS 
str(df)
# data.frame':  1 obs. of  182 variables:
#  $ SECRgnCD : Factor w/ 1 level "DRO": 1
#   ..- attr(*, "names")= chr "SECRgnCD"
#  $ FirmCrdNb: Factor w/ 1 level "116085": 1
#   ..- attr(*, "names")= chr "FirmCrdNb"
#  $ SECNb    : Factor w/ 1 level "801-63010": 1
#   ..- attr(*, "names")= chr "SECNb"
#  $ BusNm    : Factor w/ 1 level "JACOBSEN CAPITAL MANAGEMENT": 1
#   ..- attr(*, "names")= chr "BusNm"
#  $ LegalNm  : Factor w/ 1 level "JACOBSEN CAPITAL MANAGEMENT, LLC": 1
#   ..- attr(*, "names")= chr "LegalNm"
# ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM