简体   繁体   English

从 postgresql 上的表中提取 R 中的 XML 数据

[英]Extract XML data in R from table on postgresql

I have a table on postgresql, which has an xml column and varchar/numeric columns.我有一个关于 postgresql 的表,它有一个 xml 列和 varchar/numeric 列。 When trying to retrieve the data to save it into a data frame the xml is converted to character.当尝试检索数据以将其保存到数据框中时,xml 将转换为字符。 Let's recreate the dataset:让我们重新创建数据集:

my_dataset <- data.frame(id = c(1,1,1,1,2,2,2,2,2),
                         http_action = c("REQUEST","RESPONSE","REQUEST","RESPONSE","REQUEST","RESPONSE","REQUEST","RESPONSE","RESPONSE"),
                         http_data = c('"<?xml version="1.0" standalone="yes"?> <questions> <candidate> <lastname>GOMEZ</lastname> <name>BARNEY</name> </candidate> </questions>)"',
                                       '"<validating> <opnum>123</opnum> <q1>Daily activity?</q1> <a1>Drinking at Moes</a1></validating>"',
                                       '"<?xml version="1.0" standalone="yes"?> <questions> <option>1</option> </questions>"', 
                                       '"<validating> <code>XY936701</code> <date>12/03/2020</date> <time>19:07</time> <result>NONAUTHORIZED</result> <explanation>NON SUITABLE</explanation> </validating>"',
                                       '"<?xml version="1.0" standalone="yes"?> <questions> <candidate> <lastname>LEONARD</lastname> <name>LEN</name> </candidate> </questions>)"' ,
                                       '"<validating> <opnum>124</opnum> <q1>Daily activity?</q1> <a1>Work at Nuclear Power</a1></validating>"',
                                       '"<?xml version="1.0" standalone="yes"?> <questions> <option>1</option> </questions>"', 
                                       '"<validating> <code>XY936702</code> <date>15/03/2020</date> <time>16:12</time> <result>NONAUTHORIZED</result> <explanation>NON SUITABLE</explanation> </validating>"',
                                       '"<validating> <code>XY936702</code> <date>15/03/2020</date> <time>19:24</time> <result>AUTHORIZED</result> <explanation>SUITABLE</explanation> </validating>"'),
                         http_status = c(200,200,200,200,200,200,200,200,200),
                         stringsAsFactors = FALSE)

I receive the following warning:我收到以下警告:

In postgresqlExecStatement(conn, statement, ...) :
  RS-DBI driver warning: (unrecognized PostgreSQL field type xml (id:142) in column 4)

I can extract the information using string comparisons on lines containing the node , I tried the following:我可以在包含节点的行上使用字符串比较来提取信息,我尝试了以下操作:

my_dataset <- my_dataset %>% 
mutate(authorized = ifelse(str_extract(http_data,"<result>[w+]</result>")=="",NA,
                           ifelse(str_extract(http_data,"<result>[w+]</result>")=="NONAUTHORIZED",0,1)))

As a result I get a full NA column, which is not what I expect.结果我得到了一个完整的 NA 列,这不是我所期望的。 Please, could you help me with this question?拜托,你能帮我回答这个问题吗? I mean, perhaps my regex is not well written.我的意思是,也许我的正则表达式写得不好。 And, do you know if it's possible to extract that information directly from the query?而且,您知道是否可以直接从查询中提取该信息? Thank you in advance for the help you can provide.预先感谢您提供的帮助。

Regards问候

You have a problem with your regex: it should be something like <result>(\\\\w+)</result> .您的正则表达式有问题:它应该类似于<result>(\\\\w+)</result> Also to get the group matches str_extract is not enough.同样要获得组匹配str_extract是不够的。 You can use str_match for groups.您可以将str_match用于组。 Take a look at str_match here .看看这里的str_match

As an alternative solution, you can use an XML parser.作为替代解决方案,您可以使用 XML 解析器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM