简体   繁体   English

使用 R 从大型 XML 中提取数据

[英]Extract data from large XML using R

I have a large XML that I cant parse completely in R due to memory shortage.我有一个很大的 XML,由于内存不足,我无法在 R 中完全解析。 I would like just to extract some specific columns.我只想提取一些特定的列。 I found other asked similar questions:我发现其他人问过类似的问题:

How to read large (~20 GB) xml file in R? 如何在 R 中读取大型(~20 GB)xml 文件? Storing specific XML node values with R's xmlEventParse 使用 R 的 xmlEventParse 存储特定的 XML 节点值

I cant get it to work though with my data, it runs, but no data is returned.我无法让它与我的数据一起工作,它运行,但没有返回任何数据。 I did try to adjust the suggested solutions to my XML but it still does not work.我确实尝试将建议的解决方案调整为我的 XML,但它仍然不起作用。 Might be my lack of knowledge XML.可能是我缺乏 XML 知识。 Below is a example of my XML data, where cl, clssc, clp, clpssc, primclp are the columns.下面是我的 XML 数据的示例,其中cl, clssc, clp, clpssc, primclp是列。 How can I extract only cl and clssc without parsing the whole document first?如何在不首先解析整个文档的情况下仅提取clclssc

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<abc:abc xmlns:abc="http://abc/abc" xsi:schemaLocation="http://abc/abc lala_20Q2.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <chcp>
    <cl>2000000</cl>
    <clssc>10934</clssc>
    <clp>200000</clp>
    <clpssc>10934</clpssc>
    <primclp>Y</primclp>
  </chcp>
  <chcp>
    <cl>2000000</cl>
    <clssc>10934</clssc>
    <clp>200000</clp>
    <clpssc>10934</clpssc>
    <primclp>Y</primclp>
  </chcp>
  <chcp>
    <cl>2000000</cl>
    <clssc>10934</clssc>
    <clp>2000000</clp>
    <clpssc>10934</clpssc>
    <primclp>Y</primclp>
  </chcp>
</abc:abc>

The disk.frame package is made to handle medium sized data. disk.frame包用于处理中等大小的数据。 It enables batch conversion of data into .fst and speedy I/O through the fst package , and fast data manipulation using data.table .它支持通过.fst将数据批量转换为.fst和快速 I/O,并使用data.table快速数据操作。 Here, the dtplyr interface to the data.table package is used for the final wrangling.在这里, data.table包的dtplyr接口用于最终的争论。

Step 1: preparing the input file*第 1 步:准备输入文件*

1.1 Create a folder and place your .xml file in there. 1.1 创建一个文件夹并将您的 .xml 文件放在那里。

1.2 Remove the first two lines, and the last line so that you are left with a .xml file that has this structure: 1.2 删除前两行和最后一行,以便留下具有以下结构的 .xml 文件:

  <chcp>
    <cl>2000000</cl>
    <clssc>10934</clssc>
    <clp>200000</clp>
    <clpssc>10934</clpssc>
    <primclp>Y</primclp>
  </chcp>
  <chcp>
    <cl>2000000</cl>
    <clssc>10934</clssc>
    <clp>200000</clp>
    <clpssc>10934</clpssc>
    <primclp>Y</primclp>
  </chcp>
  <chcp>
    <cl>2000000</cl>
    <clssc>10934</clssc>
    <clp>2000000</clp>
    <clpssc>10934</clpssc>
    <primclp>Y</primclp>
  </chcp>

Step 2: setup disk.frame第二步:设置disk.frame

library(tidyverse)
library(disk.frame)

path <- file.path(file.choose()) # filepath to your folder containing .fst
setup_disk.frame(workers = 10) # adjust this to your machine
options(future.globals.maxSize = Inf)
old <- getOption("scipen") ; options(scipen = 100) # prevent scientific numbers later

Step 3: convert .xml to .fst第 3 步:将 .xml 转换为 .fst

l <- csv_to_disk.frame( #works for .xml
  paste0(path, "b.xml"), # replace b.xml with your filename (in folder)
  outdir = paste0(path, "combined.df"),
  in_chunk_size = 7,
  backend = "data.table", header = F)

This gives object l of class "disk.frame" "disk.frame.folder" in your R environment.这在您的 R 环境中给出了类"disk.frame" "disk.frame.folder"对象l You should now have a subfolder "combined.df" which contains a bunch of .fst files in your specified directory.您现在应该有一个子文件夹“combined.df”,其中在您指定的目录中包含一堆.fst文件。

Step 4: read in the required columns & tidy data第 4 步:读入所需的列和整理数据

cbind(
  get_chunk(l, 1) %>%
    `[`(seq(2, nrow(.), 2)),  
  get_chunk(l, 2) %>%
    `[`(seq(1, nrow(.), 2))
) %>%
  rename("cl" = 1, "clssc" = 2) %>%
  mutate(across(.fns = parse_number)) %>%
  as_tibble() # omit this to keep data.table

# A tibble: 3 x 2
       cl clssc
    <dbl> <dbl>
1 2000000 10934
2 2000000 10934
3 2000000 10934

Finally, don't forget to revert the scientific notation options options(scipen = old) .最后,不要忘记恢复科学记数法选项options(scipen = old)

* Note: Step 1 can likely be avoided by playing with the chunk sizes or through some manipulation of the .xml from within R. This I don't know how to do (yet). *注意:可以通过调整块大小或通过在 R 中对 .xml 进行一些操作来避免第 1 步。我不知道该怎么做(目前)。

Note2: Recommend to carefully read the disk.frame documentation for tips on how to setup properly for your machine.注意 2:建议仔细阅读disk.frame文档以获取有关如何为您的机器正确设置的提示。

on a windows machine在 Windows 机器上

Here ia an approach where you use the output from the windows findstr -command to import data using data.table::fread() .这里是一种方法,您可以使用 windows findstr命令的输出来使用data.table::fread()导入数据。 It filters the data using the windows-version of 'grep' before it is loaded into R. This way you will not run into memory problems very soon.它在将数据加载到 R之前使用 Windows 版本的“grep”过滤数据。这样您就不会很快遇到内存问题。

location of the xml: e:/testdata.xml xml 的位置: e:/testdata.xml

further explanation is in data's comments below进一步的解释在下面的数据评论中

library(data.table)
# Import output from windows findstr-command
#  assumes location of data is e:/testdata.xml
#  !! use \\ in path, else findstr does not undeerstand !!
DT <- data.table::fread(cmd = 'findstr "<clssc> <cl>" e:\\testdata.xml', 
                        sep = "\n", col.names = "line", header = FALSE )
#                    line
# 1:     <cl>2000000</cl>
# 2: <clssc>10934</clssc>
# 3:     <cl>2000000</cl>
# 4: <clssc>10934</clssc>
# 5:     <cl>2000000</cl>
# 6: <clssc>10934</clssc>

# Extract data from raw line
DT[, name  := gsub("^<(.+?)>.*$", "\\1", line)]
DT[, value := gsub("^.*>([0-9]+?)<.*$", "\\1", line)]
#                    line  name   value
# 1:     <cl>2000000</cl>    cl 2000000
# 2: <clssc>10934</clssc> clssc   10934
# 3:     <cl>2000000</cl>    cl 2000000
# 4: <clssc>10934</clssc> clssc   10934
# 5:     <cl>2000000</cl>    cl 2000000
# 6: <clssc>10934</clssc> clssc   10934

# Build some id's
DT[, id := rowid(name)]

# Cast to wide format
dcast(DT, id ~ name, value.var = "value")

#    id      cl clssc
# 1:  1 2000000 10934
# 2:  2 2000000 10934
# 3:  3 2000000 10934

on a unix machine在 unix 机器上
I cannot test, since I only use windows at this location.我无法测试,因为我只在这个位置使用 windows。
Replace the cmd='...' part from findstr-command (and the regex) with the grep-command of your system.将 findstr-command(和正则表达式)中的cmd='...'部分替换为您系统的 grep-command。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM