[英]Extract data from large XML using R
I have a large XML that I cant parse completely in R due to memory shortage.我有一个很大的 XML,由于内存不足,我无法在 R 中完全解析。 I would like just to extract some specific columns.
我只想提取一些特定的列。 I found other asked similar questions:
我发现其他人问过类似的问题:
How to read large (~20 GB) xml file in R? 如何在 R 中读取大型(~20 GB)xml 文件? Storing specific XML node values with R's xmlEventParse
使用 R 的 xmlEventParse 存储特定的 XML 节点值
I cant get it to work though with my data, it runs, but no data is returned.我无法让它与我的数据一起工作,它运行,但没有返回任何数据。 I did try to adjust the suggested solutions to my XML but it still does not work.
我确实尝试将建议的解决方案调整为我的 XML,但它仍然不起作用。 Might be my lack of knowledge XML.
可能是我缺乏 XML 知识。 Below is a example of my XML data, where
cl, clssc, clp, clpssc, primclp
are the columns.下面是我的 XML 数据的示例,其中
cl, clssc, clp, clpssc, primclp
是列。 How can I extract only cl
and clssc
without parsing the whole document first?如何在不首先解析整个文档的情况下仅提取
cl
和clssc
?
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<abc:abc xmlns:abc="http://abc/abc" xsi:schemaLocation="http://abc/abc lala_20Q2.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<chcp>
<cl>2000000</cl>
<clssc>10934</clssc>
<clp>200000</clp>
<clpssc>10934</clpssc>
<primclp>Y</primclp>
</chcp>
<chcp>
<cl>2000000</cl>
<clssc>10934</clssc>
<clp>200000</clp>
<clpssc>10934</clpssc>
<primclp>Y</primclp>
</chcp>
<chcp>
<cl>2000000</cl>
<clssc>10934</clssc>
<clp>2000000</clp>
<clpssc>10934</clpssc>
<primclp>Y</primclp>
</chcp>
</abc:abc>
The disk.frame package is made to handle medium sized data. disk.frame包用于处理中等大小的数据。 It enables batch conversion of data into
.fst
and speedy I/O through the fst package , and fast data manipulation using data.table
.它支持通过
.fst
包将数据批量转换为.fst
和快速 I/O,并使用data.table
快速数据操作。 Here, the dtplyr
interface to the data.table
package is used for the final wrangling.在这里,
data.table
包的dtplyr
接口用于最终的争论。
1.1 Create a folder and place your .xml file in there. 1.1 创建一个文件夹并将您的 .xml 文件放在那里。
1.2 Remove the first two lines, and the last line so that you are left with a .xml file that has this structure: 1.2 删除前两行和最后一行,以便留下具有以下结构的 .xml 文件:
<chcp>
<cl>2000000</cl>
<clssc>10934</clssc>
<clp>200000</clp>
<clpssc>10934</clpssc>
<primclp>Y</primclp>
</chcp>
<chcp>
<cl>2000000</cl>
<clssc>10934</clssc>
<clp>200000</clp>
<clpssc>10934</clpssc>
<primclp>Y</primclp>
</chcp>
<chcp>
<cl>2000000</cl>
<clssc>10934</clssc>
<clp>2000000</clp>
<clpssc>10934</clpssc>
<primclp>Y</primclp>
</chcp>
library(tidyverse)
library(disk.frame)
path <- file.path(file.choose()) # filepath to your folder containing .fst
setup_disk.frame(workers = 10) # adjust this to your machine
options(future.globals.maxSize = Inf)
old <- getOption("scipen") ; options(scipen = 100) # prevent scientific numbers later
l <- csv_to_disk.frame( #works for .xml
paste0(path, "b.xml"), # replace b.xml with your filename (in folder)
outdir = paste0(path, "combined.df"),
in_chunk_size = 7,
backend = "data.table", header = F)
This gives object l
of class "disk.frame" "disk.frame.folder"
in your R environment.这在您的 R 环境中给出了类
"disk.frame" "disk.frame.folder"
对象l
。 You should now have a subfolder "combined.df" which contains a bunch of .fst
files in your specified directory.您现在应该有一个子文件夹“combined.df”,其中在您指定的目录中包含一堆
.fst
文件。
cbind(
get_chunk(l, 1) %>%
`[`(seq(2, nrow(.), 2)),
get_chunk(l, 2) %>%
`[`(seq(1, nrow(.), 2))
) %>%
rename("cl" = 1, "clssc" = 2) %>%
mutate(across(.fns = parse_number)) %>%
as_tibble() # omit this to keep data.table
# A tibble: 3 x 2
cl clssc
<dbl> <dbl>
1 2000000 10934
2 2000000 10934
3 2000000 10934
Finally, don't forget to revert the scientific notation options options(scipen = old)
.最后,不要忘记恢复科学记数法选项
options(scipen = old)
。
* Note: Step 1 can likely be avoided by playing with the chunk sizes or through some manipulation of the .xml from within R. This I don't know how to do (yet). *注意:可以通过调整块大小或通过在 R 中对 .xml 进行一些操作来避免第 1 步。我不知道该怎么做(目前)。
Note2: Recommend to carefully read the disk.frame
documentation for tips on how to setup properly for your machine.注意 2:建议仔细阅读
disk.frame
文档以获取有关如何为您的机器正确设置的提示。
on a windows machine在 Windows 机器上
Here ia an approach where you use the output from the windows findstr
-command to import data using data.table::fread()
.这里是一种方法,您可以使用 windows
findstr
命令的输出来使用data.table::fread()
导入数据。 It filters the data using the windows-version of 'grep' before it is loaded into R. This way you will not run into memory problems very soon.它在将数据加载到 R之前使用 Windows 版本的“grep”过滤数据。这样您就不会很快遇到内存问题。
location of the xml: e:/testdata.xml
xml 的位置:
e:/testdata.xml
further explanation is in data's comments below进一步的解释在下面的数据评论中
library(data.table)
# Import output from windows findstr-command
# assumes location of data is e:/testdata.xml
# !! use \\ in path, else findstr does not undeerstand !!
DT <- data.table::fread(cmd = 'findstr "<clssc> <cl>" e:\\testdata.xml',
sep = "\n", col.names = "line", header = FALSE )
# line
# 1: <cl>2000000</cl>
# 2: <clssc>10934</clssc>
# 3: <cl>2000000</cl>
# 4: <clssc>10934</clssc>
# 5: <cl>2000000</cl>
# 6: <clssc>10934</clssc>
# Extract data from raw line
DT[, name := gsub("^<(.+?)>.*$", "\\1", line)]
DT[, value := gsub("^.*>([0-9]+?)<.*$", "\\1", line)]
# line name value
# 1: <cl>2000000</cl> cl 2000000
# 2: <clssc>10934</clssc> clssc 10934
# 3: <cl>2000000</cl> cl 2000000
# 4: <clssc>10934</clssc> clssc 10934
# 5: <cl>2000000</cl> cl 2000000
# 6: <clssc>10934</clssc> clssc 10934
# Build some id's
DT[, id := rowid(name)]
# Cast to wide format
dcast(DT, id ~ name, value.var = "value")
# id cl clssc
# 1: 1 2000000 10934
# 2: 2 2000000 10934
# 3: 3 2000000 10934
on a unix machine在 unix 机器上
I cannot test, since I only use windows at this location.我无法测试,因为我只在这个位置使用 windows。
Replace the cmd='...'
part from findstr-command (and the regex) with the grep-command of your system.将 findstr-command(和正则表达式)中的
cmd='...'
部分替换为您系统的 grep-command。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.