简体   繁体   English

在 R 中打开 .bcp 文件

[英]Opening .bcp files in R

I have been trying to convert UK charity commission data which is in .bcp file format into .csv file format which could then be read into R. The data I am referring to is available here: http://data.charitycommission.gov.uk/ .我一直在尝试将 .bcp 文件格式的英国慈善委员会数据转换为 .csv 文件格式,然后可以将其读入 R。我所指的数据可在此处获得: http : //data.charitycommission.gov。英国/ What I am trying to do is turn these .bcp files into useable dataframes that I can clean and run analyses on in R.我想要做的是将这些 .bcp 文件转换为可用的数据帧,我可以在 R 中清理和运行分析。

There are suggestions on how to do this through python on this github page https://github.com/ncvo/charity-commission-extract but unfortunately I haven't been able to get these options to work.在此 github 页面https://github.com/ncvo/charity-commission-extract上有关于如何通过 python 执行此操作的建议,但不幸的是我无法使这些选项起作用。

I am wondering if there is any syntax or packages that will allow me to open these data in R directly?我想知道是否有任何语法或包可以让我直接在 R 中打开这些数据? I haven't been able to find any.我一直找不到。

Another option would be to simply open the files within R as a single character vector using readLines .另一种选择是使用readLines R 中的文件作为单个字符向量简单地打开。 I have done this and the files are delimited with @**@ for columns and *@@* for rows.我已经这样做了,文件用@**@分隔列, *@@*分隔行。 (See here: http://data.charitycommission.gov.uk/data-definition.aspx ). (参见此处: http : //data.charitycommission.gov.uk/data-definition.aspx )。 Is there an R command that would allow me to create a dataframe from a long character string, defining de-limiters for both rows and columns?是否有 R 命令允许我从长字符串创建数据帧,为行和列定义分隔符?

R-solution R-解决方案

edited version编辑过的版本

Not sure if all .bcp files are in the same format.. I downloaded the dataset you mentioned, and tried a solution for the smallest file;不确定所有 .bcp 文件的格式是否相同。我下载了您提到的数据集,并尝试了最小文件的解决方案; extract_aoo_ref.bcp

library(data.table)

#read the file as-is
text <- readChar("./extract_aoo_ref.bcp", 
                 nchars = file.info( "./extract_aoo_ref.bcp" )$size, 
                 useBytes = TRUE)
#replace column and row separator
text <- gsub( ";", ":", text)
text <- gsub( "@\\*\\*@", ";", text)
text <- gsub( "\\*@@\\*", "\n", text, perl = TRUE)
#read the results
result <- data.table::fread( text, 
                             header = FALSE, 
                             sep = ";", 
                             fill = TRUE, 
                             quote = "", 
                             strip.white = TRUE)

head(result,10)

#    V1 V2                           V3                                           V4 V5 V6
# 1:  A  1 THROUGHOUT ENGLAND AND WALES At least 10 authorities in England and Wales  N NA
# 2:  B  1             BRACKNELL FOREST                             BRACKNELL FOREST  N NA
# 3:  D  1                  AFGHANISTAN                                  AFGHANISTAN  N  2
# 4:  E  1                       AFRICA                                       AFRICA  N NA
# 5:  A  2           THROUGHOUT ENGLAND      At least 10 authorities in England only  N NA
# 6:  B  2               WEST BERKSHIRE                               WEST BERKSHIRE  N NA
# 7:  D  2                      ALBANIA                                      ALBANIA  N  3
# 8:  E  2                         ASIA                                         ASIA  N NA
# 9:  A  3             THROUGHOUT WALES        At least 10 authorities in Wales only  Y NA
# 10:  B  3                      READING                                      READING  N NA

same for the tricky file;对于棘手的文件也是如此; extract_charity.bcp

head(result[,1:3],10)
#       V1 V2                                                                                 V3
# 1: 200000  0                                                          HOMEBOUND CRAFTSMEN TRUST
# 2: 200001  0                                                          PAINTERS' COMPANY CHARITY
# 3: 200002  0                                              THE ROYAL OPERA HOUSE BENEVOLENT FUND
# 4: 200003  0                                                          HERGA WORLD DISTRESS FUND
# 5: 200004  0 THE WILLIAM GOLDSTEIN LAY STAFF BENEVOLENT FUND (ROYAL HOSPITAL OF ST BARTHOLOMEW)
# 6: 200005  0                              DEVON AND CORNWALL ROMAN CATHOLIC DEVELOPMENT SOCIETY
# 7: 200006  0                                                    THE HORLEY SICK CHILDREN'S FUND
# 8: 200007  0                                            THE HOLDENHURST OLD PEOPLE'S HOME TRUST
# 9: 200008  0                                                         LORNA GASCOIGNE TRUST FUND
# 10: 200009  0                                          THE RALPH LEVY CHARITABLE COMPANY LIMITED

so.. looks like it is working :)所以..看起来它正在工作:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM