简体   繁体   English

在R中选择性读取txt文件

[英]Read txt file selectively in R

I'm looking for an easy fix to read a txt file that looks like this when opened in excel: 我正在寻找一种简单的修复方法来读取在excel中打开的txt文件:

IDmaster    By_uspto    App_date    Grant_date  Applicant   Cited   
2   1   19671106    19700707    Motorola Inc    1052446 
2   1   19740909    19751028    Gen Motors Corp 1062884 
2   1   19800331    19820817    Amp Incorporated    1082369 
2   1   19910515    19940719    Dell Usa L.P.   389546  
2   1   19940210    19950912    Schueman Transfer    Inc.   1164239
2   1   19940217    19950912    Spacelabs Medical    Inc.   1164336

EDIT: Opening the txt file in notepad looks like this (with commas). 编辑:在记事本中打开txt文件看起来像这样(用逗号)。 The last two rows exhibit the problem. 最后两行显示了问题。

IDmaster,By_uspto,App_date,Grant_date,Applicant,Cited
2,1,19671106,19700707,Motorola Inc,1052446
2,1,19740909,19751028,Gen Motors Corp,1062884
2,1,19800331,19820817,Amp Incorporated,1082369
2,1,19910515,19940719,Dell Usa L.P.,389546
2,1,19940210,19950912,Schueman Transfer, Inc.,1164239
2,1,19940217,19950912,Spacelabs Medical, Inc.,1164336

The problem is that some of the Applicant names contain commas so that they are read as if they belong in a different column, which they actually don't. 问题在于,某些Applicant名称包含逗号,因此,它们的读取就像它们属于其他列一样,而实际上却不是。

Is there a simple way to a) "teach" R to keep string variables together, regardless of commas in between b) read in the first 4 columns, and then add an extra column for everything behind the last comma? 是否有一种简单的方法来a)“教” R以使字符串变量保持在一起,而不管两者之间的逗号如何?b)在前4列中读取,然后为最后一个逗号后面的所有内容添加一个额外的列?

Given the length of the data I can't open it entirely in excel which would be otherwise a simple alternative. 考虑到数据的长度,我无法完全在excel中打开它,否则将是一个简单的选择。

If your example is written in a "Test.csv" file, try with: 如果您的示例写在“ Test.csv”文件中,请尝试:

read.csv(text=gsub(', ', ' ', paste0(readLines("Test.csv"),collapse="\n")),
         quote="'",
         stringsAsFactors=FALSE)

It returns: 它返回:

#   IDmaster By_uspto App_date Grant_date              Applicant   Cited
# 1        2        1 19671106   19700707           Motorola Inc 1052446
# 2        2        1 19740909   19751028        Gen Motors Corp 1062884
# 3        2        1 19800331   19820817       Amp Incorporated 1082369
# 4        2        1 19910515   19940719          Dell Usa L.P.  389546
# 5        2        1 19940210   19950912 Schueman Transfer Inc. 1164239
# 6        2        1 19940217   19950912 Spacelabs Medical Inc. 1164336

This provides a very silly workaround but it does the trick for me (because I don't really care about the Applicant names atm. However, I'm hoping for a better solution. 这提供了一个非常愚蠢的解决方法,但却为我解决了问题(因为我不太在乎申请人姓名atm。但是,我希望有一个更好的解决方案。

Step 1: Open the .txt file in notepad, and add five column names V1, V2, V3, V4, V5 (to be sure to capture names with multiple commas). 步骤1:在记事本中打开.txt文件,并添加五个列名称V1,V2,V3,V4,V5(以确保捕获带有多个逗号的名称)。

bc <- read.table("data.txt", header = T, na.strings = T, fill = T, sep = ",", stringsAsFactors = F)

library(data.table)

sapply(bc, class)
unique(bc$V5) # only NA so can be deleted
setDT(bc)
bc <- bc[,1:10, with = F]
bc$Cited <- as.numeric(bc$Cited)
  bc$Cited[is.na(bc$Cited)] <- 0
  bc$V1 <- as.numeric(bc$V1)
  bc$V2 <- as.numeric(bc$V2)
  bc$V3 <- as.numeric(bc$V3)
  bc$V4 <- as.numeric(bc$V4)

  bc$V1[is.na(bc$V1)] <- 0
  bc$V2[is.na(bc$V2)] <- 0
  bc$V3[is.na(bc$V3)] <- 0
  bc$V4[is.na(bc$V4)] <- 0

head(bc, 10)
bc$Cited <- with(bc, Cited + V1 + V2 + V3 + V4)

It's a silly patch but it does the trick in this particular context 这是一个愚蠢的补丁,但在这种特定情况下可以解决问题

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM