[英]Read txt file selectively in R
I'm looking for an easy fix to read a txt file that looks like this when opened in excel: 我正在寻找一种简单的修复方法来读取在excel中打开的txt文件:
IDmaster By_uspto App_date Grant_date Applicant Cited
2 1 19671106 19700707 Motorola Inc 1052446
2 1 19740909 19751028 Gen Motors Corp 1062884
2 1 19800331 19820817 Amp Incorporated 1082369
2 1 19910515 19940719 Dell Usa L.P. 389546
2 1 19940210 19950912 Schueman Transfer Inc. 1164239
2 1 19940217 19950912 Spacelabs Medical Inc. 1164336
EDIT: Opening the txt file in notepad looks like this (with commas). 编辑:在记事本中打开txt文件看起来像这样(用逗号)。 The last two rows exhibit the problem.
最后两行显示了问题。
IDmaster,By_uspto,App_date,Grant_date,Applicant,Cited
2,1,19671106,19700707,Motorola Inc,1052446
2,1,19740909,19751028,Gen Motors Corp,1062884
2,1,19800331,19820817,Amp Incorporated,1082369
2,1,19910515,19940719,Dell Usa L.P.,389546
2,1,19940210,19950912,Schueman Transfer, Inc.,1164239
2,1,19940217,19950912,Spacelabs Medical, Inc.,1164336
The problem is that some of the Applicant
names contain commas so that they are read as if they belong in a different column, which they actually don't. 问题在于,某些
Applicant
名称包含逗号,因此,它们的读取就像它们属于其他列一样,而实际上却不是。
Is there a simple way to a) "teach" R to keep string variables together, regardless of commas in between b) read in the first 4 columns, and then add an extra column for everything behind the last comma? 是否有一种简单的方法来a)“教” R以使字符串变量保持在一起,而不管两者之间的逗号如何?b)在前4列中读取,然后为最后一个逗号后面的所有内容添加一个额外的列?
Given the length of the data I can't open it entirely in excel which would be otherwise a simple alternative. 考虑到数据的长度,我无法完全在excel中打开它,否则将是一个简单的选择。
If your example is written in a "Test.csv" file, try with: 如果您的示例写在“ Test.csv”文件中,请尝试:
read.csv(text=gsub(', ', ' ', paste0(readLines("Test.csv"),collapse="\n")),
quote="'",
stringsAsFactors=FALSE)
It returns: 它返回:
# IDmaster By_uspto App_date Grant_date Applicant Cited
# 1 2 1 19671106 19700707 Motorola Inc 1052446
# 2 2 1 19740909 19751028 Gen Motors Corp 1062884
# 3 2 1 19800331 19820817 Amp Incorporated 1082369
# 4 2 1 19910515 19940719 Dell Usa L.P. 389546
# 5 2 1 19940210 19950912 Schueman Transfer Inc. 1164239
# 6 2 1 19940217 19950912 Spacelabs Medical Inc. 1164336
This provides a very silly workaround but it does the trick for me (because I don't really care about the Applicant names atm. However, I'm hoping for a better solution. 这提供了一个非常愚蠢的解决方法,但却为我解决了问题(因为我不太在乎申请人姓名atm。但是,我希望有一个更好的解决方案。
Step 1: Open the .txt file in notepad, and add five column names V1, V2, V3, V4, V5 (to be sure to capture names with multiple commas). 步骤1:在记事本中打开.txt文件,并添加五个列名称V1,V2,V3,V4,V5(以确保捕获带有多个逗号的名称)。
bc <- read.table("data.txt", header = T, na.strings = T, fill = T, sep = ",", stringsAsFactors = F)
library(data.table)
sapply(bc, class)
unique(bc$V5) # only NA so can be deleted
setDT(bc)
bc <- bc[,1:10, with = F]
bc$Cited <- as.numeric(bc$Cited)
bc$Cited[is.na(bc$Cited)] <- 0
bc$V1 <- as.numeric(bc$V1)
bc$V2 <- as.numeric(bc$V2)
bc$V3 <- as.numeric(bc$V3)
bc$V4 <- as.numeric(bc$V4)
bc$V1[is.na(bc$V1)] <- 0
bc$V2[is.na(bc$V2)] <- 0
bc$V3[is.na(bc$V3)] <- 0
bc$V4[is.na(bc$V4)] <- 0
head(bc, 10)
bc$Cited <- with(bc, Cited + V1 + V2 + V3 + V4)
It's a silly patch but it does the trick in this particular context 这是一个愚蠢的补丁,但在这种特定情况下可以解决问题
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.