简体   繁体   English

R文字清理

[英]R text clean up

I have a csv file with many entries as follows (one example provided): 我有一个csv文件,其中包含许多条目,如下所示(提供了一个示例):

Customer 1 car purchase
08/22/2016 08:10:00 Agent 1 (Agt1)
Customer 1 car purchase and service purchase.\n
Service indicates tires needed\n
possible oil change as well.\n
Tire quote provided.\n
*Name: Service advisor \n
*Phone: 123-456-7890 \n
Customer 1 called back to schedule appt.\n

I am trying to write R code such that output is as follows (for each entry) 我正在尝试编写R代码,以使输出如下(对于每个条目)

Customer 1 car purchase and service purchase.
Service indicates tires needed and possible oil change as well.
Tire quote provided.
Customer 1 called back to schedule appt.

I am looking to strip out the first two lines and any lines with *Name and *Phone out. 我想删除前两行以及带有* Name和* Phone的任何行。

One thing I tried is to use is assigning each entry to a temp variable and then 我尝试使用的一件事是将每个条目分配给temp变量,然后

stri_split_lines (temp)
x=stri_split_lines(temp)
y=x[[1]][3:length(x[[1]])]

This extracts out the first two lines. 这将提取出前两行。 However I am not sure how to extract the lines with *Name and *Phone as they could be anywhere in the text. 但是我不确定如何使用* Name和* Phone提取行,因为它们可能在文本中的任何位置。 I am also quite convinced there is probably a better way out there :) Any ideas on how I can achieve this? 我也非常相信可能还有更好的方法:)关于如何实现这一目标的任何想法? The lines have \\n at the end, so I was hoping to use regex to split based on that, but was not able to get it to work. 这些行的末尾有\\ n,因此我希望使用regex进行拆分,但是无法使其正常工作。 Thanks! 谢谢!

You can use readLines or strsplit to read in each entry (with lapply as necessary), and then grep to index: 您可以使用readLinesstrsplit读取每个条目(必要时使用lapply ),然后使用grep来索引:

x <- readLines(textConnection('Customer 1 car purchase
                               08/22/2016 08:10:00 Agent 1 (Agt1)
                               Customer 1 car purchase and service purchase.
                               Service indicates tires needed
                               possible oil change as well.
                               Tire quote provided.
                               *Name: Service advisor 
                               *Phone: 123-456-7890 
                               Customer 1 called back to schedule appt.'))

x <- trimws(x)    # clean up extra white space

x[c(-1, -2, -grep('\\*Name|\\*Phone', x))]
## [1] "Customer 1 car purchase and service purchase."
## [2] "Service indicates tires needed"               
## [3] "possible oil change as well."                 
## [4] "Tire quote provided."                         
## [5] "Customer 1 called back to schedule appt." 

paste back to a single block if you like. 如果愿意,请paste回单个块。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM