简体   繁体   English

R 变量的 Grep 文件名

[英]R Grep file name for a variable

I am new to R so I am struggling with what I imagine is a fairly simple question.我是 R 的新手,所以我正在努力解决我想象的一个相当简单的问题。 For this question I am not looking for someone to give me just a solution.对于这个问题,我不是在找人给我一个解决方案。 I was hoping that someone could explain the answer to me, so that I might learn to do it myself, rather than just copy what it is you have done.我希望有人可以向我解释答案,这样我就可以学会自己做,而不仅仅是复制你所做的。 That being said, here is my problem and questions.话虽如此,这是我的问题和问题。

I am making a histogram with R. A user will submit a file and data from that file will be used to make a histogram.我正在用 R 制作直方图。用户将提交一个文件,该文件中的数据将用于制作直方图。 That much is already set and done.这么多已经设置并完成了。 Where I am having a problem is that I need to take only part of that file name and use it to help make a title for the histogram.我遇到的问题是我只需要获取该文件名的一部分并使用它来帮助制作直方图的标题。 The file name is a bit of a monster and follows this naming convention:文件名有点像怪物,并遵循以下命名约定:

X_Y.doc.Z.x_y_z X_Y.doc.Z.x_y_z

The aspects of that file name that I need are the Y and Z. I know that many people use grep but I am not sure how to use it in this instance.我需要的文件名的方面是 Y 和 Z。我知道很多人使用 grep 但我不确定在这种情况下如何使用它。 I have already read the ??grep page and am familiar with the basics of grep but don't really know where to start.我已经阅读了 ??grep 页面并且熟悉 grep 的基础知识,但真的不知道从哪里开始。

Eventually I will also need to grep some information from an excel file, if someone cares to advise me in that matter as well.最终我还需要从 excel 文件中 grep 一些信息,如果有人也愿意在这方面给我建议的话。 If it helps, this is how I am accepting the files:如果有帮助,这就是我接受文件的方式:

F.n<-(tk_choose.files(default="", caption="Select a file", multi=TRUE, filters=NULL, index=1))

Does anyone have any suggestions?有没有人有什么建议?

The answer already given using stringr is excellent.使用stringr已经给出的答案非常好。 That package provides you with some very helpful string munging tools.该软件包为您提供了一些非常有用的字符串处理工具。

If you want to only use base, you could do this with gsub .如果你只想使用 base,你可以用gsub来做到这一点。 Assuming your punctuation stays the same and there will not be any embedded periods or underscores in the X, Y or Z something like this should work假设您的标点符号保持不变,并且 X、Y 或 Z 中没有任何嵌入的句点或下划线,这样的事情应该可以工作

f <- 'X_Y.doc.Z.x_y_z'
gsub('^.+_(.+)\\.doc\\.(.+)\\..+_.+$', '\\1 \\2', f)

which returns:返回:

"Y Z" 

you could put whatever you want in there though to make it easier to get at each piece or could do this in two lines returning one each.你可以把任何你想要的东西放在那里,以便更容易地得到每一件,或者可以在两行中做这件事,每行返回一个。 And remember, R almost never changes data in place.请记住,R 几乎从不更改数据。 You need to assign the output of a function to a variable like below.您需要将函数的输出分配给如下所示的变量。 Otherwise it will just print to the console and be "lost" (this is true most of the time).否则它只会打印到控制台并“丢失”(大多数情况下都是如此)。

y <- gsub('^.+_(.+)\\.doc\\..+\\..+_.+$', '\\1', f)
z <- gsub('^.+_.+\\.doc\\.(.+)\\..+_.+$', '\\1', f)

Lets break it down.让我们分解它。

^ specifies the beginning of a line. ^指定行的开头。 its good to be explicit.它是明确的。 similarly $ identifies the end of a line.同样$标识行的结尾。

. represents any character and following it with a + means one or more of any character.代表任何字符,其后的+表示一个或多个任何字符。 If you used .* instead of .+ it would mean zero or more of any character and that isnt what we want.如果您使用.*而不是.+则意味着零个或多个任何字符,这不是我们想要的。 If i want to write a normal .如果我想写一个普通的. I need to escape it since its a special character.我需要逃避它,因为它是一个特殊的角色。 \\ is the escape character both for regular expressions and for R. So... you need two. \\是正则表达式和 R 的转义字符。所以......你需要两个。 To write a normal period you need to write \\\\.要写一个正常的时期,你需要写\\\\.

Clear to be sure.清除以确保。 Finally the parentheses represent a group I want to save.最后括号代表我想保存的组。 They can be referenced later using numbers indicating the order you saved them.稍后可以使用指示您保存它们的顺序的数字来引用它们。 In some languages these parentheses need to be escaped also, but not R.在某些语言中,这些括号也需要转义,但 R 不需要。

Grep uses Regular Expressions to search for substrings matching a pattern. Grep 使用正则表达式来搜索与模式匹配的子字符串。 For your problem of matching certain elements from a filename, you would probably want to use capturing groups to extract the different parts.对于从文件名匹配某些元素的问题,您可能希望使用捕获组来提取不同的部分。

An example of a regular expression with a capturing group would be:带有捕获组的正则表达式示例如下:

"Hello, (\w+)"

To match strings of the format "Hello, Friend".匹配格式为“Hello, Friend”的字符串。 Here is an explanation of the pattern:下面是对模式的解释:

  • \\w will match a "word character", while \\w将匹配一个“单词字符”,而
  • + means that at least one, but multiple of them will be matched. +表示至少匹配一个,但匹配多个。
  • For the other structural parts of your file name convention, we can just include _ as they are but have to escape .对于文件名约定的其他结构部分,我们可以按原样包含_但必须转义. as they have a special meaning in regular expressions.因为它们在正则表达式中具有特殊的含义。
  • To define a group that you want to match (a capturing group), you put the part to be matched in parentheses (\\w+)要定义要匹配的组(捕获组),请将要匹配的部分放在括号(\\w+)

Using all that, we get the following pattern:使用所有这些,我们得到以下模式:

"(\w+)_(\w+)\.doc\.(\w+)\.(\w+)_(\w+)_(\w+)"

To get the pattern to work in R, we will have to escape all \\ characters as \\\\ :为了使模式在 R 中工作,我们必须将所有\\字符转义为\\\\

> pattern = "(\\w+)_(\\w+)\\.doc\.(\\w+)\\.(\\w+)_(\\w+)_(\\w+)"

While grep and regex are powerful, I personally prefer the stringr package for its simpler interface, in particular the str_match function can be very helpful as it will return a matrix with column 1 giving the full match and all subsequent columns giving the matches to the capturing groups:虽然 grep 和 regex 很强大,但我个人更喜欢stringr包,因为它的界面更简单,特别是str_match函数可能非常有用,因为它会返回一个矩阵,其中第 1 列给出完整匹配,所有后续列给出匹配捕获团体:

> x = "X_Y.doc.Z.x_y_z"
> str_match(x, pattern)

     [,1]              [,2] [,3] [,4] [,5] [,6] [,7]
[1,] "X_Y.doc.Z.x_y_z" "X"  "Y"  "Z"  "x"  "y"  "z" 

If you are new to regular expressions, you should be fine with a tutorial for any language such as this one .如果您不熟悉正则表达式,那么您应该可以学习任何语言的教程,例如this one Syntax will mostly be similar, but vary only in details while not all features are supported by all programming languages.语法大多相似,但仅在细节上有所不同,而并非所有编程语言都支持所有功能。 If you want to try out your expressions before putting them into your programs, I highly recommend RegexPal如果您想在将表达式放入程序之前试用它们,我强烈推荐RegexPal

In this simple case of just needing a single letter that is in a well-defined place, substr would probably be simpler:在这个简单的情况下,只需要一个位于明确定义位置的字母, substr可能会更简单:

> a <- "X_Y.doc.Z.x_y_z"
> substr(a, 3, 3)
[1] "Y"
> substr(a, 9, 9)
[1] "Z"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM