简体   繁体   English

从R中的数据框中删除空格

[英]Removing white space from data frame in R

I have scraped some data and stored it in a data frame. 我已经抓取了一些数据并将其存储在数据框中。 Some rows contain unwanted information within square brackets. 有些行在方括号内包含不需要的信息。 Example "[N] Team Name". 示例“ [N]团队名称”。 I want to keep just the part containing the team name, so first I use the code below to remove the brackets and any text contained within them 我只想保留包含团队名称的部分,所以首先我使用下面的代码删除方括号和其中包含的任何文本

gsub( " *\\(.*?\\) *", "", x)

This leaves me with " Team Name" (notice the space before the T). 这给我留下了“团队名称”(注意T之前的空格)。 Now I am trying to remove the white space before the T using trimws or the method shown here , but it is not working 现在,我尝试使用微调或此处显示的方法删除T之前的空白,但是它不起作用

could someone please help me with removing the extra white space. 有人可以帮我删除多余的空白吗?

Note: if I write the string containing the space manually and apply trimws on it, it works. 注意:如果我手动编写包含空格的字符串并在其上应用修剪,它将起作用。 However when obtaining the string directly from the data frame it doesnt. 但是,当直接从数据帧中获取字符串时,它不会。 Also when running the code snippet below (where df[1,1] is the same string retreived from the data frame), I get FALSE. 同样,当运行下面的代码片段时(其中df [1,1]是从数据帧检索到的相同字符串),我得到FALSE。 This gives me reason to believe that the string in the data frame is not the same as the manually typed string. 这使我有理由相信数据框中的字符串与手动键入的字符串不同。

" team name" == df[1,1]

你可以试试

gsub( "\\[[^]]*\\]\\W*", "", "[N] Team Name")

You should be able to remove the bracketed piece as well as any following whitespace with a single regex substitution. 您应该能够使用单个正则表达式替换删除方括号以及以下任何空格。 Your regex is correct as-is, and should successfully accomplish this. 您的正则表达式是正确的,应该成功完成此操作。 (Note: I've ignored the unexplained discrepancy between your use of parentheses vs. square brackets in your question. I've assumed square brackets for my answer.) (注意:我忽略了问题中使用括号方括号之间的无法解释的差异。我以方括号作为答案。)

Strangely, this seems to be a case where the default regex engine is failing, but adding perl=T gets it working: 奇怪的是,这似乎是默认正则表达式引擎失败的情况,但是添加perl=T可以使其正常工作:

x <- '[N] Team Name';
gsub(' *\\[.*?\\] *','',x);
## [1] " Team Name"
gsub(perl=T,' *\\[.*?\\] *','',x);
## [1] "Team Name"

In the past I have run across cases where the default regex engine flakes out, but I have never encountered this with perl=T , so I suggest you use that. 过去,我遇到过默认正则表达式引擎崩溃的情况,但是我从未遇到过perl=T ,因此我建议您使用它。 I really think there is something broken in the default regex implementation. 我真的认为默认正则表达式实现中存在一些问题。

We can use 我们可以用

sub(".*\\]\\s+", "", x)
#[1] "Team Name"

Or just 要不就

sub("\\S+\\s+", "", x)
#[1] "Team Name"

data 数据

x <- '[N] Team Name';

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM