简体   繁体   中英

Removing white space from data frame in R

I have scraped some data and stored it in a data frame. Some rows contain unwanted information within square brackets. Example "[N] Team Name". I want to keep just the part containing the team name, so first I use the code below to remove the brackets and any text contained within them

gsub( " *\\(.*?\\) *", "", x)

This leaves me with " Team Name" (notice the space before the T). Now I am trying to remove the white space before the T using trimws or the method shown here , but it is not working

could someone please help me with removing the extra white space.

Note: if I write the string containing the space manually and apply trimws on it, it works. However when obtaining the string directly from the data frame it doesnt. Also when running the code snippet below (where df[1,1] is the same string retreived from the data frame), I get FALSE. This gives me reason to believe that the string in the data frame is not the same as the manually typed string.

" team name" == df[1,1]

你可以试试

gsub( "\\[[^]]*\\]\\W*", "", "[N] Team Name")

You should be able to remove the bracketed piece as well as any following whitespace with a single regex substitution. Your regex is correct as-is, and should successfully accomplish this. (Note: I've ignored the unexplained discrepancy between your use of parentheses vs. square brackets in your question. I've assumed square brackets for my answer.)

Strangely, this seems to be a case where the default regex engine is failing, but adding perl=T gets it working:

x <- '[N] Team Name';
gsub(' *\\[.*?\\] *','',x);
## [1] " Team Name"
gsub(perl=T,' *\\[.*?\\] *','',x);
## [1] "Team Name"

In the past I have run across cases where the default regex engine flakes out, but I have never encountered this with perl=T , so I suggest you use that. I really think there is something broken in the default regex implementation.

We can use

sub(".*\\]\\s+", "", x)
#[1] "Team Name"

Or just

sub("\\S+\\s+", "", x)
#[1] "Team Name"

data

x <- '[N] Team Name';

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM