简体   繁体   English

从 R 中的整个数据帧中删除空白

[英]Removing Whitespace From a Whole Data Frame in R

I've been trying to remove the white space that I have in a data frame (using R) .我一直在尝试删除数据框中的空白区域(使用 R) The data frame is large (>1gb) and has multiple columns that contains white space in every data entry.数据框很大 (>1gb) 并且有多个列,每个数据条目中都包含空格。

Is there a quick way to remove the white space from the whole data frame?有没有一种快速的方法可以从整个数据框中删除空白? I've been trying to do this on a subset of the first 10 rows of data using:我一直在尝试使用以下方法对前 10 行数据的子集执行此操作:

gsub( " ", "", mydata) 

This didn't seem to work, although R returned an output which I have been unable to interpret.这似乎不起作用,尽管 R 返回了我无法解释的输出。

str_replace( " ", "", mydata)

R returned 47 warnings and did not remove the white space. R 返回了47 个警告并且没有删除空格。

erase_all(mydata, " ")

R returned an error saying 'Error: could not find function "erase_all"' R 返回一个错误,指出“错误:找不到函数“erase_all””

I would really appreciate some help with this as I've spent the last 24hrs trying to tackle this problem.我真的很感激这方面的一些帮助,因为我花了过去 24 小时试图解决这个问题。

Thanks!谢谢!

A lot of the answers are older, so here in 2019 is a simple dplyr solution that will operate only on the character columns to remove trailing and leading whitespace.很多答案都是旧的,所以在 2019 年这里是一个简单的dplyr解决方案,它将只对字符列进行操作以删除尾随和前导空格。

library(dplyr)
library(stringr)

data %>%
  mutate_if(is.character, str_trim)

## ===== 2020 edit for dplyr (>= 1.0.0) =====
df %>% 
  mutate(across(where(is.character), str_trim))

You can switch out the str_trim() function for other ones if you want a different flavor of whitespace removal.如果您想要不同风格的空白删除,您可以将str_trim()函数切换为其他函数。

# for example, remove all spaces
df %>% 
  mutate(across(where(is.character), str_remove_all, pattern = fixed(" ")))

If i understood you correctly then you want to remove all the white spaces from entire data frame, i guess the code which you are using is good for removing spaces in the column names.I think you should try this:如果我理解正确,那么您想从整个数据框中删除所有空格,我猜您正在使用的代码适用于删除列名中的空格。我认为您应该尝试以下操作:

 apply(myData,2,function(x)gsub('\\s+', '',x))

Hope this works.希望这有效。

This will return a matrix however, if you want to change it to data frame then do:但是,这将返回一个矩阵,如果要将其更改为数据框,请执行以下操作:

as.data.frame(apply(myData,2,function(x)gsub('\\s+', '',x)))

EDIT In 2020: 2020年编辑:

Using lapply and trimws function with both=TRUE can remove leading and trailing spaces but not inside it.Since there was no input data provided by OP, I am adding a dummy example to produce the results.使用带有both=TRUE lapplytrimws函数可以删除前导和尾随空格,但不能删除其中。由于 OP 没有提供输入数据,因此我添加了一个虚拟示例来生成结果。

DATA:数据:

df <- data.frame(val = c(" abc"," kl m","dfsd "),val1 = c("klm ","gdfs","123"),num=1:3,num1=2:4,stringsAsFactors = FALSE)

#situation: 1 (Using Base R), when we want to remove spaces only at the leading and trailing ends NOT inside the string values, we can use trimws #situation: 1 (Using Base R),当我们只想删除字符串值首尾两端的空格时,我们可以使用trimws

cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[,cols_to_be_rectified] <- lapply(df[,cols_to_be_rectified], trimws)

# situation: 2 (Using Base R) , when we want to remove spaces at every place in the dataframe in character columns ( inside of a string as well as at the leading and trailing ends ). # 情况:2 (Using Base R) ,当我们想要删除字符列中数据帧中每个位置的空格(字符串内部以及首尾两端)。

( This was the initial solution proposed using apply, please note a solution using apply seems to work but would be very slow, also the with the question its apparently not very clear if OP really wanted to remove leading/trailing blank or every blank in the data ) 这是使用 apply 提出的初始解决方案,请注意使用 apply 的解决方案似乎有效但会很慢,而且问题显然不是很清楚,如果 OP 真的想删除前导/尾随空白或每个空白数据

cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[,cols_to_be_rectified] <- lapply(df[,cols_to_be_rectified], function(x)gsub('\\s+','',x))

## situation: 1 (Using data.table, removing only leading and trailing blanks) ##情况:1 (使用data.table,只删除前导和尾随空格)

library(data.table)
setDT(df)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[,c(cols_to_be_rectified) := lapply(.SD, trimws), .SDcols = cols_to_be_rectified]

Output from situation1 :情况 1 的输出

 val val1 num num1 1: abc klm 1 2 2: kl m gdfs 2 3 3: dfsd 123 3 4

## situation: 2 (Using data.table, removing every blank inside as well as leading/trailing blanks) ##情况:2 (使用data.table,删除内部的每个空格以及前导/尾随空格)

cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[,c(cols_to_be_rectified) := lapply(.SD, function(x)gsub('\\s+', '', x)), .SDcols = cols_to_be_rectified]

Output from situation2 :情况 2 的输出

 val val1 num num1 1: abc klm 1 2 2: klm gdfs 2 3 3: dfsd 123 3 4

Note the difference between the outputs of both situation, In row number 2: you can see that, with trimws we can remove leading and trailing blanks, but with regex solution we are able to remove every blank(s).请注意两种情况的输出之间的差异,在第 2 行:您可以看到,使用trimws我们可以删除前导和尾随空白,但使用正则表达式解决方案我们能够删除每个空白。

I hope this helps , Thanks我希望这会有所帮助,谢谢

Picking up on Fremzy and the comment from Stamper, this is now my handy routine for cleaning up whitespace in data:根据 Fremzy 和 Stamper 的评论,这现在是我清理数据中空白的方便例程:

df <- data.frame(lapply(df, trimws), stringsAsFactors = FALSE)

As others have noted this changes all types to character.正如其他人所指出的那样,这会将所有类型更改为字符。 In my work, I first determine the types available in the original and conversions required.在我的工作中,我首先确定原始文件中可用的类型和所需的转换。 After trimming, I re-apply the types needed.修剪后,我重新应用所需的类型。

If your original types are OK, apply the solution from MarkusN below https://stackoverflow.com/a/37815274/2200542如果您的原始类型没问题,请在https://stackoverflow.com/a/37815274/2200542下面应用 MarkusN 的解决方案

Those working with Excel files may wish to explore the readxl package which defaults to trim_ws = TRUE when reading.那些使用 Excel 文件的人可能希望探索 readxl 包,它在阅读时默认为 trim_ws = TRUE。

Picking up on Fremzy and Mielniczuk, I came to the following solution:拿起 Fremzy 和 Mielniczuk,我得出了以下解决方案:

data.frame(lapply(df, function(x) if(class(x)=="character") trimws(x) else(x)), stringsAsFactors=F)

It works for mixed numeric/charactert dataframes manipulates only character-columns.它适用于混合数字/字符数据框,仅操作字符列。

One possibility involving just dplyr could be:仅涉及dplyr一种可能性可能是:

data %>%
 mutate_if(is.character, trimws)

Or considering that all variables are of class character:或者考虑到所有变量都是类字符:

data %>%
 mutate_all(trimws)

R is simply not the right tool for such file size. R 根本不是适合这种文件大小的工具。 However have 2 options :但是有 2 个选项:

Use ffdply and ff base使用 ffdply 和 ff 基础

Use ff and ffbase packages:使用ffffbase包:

library(ff)
library(ffabse)
x <- read.csv.ffdf(file=your_file,header=TRUE, VERBOSE=TRUE,
                 first.rows=1e4, next.rows=5e4)
x$split = as.ff(rep(seq(splits),each=nrow(x)/splits))
ffdfdply( x, x$split , BATCHBYTES=0,function(myData)        
             apply(myData,2,function(x)gsub('\\s+', '',x))

Use sed (my preference)使用 sed(我的偏好)

sed -ir "s/(\S)\s+(/S)/\1\2/g;s/^\s+//;s/\s+$//" your_file 

You could use trimws function in R 3.2 on all the columns.您可以在 R 3.2 中的所有列上使用 trimws 函数。

myData[,c(1)]=trimws(myData[,c(1)])

You can loop this for all the columns in your dataset.您可以为数据集中的所有列循环此操作。 It has good performance with large datasets as well.它在大型数据集上也具有良好的性能。

If you're dealing with large data sets like this, you could really benefit form the speed of data.table .如果您正在处理这样的大型数据集,您真的可以从data.table的速度中data.table

library(data.table)

setDT(df)

for (j in names(df)) set(df, j = j, value = df[[trimws(j)]]) 

I would expect this to be the fastest solution.我希望这是最快的解决方案。 This line of code uses the set operator of data.table , which loops over columns really fast.这行代码使用了data.tableset运算符,它非常快速地遍历列。 There is a nice explanation here: Fast looping with set .这里有一个很好的解释: Fast looping with set

If you want to maintain the variable classes in your data.frame - you should know that using apply will clobber them because it outputs a matrix where all variables are converted to either character or numeric .如果您想维护data.frame的变量类 - 您应该知道使用apply会破坏它们,因为它输出一个matrix ,其中所有变量都转换为characternumeric Building upon the code of Fremzy and Anthony Simon Mielniczuk you can loop through the columns of your data.frame and trim the white space off only columns of class factor or character (and maintain your data classes):基于 Fremzy 和 Anthony Simon Mielniczuk 的代码,您可以遍历 data.frame 的列并仅从类factorcharacter列中修剪空白(并维护您的数据类):

for (i in names(mydata)) {
  if(class(mydata[, i]) %in% c("factor", "character")){
    mydata[, i] <- trimws(mydata[, i])
  }
}

I think that a simple approach with sapply, also works, given a df like:我认为 sapply 的简单方法也有效,给定 df 如下:

dat<-data.frame(S=LETTERS[1:10],
            M=LETTERS[11:20],
            X=c(rep("A:A",3),"?","A:A ",rep("G:G",5)),
            Y=c(rep("T:T",4),"T:T ",rep("C:C",5)),
            Z=c(rep("T:T",4),"T:T ",rep("C:C",5)),
            N=c(1:3,'4 ','5 ',6:10),
            stringsAsFactors = FALSE)

You will notice that dat$N is going to become class character due to '4 ' & '5 ' (you can check with class(dat$N) )您会注意到dat$N由于'4 ' & '5 '将成为类字符(您可以查看class(dat$N)

To get rid of the spaces on the numeic column simply convert to numeric with as.numeric or as.integer .要摆脱numeric列上的空格,只需使用as.numericas.integer转换为numeric as.integer

dat$N<-as.numeric(dat$N)

If you want to remove all the spaces, do:如果要删除所有空格,请执行以下操作:

dat.b<-as.data.frame(sapply(dat,trimws),stringsAsFactors = FALSE)

And again use as.numeric on col N (ause sapply will convert it to character )再次在 col N 上使用as.numeric (ause sapply 会将其转换为character

dat.b$N<-as.numeric(dat.b$N)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM