简体   繁体   中英

Avoiding redundancy when selecting rows in a data frame

My code is littered with statements of the following taste:

selected <- long_data_frame_name[long_data_frame_name$col1 == "condition1" & 
long_data_frame_name$col2 == "condition2" & !is.na(long_data_frame_name$col3),
selected_columns]

The repetition of the data frame name is tedious and error-prone. Is there a way to avoid it?

You can use with

For instance

sel.ID <- with(long_data_frame_name, col1==2 & col2<0.5 & col3>0.2)
selected <- long_data_frame_name[sel.ID, selected_columns]

Several ways come to mind.

If you think about it, you are subsetting your data. Hence use the subset function (base package):

your_subset <- subset(long_data_frame_name,
                      col1 == "cond1" & "cond2" == "cond2" & !is.na(col3),
                      select = selected_columns)

This is in my opinion the most "talking" code to accomplish your task.

Use data tables.

library(data.table)
long_data_table_name = data.table(long_data_frame_name, key="col1,col2,col3")
selected <- long_data_table_name[col1 == "condition1" & 
                                 col2 == "condition2" & 
                                 !is.na(col3),
                                 list(col4,col5,col6,col7)]

You don't have to set the key in the data.table(...) call, but if you have a large dataset, this will be much faster. Either way it will be much faster than using data frames. Finally, using J(...) , as below, does require a keyed data.table, but is even faster.

selected <- long_data_table_name[J("condition1","condition2",NA),
                                 list(col4,col5,col6,col7)]

You have several possibilities:

  • attach which adds the variables of the data.frame to the search path just below the global environment. Very useful for code demonstrations but I warn you not to do that programmatically.

  • with which creates a whole new environment temporarilly.

In very limited cases you want to use other options such as within .

df = data.frame(random=runif(100))
df1 = with(df,log(random))
df2 = within(df,logRandom <- log(random))

within will examine the created environment after evaluation and add the modifications to the data. Check the help of with to see more examples. with will just evaluate you expression.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM