简体   繁体   中英

The matching columns from a data frame based on value in a column from other data Frame

I have two data frames The first one is df1 has 485513 columns and 100 rows,

head(df1)

sample  cg1 cg2 cg3 cg4 cg5 cg6 cg7 cg8 cg9 cg10    cg11
AAD_1   33435   33436   33437   33438   33439   33440   33441   33442   33443   33444   33445
AAD_2   0.33    1.33    2.33    3.33    4.33    5.33    6.33    7.33    8.33    9.33    10.33
AAD_3   0.56    1.56    2.56    3.56    4.56    5.56    6.56    7.56    8.56    9.56    10.56
AAD_4   45.9    46.9    47.9    48.9    49.9    50.9    51.9    52.9    53.9    54.9    55.9
AAD_5   46.9    47.9    48.9    49.9    50.9    51.9    52.9    53.9    54.9    55.9    56.9
AAD_6   47.9    48.9    49.9    50.9    51.9    52.9    53.9    54.9    55.9    56.9    57.9
AAD_7   48.9    49.9    50.9    51.9    52.9    53.9    54.9    55.9    56.9    57.9    58.9
AAD_8   49.9    50.9    51.9    52.9    53.9    54.9    55.9    56.9    57.9    58.9    59.9
AAD_9   50.9    51.9    52.9    53.9    54.9    55.9    56.9    57.9    58.9    59.9    60.9
AAD_10  51.9    52.9    53.9    54.9    55.9    56.9    57.9    58.9    59.9    60.9    61.9

and the second one has df2 84 rows and single column. I am aiming to get a subset of df1 using the values in the column from the df2 data frame.

head(df2)
    ID
    cg1
    cg2
    cg3
    cg4
    cg5

The values of df2 are the columns names of my interest from df1 and so I have tried the following one-liner in R.

> UP=(df1 %>% as.data.frame)[,df2$ID]

The Up data frame returns me with unmatched columns from my query df2

And it resulted in a data frame UP with 84 columns and 100 rows but none of the columns the above command line returned is matching with the input query data frame df2.

It would be great if someone suggests me an alternative solution

Assuming df2 is a Series:

>>> df[df2.tolist()]

        cg1       cg2       cg3       cg4       cg5
0  33435.00  33436.00  33437.00  33438.00  33439.00
1      0.33      1.33      2.33      3.33      4.33
2      0.56      1.56      2.56      3.56      4.56
3     45.90     46.90     47.90     48.90     49.90
4     46.90     47.90     48.90     49.90     50.90
5     47.90     48.90     49.90     50.90     51.90
6     48.90     49.90     50.90     51.90     52.90
7     49.90     50.90     51.90     52.90     53.90
8     50.90     51.90     52.90     53.90     54.90
9     51.90     52.90     53.90     54.90     55.90

If it is a dataframe, then this should work:

df[df2.ID.tolist()]

In R , we can just do

df[as.character(df2$ID)]

assuming that 'ID' column is factor . In case it is character class, it is more easier

df[df2$ID]

But if there are elements in 'ID' that are not in the column names of 'df', it may be better to use intersect

df[intersect(colnames(df), df2$ID)]

If the 'df' is a data.table , the usual way to subset columns will be to include the with =FALSE . It is mentioned in ?data.table

with

By default with=TRUE and j is evaluated within the frame of x; column names can be used as variables.

When with=FALSE j is a character vector of column names, a numeric vector of column positions to select or of the form startcol:endcol, and the value returned is always a data.table. with=FALSE is often useful in data.table to select columns dynamically. Note that x[, cols, with=FALSE] is equivalent to x[, .SD, .SDcols=cols].

Therefore, the above commands would be

 df[, as.character(df2$ID), with = FALSE]

or

 df[, df2$ID, with = FALSE] #if 'ID' is already character class.

Or

 df[, intersect(colnames(df), df2$ID), with = FALSE]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM