简体   繁体   中英

python - New column based on comparison with another dataframe

I have two dataframes. I'd like to create a new column in one based on comparing it to another.

I'm new to Python, but here's my solution in R, which works but is horrific and slow. I'd like to find a faster method, and I've been trying desperately to learn to use pandas since it seems like a good method.

Mapfile (has ~800,000 rows)

Name    Chr   Position
S1      1      3000
S2      1      6000
S3      1      1000

Armsfile (has 39 rows)

Chr    Arm    Start   End
1      p      0       5000
1      q      5001    10000

R Script:

for (line in 1:nrow(mapfile)){
      mapfile$Arm[line] <- Armsfile$Arm[mapfile$Chr[line] == Armsfile$Chr &  mapfile$Position[line] > Armsfile$Start &  mapfile$Position[line] < Armsfile$End]
      }

Output Table:

Name   Chr   Position   Arm
S1      1     3000      p
S2      1     6000      q
S3      1     1000      p

In words: I want each line to look up the location ( 1) find the right Chr , 2) find the line where the START < POSITION < END ), then get the ARM information and place it in a new column.

I tried just reformatting my R script for Python, but couldn't get the syntax right. I also tried using merge for pandas , but couldn't find a way to do mathematical comparisons.

For completeness, here are my bad attempts just mentioned:

for line in 1:mapfile.shape[0]:
      mapfile$Arm[line] = Armsfile$Arm[   mapfile$Chr[line] == Armsfile$Chr &&  mapfile$Position[line] > Armsfile$Start &&  mapfile$Position[line] < Armsfile$End]

and

df = pd.merge(mapfile, Armsfile, on=['Chr', mapfile.Position > Armsfile.Start, mapfile.Position < Armsfile.End])

Edit: Another possible way to think about it

I've been working on another possibility in R that perhaps could translate somehow to Python. Here's the R code:

mapfile <- data.frame(Name = c("S1", "S2", "S3"), Chr = 1, Position = c(3000, 6000, 1000), key = "Chr")
Chr.Arms <- data.frame(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001), End = c(5000, 10000), key = "Chr")
mapfile$Arm <- c("N")
> mapfile
   Name Chr Position Arm
1:   S1   1     3000   N
2:   S2   1     6000   N
3:   S3   1     1000   N

for(i in 1:nrow(Chr.Arms)){
   cur.row <- Chr.Arms[i,]
   mapfile$Arm[mapfile$Chr == cur.row$Chr & mapfile$Position >= cur.row$Start & mapfile$Position <= cur.row$End] <- Chr.Arms$Arm
   }

> mapfile
   Name Chr Position Arm
1:   S1   1     3000   p
2:   S2   1     6000   p
3:   S3   1     1000   q

But again, with such large data, I'd like to be able to do something similar in Python. I haven't yet found the solution.

Since you have many 800K rows of data, I don't know how optimal this is but could you:

  1. merge
  2. use loc to filter down the merged dataframe ?

df = Mapfile.merge(Armsfile)
df.loc[(df['Position'] > df['Start']) & (df['Position'] <= df['End'])].drop(['Start', 'End'], axis=1)

Note: I wasn't sure how to handle the Armsfile Chr because your Mapfile and Armsfile both had Chr of 1 in your example.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM