python - New column based on comparison with another dataframe

Question

I have two dataframes. I'd like to create a new column in one based on comparing it to another.

I'm new to Python, but here's my solution in R, which works but is horrific and slow. I'd like to find a faster method, and I've been trying desperately to learn to use pandas since it seems like a good method.

Mapfile (has ~800,000 rows)

Name    Chr   Position
S1      1      3000
S2      1      6000
S3      1      1000

Armsfile (has 39 rows)

Chr    Arm    Start   End
1      p      0       5000
1      q      5001    10000

R Script:

for (line in 1:nrow(mapfile)){
      mapfile$Arm[line] <- Armsfile$Arm[mapfile$Chr[line] == Armsfile$Chr &  mapfile$Position[line] > Armsfile$Start &  mapfile$Position[line] < Armsfile$End]
      }

Output Table:

Name   Chr   Position   Arm
S1      1     3000      p
S2      1     6000      q
S3      1     1000      p

In words: I want each line to look up the location ( 1) find the right Chr , 2) find the line where the START < POSITION < END ), then get the ARM information and place it in a new column.

I tried just reformatting my R script for Python, but couldn't get the syntax right. I also tried using merge for pandas , but couldn't find a way to do mathematical comparisons.

For completeness, here are my bad attempts just mentioned:

for line in 1:mapfile.shape[0]:
      mapfile$Arm[line] = Armsfile$Arm[   mapfile$Chr[line] == Armsfile$Chr &&  mapfile$Position[line] > Armsfile$Start &&  mapfile$Position[line] < Armsfile$End]

and

df = pd.merge(mapfile, Armsfile, on=['Chr', mapfile.Position > Armsfile.Start, mapfile.Position < Armsfile.End])

Edit: Another possible way to think about it

I've been working on another possibility in R that perhaps could translate somehow to Python. Here's the R code:

mapfile <- data.frame(Name = c("S1", "S2", "S3"), Chr = 1, Position = c(3000, 6000, 1000), key = "Chr")
Chr.Arms <- data.frame(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001), End = c(5000, 10000), key = "Chr")
mapfile$Arm <- c("N")
> mapfile
   Name Chr Position Arm
1:   S1   1     3000   N
2:   S2   1     6000   N
3:   S3   1     1000   N

for(i in 1:nrow(Chr.Arms)){
   cur.row <- Chr.Arms[i,]
   mapfile$Arm[mapfile$Chr == cur.row$Chr & mapfile$Position >= cur.row$Start & mapfile$Position <= cur.row$End] <- Chr.Arms$Arm
   }

> mapfile
   Name Chr Position Arm
1:   S1   1     3000   p
2:   S2   1     6000   p
3:   S3   1     1000   q

But again, with such large data, I'd like to be able to do something similar in Python. I haven't yet found the solution.

Answer 1

Since you have many 800K rows of data, I don't know how optimal this is but could you:

merge
use loc to filter down the merged dataframe ?

df = Mapfile.merge(Armsfile)
df.loc[(df['Position'] > df['Start']) & (df['Position'] <= df['End'])].drop(['Start', 'End'], axis=1)

Note: I wasn't sure how to handle the Armsfile Chr because your Mapfile and Armsfile both had Chr of 1 in your example.

python - New column based on comparison with another dataframe

Question

1 answers

solution1
0 2016-01-28 22:55:17

python - New column based on comparison with another dataframe

Question

1 answers

solution1 0 2016-01-28 22:55:17

solution1
0 2016-01-28 22:55:17