I have two dataframes. I'd like to create a new column in one based on comparing it to another.
I'm new to Python, but here's my solution in R, which works but is horrific and slow. I'd like to find a faster method, and I've been trying desperately to learn to use pandas
since it seems like a good method.
Mapfile (has ~800,000 rows)
Name Chr Position
S1 1 3000
S2 1 6000
S3 1 1000
Armsfile (has 39 rows)
Chr Arm Start End
1 p 0 5000
1 q 5001 10000
R Script:
for (line in 1:nrow(mapfile)){
mapfile$Arm[line] <- Armsfile$Arm[mapfile$Chr[line] == Armsfile$Chr & mapfile$Position[line] > Armsfile$Start & mapfile$Position[line] < Armsfile$End]
}
Output Table:
Name Chr Position Arm
S1 1 3000 p
S2 1 6000 q
S3 1 1000 p
In words: I want each line to look up the location ( 1) find the right Chr
, 2) find the line where the START < POSITION < END
), then get the ARM
information and place it in a new column.
I tried just reformatting my R script for Python, but couldn't get the syntax right. I also tried using merge
for pandas
, but couldn't find a way to do mathematical comparisons.
For completeness, here are my bad attempts just mentioned:
for line in 1:mapfile.shape[0]:
mapfile$Arm[line] = Armsfile$Arm[ mapfile$Chr[line] == Armsfile$Chr && mapfile$Position[line] > Armsfile$Start && mapfile$Position[line] < Armsfile$End]
and
df = pd.merge(mapfile, Armsfile, on=['Chr', mapfile.Position > Armsfile.Start, mapfile.Position < Armsfile.End])
Edit: Another possible way to think about it
I've been working on another possibility in R that perhaps could translate somehow to Python. Here's the R code:
mapfile <- data.frame(Name = c("S1", "S2", "S3"), Chr = 1, Position = c(3000, 6000, 1000), key = "Chr")
Chr.Arms <- data.frame(Chr = 1, Arm = c("p", "q"), Start = c(0, 5001), End = c(5000, 10000), key = "Chr")
mapfile$Arm <- c("N")
> mapfile
Name Chr Position Arm
1: S1 1 3000 N
2: S2 1 6000 N
3: S3 1 1000 N
for(i in 1:nrow(Chr.Arms)){
cur.row <- Chr.Arms[i,]
mapfile$Arm[mapfile$Chr == cur.row$Chr & mapfile$Position >= cur.row$Start & mapfile$Position <= cur.row$End] <- Chr.Arms$Arm
}
> mapfile
Name Chr Position Arm
1: S1 1 3000 p
2: S2 1 6000 p
3: S3 1 1000 q
But again, with such large data, I'd like to be able to do something similar in Python. I haven't yet found the solution.
Since you have many 800K rows of data, I don't know how optimal this is but could you:
loc
to filter down the merged dataframe
? df = Mapfile.merge(Armsfile)
df.loc[(df['Position'] > df['Start']) & (df['Position'] <= df['End'])].drop(['Start', 'End'], axis=1)
Note: I wasn't sure how to handle the Armsfile
Chr
because your Mapfile
and Armsfile
both had Chr
of 1 in your example.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.