简体   繁体   English

在相邻列中查找对值的有效方法(Python/R/Sql)

[英]efficient way finding pair value in adjacent columns (Python/R/Sql)

I have a dataframe( df below) in pandas with several million rows * 20 columns.我在 pandas 中有一个数据框(下面的df ),有几百万行 * 20 列。

And given a pair of values, I'm trying to find if they exist in adjacent columns of df .并给定一对值,我试图找出它们是否存在于df的相邻列中。

eg df looks like例如df看起来像

在此处输入图像描述

Given pairs of value (a3, b2) , we find they exist in adjacent columns(doesn't have to be in the same row).给定值对(a3, b2) ,我们发现它们存在于相邻列中(不必在同一行中)。

For pairs of value (b2, a3) , we don't consider they meet the condition(the shift of column should be to the right).对于值对(b2, a3) ,我们认为它们不满足条件(列的移动应该向右)。

This can be done using loops for a small dataset, but I have millions of records with 20 columns and lots of pairs to check.这可以通过对小数据集使用循环来完成,但我有数百万条记录,有 20 列和很多对要检查。 Is there any way to compute this efficiently?有没有办法有效地计算这个? Thanks!谢谢!

Here is a base R solution by defining a function isAdjacent这是一个基本的 R 解决方案,通过定义 function isAdjacent

isAdjacent <- function(df,p) {
    colnum <- col(df)
    diff(sapply(p,function(x) colnum[df==x],USE.NAMES = FALSE))==1
}

where df is the data.frame, and p is the pair.其中df是 data.frame, p是对。

Example例子

p1 <- c("b1","c2")
p2 <- rev(p1)
p3 <- c("a1","c3")

> isAdjacent(df,p1)
[1] TRUE

> isAdjacent(df,p2)
[1] FALSE

> isAdjacent(df,p3)
[1] FALSE

Data数据

> dput(df)
structure(list(A = c("a1", "a2", "a3", "a4"), B = c("b1", "b2", 
"b3", "b4"), C = c("c1", "c2", "c3", "c4"), D = c("d1", "d2", 
"d3", "d4"), E = c("e1", "e2", "e3", "e4"), F = c("f1", "f2",
"f3", "f4"), G = c("g1", "g2", "g3", "g4")), class = "data.frame", row.names = c(NA, 

-4L))

Large Data Example (Benchmarking)大数据示例(基准测试)

df <- setNames(as.data.frame(sapply(letters[1:20], paste0, 1:1e6)), LETTERS[1:20])

p <- c("a1", "c3")
system.time({
    isAdjacent <- function(df, p) {
        colnum <- col(df)
        diff(sapply(p, function(x) colnum[df == x], USE.NAMES = FALSE)) == 1
    }
    isAdjacent(df, p)
})
#   user  system elapsed 
#   1.03    0.07    1.11

library(data.table)
system.time({
    DT <- data.table(VAL = unlist(df), COL = rep(1L:ncol(df), each = nrow(df)), key = "VAL")
    isadj <- function(left, right) {
        DT[.(left), .(COL = COL + 1L)][DT[.(right)], on = .(COL), nomatch = 0L, .N > 0L]
    }
    isadj(p[1], p[2])
})

#   user  system elapsed
#  35.79    1.91   36.24

Using df in ThomasIsCoding's post, here is an option using data.table in R:在 ThomasIsCoding 的帖子中使用df ,这是在 R 中使用data.table的选项:

library(data.table)
DT <- data.table(VAL=unlist(df), COL=rep(1L:ncol(df), each=nrow(df)), key="VAL")
isadj <- function(left, right) {
    DT[.(left), .(COL=COL+1L)][DT[.(right)], on=.(COL), nomatch=0L, .N > 0L]
}

isadj("a3", "b2")    
#[1] TRUE

isadj("b2", "a3")    
#[1] FALSE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在网格中查找相邻单元格的 Pythonic 和有效方法 - Pythonic and efficient way of finding adjacent cells in grid 对于许多列,找到最大绝对值的有效方法 - Efficient way of finding the maximum absolute value, for many columns 在python中查找字符串的有效方法 - Efficient way to Finding string in python Python,对坐标进行操作的有效方法 - Python, efficient way to operate on pair of coordinates SQL语句中多个元组对WHERE条件的有效表达方式 - Efficient way of phrasing multiple tuple pair WHERE conditions in SQL statement 将列表字典转换为键和值对列表的有效方法 - Efficient way to convert dictionary of list to pair list of key and value 生成具有值对的列表(大于15个元素)的分区的有效方法 - Efficient way to generate partitions with value pair of a list >15 elements 在 Python 中绘制多个日期和值对列 - Plotting Multiple Date and Value pair columns in Python Python Dataframe:如何 map 列值与相邻列值? - Python Dataframe: How to map a column value with adjacent columns value? 在 python 或 R 我想要一种更有效的方法将一列中的文本拆分为四列 - In python or R I want a more efficient way to string split a text in a column into four columns
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM