With inner join the number of lines cannot be higher than in the 1st data frame. But why does R output a higher number of rows?
I have measured values in the 1st data frame and would like to supplement these data frames with unique information from a second data frame. For example: the 1st data frame has 368000 lines, the second data frame has 19870 unique values.
Data<-tidyft::left_join(data_measurement_document, measurement points, by = "T1")
But this returns 968497 values! That cannot be. I only have 368000 measured values!
From my point of view, there must not be more lines than in the first data frame.
Data<-tidyft::inner_join(data_measurement_document, measurement points, by = "T1")
How can I get the various join functions in R to generate only the maximum number of lines of the first data frame?
Note:
Check if there are duplicate values of variable "T1" in your data.
If there are duplicates values in the second table in the column you are joining by then this behaviour would be expected. left_join
will join each occurrence in the first table with each occurrence in the second table. Consider the following example. (I'm using left_join
from dplyr
instead of tidyft
but I assume the functions behave similarly.)
data1 <- data.frame(id=1:3,
value1=paste0('data1_value',1:3),
stringsAsFactors = FALSE)
data2 <- data.frame(id=c(1,1),
value2=paste0('data2_value',1:2),
stringsAsFactors = FALSE )
data3 <- left_join(data1,data2)
Then data1
has 3 rows
id | value1 |
---|---|
1 | data1_value1 |
2 | data1_value2 |
3 | data1_value3 |
data2
has 2 rows, but the id
value is duplicated
id | value2 |
---|---|
1 | data2_value1 |
1 | data2_value2 |
And data3
, the left-joined data, has 4 rows
id | value1 | value2 |
---|---|---|
1 | data1_value1 | data2_value1 |
1 | data1_value1 | data2_value2 |
2 | data1_value2 | NA |
3 | data1_value3 | NA |
because the row id=1 in data1 gets joined with the 2 rows in table2 with id=1.
I have used your sample data and inner_join
in the code below but it does not produce more rows that there were in the first table
data1 <- structure(list(T1 = c(115, 160, 150, 115, 116, 150),
Value.1 = c("A", "B", "C", "D", "E", "F")),
class = "data.frame", row.names = c(NA, -6L))
data2 <- structure(list(T1 = c(115, 116, 150, 160),
Value.2 = c("X1", "X2", "X3", "X4")),
class = "data.frame", row.names = c(NA, -4L))
data3 <- inner_join(data1,data2,by="T1")
The result (data3) is below:
T1 | Value.1 | Value.2 |
---|---|---|
115 | A | X1 |
160 | B | X4 |
150 | C | X3 |
115 | D | X1 |
116 | E | X2 |
150 | F | X3 |
This is the same number of rows as the left data frame.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.