简体   繁体   中英

R inner join has more rows than in the first frame

With inner join the number of lines cannot be higher than in the 1st data frame. But why does R output a higher number of rows?

I have measured values in the 1st data frame and would like to supplement these data frames with unique information from a second data frame. For example: the 1st data frame has 368000 lines, the second data frame has 19870 unique values.

Data<-tidyft::left_join(data_measurement_document, measurement points, by = "T1")

But this returns 968497 values! That cannot be. I only have 368000 measured values!

From my point of view, there must not be more lines than in the first data frame.

Data<-tidyft::inner_join(data_measurement_document, measurement points, by = "T1")

How can I get the various join functions in R to generate only the maximum number of lines of the first data frame?

Note:

  • measurement points: This has unique values in T1 , no duplicate values occur.
  • data_measurement_document: Here the values of T1 are not unique .
  • There must not be more rows in the result than in the dataset: data_measurement_document.

Example data set

Expected result

Check if there are duplicate values of variable "T1" in your data.

If there are duplicates values in the second table in the column you are joining by then this behaviour would be expected. left_join will join each occurrence in the first table with each occurrence in the second table. Consider the following example. (I'm using left_join from dplyr instead of tidyft but I assume the functions behave similarly.)

data1 <- data.frame(id=1:3,
                value1=paste0('data1_value',1:3),
                stringsAsFactors = FALSE)

data2 <- data.frame(id=c(1,1),
                    value2=paste0('data2_value',1:2),
                    stringsAsFactors = FALSE )

data3 <- left_join(data1,data2)

Then data1 has 3 rows

id value1
1 data1_value1
2 data1_value2
3 data1_value3

data2 has 2 rows, but the id value is duplicated

id value2
1 data2_value1
1 data2_value2

And data3 , the left-joined data, has 4 rows

id value1 value2
1 data1_value1 data2_value1
1 data1_value1 data2_value2
2 data1_value2 NA
3 data1_value3 NA

because the row id=1 in data1 gets joined with the 2 rows in table2 with id=1.

Edit

I have used your sample data and inner_join in the code below but it does not produce more rows that there were in the first table

data1 <- structure(list(T1 = c(115, 160, 150, 115, 116, 150), 
                        Value.1 = c("A",  "B", "C", "D", "E", "F")), 
                   class = "data.frame", row.names = c(NA, -6L))

data2 <- structure(list(T1 = c(115, 116, 150, 160), 
                        Value.2 = c("X1", "X2", "X3", "X4")), 
                   class = "data.frame", row.names = c(NA, -4L))

data3 <- inner_join(data1,data2,by="T1")

The result (data3) is below:

T1 Value.1 Value.2
115 A X1
160 B X4
150 C X3
115 D X1
116 E X2
150 F X3

This is the same number of rows as the left data frame.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM