R inner join has more rows than in the first frame

Question

With inner join the number of lines cannot be higher than in the 1st data frame. But why does R output a higher number of rows?

I have measured values in the 1st data frame and would like to supplement these data frames with unique information from a second data frame. For example: the 1st data frame has 368000 lines, the second data frame has 19870 unique values.

Data<-tidyft::left_join(data_measurement_document, measurement points, by = "T1")

But this returns 968497 values! That cannot be. I only have 368000 measured values!

From my point of view, there must not be more lines than in the first data frame.

Data<-tidyft::inner_join(data_measurement_document, measurement points, by = "T1")

How can I get the various join functions in R to generate only the maximum number of lines of the first data frame?

Note:

measurement points: This has unique values in T1 , no duplicate values occur.
data_measurement_document: Here the values of T1 are not unique .
There must not be more rows in the result than in the dataset: data_measurement_document.

Example data set

Expected result

Answer 1

Check if there are duplicate values of variable "T1" in your data.

If there are duplicates values in the second table in the column you are joining by then this behaviour would be expected. left_join will join each occurrence in the first table with each occurrence in the second table. Consider the following example. (I'm using left_join from dplyr instead of tidyft but I assume the functions behave similarly.)

data1 <- data.frame(id=1:3,
                value1=paste0('data1_value',1:3),
                stringsAsFactors = FALSE)

data2 <- data.frame(id=c(1,1),
                    value2=paste0('data2_value',1:2),
                    stringsAsFactors = FALSE )

data3 <- left_join(data1,data2)

Then data1 has 3 rows

id	value1
1	data1_value1
2	data1_value2
3	data1_value3

data2 has 2 rows, but the id value is duplicated

id	value2
1	data2_value1
1	data2_value2

And data3 , the left-joined data, has 4 rows

id	value1	value2
1	data1_value1	data2_value1
1	data1_value1	data2_value2
2	data1_value2	NA
3	data1_value3	NA

because the row id=1 in data1 gets joined with the 2 rows in table2 with id=1.

Edit

I have used your sample data and inner_join in the code below but it does not produce more rows that there were in the first table

data1 <- structure(list(T1 = c(115, 160, 150, 115, 116, 150), 
                        Value.1 = c("A",  "B", "C", "D", "E", "F")), 
                   class = "data.frame", row.names = c(NA, -6L))

data2 <- structure(list(T1 = c(115, 116, 150, 160), 
                        Value.2 = c("X1", "X2", "X3", "X4")), 
                   class = "data.frame", row.names = c(NA, -4L))

data3 <- inner_join(data1,data2,by="T1")

The result (data3) is below:

T1	Value.1	Value.2
115	A	X1
160	B	X4
150	C	X3
115	D	X1
116	E	X2
150	F	X3

This is the same number of rows as the left data frame.

R inner join has more rows than in the first frame

Question

1 answers

solution1
2 2022-05-28 20:54:57

Edit

R inner join has more rows than in the first frame

Question

1 answers

solution1 2 2022-05-28 20:54:57

Edit

solution1
2 2022-05-28 20:54:57