简体   繁体   中英

Join CSVs with no headers without losing any columns/rows in R

I would like to join two tables with no headers and the only commonality is the first column that always has the IDs in R. The tables do not have the same number of columns or rows.

I want to join this table with no header

+-------+--------+--------+
| 80938 | James  | Nov-00 |
+-------+--------+--------+
| 78397 | Tom    | Jul-20 |
+-------+--------+--------+
| 73820 | Pan    | Sep-10 |
+-------+--------+--------+
| 64920 | Kim    | Nov-01 |
+-------+--------+--------+
| 83915 | Amanda | Jan-03 |
+-------+--------+--------+
| 83649 | Linda  | Jul-07 |
+-------+--------+--------+

and this table with no header

+-------+---+--------+--------+--------+--------+
| 80938 | 1 | 500000 | 600000 | 700000 | 800000 |
+-------+---+--------+--------+--------+--------+
| 80938 | 2 | 333    | 456    | 567    | 467    |
+-------+---+--------+--------+--------+--------+
| 80938 | 3 | 444    | 456    | 399    | 799    |
+-------+---+--------+--------+--------+--------+
| 80938 | 4 | 20000  | 4000   | 3222   | 3456   |
+-------+---+--------+--------+--------+--------+
| 80938 | 5 | 21305  | 23456  | 3567   | 8533   |
+-------+---+--------+--------+--------+--------+
| 80938 | 6 | 345067 | 2455   | 23356  | 244567 |
+-------+---+--------+--------+--------+--------+

to the final combined table below.

+-------+--------+--------+---+--------+--------+--------+--------+
| 80938 | James  | Nov-00 | 1 | 500000 | 600000 | 700000 | 800000 |
+-------+--------+--------+---+--------+--------+--------+--------+
| 80938 | James  | Dec-00 | 2 | 333    | 456    | 567    | 467    |
+-------+--------+--------+---+--------+--------+--------+--------+
| 80938 | James  | Jan-01 | 3 | 444    | 456    | 399    | 799    |
+-------+--------+--------+---+--------+--------+--------+--------+
| 80938 | James  | Feb-01 | 4 | 20000  | 4000   | 3222   | 3456   |
+-------+--------+--------+---+--------+--------+--------+--------+
| 80938 | James  | Mar-01 | 5 | 21305  | 23456  | 3567   | 8533   |
+-------+--------+--------+---+--------+--------+--------+--------+
| 80938 | James  | Apr-01 | 6 | 345067 | 2455   | 23356  | 244567 |
+-------+--------+--------+---+--------+--------+--------+--------+
| 78397 | Tom    | 20-Jul | 1 | 4728   | 82920  | 39     | 323992 |
+-------+--------+--------+---+--------+--------+--------+--------+
| 78397 | Tom    | 21-Jul | 2 | 38120  | 3820   | 38292  | 2920   |
+-------+--------+--------+---+--------+--------+--------+--------+
| 78397 | Tom    | 22-Jul | 3 | 39302  | 238202 | 23920  | 2822   |
+-------+--------+--------+---+--------+--------+--------+--------+
| 78397 | Tom    | 23-Jul | 4 | 3920   | 28202  | 293    | 83920  |
+-------+--------+--------+---+--------+--------+--------+--------+
| 78397 | Tom    | 24-Jul | 5 | 3830   | 820230 | 9292   | 2929   |
+-------+--------+--------+---+--------+--------+--------+--------+
| 78397 | Tom    | 25-Jul | 6 | 12380  | 29202  | 2929   | 8292   |
+-------+--------+--------+---+--------+--------+--------+--------+
| 73820 | Pan    | 10-Sep |   |        |        |        |        |
+-------+--------+--------+---+--------+--------+--------+--------+
| 64920 | Kim    | 1-Nov  |   |        |        |        |        |
+-------+--------+--------+---+--------+--------+--------+--------+
| 83915 | Amanda | 3-Jan  |   |        |        |        |        |
+-------+--------+--------+---+--------+--------+--------+--------+
| 83649 | Linda  | 7-Jul  |   |        |        |        |        |
+-------+--------+--------+---+--------+--------+--------+--------+

I tried to use full_join and merge but I constantly get an error message (I read.csv the files then did a data.frame application so as to use the position V1 to join by and that did not work).

The example you give in your question cannot produce the expected output, since you only have rows that match James' ID, but you don't have repeats of Tom's ID. I'm therefore going to assume that your second table is incomplete relative to the expected output, and that your input data is like this:

csv1 <- structure(list(V1 = c(80938L, 78397L, 73820L, 64920L, 83915L, 
83649L), V2 = c("James", "Tom", "Pan", "Kim", "Amanda", "Linda"
), V3 = c("Nov-00", "Jul-20", "Sep-10", "Nov-01", "Jan-03", "Jul-07"
)), class = "data.frame", row.names = c(NA, -6L))

csv1
#>      V1     V2     V3
#> 1 80938  James Nov-00
#> 2 78397    Tom Jul-20
#> 3 73820    Pan Sep-10
#> 4 64920    Kim Nov-01
#> 5 83915 Amanda Jan-03
#> 6 83649  Linda Jul-07

and

csv2 <- structure(list(V1 = c(80938L, 80938L, 80938L, 80938L, 80938L, 
80938L, 78397L, 78397L, 78397L, 78397L, 78397L, 78397L), V2 = c(1L, 
2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L), V3 = c(500000L, 
333L, 444L, 20000L, 21305L, 345067L, 4728L, 38120L, 39302L, 3920L, 
3830L, 12380L), V4 = c(600000L, 456L, 456L, 4000L, 23456L, 2455L, 
82920L, 3820L, 238202L, 28202L, 820230L, 29202L), V5 = c(700000L, 
567L, 399L, 3222L, 3567L, 23356L, 39L, 38292L, 23920L, 293L, 
9292L, 2929L), V6 = c(800000L, 467L, 799L, 3456L, 8533L, 244567L, 
323992L, 2920L, 2822L, 83920L, 2929L, 8292L)), row.names = c(NA, 
-12L), class = "data.frame")

csv2
#>       V1 V2     V3     V4     V5     V6
#> 1  80938  1 500000 600000 700000 800000
#> 2  80938  2    333    456    567    467
#> 3  80938  3    444    456    399    799
#> 4  80938  4  20000   4000   3222   3456
#> 5  80938  5  21305  23456   3567   8533
#> 6  80938  6 345067   2455  23356 244567
#> 7  78397  1   4728  82920     39 323992
#> 8  78397  2  38120   3820  38292   2920
#> 9  78397  3  39302 238202  23920   2822
#> 10 78397  4   3920  28202    293  83920
#> 11 78397  5   3830 820230   9292   2929
#> 12 78397  6  12380  29202   2929   8292

Creating the join is very straightforward: you want to left join csv2 onto csv1 like this:

library(dplyr)

csv1 %>% 
  left_join(csv2, by = "V1") 
#>       V1   V2.x   V3.x V2.y   V3.y     V4     V5     V6
#> 1  80938  James Nov-00    1 500000 600000 700000 800000
#> 2  80938  James Nov-00    2    333    456    567    467
#> 3  80938  James Nov-00    3    444    456    399    799
#> 4  80938  James Nov-00    4  20000   4000   3222   3456
#> 5  80938  James Nov-00    5  21305  23456   3567   8533
#> 6  80938  James Nov-00    6 345067   2455  23356 244567
#> 7  78397    Tom Jul-20    1   4728  82920     39 323992
#> 8  78397    Tom Jul-20    2  38120   3820  38292   2920
#> 9  78397    Tom Jul-20    3  39302 238202  23920   2822
#> 10 78397    Tom Jul-20    4   3920  28202    293  83920
#> 11 78397    Tom Jul-20    5   3830 820230   9292   2929
#> 12 78397    Tom Jul-20    6  12380  29202   2929   8292
#> 13 73820    Pan Sep-10   NA     NA     NA     NA     NA
#> 14 64920    Kim Nov-01   NA     NA     NA     NA     NA
#> 15 83915 Amanda Jan-03   NA     NA     NA     NA     NA
#> 16 83649  Linda Jul-07   NA     NA     NA     NA     NA

However, it seems you would rather have blank cells than NA , in which case you need to convert the numeric columns to characters and replace the NA values with empty strings:

csv1 %>% 
  left_join(csv2, by = "V1") %>%
  mutate_all(function(x) replace(x, is.na(as.character(x)), ""))
#>       V1   V2.x   V3.x V2.y   V3.y     V4     V5     V6
#> 1  80938  James Nov-00    1 500000 600000 700000 800000
#> 2  80938  James Nov-00    2    333    456    567    467
#> 3  80938  James Nov-00    3    444    456    399    799
#> 4  80938  James Nov-00    4  20000   4000   3222   3456
#> 5  80938  James Nov-00    5  21305  23456   3567   8533
#> 6  80938  James Nov-00    6 345067   2455  23356 244567
#> 7  78397    Tom Jul-20    1   4728  82920     39 323992
#> 8  78397    Tom Jul-20    2  38120   3820  38292   2920
#> 9  78397    Tom Jul-20    3  39302 238202  23920   2822
#> 10 78397    Tom Jul-20    4   3920  28202    293  83920
#> 11 78397    Tom Jul-20    5   3830 820230   9292   2929
#> 12 78397    Tom Jul-20    6  12380  29202   2929   8292
#> 13 73820    Pan Sep-10                                 
#> 14 64920    Kim Nov-01                                 
#> 15 83915 Amanda Jan-03                                 
#> 16 83649  Linda Jul-07

Note also that in your expected output you show the dates incrementing for those with repeated entries. However, one appears to increment by months and one by days, with no indication of how this pattern was decided or to be achieved. I therefore have left these as they are pending your advice.

Created on 2020-08-02 by the reprex package (v0.3.0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM