简体   繁体   中英

Tidying up my data frame: moving columns to headers and data

I'm using a webscraper to scrape some data from FinViz. Here's an example The problem is that the data frame is messy, the first column holds what I would ideally want as the headers and the second column holds the corresponding data. Here's an output:

           data1   data2         data3  data4         data5      data6         data7   data8        data9          data10
1       Index S&P 500           P/E  36.13     EPS (ttm)       4.60   Insider Own   0.10% Shs Outstand           2.93B
2  Market Cap 487.15B   Forward P/E  25.65    EPS next Y       6.48 Insider Trans -86.95%    Shs Float           2.33B
3      Income  13.58B           PEG   1.36    EPS next Q       1.27      Inst Own  72.50%  Short Float           0.87%
4       Sales  33.17B           P/S  14.69    EPS this Y    170.20%    Inst Trans  -0.22%  Short Ratio            1.13
5     Book/sh   22.92           P/B   7.26    EPS next Y     21.63%           ROA  20.30% Target Price          192.62
6     Cash/sh   12.10           P/C  13.74   EPS next 5Y     26.57%           ROE  22.50%    52W Range 113.55 - 175.49
7    Dividend       -         P/FCF  34.05   EPS past 5Y     62.10%           ROI  17.10%     52W High          -5.23%
8  Dividend %       -   Quick Ratio  12.30 Sales past 5Y     49.40%  Gross Margin  86.60%      52W Low          46.47%
9   Employees   20658 Current Ratio  12.30     Sales Q/Q     44.80%  Oper. Margin  46.40%     RSI (14)           49.05
10 Optionable     Yes       Debt/Eq   0.00       EPS Q/Q     68.80% Profit Margin  40.90%   Rel Volume            0.70
11  Shortable     Yes    LT Debt/Eq   0.00      Earnings Jul 26 AMC        Payout   0.00%   Avg Volume          17.87M
12      Recom    1.70         SMA20 -1.84%         SMA50      2.85%        SMA200  17.52%       Volume      12,583,873

As you can see, data1 contains the categories and data2 contains the following information.

Ideally I'd want it in this structure:

Index | Market Cap | Income | Sales | Book sh | ...
------------------------------------------------
S&P500 | 487.15B   | 13.58B | 33.17B | 22.92  |

So that data1,3,5,7 were all the headers and data2,4,6,8 where all in one row.

Could anyone provide any input? I'm trying to avoid compiling them into 2 different vectors then rbinding the frame together.

Cheerio!

You can try:

library(data.table); library(dplyr)

table1 <- df[, 1:2] %>%as.data.table() %>%  dcast.data.table(.~data1, value.var = "data2")
table2 <- df[, 3:4] %>%as.data.table() %>%  dcast.data.table(.~data3, value.var = "data4")

cbind(table1, table2)

and so on for the rest

Would this work ?

data <- data.frame(data1= letters[1:10],data2=LETTERS[1:10],data3= letters[11:20],data4=LETTERS[11:20],stringsAsFactors=F)
#    data1 data2 data3 data4
# 1      a     A     k     K
# 2      b     B     l     L
# 3      c     C     m     M
# 4      d     D     n     N
# 5      e     E     o     O
# 6      f     F     p     P
# 7      g     G     q     Q
# 8      h     H     r     R
# 9      i     I     s     S
# 10     j     J     t     T

output <- setNames(data.frame(
  t(unlist(data[!as.logical(seq_along(data)%%2)]))),
  unlist(data[as.logical(seq_along(data)%%2)]))
#   a b c d e f g h i j k l m n o p q r s t
# 1 A B C D E F G H I J K L M N O P Q R S T

Here is a solution using some tidyverse packages and your dataset.

library(rvest) # for scrapping the data
#> Le chargement a nécessité le package : xml2
library(dplyr, warn.conflicts = F)
library(tidyr)
library(purrr, warn.conflict = F)

Fisrt, we get your data directly from your example url.

tab <- read_html("http://finviz.com/quote.ashx?t=BA") %>%
  html_node("table.snapshot-table2") %>%
  html_table(header = F) %>%
  as_data_frame()

tab
#> # A tibble: 12 x 12
#>            X1          X2            X3     X4            X5         X6
#>         <chr>       <chr>         <chr>  <chr>         <chr>      <chr>
#>  1      Index DJIA S&P500           P/E  20.77     EPS (ttm)      11.42
#>  2 Market Cap     141.89B   Forward P/E  22.14    EPS next Y      10.71
#>  3     Income       7.12B           PEG   1.13    EPS next Q       2.62
#>  4      Sales      90.90B           P/S   1.56    EPS this Y      2.30%
#>  5    Book/sh       -3.34           P/B      -    EPS next Y      7.28%
#>  6    Cash/sh       17.26           P/C  13.74   EPS next 5Y     18.36%
#>  7   Dividend        5.68         P/FCF  17.94   EPS past 5Y      7.40%
#>  8 Dividend %       2.39%   Quick Ratio   0.40 Sales past 5Y      6.60%
#>  9  Employees      150500 Current Ratio   1.20     Sales Q/Q     -8.10%
#> 10 Optionable         Yes       Debt/Eq      -       EPS Q/Q    885.50%
#> 11  Shortable         Yes    LT Debt/Eq      -      Earnings Jul 26 BMO
#> 12      Recom        2.20         SMA20 -0.16%         SMA50      8.14%
#> # ... with 6 more variables: X7 <chr>, X8 <chr>, X9 <chr>, X10 <chr>,
#> #   X11 <chr>, X12 <chr>

As headers are in every odd column and data in every even column, we create a tidy tibble of two columns by row binding the subsets. For that, we generate odd and even index. Then, purrr::map_dfr allows us to iterates over those 2 lists, applies a function and row bind the results. The function consist of selecting 2 columns with of the table [ ] and rename those two columns with set_names .

col_num <- seq_len(ncol(tab))
even <- col_num[col_num %% 2 == 0]
odd <- setdiff(col_num, even)

tab2 <- map2_dfr(odd, even, ~ set_names(tab[, c(.x, .y)], c("header", "value")))
tab2
#> # A tibble: 72 x 2
#>        header       value
#>         <chr>       <chr>
#>  1      Index DJIA S&P500
#>  2 Market Cap     141.89B
#>  3     Income       7.12B
#>  4      Sales      90.90B
#>  5    Book/sh       -3.34
#>  6    Cash/sh       17.26
#>  7   Dividend        5.68
#>  8 Dividend %       2.39%
#>  9  Employees      150500
#> 10 Optionable         Yes
#> # ... with 62 more rows

You have a nice 2 column long table with all your data. Now if you want the table in wide format instead of long format, you have to transpose. But first, we have to deal with some duplicates names in the header column. You can't have duplicates column names.

tab2 %>%
  filter(header == header[duplicated(header)])
#> # A tibble: 2 x 2
#>       header value
#>        <chr> <chr>
#> 1 EPS next Y 10.71
#> 2 EPS next Y 7.28%

We just rename the second occurence adding _2

tab3 <- tab2 %>%
  mutate(header = case_when(
    duplicated(header) ~ paste(header, 2, sep =  "_"),
    TRUE ~ header)
  )
# No more duplicates
any(duplicated(tab3$header))
#> [1] FALSE
    tab3 %>% filter(stringr::str_detect(header, "EPS next Y"))
#> # A tibble: 2 x 2
#>         header value
#>          <chr> <chr>
#> 1   EPS next Y 10.71
#> 2 EPS next Y_2 7.28%

You can pass in wide format and have 72 columns instead of 72 lines.

tab3 %>%
  spread(header, value)
#> # A tibble: 1 x 72
#>   `52W High` `52W Low`     `52W Range`   ATR `Avg Volume`  Beta `Book/sh`
#> *      <chr>     <chr>           <chr> <chr>        <chr> <chr>     <chr>
#> 1     -3.78%    87.78% 126.31 - 246.49  3.77        3.46M  1.18     -3.34
#> # ... with 65 more variables: `Cash/sh` <chr>, Change <chr>, `Current
#> #   Ratio` <chr>, `Debt/Eq` <chr>, Dividend <chr>, `Dividend %` <chr>,
#> #   Earnings <chr>, Employees <chr>, `EPS (ttm)` <chr>, `EPS next
#> #   5Y` <chr>, `EPS next Q` <chr>, `EPS next Y` <chr>, `EPS next
#> #   Y_2` <chr>, `EPS past 5Y` <chr>, `EPS Q/Q` <chr>, `EPS this Y` <chr>,
#> #   `Forward P/E` <chr>, `Gross Margin` <chr>, Income <chr>, Index <chr>,
#> #   `Insider Own` <chr>, `Insider Trans` <chr>, `Inst Own` <chr>, `Inst
#> #   Trans` <chr>, `LT Debt/Eq` <chr>, `Market Cap` <chr>, `Oper.
#> #   Margin` <chr>, Optionable <chr>, `P/B` <chr>, `P/C` <chr>,
#> #   `P/E` <chr>, `P/FCF` <chr>, `P/S` <chr>, Payout <chr>, PEG <chr>,
#> #   `Perf Half Y` <chr>, `Perf Month` <chr>, `Perf Quarter` <chr>, `Perf
#> #   Week` <chr>, `Perf Year` <chr>, `Perf YTD` <chr>, `Prev Close` <chr>,
#> #   Price <chr>, `Profit Margin` <chr>, `Quick Ratio` <chr>, Recom <chr>,
#> #   `Rel Volume` <chr>, ROA <chr>, ROE <chr>, ROI <chr>, `RSI (14)` <chr>,
#> #   Sales <chr>, `Sales past 5Y` <chr>, `Sales Q/Q` <chr>, `Short
#> #   Float` <chr>, `Short Ratio` <chr>, Shortable <chr>, `Shs Float` <chr>,
#> #   `Shs Outstand` <chr>, SMA20 <chr>, SMA200 <chr>, SMA50 <chr>, `Target
#> #   Price` <chr>, Volatility <chr>, Volume <chr>

Idea: You can also replace all the spaces by _ in the header column to have column names without spaces. Often simpler to handle.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM