简体   繁体   中英

How to import multi-file .xlsx data into single R dataframe without levels?

I'm trying to extract each company's data from some free NASDAQ 100 Twitter data sets available here . The ultimate goal, once constructed and curated, is to run some modeling experiments with the dataframe. The basic dataframe form I'm aiming for is:

ATVI  49.02   0.44   0.91   7193022   .3 
ADBE  119.91  0.31   0.26   1984225   .1 
AKAM  64.2    0.65   1.02   1336622   .1 
ALXN  126.55  0.86   0.67   2182253   .2
GOOG  838.68  3.31   0.4    1261517  1.0 
AMZN  853     2.5    0.29   2048187  1.0

For each company, there are six .xlsx files (unzipped into separate directories), each excel file with multiple worksheets inside. For right now, I'm just trying to extract the first worksheet from each of the six excels spreadsheets for each company. All of those worksheets have two columns with varying numbers of rows and the data labels are on different rows, eg file 1, company 1:

Keyword             $AAPL -
Total tweets        166631
Total audience      221363515
Contributors        42738
Original tweets     91614
Replies             4964
RTs                 70053
Images and links    43361

file 2, company 1:

Keyword                        $AAPL -
Total audience                 221363515
Contributors                   42738
Total tweets                   166631
Total potential impressions    1.250.920.501
Measured data from             2016-04-02 18:06
Measured data to               2016-06-15 12:23
Tweets per contributor         3,90
Impressions / Audience         5,65
Measured time in seconds       6373058
Measured time in minutes       106218
Measured time in hours         1770
Measured time in days          74
Tweets per second              0.026146161
Tweets per minute              1.568769655
Tweets per hour                94.1261793
Tweets per day                 2259.028303

I'm trying to implement readxl as suggested in this post and then put each companies data into a row of a dataframe [below]. Right now, I'm setting the first path as my directory and then running the code, then setting the second path and running it again to add the new row (I know this isn't optimal, see below).

library(readxl)

#create empty dataframe to assemble all the rows
cdf <- data.frame()

#setwd('...\\NASDAQ_100\\aal_2016_06_15_12_01_41')
#setwd('...\\NASDAQ_100\\aapl_2016_06_15_14_30_09')

#constructing list of all .xlsx files in current directory
file.list <- list.files(pattern='*.xlsx')

#using read_excel function to read each file in list and put in a dataframe of lists 
df.list <- lapply(file.list, read_excel)

#converting the dataframe of lists to a 77x2 dataframe
df <- as.data.frame(do.call(rbind, df.list),stringsAsFactors=FALSE)

#transposing the dataframe to prepare to stack multiple companies data in single dataframe
df <- t(df)

#making sure that the dataframe entry values are numeric
df <- transform(df,as.numeric)

#appending the 2nd row with the actual data into the dataframe that will have all companies' data
cdf <- rbind(cdf,df[2,])

Sample output:

> cdf[,1:8]
            X1        X2    X3    X4   X5    X6    X7        X8
$AAL      6507  14432722  1645  5211  459   837   938  14432722
$AAPL - 166631 221363515 42738 91614 4964 70053 43361 221363515

Upon inspection, I'm finding that there are levels in my columns which I've gathered from various other posts is because of how I imported the data and is why I tried adding stringsAsFactors=FALSE to the as.data.frame , but clearly that's not the solution:

> cdf[,2]
     $AAL   $AAPL - 
 14432722 221363515 
Levels: 14432722 Total audience 221363515

According to the documentation, that's not an argument for read_excel . Is there a way to still use it, but avoid these levels?

Once I've sorted this out, I'm hoping to get this in a basic for loop to go through all the unzipped sub-directories:

dir.list <- list.dirs(recursive = F)

for (subdir in dir.list) {

  file.list <- list.files(pattern='*.xlsx')

  df.list <- lapply(file.list, read_excel)

  df <- as.data.frame(do.call(rbind, df.list),stringsAsFactors=FALSE)

  df <- t(df)

  df <- transform(df,as.numeric)

  cdf <- rbind(cdf,df[2,])

}

But this yields > cdf data frame with 0 columns and 0 rows ? I know that none of the code is elegant or compact (& that rbind is ill-advised in for loops), but it's what I've been able to piece together. I'm very receptive to style corrections and alternate methods, but it'd be very much appreciated if their context were explained within the overall problem/solution described here (ie: not just, "use package xyz" or "read ldply()'s documentation").

Thanks,

I imagine that your df.list contains data.frames with factors rather than strings, and that may be what causes problems in the subsequent rbind. Can you try:

df.list <- lapply(file.list, function(x) {
             as.data.frame(read_excel(x), stringsAsFactors=FALSE)
           })

That way the data.frames in df.list should not contain factors.

The data in the .xlsx files appear to be stored in a key (column 1) and value (column 2) structure. I would use readxl and data.table to read the data and initially store it in the long key/value format (with a third column indicating company). Then I would cast ( dcast ) the long format to a wide format so each key gets it's own column:

library(readxl)
library(data.table)

# Get list of files
file.list <- list.files(path = ".", pattern = "*.xlsx")

# Iterate over files
dt_list <- lapply(seq_along(file.list), function(x) {
  # Read sheet 1 as data.table
  dt <- data.table(read_excel(file.list[x], sheet = 1))
  # Get company based on name of second column
  company <- gsub(colnames(dt)[2], pattern = "[^A-Z]*", replacement = "")
  # Set company and file_name (optional for debugging)
  dt[, ":="(company = company, file_name = file.list[x])]
  setnames(dt, c("key", "value", "company", "file_name"))
  dt
})
dt <- rbindlist(dt_list, use.names = TRUE)

# Get rid of file_name and remove duplicates
dt[, file_name := NULL]
dt <- unique(dt)

# Optional filtering on key
# dt <- dt[key %in% c("Total tweets", "Total audience")]

# Use dcast to make wide format table with one row per company
dt_wide <- dcast(dt, formula = company~key)

The contents of dt_wide (with AAPL and ATVI):

    company Average contributor followers Average contributor following Contributor followers median ...
 1:    AAPL                       5197,58                        832,06                       141,00 ...
 2:    ATVI                       9769,01                       1389,17                       562,00 ...

You can convert dt_wide to a standard data.frame using df <- as.data.frame(dt_wide)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM