creating corpus from multiple html text files

Question

I have a list of html files, I have taken some texts from the web and make them read with the read_html .

My files names are like:

a1 <- read_html(link of the text) 
a2 <- read_html(link of the text) 
.
.
. ## until:
a100 <- read_html(link of the text)

I am trying to create a corpus with these.

Any ideas how can I do it?

Thanks.

Answer 1

You could allocate the vector beforehand:

text <- rep(NA, 100)
text[1] <- read_html(link1)
...
text[100] <- read_html(link100)

Even better, if you organize your links as vector. Then you can use, as suggested in the comments, lapply :

text <- lapply(links, read_html)

(here links is a vector of the links).

It would be rather bad coding style to use assign :

# not a good idea
for (i in 1:100) assign(paste0("text", i), get(paste0("link", i)))

since this is rather slow and hard to process further.

Answer 2

I would suggest using purrr for this solution:

library(tidyverse)
library(purrr)
library(rvest)

files <- list.files("path/to/html_links", full.names = T)

all_html <- tibble(file_path = files) %>% 
  mutate(filenames = basename(files)) %>% 
  mutate(text = map(file_path, read_html))

Is a nice way to keep track of which piece of text belongs to which file. It also makes things like sentiment or any other type analysis easy at a document level.

creating corpus from multiple html text files

Question

2 answers

solution1
0 ACCPTED 2018-11-20 12:03:27

solution2
0 2018-11-20 12:44:11

creating corpus from multiple html text files

Question

2 answers

solution1 0 ACCPTED 2018-11-20 12:03:27

solution2 0 2018-11-20 12:44:11

solution1
0 ACCPTED 2018-11-20 12:03:27

solution2
0 2018-11-20 12:44:11