简体   繁体   中英

creating corpus from multiple html text files

I have a list of html files, I have taken some texts from the web and make them read with the read_html .

My files names are like:

a1 <- read_html(link of the text) 
a2 <- read_html(link of the text) 
.
.
. ## until:
a100 <- read_html(link of the text)

I am trying to create a corpus with these.

Any ideas how can I do it?

Thanks.

You could allocate the vector beforehand:

text <- rep(NA, 100)
text[1] <- read_html(link1)
...
text[100] <- read_html(link100)

Even better, if you organize your links as vector. Then you can use, as suggested in the comments, lapply :

text <- lapply(links, read_html)

(here links is a vector of the links).

It would be rather bad coding style to use assign :

# not a good idea
for (i in 1:100) assign(paste0("text", i), get(paste0("link", i)))

since this is rather slow and hard to process further.

I would suggest using purrr for this solution:

library(tidyverse)
library(purrr)
library(rvest)

files <- list.files("path/to/html_links", full.names = T)

all_html <- tibble(file_path = files) %>% 
  mutate(filenames = basename(files)) %>% 
  mutate(text = map(file_path, read_html))

Is a nice way to keep track of which piece of text belongs to which file. It also makes things like sentiment or any other type analysis easy at a document level.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM