简体   繁体   中英

xmlTreeParse and html content

I can't get (web scrape) html tree content with R function xmlTreeParse - I mean common page with products.

I get library Rcurl and XML.

myurln3<-"www.amazon.com/s?k=router+hand+plane+cheap&i=arts-crafts-intl-ship&ref=nb_sb_noss"
html_page<-xmlTreeParse(myurln3, useInternalNodes = TRUE)

Error: XML content does not seem to be XML: 'www.amazon.com/s?k=router+hand+plane+cheap&i=arts-crafts-intl-ship&ref=nb_sb_noss'

I expect to scrape page and get full html structure.

I back after some other projects to web scraping with R and still with problems.

> library(XML)

Warning message:
XML package is in R 3.5.3 version 

> my_url99 <- "https://www.amazon.com/s?k=Dell+laptop+windows+10&ref=nb_sb_noss_2"
> html_page99 <- htmlTreeParse(my_url99, useInternalNode=TRUE)

Warning message:
XML content does not seem to be XML: 'https://www.amazon.com/s?k=Dell+laptop+windows+10&ref=nb_sb_noss_2' 

> head(html_page99)

Error in `[.XMLInternalDocument`(x, seq_len(n)) : 
  No method for subsetting an XMLInternalDocument with integer

> html_page99

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>https://www.amazon.com/s?k=Dell+laptop+windows+10&amp;ref=nb_sb_noss_2</p></body></html>

But I need to scrape above page with full content = I mean content with $ sign on the left (fmaybe that's not the best direct description) and all the tags.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM