简体   繁体   中英

scrape data from website and get its html in plain text

Please check the code below. I am trying to scrape website using proxy and it's working now. The problem is in print_r data displaying in non-readable format. I need to make it "normal" html source code. How can I do it?

<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://www.amazon.com');
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, 1);
curl_setopt($ch, CURLOPT_PROXY, '142.234.203.59:12345');
curl_setopt($ch, CURLOPT_PROXYUSERPWD, 'haris20202:veryfastplease123');
$data = curl_exec($ch);
curl_close($ch);

print_r($data);

Using a slightly more fully featured curl function that the one above the response looks good BUT it includes a Robot Check

* Rebuilt URL to: https://www.amazon.com/
*   Trying 142.234.203.59...
* TCP_NODELAY set
* Connected to 142.234.203.59 (142.234.203.59) port 12345 (#0)
* allocate connect buffer!
* Establish HTTP proxy tunnel to www.amazon.com:443
* Proxy auth using Basic with user 'haris20202'
> CONNECT www.amazon.com:443 HTTP/1.1
Host: www.amazon.com:443
Proxy-Authorization: Basic aGFyaXMyMDIwMjp2ZXJ5ZmFzdHBsZWFzZTEyMw==
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36
Proxy-Connection: Keep-Alive

< HTTP/1.1 200 Connection established
< 
* Proxy replied 200 to CONNECT request
* CONNECT phase completed!
* ALPN, offering http/1.1
* successfully set certificate verify locations:
  CAfile: c:/wwwroot/cacert.pem
  CApath: none
* CONNECT phase completed!
* CONNECT phase completed!
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: C=US; ST=Washington; L=Seattle; O=Amazon.com, Inc.; CN=www.amazon.com
*  start date: Sep 18 00:00:00 2019 GMT
*  expire date: Aug 23 12:00:00 2020 GMT
*  subjectAltName: host "www.amazon.com" matched cert's "www.amazon.com"
*  issuer: C=US; O=DigiCert Inc; CN=DigiCert Global CA G2
*  SSL certificate verify ok.
> GET / HTTP/1.1
Host: www.amazon.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36
Accept: */*
Accept-Encoding: deflate, gzip

< HTTP/1.1 200 OK
< Content-Type: text/html
< Content-Length: 2097
< Connection: keep-alive
< Server: Server
< Date: Tue, 26 Nov 2019 10:14:10 GMT
< Vary: Content-Type,Cookie,Referer,Accept-Encoding,X-Amzn-CDN-Cache,X-Amzn-AX-Treatment,User-Agent
< Content-Encoding: gzip
< x-amz-rid: DTAY61T1CN3HGSADJG16
< Edge-Control: no-store
< X-Cache: Miss from cloudfront
< Via: 1.1 274469ea4a9ada6e05630e17982ca5de.cloudfront.net (CloudFront)
< X-Amz-Cf-Pop: PHL50
< X-Amz-Cf-Id: R3hAZb_0qdQYB25p3WwZ5D-wK_1ujzleVSOS7EZo_zsTyMx9oYU6CA==
< 
* Connection #0 to host 142.234.203.59 left intact

亚马逊机器人检查-我猜他们不喜欢刮...

Amazon have an API - have you considered using that? Amazon for Developers

Include header('Content-Type: application/json'); in your file to get response in string type

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM