简体   繁体   中英

How can I extract web app content from html code?

So I'm currently trying to gather data from csgo gambling sites to analyze them. So I wrote a very short programm extracting the html code from this website but it won't extract the content of the web app. My problem now is that I need the information within this web app. I mean I can view it in Chrome so I guess there will be solution. Maybe the pictures help to understand what I'm looking for:

HTML code; marked the line I want

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;

public class Main {

    public static void main(String[] args) {
        
        try {
            
            String html = Jsoup.connect("https://www.wtfskins.com/crash").get().html();
            System.out.println(html);
            
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

So that's what I get. I need the content of

<body> <app-root> 
  loading... // That's the problem
 </app-root> 
 <script src="https://code.jquery.com/jquery-3.1.1.min.js" integrity="sha256-hVVnYaiADRTO2PzUGmuLJr8BLUSjGIZsDYGmIJLv2b8=" crossorigin="anonymous"></script> 
 <script src="https://cdnjs.cloudflare.com/ajax/libs/tether/1.4.0/js/tether.min.js" integrity="sha384-DztdAPBWPRXSA/3eYEEUWrWCy7G5KFbe8fFjk5JAIxUYHKkDx6Qin1DkWx51bBrb" crossorigin="anonymous"></script> 
 <script src="/assets/js/jquery-ui.min.js"></script> 
 <script src="/assets/js/bootstrap.js"></script> 
 <script src="/assets/js/sha3.js"></script> 
 <script src="/assets/js/sha256.js"></script> 
 <script type="text/javascript" src="inline.318b50c57b4eba3d437b.bundle.js"></script> 
 <script type="text/javascript" src="polyfills.2b75d68d2d6cb678fc8d.bundle.js"></script> 
 <script type="text/javascript" src="main.7932c68952979c366236.bundle.js"></script>  
</body>

The data is loaded in the page after the initial DOM. When you are getting data with JSoup , you get the initial html request.

This image shows that the html request really gives kinda empty html structure

在此处输入图像描述 If you check the Network tab in the dev tools in the browser, you will see that after the initial load there will be extra XHR requests, getting the data. ngcontent attributes of tags assure that the page is loaded using Angular , which is a Javascript framework.
This is done to make page loads more efficient and protect from the scraping a bit more.

AFTER CHECKING

The network tab shows multiple requests after the page load that have JSON responses. You need to look at those, see which request headers are mandatory to request them. As image shows, one of interesting ones is: https://www.wtfskins.com/api/v1/p2ptrading/usertrades/

在此处输入图像描述

You can start by looking at How the Web works with subcategories about Async Javascript requests and REST API basics as well. If you are not familiar with web dev, the research will take a bit of time.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM