简体   繁体   English

如何从 html 代码中提取 web 应用程序内容?

[英]How can I extract web app content from html code?

So I'm currently trying to gather data from csgo gambling sites to analyze them.所以我目前正在尝试从 csgo 赌博网站收集数据来分析它们。 So I wrote a very short programm extracting the html code from this website but it won't extract the content of the web app.所以我写了一个非常短的程序,从这个网站提取 html 代码,但它不会提取 web 应用程序的内容。 My problem now is that I need the information within this web app.我现在的问题是我需要这个 web 应用程序中的信息。 I mean I can view it in Chrome so I guess there will be solution.我的意思是我可以在 Chrome 中查看它,所以我想会有解决方案。 Maybe the pictures help to understand what I'm looking for:也许这些图片有助于理解我在寻找什么:

HTML code; HTML代码; marked the line I want标记了我想要的行

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;

public class Main {

    public static void main(String[] args) {
        
        try {
            
            String html = Jsoup.connect("https://www.wtfskins.com/crash").get().html();
            System.out.println(html);
            
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

So that's what I get.所以这就是我得到的。 I need the content of我需要的内容

<body> <app-root> 
  loading... // That's the problem
 </app-root> 
 <script src="https://code.jquery.com/jquery-3.1.1.min.js" integrity="sha256-hVVnYaiADRTO2PzUGmuLJr8BLUSjGIZsDYGmIJLv2b8=" crossorigin="anonymous"></script> 
 <script src="https://cdnjs.cloudflare.com/ajax/libs/tether/1.4.0/js/tether.min.js" integrity="sha384-DztdAPBWPRXSA/3eYEEUWrWCy7G5KFbe8fFjk5JAIxUYHKkDx6Qin1DkWx51bBrb" crossorigin="anonymous"></script> 
 <script src="/assets/js/jquery-ui.min.js"></script> 
 <script src="/assets/js/bootstrap.js"></script> 
 <script src="/assets/js/sha3.js"></script> 
 <script src="/assets/js/sha256.js"></script> 
 <script type="text/javascript" src="inline.318b50c57b4eba3d437b.bundle.js"></script> 
 <script type="text/javascript" src="polyfills.2b75d68d2d6cb678fc8d.bundle.js"></script> 
 <script type="text/javascript" src="main.7932c68952979c366236.bundle.js"></script>  
</body>

The data is loaded in the page after the initial DOM.数据在初始 DOM 之后加载到页面中。 When you are getting data with JSoup , you get the initial html request.当您使用JSoup获取数据时,您会收到初始 html 请求。

This image shows that the html request really gives kinda empty html structure此图像显示 html 请求确实给出了有点空的 html 结构

在此处输入图像描述 If you check the Network tab in the dev tools in the browser, you will see that after the initial load there will be extra XHR requests, getting the data.如果您在浏览器的dev tools中检查Network选项卡,您会看到在初始加载后会有额外的 XHR 请求,获取数据。 ngcontent attributes of tags assure that the page is loaded using Angular , which is a Javascript framework.标签的ngcontent属性确保使用Angular加载页面,这是一个 Javascript 框架。
This is done to make page loads more efficient and protect from the scraping a bit more.这样做是为了使页面加载更高效并更多地防止刮擦。

AFTER CHECKING检查后

The network tab shows multiple requests after the page load that have JSON responses.网络选项卡显示页面加载后具有 JSON 响应的多个请求。 You need to look at those, see which request headers are mandatory to request them.您需要查看这些,查看哪些请求标头是强制要求的。 As image shows, one of interesting ones is: https://www.wtfskins.com/api/v1/p2ptrading/usertrades/如图所示,其中一个有趣的是: https://www.wtfskins.com/api/v1/p2ptrading/usertrades/

在此处输入图像描述

You can start by looking at How the Web works with subcategories about Async Javascript requests and REST API basics as well.您可以先查看Web 如何与有关异步 Javascript 请求和REST ZDB9474238D2DA927102ACAAAF7154A37Z 请求的子类别一起工作。 If you are not familiar with web dev, the research will take a bit of time.如果您对 web 开发人员不熟悉,研究将需要一些时间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM