[英]How can I extract web app content from html code?
So I'm currently trying to gather data from csgo gambling sites to analyze them.所以我目前正在尝试从 csgo 赌博网站收集数据来分析它们。 So I wrote a very short programm extracting the html code from this website but it won't extract the content of the web app.
所以我写了一个非常短的程序,从这个网站提取 html 代码,但它不会提取 web 应用程序的内容。 My problem now is that I need the information within this web app.
我现在的问题是我需要这个 web 应用程序中的信息。 I mean I can view it in Chrome so I guess there will be solution.
我的意思是我可以在 Chrome 中查看它,所以我想会有解决方案。 Maybe the pictures help to understand what I'm looking for:
也许这些图片有助于理解我在寻找什么:
HTML code; HTML代码; marked the line I want
标记了我想要的行
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
public class Main {
public static void main(String[] args) {
try {
String html = Jsoup.connect("https://www.wtfskins.com/crash").get().html();
System.out.println(html);
} catch (IOException e) {
e.printStackTrace();
}
}
}
So that's what I get.所以这就是我得到的。 I need the content of
我需要的内容
<body> <app-root>
loading... // That's the problem
</app-root>
<script src="https://code.jquery.com/jquery-3.1.1.min.js" integrity="sha256-hVVnYaiADRTO2PzUGmuLJr8BLUSjGIZsDYGmIJLv2b8=" crossorigin="anonymous"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/tether/1.4.0/js/tether.min.js" integrity="sha384-DztdAPBWPRXSA/3eYEEUWrWCy7G5KFbe8fFjk5JAIxUYHKkDx6Qin1DkWx51bBrb" crossorigin="anonymous"></script>
<script src="/assets/js/jquery-ui.min.js"></script>
<script src="/assets/js/bootstrap.js"></script>
<script src="/assets/js/sha3.js"></script>
<script src="/assets/js/sha256.js"></script>
<script type="text/javascript" src="inline.318b50c57b4eba3d437b.bundle.js"></script>
<script type="text/javascript" src="polyfills.2b75d68d2d6cb678fc8d.bundle.js"></script>
<script type="text/javascript" src="main.7932c68952979c366236.bundle.js"></script>
</body>
The data is loaded in the page after the initial DOM.数据在初始 DOM 之后加载到页面中。 When you are getting data with
JSoup
, you get the initial html request.当您使用
JSoup
获取数据时,您会收到初始 html 请求。
If you check the
Network
tab in the dev tools
in the browser, you will see that after the initial load there will be extra XHR requests, getting the data.如果您在浏览器的
dev tools
中检查Network
选项卡,您会看到在初始加载后会有额外的 XHR 请求,获取数据。 ngcontent
attributes of tags assure that the page is loaded using Angular , which is a Javascript framework.标签的
ngcontent
属性确保使用Angular加载页面,这是一个 Javascript 框架。
This is done to make page loads more efficient and protect from the scraping a bit more.这样做是为了使页面加载更高效并更多地防止刮擦。
The network tab shows multiple requests after the page load that have JSON responses.网络选项卡显示页面加载后具有 JSON 响应的多个请求。 You need to look at those, see which request headers are mandatory to request them.
您需要查看这些,查看哪些请求标头是强制要求的。 As image shows, one of interesting ones is: https://www.wtfskins.com/api/v1/p2ptrading/usertrades/
如图所示,其中一个有趣的是: https://www.wtfskins.com/api/v1/p2ptrading/usertrades/
You can start by looking at How the Web works with subcategories about Async Javascript requests and REST API basics as well.您可以先查看Web 如何与有关异步 Javascript 请求和REST ZDB9474238D2DA927102ACAAAF7154A37Z 请求的子类别一起工作。 If you are not familiar with web dev, the research will take a bit of time.
如果您对 web 开发人员不熟悉,研究将需要一些时间。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.