简体   繁体   English

Java-Jsoup,抓取HTML

[英]Java-Jsoup, scrape html

I am using Jsoup with Java to Parse an HTML file. 我正在将Jsoup与Java一起使用来解析HTML文件。 My question is how can I just extract the line that says "Hourly Rate: 23,016 orders" I am parsing a lot of files, so the number next to the Hourly Rate will change. 我的问题是我如何提取“小时费率:23,016个订单”这一行,因为我正在解析很多文件,因此“小时费率”旁边的数字将发生变化。

<html>
<head>
<title>Testing</title>
</head>
<body>
<p class=MsoNormal align=center style='background:#DEDEDF'>
<span style='font-size:18.0pt'><b>Testing</b></span></p>
Hourly Rate: 23,016 orders<br>
<table border=0 cellpadding=0>
<tr valign=top>
<td>

Thanks 谢谢

I just added this code: 我刚刚添加了以下代码:

 String HourlyRate = doc.body().ownText();
//String text = doc.body().text();

System.out.println(HourlyRate);

This Printed out: Hourly Rate: 23,016 orders 已打印输出:时薪:23,016订单

Grab the MsoNormal class then use a regular expression to look for a number ie 抓住MsoNormal类,然后使用正则表达式查找数字,即

Document doc = Jsoup.parse(htmlString);
Element msoNormal = doc.getElementsByClass("MsoNormal").first();
if(msoNormal!=null){
  Pattern p = Pattern.compile("[0-9]+,[0-9]+");
  Matcher m = pattern.matcher(msoNormal.text());
  if(matcher.find())
    System.out.println(m.get());
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM