[英]Extracting Data from a webpage using java
I'm working on a project that consists on collecting job offers from the web. 我正在一个项目,该项目包括从网络上收集工作机会。 So as a first step, I want to extract data (job offer data) from a specific webpage. 因此,第一步,我想从特定网页中提取数据(工作机会数据)。 So I want to know if there is an API or an existing code that can help me. 所以我想知道是否有API或现有代码可以为我提供帮助。
我发现的最好的项目是jsoup( http://jsoup.org/ )
for example you can use for make request this: 例如,您可以使用make请求:
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.NameValuePair;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.HttpClient;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.impl.client.HttpClientBuilder;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.protocol.HTTP;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class ... {
Document doc;
HttpClient client = HttpClientBuilder.create().build();
HttpGet requestGet = new HttpGet(url + params);
HttpResponse response = client.execute(requestGet);
HttpEntity entity = response.getEntity();
String responseString = EntityUtils.toString(entity, "UTF-8");
/*
* Here you can retrive the information with Jsoup library
* in thi example extract data from a table element
*/
doc = Jsoup.parse(response);
Element elementsByTag = doc.getElementsByTag("table").get(1);
Elements rows = elementsByTag.getElementsByTag("tr");
for (Element row : rows) {
\\TODO
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.