简体   繁体   English

JAVA解析表数据

[英]JAVA parsing table data

I would like to extract some html data from page source. 我想从页面源中提取一些html数据。 Here is the ref. 这是裁判。 link have a html link view-source: http://www.4icu.org/reviews/index2.htm . 链接具有html链接视图源: http : //www.4icu.org/reviews/index2.htm I would like to ask how could I extract only the name of the university and the country name with JAVA. 我想问一下如何使用JAVA仅提取大学名称和国家/地区名称。 I know the way to just extract the university name as they are between , but how could I make the program faster by just scanning the table when class="i" and extract also the country, ie United States, with the <...alt="United States" /> 我知道只提取大学名称之间的方法,但是如何通过在class =“ i”时扫描表并使用<...提取国家/地区,也就是美国,来使程序更快? alt =“美国” />

<tr>
<td><a name="UNIVERSITIES-BY-NAME"></a><h2>A-Z list of world Universities and Colleges</h2>
</tr>

<tr>
<td class="i"><a href="/reviews/9107.htm"> A.T. Still University</a></td>
<td width="50" align="right" nowrap>us <img src="/i/bg.gif" class="fl flag-us" alt="United States" /></td>
</tr>

Thanks in advance. 提前致谢。

EDIT Following what @11thdimension has said, here is my .java file 编辑 @ 11thdimension所说的之后,这是我的.java文件

public class University {
    public static void main(String[] args) throws Exception {
        System.out.println("Started");

        URL url = new URL ("http://www.4icu.org/reviews/index2.htm");

        URLConnection spoof = url.openConnection();        
        // Spoof the connection so we look like a web browser
        spoof.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)");

        String connect = url.toString();
        Document doc = Jsoup.connect(connect).get();

        Elements cells = doc.select("td.i");

        Iterator<Element> iterator = cells.iterator();

        while (iterator.hasNext()) {
            Element cell = iterator.next();
            String university = cell.select("a").text();
            String country = cell.nextElementSibling().select("img").attr("alt");

            System.out.printf("country : %s, university : %s %n", country, university);
        }
    }
}

However, when I run it, it gives me the following error. 但是,当我运行它时,它给了我以下错误。

Started
Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=403, URL=http://www.4icu.org/reviews/index2.htm

EDIT2 I have created the following program to get the header of the html site. EDIT2我创建了以下程序来获取html站点的标题。

public class Get_Header {
  public static void main(String[] args) throws Exception {
    URL url = new URL("http://www.4icu.org/reviews/index2.htm");
    URLConnection connection = url.openConnection();

    Map responseMap = connection.getHeaderFields();
    for (Iterator iterator = responseMap.keySet().iterator(); iterator.hasNext();) {
      String key = (String) iterator.next();
      System.out.println(key + " = ");

      List values = (List) responseMap.get(key);
      for (int i = 0; i < values.size(); i++) {
        Object o = values.get(i);
        System.out.println(o + ", ");
      }
    }
  }
}

It retunrs the following result. 它重新调谐以下结果。

X-Frame-Options = 
SAMEORIGIN, 
Transfer-Encoding = 
chunked, 
null = 
HTTP/1.1 403 Forbidden, 
CF-RAY = 
2ca61c7a769b1980-HKG, 
Server = 
cloudflare-nginx, 
Cache-Control = 
max-age=10, 
Connection = 
keep-alive, 
Set-Cookie = 
__cfduid=d4f8d740e0ae0dd551be15e031359844d1469853403; expires=Sun, 30-Jul-17 04:36:43 GMT; path=/; domain=.4icu.org; HttpOnly, 
Expires = 
Sat, 30 Jul 2016 04:36:53 GMT, 
Date = 
Sat, 30 Jul 2016 04:36:43 GMT, 
Content-Type = 
text/html; charset=UTF-8, 

Though I can get the header, but how should I combine the code in EDIT and EDIT2 to form a complete one? 虽然我可以获取标题,但是如何将EDIT和EDIT2中的代码结合起来形成一个完整的标题呢? Thanks. 谢谢。

If it's going to be a single time task then you should probably use Javascript fot it. 如果这将是一次任务,那么您可能应该使用Javascript。

Following code will log the required names in the console. 以下代码将在控制台中记录所需的名称。 You'll have to run it in the browser console. 您必须在浏览器控制台中运行它。

(function () {
    var a = [];
    document.querySelectorAll("td.i a").forEach(function (anchor) { a.push(anchor.textContent.trim());});

    console.log(a.join("\n"));
})();

Following is a Java example with Jsoup selectors 以下是带有Jsoup选择器的Java示例

Maven Dependency Maven依赖

<dependencies>
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.8.3</version>
    </dependency>
</dependencies>

Java Code Java代码

import java.io.File;
import java.util.Iterator;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class TestJsoup {
    public static void main(String[] args) throws Exception {
        System.out.println("Starteed");

        File file = new File("A-Z list of 11930 World Colleges & Universities.html");
        Document doc = Jsoup.parse(file, "UTF-8");

        Elements cells = doc.select("td.i");

        Iterator<Element> iterator = cells.iterator();

        while (iterator.hasNext()) {
            Element cell = iterator.next();
            String university = cell.select("a").text();
            String country = cell.nextElementSibling().select("img").attr("alt");

            System.out.printf("country : %s, university : %s %n", country, university);
        }
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM