如何从另一个表中具有相同类的表中抓取数据

Question

I have to scrape data and save it in .csv file from a web site which have many tables. 我必须从具有许多表的网站上抓取数据并将其保存在.csv文件中。 I only scrape the data of one table having class marketData. 我只抓取一个具有class marketData的表的数据。 But, there are two other tables having the same class. 但是，还有两个其他表具有相同的类。 Currently my code is bringing all data from tables having class marketData. 目前，我的代码正在从具有class marketData的表中获取所有数据。 How can I scrape data from one table and skip other tables? 如何从一个表中抓取数据并跳过其他表？ my code is as follows. 我的代码如下。

public class ComMarket_summary {

boolean writeCSVToConsole = true;
boolean writeCSVToFile = true;
boolean sortTheList = true;
boolean writeToConsole;
boolean writeToFile;
public static Document doc = null;
public static Elements tbodyElements = null;
public static Elements elements = null;
public static Elements tdElements = null;
public static Elements trElement2 = null;
public static String Dcomma = ",";
public static String line = "";
public static ArrayList<Elements> sampleList = new ArrayList<Elements>();

public static void createConnection() throws IOException {
    System.setProperty("http.proxyHost", "191.1.1.202");
    System.setProperty("http.proxyPort", "8080");
    String tempUrl = "http://www.psx.com.pk/phps/mktSummary.php";
    doc = Jsoup.parse(new URL(tempUrl), 1000);        
    System.out.println("Successfully Connected");
}

public static void parsingHTML() throws Exception {

    for (Element table : doc.getElementsByTag("table")) {
        for (Element trElement : table.getElementsByTag("tr")) {
            File fold = new File("C:\\market_smry.csv");
            fold.delete();
            File fnew = new File("C:\\market_smry.csv");
            trElement2 = trElement.getElementsByTag("tr");
            tdElements = trElement.getElementsByTag("td");
            FileWriter sb = new FileWriter(fnew, true);

            if (table.hasClass("marketData")) {

                for (Iterator<Element> it = trElement2.iterator(); it.hasNext();) {
                    if (it.hasNext()) {
                        sb.append("\r\n");

                    }

                    for (Iterator<Element> it2 = trElement2.iterator(); it.hasNext();) {
                        Element tdElement2 = it.next();
                        final String content = tdElement2.text();
                        if (it2.hasNext()) {

                            sb.append(formatData(content));
                            sb.append("   ,   ");

                        }

                    }

                    System.out.println(sb.toString());
                    sb.flush();
                    sb.close();
                }
            }
            System.out.println(sampleList.add(tdElements));

        }
    }
}
private static final SimpleDateFormat FORMATTER_MMM_d_yyyy = new SimpleDateFormat("MMM d, yyyy", Locale.US);
private static final SimpleDateFormat FORMATTER_dd_MMM_yyyy = new SimpleDateFormat("dd-MMM-YYYY", Locale.US);

public static String formatData(String text) {
    String tmp = null;

    try {
        Date d = FORMATTER_MMM_d_yyyy.parse(text);
        tmp = FORMATTER_dd_MMM_yyyy.format(d);
    } catch (ParseException pe) {
        tmp = text;
    }

    return tmp;
}

public static void main(String[] args) throws IOException, Exception {
    createConnection();
    parsingHTML();

}

PS: I am using JDK 1.8,Jre 1.8, jsoup 1.8. PS：我正在使用JDK 1.8，Jre 1.8，jsoup 1.8。

Answer 1

You can optimize your code by using a more specific selector. 您可以使用更具体的选择器来优化代码。

for (Element table : doc.select("table.marketData")) {
//Process table
}

If you want to process just a specific table on the page, you can access the table by its index. 如果只想处理页面上的特定表，则可以按其索引访问该表。

Elements tables = doc.select("table.marketData");
Element table = tables.get(1);

Answer 2

Seeing as how there are 3 tables with class "marketData", you will need to find some other identifying feature of the table you want (Does the table you want have an id?, Are the header columns different?, ect). 看到类别为“ marketData”的表有3个，您将需要找到所需表的其他标识功能（所需表是否具有ID？标题页是否不同？等等）。 Without seeing the html, I can't give more guidance than that, though. 但是，如果没有看到html，我将无法提供更多指导。

Answer 3

Let's suppose you want to extract data from the first table. 假设您要从第一个表中提取数据。
You would use this CSS selector: table.marketData:nth-of-type(1) . 您将使用以下CSS选择器： table.marketData:nth-of-type(1) 。

You code then become: 您的代码将变为：

for (Element table : doc.getElementsByTag("table.marketData:nth-of-type(1)")) {
    for (Element trElement : table.getElementsByTag("tr")) {
        File fold = new File("C:\\market_smry.csv");
        fold.delete();
        File fnew = new File("C:\\market_smry.csv");
        trElement2 = trElement.getElementsByTag("tr");
        tdElements = trElement.getElementsByTag("td");
        FileWriter sb = new FileWriter(fnew, true);

        // /////////
        // You can safely remove the if block below.  
        // Jsoup has already performed the filtering for you.
        // /////////
        //if (table.hasClass("marketData")) {

            for (Iterator<Element> it = trElement2.iterator(); it.hasNext();) {
                if (it.hasNext()) {
                    sb.append("\r\n");

                }

                for (Iterator<Element> it2 = trElement2.iterator(); it.hasNext();) {
                    Element tdElement2 = it.next();
                    final String content = tdElement2.text();
                    if (it2.hasNext()) {

                        sb.append(formatData(content));
                        sb.append("   ,   ");

                    }

                }

                System.out.println(sb.toString());
                sb.flush();
                sb.close();
            }
        //}
        System.out.println(sampleList.add(tdElements));
    }
}

References: 参考文献：

CSS selector DEMO CSS选择器DEMO
Jsoup CSS selector syntax Jsoup CSS选择器语法

如何从另一个表中具有相同类的表中抓取数据

问题描述

3 个解决方案

解决方案1
1 2016-05-06 12:10:31

解决方案2
0 2016-05-06 12:10:08

解决方案3
0 2016-05-06 15:59:19

如何从另一个表中具有相同类的表中抓取数据

问题描述

3 个解决方案

解决方案1 1 2016-05-06 12:10:31

解决方案2 0 2016-05-06 12:10:08

解决方案3 0 2016-05-06 15:59:19

解决方案1
1 2016-05-06 12:10:31

解决方案2
0 2016-05-06 12:10:08

解决方案3
0 2016-05-06 15:59:19