简体   繁体   中英

how to scrape data from one table having same class in other table

I have to scrape data and save it in .csv file from a web site which have many tables. I only scrape the data of one table having class marketData. But, there are two other tables having the same class. Currently my code is bringing all data from tables having class marketData. How can I scrape data from one table and skip other tables? my code is as follows.

public class ComMarket_summary {

boolean writeCSVToConsole = true;
boolean writeCSVToFile = true;
boolean sortTheList = true;
boolean writeToConsole;
boolean writeToFile;
public static Document doc = null;
public static Elements tbodyElements = null;
public static Elements elements = null;
public static Elements tdElements = null;
public static Elements trElement2 = null;
public static String Dcomma = ",";
public static String line = "";
public static ArrayList<Elements> sampleList = new ArrayList<Elements>();

public static void createConnection() throws IOException {
    System.setProperty("http.proxyHost", "191.1.1.202");
    System.setProperty("http.proxyPort", "8080");
    String tempUrl = "http://www.psx.com.pk/phps/mktSummary.php";
    doc = Jsoup.parse(new URL(tempUrl), 1000);        
    System.out.println("Successfully Connected");
}

public static void parsingHTML() throws Exception {

    for (Element table : doc.getElementsByTag("table")) {
        for (Element trElement : table.getElementsByTag("tr")) {
            File fold = new File("C:\\market_smry.csv");
            fold.delete();
            File fnew = new File("C:\\market_smry.csv");
            trElement2 = trElement.getElementsByTag("tr");
            tdElements = trElement.getElementsByTag("td");
            FileWriter sb = new FileWriter(fnew, true);

            if (table.hasClass("marketData")) {

                for (Iterator<Element> it = trElement2.iterator(); it.hasNext();) {
                    if (it.hasNext()) {
                        sb.append("\r\n");

                    }

                    for (Iterator<Element> it2 = trElement2.iterator(); it.hasNext();) {
                        Element tdElement2 = it.next();
                        final String content = tdElement2.text();
                        if (it2.hasNext()) {

                            sb.append(formatData(content));
                            sb.append("   ,   ");

                        }

                    }

                    System.out.println(sb.toString());
                    sb.flush();
                    sb.close();
                }
            }
            System.out.println(sampleList.add(tdElements));

        }
    }
}
private static final SimpleDateFormat FORMATTER_MMM_d_yyyy = new SimpleDateFormat("MMM d, yyyy", Locale.US);
private static final SimpleDateFormat FORMATTER_dd_MMM_yyyy = new SimpleDateFormat("dd-MMM-YYYY", Locale.US);

public static String formatData(String text) {
    String tmp = null;

    try {
        Date d = FORMATTER_MMM_d_yyyy.parse(text);
        tmp = FORMATTER_dd_MMM_yyyy.format(d);
    } catch (ParseException pe) {
        tmp = text;
    }

    return tmp;
}

public static void main(String[] args) throws IOException, Exception {
    createConnection();
    parsingHTML();

}

PS: I am using JDK 1.8,Jre 1.8, jsoup 1.8.

You can optimize your code by using a more specific selector.

for (Element table : doc.select("table.marketData")) {
//Process table
}

If you want to process just a specific table on the page, you can access the table by its index.

Elements tables = doc.select("table.marketData");
Element table = tables.get(1);

Seeing as how there are 3 tables with class "marketData", you will need to find some other identifying feature of the table you want (Does the table you want have an id?, Are the header columns different?, ect). Without seeing the html, I can't give more guidance than that, though.

Let's suppose you want to extract data from the first table.
You would use this CSS selector: table.marketData:nth-of-type(1) .

You code then become:

for (Element table : doc.getElementsByTag("table.marketData:nth-of-type(1)")) {
    for (Element trElement : table.getElementsByTag("tr")) {
        File fold = new File("C:\\market_smry.csv");
        fold.delete();
        File fnew = new File("C:\\market_smry.csv");
        trElement2 = trElement.getElementsByTag("tr");
        tdElements = trElement.getElementsByTag("td");
        FileWriter sb = new FileWriter(fnew, true);

        // /////////
        // You can safely remove the if block below.  
        // Jsoup has already performed the filtering for you.
        // /////////
        //if (table.hasClass("marketData")) {

            for (Iterator<Element> it = trElement2.iterator(); it.hasNext();) {
                if (it.hasNext()) {
                    sb.append("\r\n");

                }

                for (Iterator<Element> it2 = trElement2.iterator(); it.hasNext();) {
                    Element tdElement2 = it.next();
                    final String content = tdElement2.text();
                    if (it2.hasNext()) {

                        sb.append(formatData(content));
                        sb.append("   ,   ");

                    }

                }

                System.out.println(sb.toString());
                sb.flush();
                sb.close();
            }
        //}
        System.out.println(sampleList.add(tdElements));
    }
}

References:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM