简体   繁体   English

网络抓取工具未创建 CSV 文件

[英]Web scraper not creating CSV file

I have created a web scraper which brings the market data of share rates from the website of stock exchange.我创建了一个网络爬虫,它可以从证券交易所网站获取股价的市场数据。 www.psx.com.pk in that site there is a hyperlink of Market Summary. www.psx.com.pk在该站点中有一个市场摘要的超链接。 From that link I have to scrap the data.从那个链接我必须报废数据。 I have created a program which is as follows.我创建了一个程序,如下所示。

package com.market_summary;

import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.Iterator;
import java.util.Locale;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class ComMarket_summary {

    boolean writeCSVToConsole = true;
    boolean writeCSVToFile = true;
    boolean sortTheList = true;
    boolean writeToConsole;
    boolean writeToFile;
    public static Document doc = null;
    public static Elements tbodyElements = null;
    public static Elements elements = null;
    public static Elements tdElements = null;
    public static Elements trElement2 = null;
    public static String Dcomma = ",";
    public static String line = "";
    public static ArrayList<Elements> sampleList = new ArrayList<Elements>();

    public static void createConnection() throws IOException {
        System.setProperty("http.proxyHost", "191.1.1.202");
        System.setProperty("http.proxyPort", "8080");
        String tempUrl = "http://www.psx.com.pk/index.php";
        doc = Jsoup.connect(tempUrl).get();
        System.out.println("Successfully Connected");
    }

    public static void parsingHTML() throws Exception {

        File fold = new File("C:\\market_smry.csv");
        fold.delete();
        File fnew = new File("C:\\market_smry.csv");
        for (Element table : doc.getElementsByTag("table")) {
            for (Element trElement : table.getElementsByTag("tr")) {
                trElement2 = trElement.getElementsByTag("td");
                tdElements = trElement.getElementsByTag("td");
                FileWriter sb = new FileWriter(fnew, true);

                if (trElement.hasClass("marketData")) {
                    for (Iterator<Element> it = tdElements.iterator(); it.hasNext();) {
                        if (it.hasNext()) {
                            sb.append("\r\n");

                        }

                        for (Iterator<Element> it2 = trElement2.iterator(); it.hasNext();) {
                            Element tdElement2 = it.next();
                            final String content = tdElement2.text();
                            if (it2.hasNext()) {

                                sb.append(formatData(content));
                                sb.append("   |   ");

                            }

                        }

                        System.out.println(sb.toString());
                        sb.flush();
                        sb.close();
                    }
                }
                System.out.println(sampleList.add(tdElements));

            }
        }
    }
    private static final SimpleDateFormat FORMATTER_MMM_d_yyyy = new SimpleDateFormat("MMM d, yyyy", Locale.US);
    private static final SimpleDateFormat FORMATTER_dd_MMM_yyyy = new SimpleDateFormat("dd-MMM-YYYY", Locale.US);

    public static String formatData(String text) {
        String tmp = null;

        try {
            Date d = FORMATTER_MMM_d_yyyy.parse(text);
            tmp = FORMATTER_dd_MMM_yyyy.format(d);
        } catch (ParseException pe) {
            tmp = text;
        }

        return tmp;
    }

    public static void main(String[] args) throws IOException, Exception {
        createConnection();
        parsingHTML();

    }
}

Now, the problem is when I execute this program it should create a .csv file but what actually happens is it's not creating any file.现在,问题是当我执行这个程序时,它应该创建一个 .csv 文件,但实际发生的是它没有创建任何文件。 When I debug this code I found that program is not entering in the loop.当我调试这段代码时,我发现程序没有进入循环。 I don't understand that why it is doing so.我不明白为什么要这样做。 While when I run the same program on the other website which have slightly different page structure it is running fine.当我在另一个页面结构略有不同的网站上运行相同的程序时,它运行良好。
What I understand that this data is present in the #document which is a virtual element and doesn't mean anything that's why program can't read it while there is no such thing in other website.据我所知,这些数据存在于#document 中,它是一个虚拟元素,并不意味着任何事情,这就是为什么程序无法读取它而其他网站没有这样的东西。 Kindly, help me out to read the data inside the #document element.请帮我读取#document元素中的数据。

Long Story Short长话短说

Change your temp url to http://www.psx.com.pk/phps/index1.php将您的临时网址更改为http://www.psx.com.pk/phps/index1.php

Explanation说明

There is no table in the document of http://www.psx.com.pk/index.php . http://www.psx.com.pk/index.phpdocument中没有table

Instead it is showing it's content in two frameset .相反,它在两个frameset显示它的内容。

One is dummy with url http://www.psx.com.pk/phps/blank.php .一个是带有 url http://www.psx.com.pk/phps/blank.php 的虚拟 Another one is the real page which is showing actual data and it's url is http://www.psx.com.pk/phps/index1.php另一个是显示实际数据的真实页面,它的 url 是http://www.psx.com.pk/phps/index1.php

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM