簡體   English   中英

從任何網頁提取所有網址,遇到indexOf [homework]的問題

[英]Pulling all urls from any webpage, having trouble with indexOf [homework]

無論輸入什么,indexOf始終返回負7,我將使用網站http://www.columbusstate.edu

import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
import java.util.Arrays;
public class WebCrawler
{
    private static int linkCount = 0;
    public static void main(String[] args) throws IOException
    {

實例變量

        ArrayList<String> links = new ArrayList<String>();
        System.out.println("Enter the website you would like to web crawl");
        Scanner input = new Scanner(System.in);
        String address=input.next();

進入網站

        URL locator = new URL(address);
        Scanner in=new Scanner(locator.openStream());

        String str="";
        PrintWriter out=new PrintWriter("links.txt");

搜索網頁並拉出鏈接,還是應該如此。

        while(in.hasNextLine())
        {
            str=in.next();
            if(str.contains("href=\"http://"))
            {   
                linkCount++;
                int start = str.indexOf("ht");
                int end = str.indexOf("/\"");
                if(links.contains(str.substring(start, end))){

                }
                else{
                     links.add("Line Number "+linkCount+""+str.substring(start, end));
                }
            }
            else if(str.contains("href=\"https://")){
                linkCount++;
                int start = str.indexOf("ht");
                int end = str.indexOf("://")+15;
                if(links.contains(str.substring(start, end))){

                }
                else{
                    links.add("Line Number "+linkCount+""+str.substring(start, end));
                }
            }
        }
        int num = links.size();
        System.out.println(num);
        out.println("Number of links on this webpage is "+linkCount);
        out.println("Links are:");
        for(int i = links.size()-1; i>0; i--){
           out.println(links.get(i)); 
        }
        out.close();
    }
}

如果您確實在尋找從網頁中提取鏈接的方法,那么使用適當的HTML解析器比嘗試手動進行處理要好。 這是JSOUP的示例

import java.io.IOException;
import java.util.List;
import java.util.ArrayList;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class HTMLUtils {
  private HTMLUtils() {}

  public static List<String>extractLinks(String url) throws IOException {
    final ArrayList<String> result = new ArrayList<String>();

    Document doc = Jsoup.connect(url).get();

    Elements links = doc.select("a[href]");

    // href ...
    for (Element link : links) {
      result.add(link.attr("abs:href"));
      // result.add(link.text());
    }
    return result;
  }


  public final static void main(String[] args) throws Exception{
    String site = "http://www.columbusstate.edu";
    List<String> links = HTMLUtils.extractLinks(site);
    for (String link : links) {
      System.out.println(link);
    }
  }
}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM