简体   繁体   中英

JSoup select from HTML in unix

I have a program that extracts certain elements(article author names) from many articles, from the PubMed site. While the program works correctly in my pc (windows), when i try to run it on unix returns an empty list. I suspect this is because the syntax should be somewhat different in a unix system. The problem is the JSoup documentation does not mention something. Anyone know anything on this? My code is something like this:

Document doc = Jsoup.connect("http://www.ncbi.nlm.nih.gov/pubmed/" + pmidString).timeout(60000).userAgent("Mozilla/25.0").get();
            System.out.println("connected");
            Elements authors = doc.select("div.auths >*");
            System.out.println("number of elements is " + authors.size());

The final System.out.println always says the size is 0 therefore it cannot do anything more.

Thanks in advance

Complete Example:

protected static void searchLink(HashMap<String, HashSet<String>> authorsMap,  HashMap<String, HashSet<String>> reverseAuthorsMap,
        String fileLine

        ) throws IOException, ParseException, InterruptedException
{

            JSONParser parser = new JSONParser();
            JSONObject jsonObj = (JSONObject) parser.parse(fileLine.substring(0, fileLine.length() - 1 ));
            String pmidString = (String)jsonObj.get("pmid");
            System.out.println(pmidString);

            Document doc = Jsoup.connect("http://www.ncbi.nlm.nih.gov/pubmed/" + pmidString).timeout(60000).userAgent("Mozilla/25.0").get();
            System.out.println("connected");
            Elements authors = doc.select("div.auths >*");
            System.out.println("found the element");

            HashSet<String> authorsList = new HashSet<>();
            System.out.println("authors list hashSet created");
            System.out.println("number of elements is " + authors.size());
            for (int i =0; i < authors.size(); i++)
            {


                // add the current name to the names list
                authorsList.add(authors.get(i).text());

                // pmidList variable
                HashSet<String> pmidList;
                System.out.println("stage 1");
                // if the author name is new, then create the list, add the current pmid and put it in the map
                if(!authorsMap.containsKey(authors.get(i).text()))
                {
                    pmidList = new HashSet<>();
                    pmidList.add(pmidString);
                    System.out.println("made it to searchLink");
                    authorsMap.put(authors.get(i).text(), pmidList);

                }
                // if the author name has been found before, get the list of articles and add the current
                else
                {
                    System.out.println("Author exists in map");
                    pmidList = authorsMap.get(authors.get(i).text());
                    pmidList.add(pmidString);


                    authorsMap.put(authors.get(i).text(), pmidList);
                    //authorsMap.put((String) authorName, null);
                }

                // finally, add the pmid-authorsList to the map
                reverseAuthorsMap.put(pmidString, authorsList);
                System.out.println("reverseauthors populated");

            }

}

I have a thread pool, and each thread uses this method to populate two maps. The fileline argument is a single line that I parse as json and keep the "pmid" field. Using this string I access the url of this article, and parse the HTML for the names of the authors. The rest should work (it does work in my pc), but because the authors.size is 0 always, the for directly below the number of elements System.out does not get executed at all.

I've tried an example doing exactly what you're trying:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;

public class Test {
  public static void main (String[] args) throws IOException {
    String docId = "24312906";
    if (args.length > 0) {
      docId = args[0];
    }

    String url = "http://www.ncbi.nlm.nih.gov/pubmed/" + docId;
    Document doc = Jsoup.connect(url).timeout(60000).userAgent("Mozilla/25.0").get();
    Elements authors = doc.select("div.auths >*");

    System.out.println("os.name=" + System.getProperty("os.name"));
    System.out.println("os.arch=" + System.getProperty("os.arch"));

    // System.out.println("doc=" + doc);
    System.out.println("authors=" + authors);
    System.out.println("authors.length=" + authors.size());

    for (Element a : authors) {
      System.out.println("  author: " + a);
    }
  }
}

My OS is Linux:

# uname -a
Linux graphene 3.11.0-13-generic #20-Ubuntu SMP Wed Oct 23 07:38:26 UTC 2013 x86_64 x86_64 x86_64 
GNU/Linux
# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 13.10
Release:        13.10
Codename:       saucy

Running that program produces:

os.name=Linux
os.arch=amd64
authors=<a href="/pubmed?term=Liu%20W%5BAuthor%5D&amp;cauthor=true&amp;cauthor_uid=24312906">Liu W</a>
<a href="/pubmed?term=Chen%20D%5BAuthor%5D&amp;cauthor=true&amp;cauthor_uid=24312906">Chen D</a>
authors.length=2
  author: <a href="/pubmed?term=Liu%20W%5BAuthor%5D&amp;cauthor=true&amp;cauthor_uid=24312906">Liu W</a>
  author: <a href="/pubmed?term=Chen%20D%5BAuthor%5D&amp;cauthor=true&amp;cauthor_uid=24312906">Chen D</a>

Which seems to work. Perhaps the issue is with fileLine? Can you print out the value of 'url':

System.out.println("url='" + "http://www.ncbi.nlm.nih.gov/pubmed/" + pmidString+ "'");

Since you're not getting an exception from your code, I suspect you're getting a document, just not one your code is anticipating. Printing out the document so you can see what you've gotten back will probably help as well.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM