简体   繁体   中英

Web Scraping Java Sites with HtmlAgilityPack

Let me start out by saying that I am no pro at web scraping. I can do the basics on most platforms, but that's about it.

I am trying to create the foundation for a web application that can helps users reinforce their language learning by generating additional data, metrics, as well as create new tools for self-testing. The Duolingo website is not offering up any sort of API so my next thought for now is just to scrape https://www.duome.eu/ . I wrote a quick little scraper but didn't realize that the site was java. In the following example, it is my wish to collect all of the words from the Words tab that contain anchors:

using System;
using HtmlAgilityPack;
using System.Net.Http;
using System.Text.RegularExpressions;

namespace DuolingoUpdate
{
    class Program
    {
        static void Main(string[] args)
        {
            string userName = "Podus";
            UpdateDuolingoUser(userName);
            Console.ReadLine();
        }

        private static async void UpdateDuolingoUser(string userName)
        {
            string url = "https://www.duome.eu/" + userName + "/progress/";

            // Create the http client connection
            HttpClient httpClient = new HttpClient();
            var html = await httpClient.GetStringAsync(url);

            // Store the html client data in an object
            HtmlDocument htmlDocument = new HtmlDocument();
            htmlDocument.LoadHtml(html);

            //var words = htmlDocument.DocumentNode.Descendants("div")
            //    .Where(node => node.GetAttributeValue("id", "")
            //    .Equals("words")).ToList();

            //var wordList = words[0].Descendants("a")
            //    .Where(node => node.GetAttributeValue("class", "")
            //    .Contains("wA")).ToList();

            Console.WriteLine(html);
        }
    }
}

The html object of the above code contains:

<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="google" value="notranslate">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Duolingo · Podus @ duome.eu</title>
<link rel="stylesheet" href="/style.css?1548418871" />
<link href="/favicon.ico" rel="shortcut icon" type="image/x-icon" />
<script src="//code.jquery.com/jquery-3.3.1.min.js"></script>
<script type="text/javascript">
    $(document).ready(function() {
        if("".length==0){
            var visitortime = new Date();
            var visitortimezone = "GMT " + -visitortime.getTimezoneOffset()/60;
            //localStorage.tz = visitortimezone;
            //timezone = Date.parse(localStorage.tz);
            //timezone = localStorage.tz;
            //console.log(timezone);
            $.ajax({
                type: "GET",
                url: "/tz.php",
                data: 'time='+ visitortimezone,
                success: function(){
                    location.reload();
                }
            });
        }
    });
</script>

</head>
<body>
<noscript>Click <a href="https://duome.eu//Podus/progress/">here</a> to adjsut XP charts to your local timezone. </noscript>
<!-- Yandex.Metrika counter --> <script type="text/javascript" > (function (d, w, c) { (w[c] = w[c] || []).push(function() { try { w.yaCounter47765476 = new Ya.Metrika({ id:47765476, clickmap:true, trackLinks:true, accurateTrackBounce:true }); } catch(e) { } }); var n = d.getElementsByTagName("script")[0], s = d.createElement("script"), f = function () { n.parentNode.insertBefore(s, n); }; s.type = "text/javascript"; s.async = true; s.src = "https://mc.yandex.ru/metrika/watch.js"; if (w.opera == "[object Opera]") { d.addEventListener("DOMContentLoaded", f, false); } else { f(); } })(document, window, "yandex_metrika_callbacks"); </script> <noscript><div><img src="https://mc.yandex.ru/watch/47765476" style="position:absolute; left:-9999px;" alt="" /></div></noscript> <!-- /Yandex.Metrika counter -->
</body>
</html>

But if you go to the actual url https://www.duome.eu/Podus/progress/ , the site contains a ton of script. So upon inspection the first problem is that I am not getting the html that I see in the browser. The second problem is that if you view source, its nothing like what is in inspect and I don't see anything in source that would lead me to isolate the data from div id="words" .

Given my lackluster knowledge of java built web pages, how do I do this, or is it even possible?

You can access dualingo profile data in JSON format via https://www.duolingo.com/users/<username>

eg. https://www.duolingo.com/users/Podus

This should be much easier than trying to scrape the duome profile page manually.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM