简体   繁体   中英

Extract information from another site

I want to extract the number of followers from https://www.instagram.com/bbcpersian/ and use the following codes to do this but it is not working properly.

var url = "https://www.instagram.com/bbcpersian/";
var web = new HtmlWeb();
var htmlDoc = web.Load(url);
var node = htmlDoc.DocumentNode.SelectSingleNode("/html/body/div[1]/section/main/div/header/section/ul/li[2]/a/span");
string result = node.WriteContentTo();
Console.WriteLine(result);

Error在此处输入图片说明

OR

var html = @"https://www.instagram.com/bbcpersian/";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(html);
var htmlNodes = htmlDoc.DocumentNode.SelectNodes("/html/body/div[1]/section/main/div/header/section/ul/li[2]/a/span");
foreach (var node in htmlNodes)
{
    Console.WriteLine(node.InnerHtml + "  -  " + node.Attributes["title"].Value);
}

Error在此处输入图片说明

Did you check the HTML structure in view source ?

you actual html in the /html/body/div 1 is as below. The content you see in page are loaded dynamically. Hence, those structures are not available in html document you are creating. You need to consider other option to do this.

<div id="react-root">

    <span><svg width="50" height="50" viewBox="0 0 50 50" style="position:absolute;top:50%;left:50%;margin:-25px 0 0 -25px;fill:#c7c7c7"><path d="M25 1c-6.52 0-7.34.03-9.9.14-2.55.12-4.3.53-5.82 1.12a11.76 11.76 0 0 0-4.25 2.77 11.76 11.76 0 0 0-2.77 4.25c-.6 1.52-1 3.27-1.12 5.82C1.03 17.66 1 18.48 1 25c0 6.5.03 7.33.14 9.88.12 2.56.53 4.3 1.12 5.83a11.76 11.76 0 0 0 2.77 4.25 11.76 11.76 0 0 0 4.25 2.77c1.52.59 3.27 1 5.82 1.11 2.56.12 3.38.14 9.9.14 6.5 0 7.33-.02 9.88-.14 2.56-.12 4.3-.52 5.83-1.11a11.76 11.76 0 0 0 4.25-2.77 11.76 11.76 0 0 0 2.77-4.25c.59-1.53 1-3.27 1.11-5.83.12-2.55.14-3.37.14-9.89 0-6.51-.02-7.33-.14-9.89-.12-2.55-.52-4.3-1.11-5.82a11.76 11.76 0 0 0-2.77-4.25 11.76 11.76 0 0 0-4.25-2.77c-1.53-.6-3.27-1-5.83-1.12A170.2 170.2 0 0 0 25 1zm0 4.32c6.4 0 7.16.03 9.69.14 2.34.11 3.6.5 4.45.83 1.12.43 1.92.95 2.76 1.8a7.43 7.43 0 0 1 1.8 2.75c.32.85.72 2.12.82 4.46.12 2.53.14 3.29.14 9.7 0 6.4-.02 7.16-.14 9.69-.1 2.34-.5 3.6-.82 4.45a7.43 7.43 0 0 1-1.8 2.76 7.43 7.43 0 0 1-2.76 1.8c-.84.32-2.11.72-4.45.82-2.53.12-3.3.14-9.7.14-6.4 0-7.16-.02-9.7-.14-2.33-.1-3.6-.5-4.45-.82a7.43 7.43 0 0 1-2.76-1.8 7.43 7.43 0 0 1-1.8-2.76c-.32-.84-.71-2.11-.82-4.45a166.5 166.5 0 0 1-.14-9.7c0-6.4.03-7.16.14-9.7.11-2.33.5-3.6.83-4.45a7.43 7.43 0 0 1 1.8-2.76 7.43 7.43 0 0 1 2.75-1.8c.85-.32 2.12-.71 4.46-.82 2.53-.11 3.29-.14 9.7-.14zm0 7.35a12.32 12.32 0 1 0 0 24.64 12.32 12.32 0 0 0 0-24.64zM25 33a8 8 0 1 1 0-16 8 8 0 0 1 0 16zm15.68-20.8a2.88 2.88 0 1 0-5.76 0 2.88 2.88 0 0 0 5.76 0z"/></svg></span>

</div>

您可以使用正则表达式查找关注者所在的跨度。

/<a class="-nal3 " href="\/[a-zA-Z0-9]+\/followers\/"><span class="g47SY " title="([0-9.]+)">6,3mm<\/span>/m

I used Selenium to crowling a site and extract images like below, It may be useful for you:

IWebDriver _webDriver = null;
 var firefoxOptions = new FirefoxOptions
                            {
                                LogLevel = FirefoxDriverLogLevel.Debug,
                                BrowserExecutableLocation = Configuration.Developer.SeleniumBrowserExecutableLocation
                            };

                            firefoxOptions.AddArguments("no-sandbox");
                            firefoxOptions.AddArguments("-headless");

                            _webDriver = new RemoteWebDriver(new Uri($"{Configuration.Developer.SeleniumRemoteUrl}"), firefoxOptions);
  _webDriver.Manage().Window.Maximize();
                        _webDriver.Manage().Cookies.DeleteAllCookies();
                        _webDriver.Url = $"https://www.YourSite.com/";
                        _webDriver.Navigate();
                        var wait = new WebDriverWait(_webDriver, new TimeSpan(0, 0, 30));
 var element = wait.Until(SeleniumExtras.WaitHelpers.ExpectedConditions.ElementIsVisible(By.ClassName("jumbo-hero")));
                        var imageContent = element.GetAttribute("innerHTML");
                        _webDriver.Quit();
   var fromSrc = doc.DocumentNode.Descendants("img").Where(e => e.Attributes.Contains("src") && string.IsNullOrWhiteSpace(e.Attributes["src"].Value) == false).Select(e => e.Attributes["src"].Value).ToList();
                        var fromDataSrc = doc.DocumentNode.Descendants("img").Where(e => e.Attributes.Contains("data-src") && string.IsNullOrWhiteSpace(e.Attributes["data-src"].Value) == false).Select(e => e.Attributes["data-src"].Value).ToList();

Instagram pages are complicated. Your xpath "/html/body/div[1]/section/main/div/header/section/ul/li[2]/a/span" doesn't work because that part of the DOM doesn't exist yet; in a web browser most of the DOM of an Instagram page is built up by a ton of JavaScript.

Note, though, that you do have this in the downloaded web page:

<meta content="6.3m Followers, 11 Following, 17.5k Posts - See Instagram photos and videos from BBC NEWS فارسی (@bbcpersian)" name="description" />

It's pretty easy to scrape this raw HTML with a regular expression:

Match m = Regex.Match(rawHTML, "\"(?<followers>.+?) Followers, (?<following>.+?) Following, (?<posts>.+?) Posts");
string result = m.Groups["followers"].Value;

Here is what your code would look like rewritten using this technique:

var url = "https://www.instagram.com/bbcpersian/";
var web = new HtmlWeb();
var htmlDoc = web.Load(url);
string rawHTML = htmlDoc.Text;
Match m = Regex.Match(rawHTML, "\"(?<followers>.+?) Followers, (?<following>.+?) Following, (?<posts>.+?) Posts");
string result = m.Groups["followers"].Value;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM