简体   繁体   English

用 java jsoup 解析 instagram 不给 Elements 给出源

[英]Parsing instagram with java jsoup not give Elements gives source

I'm trying to get reels video URL with jsoup using java on Android Studio.我正在尝试在 Android Studio 上使用 java 获取带有 jsoup 的卷轴视频 URL。 I want to get Elements in inspect but code returns page source.我想在检查中获取元素,但代码返回页面源。 I use jsoup in other projects on different web pages and never encounter this situation.我在其他项目中使用 jsoup 在不同的 web 页面上,从未遇到过这种情况。 Can you tell me what ı doing wrong and how can ı get the Elements in inspect?你能告诉我我做错了什么以及如何让元素进入检查? Thank you谢谢

  public class fetchData extends AsyncTask<Void, Void, Void> {
        Document doc = null;
        String str;

        @Override
        protected void onPostExecute(Void aVoid) {
            super.onPostExecute(aVoid);
            MainActivity.textView.setText(str);
        }
    
        @Override
        protected Void doInBackground(Void... voids) {
            try {
                doc = Jsoup.connect("https://www.instagram.com/reel/CDok74FJzHp/?igshid=cam8ylb7okl7").get();
            } catch (IOException e) {
                e.printStackTrace();
            }
            str = doc.toString();
            return null;
        }
}

If you check the source of the page (inspect the video element) you'll find:如果您检查页面的来源(检查视频元素),您会发现:

<video class="tWeCl"
  playsinline="" 
  poster="https://instagram.flhr4-2.fna.fbcdn.net/v/t51.2885-15/e35/117157253_120443486171759_7332785595039685871_n.jpg?_nc_ht=instagram.flhr4-2.fna.fbcdn.net&amp;_nc_cat=111&amp;_nc_ohc=aX7rVh9IbGoAX_lj74j&amp;oh=ba74c5c8ad97ba14c35710addd523dfd&amp;oe=5F363C59" 
  preload="none" 
  type="video/mp4" 
  src="https://instagram.flhr4-2.fna.fbcdn.net/v/t50.2886-16/117284962_313567919762486_3343704909021624596_n.mp4?_nc_ht=instagram.flhr4-2.fna.fbcdn.net&amp;_nc_cat=102&amp;_nc_ohc=3wvoN4vNzkUAX_DLFTR&amp;oe=5F3659EF&amp;oh=7a38d593469a99239a7cb07050cc47f2">
</video>

If you then search the html for the mp4 url you'll find it in one of the javascript html tags... it is delivered as a json value. If you then search the html for the mp4 url you'll find it in one of the javascript html tags... it is delivered as a json value. So by breaking up the javascript text on the " = " and taking the latter half, you get the raw json which can then be parsed for the "video_url" using JayWay's JsonPath.read method.因此,通过分解" = "上的 javascript 文本并取后半部分,您将获得原始 json,然后可以使用 JayWay 的JsonPath.read方法将其解析为"video_url"

It would seem the video tag is therefore generated in the html by the javascript as it doesn't appear possible to filter the html for any < video > elements.因此,似乎视频标签是由 javascript 在 html 中生成的,因为似乎无法为任何 < video > 元素过滤 html。

import com.jayway.jsonpath.JsonPath;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class Instagram {

    private final String url;

    public Instagram(String url) {
        this.url = url;
    }

    public void start() {
        Document doc = getHtmlPage(url);
        Elements videoElement = getScriptElementContainingVideoUrl(doc);

        List<String> relevantTagWithMp4Url = getSingleScriptElementWithVideoUrl(videoElement);
        String scriptInnerHtml = relevantTagWithMp4Url.get(0);

        System.out.println("Video Url: " + getVideoUrl(scriptInnerHtml));
    }

    private List<String> getSingleScriptElementWithVideoUrl(Elements scriptElements) {
        List<String> relevantTagWithMp4Url = new ArrayList<>();

        for (Element element : scriptElements) {
            if (element.data().contains("mp4")) {
                relevantTagWithMp4Url.add(element.data());
            }
        }

        return relevantTagWithMp4Url;
    }

    private Elements getScriptElementContainingVideoUrl(Document doc) {
        return doc.select("script");
    }

    private String getVideoUrl(String videoElement) {
        String jsonResponse = videoElement.split(" = ")[1];
        // $.. is equivalent to $.[*] - (a wild card matcher) - you may need to play with this
        List<String> videoUrl = JsonPath.read(jsonResponse, "$..video_url");
        return videoUrl.get(0);
    }

    private Document getHtmlPage(String url) {
        try {
            return Jsoup.connect(url).get();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }


    public static void main(String[] args) {
        new Instagram("https://www.instagram.com/reel/CDok74FJzHp/?igshid=cam8ylb7okl7").start();
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM