簡體   English   中英

從HTML JavaScript NodeJS中提取數據?

[英]Pull data from html javascript nodejs?

我正在為自己的Google搜索創建CLI,並且我在使用聊天機器人時一直在使用nodejs一段時間,因此我想使其與Node.js一起使用。 我可以很好地提取數據,最后得到一個包含頁面中所有html數據的字符串。 甚至很容易在html中整理出我想要的結果是:

<div class="jd"><a class="p" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://leagueoflegends.com/&amp;ved=0CBAQFjAA&amp;usg=AFQjCNEEnWGHwxNnuwKenqm4ajKfTM6Xxw" ><b>League of Legends</b> - Free Online Game | <b>LoL</b> - <b>League of Legends</b></a> </div> <div class="kd">3 days ago&nbsp;… Official website for <b>League of Legends</b>. Join millions of players in an award   winning Multiplayer Online Battle Arena. </div> <div class="qdlmxn"><a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://leagueoflegends.com/board&amp;ved=0CBEQ0gIoADAA&amp;usg=AFQjCNHpmmAdFFbTgm8C_gJvsjVhMzVKUQ" >Community</a> - <a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://signup.leagueoflegends.com/en/signup/redownload&amp;ved=0CBIQ0gIoATAA&amp;usg=AFQjCNFHGUtn4ItgQIzODgIZRv_237Mq0A" >PVP.NET</a> - <a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://na.leagueoflegends.com/board/forumdisplay.php?f%3D2&amp;ved=0CBMQ0gIoAjAA&amp;usg=AFQjCNHpycJ8WGh7xvWw1qNu8NjjU1EA0Q" >General Discussion</a> - <a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://na.leagueoflegends.com/champions&amp;ved=0CBQQ0gIoAzAA&amp;usg=AFQjCNEpeBzNefwag5xmkFcFhCW27FoAew" >Champions</a> </div><span class="c">leagueoflegends.com/</span> -  <div class="txnles" onclick="_popup('web_result_popup_10836585','inline');"> <div class="wx4xyp" id="web_result_popup_10836585"> <div class="vfc7iu"><a class="s" href="/search?q=cache:GCRD1wy5e3QJ:leagueoflegends.com/" >Cached</a> <br/><a class="s" href="/m/?q=related:leagueoflegends.com/&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;ved=0CBYQHzAA" >Similar</a> <br/><a class="s" href="/gwt/x?q=lol&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;hl=en&amp;source=m&amp;u=http://leagueoflegends.com/" >Mobile formatted</a> </div> </div><a class="s" href="javascript:void(0)" >Options</a> <div class="m6u8fq"> </div> </div> </div> </div> <div> <div class="r ld"> <div class="jd"><a class="p" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://en.wikipedia.org/wiki/LOL&amp;ved=0CBgQFjAB&amp;usg=AFQjCNFOhgg5Y2E5SFuS5I-8830OJ9VR9Q" ><b>LOL</b> - Wikipedia, the free encyclopedia</a> </div> <div class="kd"><b>LOL</b>, an abbreviation for <b>laughing out loud</b>, or <b>laugh out loud</b>, is a common   element of Internet slang. It was used historically on Usenet&nbsp;… </div><span class="c">en.wikipedia.org/wiki/LOL</span> -  <div class="txnles" onclick="_popup('web_result_popup_30597472','inline');"> <div class="wx4xyp" id="web_result_popup_30597472"> <div class="vfc7iu"><a class="s" href="/search?q=cache:mhIpOeXQp38J:en.wikipedia.org/wiki/LOL" >Cached</a> <br/><a class="s" href="/m/?q=related:en.wikipedia.org/wiki/LOL&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;ved=0CBkQHzAB" >Similar</a> <br/><a class="s" href="/gwt/x?q=lol&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;hl=en&amp;source=m&amp;u=http://en.wikipedia.org/wiki/LOL" >Mobile formatted</a> </div> </div><a class="s" href="javascript:void(0)" >Options</a> <div class="m6u8fq"> </div> </div> </div> </div> <div> <div class="r ld">

結果是.jd,所以我首先需要將它們分開,然后再將URL和描述分開。 我從來沒有做過如此極端的字符串操作,所以我不知道從哪里開始。

這是一種更具可讀性的html,盡管我只處理了一個長字符串。

<div>
  <div class="r ld">
    <div class="jd">
      <a class="p" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://leagueoflegends.com/&amp;ved=0CBAQFjAA&amp;usg=AFQjCNEEnWGHwxNnuwKenqm4ajKfTM6Xxw" >
        <b>League of Legends</b> - Free Online Game | <b>LoL</b> - <b>League of Legends</b>
      </a>
    </div>
    <div class="kd">
      3 days ago&nbsp;… Official website for <b>League of Legends</b>. Join millions of players in an award   winning Multiplayer Online Battle Arena. 
    </div>
    <div class="qdlmxn">
      <a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://leagueoflegends.com/board&amp;ved=0CBEQ0gIoADAA&amp;usg=AFQjCNHpmmAdFFbTgm8C_gJvsjVhMzVKUQ" >Community</a> - 
      <a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://signup.leagueoflegends.com/en/signup/redownload&amp;ved=0CBIQ0gIoATAA&amp;usg=AFQjCNFHGUtn4ItgQIzODgIZRv_237Mq0A" >PVP.NET</a> - 
      <a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://na.leagueoflegends.com/board/forumdisplay.php?f%3D2&amp;ved=0CBMQ0gIoAjAA&amp;usg=AFQjCNHpycJ8WGh7xvWw1qNu8NjjU1EA0Q" >General Discussion</a> - 
      <a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://na.leagueoflegends.com/champions&amp;ved=0CBQQ0gIoAzAA&amp;usg=AFQjCNEpeBzNefwag5xmkFcFhCW27FoAew" >Champions</a> 
    </div>
    <span class="c">leagueoflegends.com/</span> -  
    <div class="txnles" onclick="_popup('web_result_popup_10836585','inline');">
      <div class="wx4xyp" id="web_result_popup_10836585"> <div class="vfc7iu">
        <a class="s" href="/search?q=cache:GCRD1wy5e3QJ:leagueoflegends.com/" >Cached</a> 
        <br/>
        <a class="s" href="/m/?q=related:leagueoflegends.com/&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;ved=0CBYQHzAA" >Similar</a> 
        <br/>
        <a class="s" href="/gwt/x?q=lol&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;hl=en&amp;source=m&amp;u=http://leagueoflegends.com/" >Mobile formatted</a> 
      </div> 
    </div>
    <a class="s" href="javascript:void(0)" >Options</a> 
    <div class="m6u8fq"> </div> 
    </div> 
  </div> 
</div> 
<div> 
  <div class="r ld"> 
    <div class="jd">
      <a class="p" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://en.wikipedia.org/wiki/LOL&amp;ved=0CBgQFjAB&amp;usg=AFQjCNFOhgg5Y2E5SFuS5I-8830OJ9VR9Q" >
        <b>LOL</b> - Wikipedia, the free encyclopedia
      </a>
    </div>
    <div class="kd">
      <b>LOL</b>, an abbreviation for <b>laughing out loud</b>, or <b>laugh out loud</b>, is a common   element of Internet slang. It was used historically on Usenet&nbsp;… 
    </div>
    <span class="c">en.wikipedia.org/wiki/LOL</span> -  
    <div class="txnles" onclick="_popup('web_result_popup_30597472','inline');"> 
      <div class="wx4xyp" id="web_result_popup_30597472"> 
        <div class="vfc7iu">
          <a class="s" href="/search?q=cache:mhIpOeXQp38J:en.wikipedia.org/wiki/LOL" >Cached</a> 
          <br/>
          <a class="s" href="/m/?q=related:en.wikipedia.org/wiki/LOL&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;ved=0CBkQHzAB" >Similar</a> 
          <br/>
          <a class="s" href="/gwt/x?q=lol&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;hl=en&amp;source=m&amp;u=http://en.wikipedia.org/wiki/LOL" >Mobile formatted</a> 
        </div> 
      </div>
      <a class="s" href="javascript:void(0)" >Options</a> 
      <div class="m6u8fq"> </div> 
    </div> 
  </div> 
</div>

所以我也處於相同的情況,但是使用xml,並且我根據自己的需要創建了這個html / xml解析器。

我試圖使其看起來像DOM操作的瀏覽器模型,所以大多數事情都以相同的方式工作

首先,一個Node只需將粘貼復制到您的js文件中。 不要忘了申報“使用嚴格的”; 在文件的開頭。

class Node {
    constructor(nodeName, nodeType) {
        this.nodeName = nodeName;

        this.nodeType = nodeType;
        this.attributes = {};
        this.childNodes = [];
        this.parentNode = null;


    }

    removeChild(node) {
        if (node.parentNode != null) {
            for (var i = 0; i < this.childNodes.length; i++) {
                if (node == this.childNodes[i]) {
                    this.childNodes.splice(i, 1);
                    node.parentNode = null;
                }
            }
        }
    }

    appendChild(child) {
        if (child.parentNode == null) {
            this.childNodes.push(child);
            child.parentNode = this;

        } else {
            child.parentNode.removeChild(child);
            this.childNodes.push(child);
            child.parentNode = this;

        }
    }

    returnMyChildNodes() {
        return this.childNodes;
    }

    returnElementCollection() {
        var array = [];
        array.push(this);
        for (var i = 0; i < this.childNodes.length; i++) {
            var tmparray = [];
            tmparray = this.childNodes[i].returnElementCollection();
            array = array.concat(tmparray);
        }

        return array;
    }

    getELementsByAttributeValue(attribute, value) {
        var matchedElements = [];
        var Elements = this.returnElementCollection();
        console.log(Elements.length);
        for (var i = 0; i < Elements.length; i++) {
            if (typeof Elements[i].attributes[attribute] != "undefined") {
                if (Elements[i].attributes[attribute] == value) {
                    matchedElements.push(Elements[i]);
                }
            }
        }

        return matchedElements;
    }

}

該節點對象的行為類似於html節點。 所以我們在這里聲明另一個類。

class Html_Node extends Node {
    constructor(name) {
        super(name, "HTML_ELEMENT");
    }

    toString() {

    }
}

class Xml_Node extends Node {
    constructor(name) {
        super(name, "XML_ELEMENT");

        this.innerText = "";
    }

}

在我們都擁有類之后,我們進入最困難的部分閱讀文檔,並在1個文檔中構建節點

class XML_Reader {
    constructor() {
        this.rawContents = "";
        this.Document = null;
    }

    loadXML(documentPath) {
        if (documentPath != null && documentPath != "") {
            var fs = require("fs");
            var fc = fs.readFileSync(documentPath, {
                encoding: "utf-8"
            });
            if (typeof fc != "undefined" && fc != null) {
                this.rawContents = fc;
            } else {
                this.rawContents = null;
            }

            delete require.cache[require.resolve("fs")];
        } else {
            this.rawContents = null;
        }


    }

    processXML() {

        var XML_DOC = new Xml_Node("root");
        var rawElements = [];
        var TagStart_index = 0;
        var TagEnd_index = 0;

        var innerContent_Start = 0;
        var innerContent_End = 0;
        for (var i = 0; i < this.rawContents.length; i++) {
            // get starting tags
            if (this.rawContents[i] == "<") {
                TagStart_index = i;
                innerContent_End = i - 1;

                var innerContent = "";
                if (innerContent_End > innerContent_Start) {
                    for (var n = innerContent_Start; n <= innerContent_End; n++) {
                        innerContent += this.rawContents[n];
                    }

                    if (/\S/.test(innerContent)) {
                        // do smth with innerContent of tag
                        rawElements.push(innerContent);
                    }
                }


            } else if (this.rawContents[i] == ">") {

                TagEnd_index = i;
                innerContent_Start = i + 1;
                var contents = "";

                for (var n = TagStart_index; n <= TagEnd_index; n++) {
                    contents += this.rawContents[n];
                }

                rawElements.push(contents);
            }

        }

        var currentParent = XML_DOC;
        for (var i = 0; i < rawElements.length; i++) {
            if (/>/.test(rawElements[i]) && /</.test(rawElements[i])) {
                if (rawElements[i].indexOf("/") == 1) {
                    currentParent = currentParent.parentNode;

                } else {
                    var str = rawElements[i];
                    str = str.replace("<", "");
                    str = str.replace(">", "");
                    var IgnoreSpace = false;
                    var tempString = "";
                    var InnerNodeContents = [];
                    var wordIndex = 0;
                    for (var n = 0; n < str.length; n++) {
                        if (!IgnoreSpace) {

                            if (str[n] == "/") {
                                InnerNodeContents[wordIndex] = tempString;
                                tempString = "";
                                wordIndex++;
                            }
                            if (n + 1 == str.length) {
                                tempString +=
                                    str[n];
                                InnerNodeContents[wordIndex] = tempString;
                                tempString = "";
                                wordIndex++;
                            } else if (!/\S/.test(str[n])) {
                                InnerNodeContents[wordIndex] = tempString;
                                tempString = "";
                                wordIndex++;
                            } else {
                                tempString += str[n];
                            }

                            if (str[n] == '"') IgnoreSpace = true;

                        } else {
                            if (str[n] == "/") {
                                InnerNodeContents[wordIndex] = tempString;
                                tempString = "";
                                wordIndex++;
                            }
                            if (n + 1 == str.length) {
                                tempString += str[n];
                                InnerNodeContents[wordIndex] = tempString;
                                tempString = "";
                                wordIndex++;
                            } else {
                                tempString += str[n];
                            }
                            if (str[n] == '"') IgnoreSpace = false;

                        }
                    }

                    var node = new Xml_Node(InnerNodeContents[0]);

                    // add attributes
                    var switchParent = false;

                    if (InnerNodeContents[InnerNodeContents.length - 1] == "/") {
                        switchParent = true;

                        for (var n = 1; n < InnerNodeContents.length - 1; n++) {
                            var tmparray = InnerNodeContents[n].split("=");
                            node.attributes[tmparray[0]] = tmparray[1].replaceAll('"', "");
                        }
                    } else {

                        for (var n = 1; n < InnerNodeContents.length; n++) {
                            var tmparray = InnerNodeContents[n].split("=");
                            node.attributes[tmparray[0]] = tmparray[1].replaceAll('"', "");
                        }
                    }

                    currentParent.appendChild(node);
                    if (!switchParent) currentParent = node;


                }
            } else {
                currentParent.innerText = rawElements[i];
            }

        }

        this.Document = XML_DOC;

    }

}

然后我們要做的就是:

var xr = new XML_Reader();
xr.loadXML("Path to HTML file");
xr.processXML();

var Elements = xr.Document.getElementsByAttributeValue("class", "jd"); 

現在,您在那個Element變量中有了具有jd類的Everysingle元素。

然后獲取每個URL的FOR循環並獲取href屬性

var myUrl = Elements[0].attributes.href;

我為自己制作了此腳本,請隨時使用它:)

還有一件事。 得到帶班JD的DIV的孩子的ü將需要獲得股利和搜索節點名(一),並獲得繼承人的href屬性。

魯迅帶領我回到google API,我找到了如何搜索完整的網絡結果,而不是僅搜索此處包含的網站: http : //support.google.com/customsearch/bin/answer.py?hl=zh_CN&answer=1210656

To create a search engine that searches the entire web:

From the Google Custom Search homepage, click Create a Custom Search Engine.
Type a name and description for your search engine.
Under Define your search engine, in the Sites to Search box, enter at least one valid URL (e.g. www.google.com).
Select the CSE edition you want and accept the Terms of Service, then click Next. Select the layout option you want, and then click Next.
Click any of the links under the Next steps section to navigate to your Control panel.
In the left-hand menu, under Control Panel, click Basics.
In the Search Preferences section, select Search the entire web but emphasize included sites.
Click Save Changes.
In the left-hand menu, under Control Panel, click Sites.
Delete the site you entered during the initial setup process.

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM