简体   繁体   English

从HTML JavaScript NodeJS中提取数据?

[英]Pull data from html javascript nodejs?

I'm working on making a CLI for google searches for myself, and I've been using nodejs for a little while working on a chatbot, so I'd like to get it working with nodejs. 我正在为自己的Google搜索创建CLI,并且我在使用聊天机器人时一直在使用nodejs一段时间,因此我想使其与Node.js一起使用。 I can pull the data just fine, and end up with a string that has all the html data from the page. 我可以很好地提取数据,最后得到一个包含页面中所有html数据的字符串。 It's even easy to sort out in the html what the results that I want are: 甚至很容易在html中整理出我想要的结果是:

<div class="jd"><a class="p" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://leagueoflegends.com/&amp;ved=0CBAQFjAA&amp;usg=AFQjCNEEnWGHwxNnuwKenqm4ajKfTM6Xxw" ><b>League of Legends</b> - Free Online Game | <b>LoL</b> - <b>League of Legends</b></a> </div> <div class="kd">3 days ago&nbsp;… Official website for <b>League of Legends</b>. Join millions of players in an award   winning Multiplayer Online Battle Arena. </div> <div class="qdlmxn"><a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://leagueoflegends.com/board&amp;ved=0CBEQ0gIoADAA&amp;usg=AFQjCNHpmmAdFFbTgm8C_gJvsjVhMzVKUQ" >Community</a> - <a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://signup.leagueoflegends.com/en/signup/redownload&amp;ved=0CBIQ0gIoATAA&amp;usg=AFQjCNFHGUtn4ItgQIzODgIZRv_237Mq0A" >PVP.NET</a> - <a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://na.leagueoflegends.com/board/forumdisplay.php?f%3D2&amp;ved=0CBMQ0gIoAjAA&amp;usg=AFQjCNHpycJ8WGh7xvWw1qNu8NjjU1EA0Q" >General Discussion</a> - <a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://na.leagueoflegends.com/champions&amp;ved=0CBQQ0gIoAzAA&amp;usg=AFQjCNEpeBzNefwag5xmkFcFhCW27FoAew" >Champions</a> </div><span class="c">leagueoflegends.com/</span> -  <div class="txnles" onclick="_popup('web_result_popup_10836585','inline');"> <div class="wx4xyp" id="web_result_popup_10836585"> <div class="vfc7iu"><a class="s" href="/search?q=cache:GCRD1wy5e3QJ:leagueoflegends.com/" >Cached</a> <br/><a class="s" href="/m/?q=related:leagueoflegends.com/&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;ved=0CBYQHzAA" >Similar</a> <br/><a class="s" href="/gwt/x?q=lol&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;hl=en&amp;source=m&amp;u=http://leagueoflegends.com/" >Mobile formatted</a> </div> </div><a class="s" href="javascript:void(0)" >Options</a> <div class="m6u8fq"> </div> </div> </div> </div> <div> <div class="r ld"> <div class="jd"><a class="p" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://en.wikipedia.org/wiki/LOL&amp;ved=0CBgQFjAB&amp;usg=AFQjCNFOhgg5Y2E5SFuS5I-8830OJ9VR9Q" ><b>LOL</b> - Wikipedia, the free encyclopedia</a> </div> <div class="kd"><b>LOL</b>, an abbreviation for <b>laughing out loud</b>, or <b>laugh out loud</b>, is a common   element of Internet slang. It was used historically on Usenet&nbsp;… </div><span class="c">en.wikipedia.org/wiki/LOL</span> -  <div class="txnles" onclick="_popup('web_result_popup_30597472','inline');"> <div class="wx4xyp" id="web_result_popup_30597472"> <div class="vfc7iu"><a class="s" href="/search?q=cache:mhIpOeXQp38J:en.wikipedia.org/wiki/LOL" >Cached</a> <br/><a class="s" href="/m/?q=related:en.wikipedia.org/wiki/LOL&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;ved=0CBkQHzAB" >Similar</a> <br/><a class="s" href="/gwt/x?q=lol&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;hl=en&amp;source=m&amp;u=http://en.wikipedia.org/wiki/LOL" >Mobile formatted</a> </div> </div><a class="s" href="javascript:void(0)" >Options</a> <div class="m6u8fq"> </div> </div> </div> </div> <div> <div class="r ld">

Anything that is .jd is a result, so I first need to separate those out, then working on getting the URLs and the Descriptions separated out, too. 结果是.jd,所以我首先需要将它们分开,然后再将URL和描述分开。 I've never done string manipulation to this extreme, so I've got no idea where to start with this. 我从来没有做过如此极端的字符串操作,所以我不知道从哪里开始。

Here's the html in a more readable format, realize though that I'm dealing with just one long string. 这是一种更具可读性的html,尽管我只处理了一个长字符串。

<div>
  <div class="r ld">
    <div class="jd">
      <a class="p" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://leagueoflegends.com/&amp;ved=0CBAQFjAA&amp;usg=AFQjCNEEnWGHwxNnuwKenqm4ajKfTM6Xxw" >
        <b>League of Legends</b> - Free Online Game | <b>LoL</b> - <b>League of Legends</b>
      </a>
    </div>
    <div class="kd">
      3 days ago&nbsp;… Official website for <b>League of Legends</b>. Join millions of players in an award   winning Multiplayer Online Battle Arena. 
    </div>
    <div class="qdlmxn">
      <a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://leagueoflegends.com/board&amp;ved=0CBEQ0gIoADAA&amp;usg=AFQjCNHpmmAdFFbTgm8C_gJvsjVhMzVKUQ" >Community</a> - 
      <a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://signup.leagueoflegends.com/en/signup/redownload&amp;ved=0CBIQ0gIoATAA&amp;usg=AFQjCNFHGUtn4ItgQIzODgIZRv_237Mq0A" >PVP.NET</a> - 
      <a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://na.leagueoflegends.com/board/forumdisplay.php?f%3D2&amp;ved=0CBMQ0gIoAjAA&amp;usg=AFQjCNHpycJ8WGh7xvWw1qNu8NjjU1EA0Q" >General Discussion</a> - 
      <a class="gg" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://na.leagueoflegends.com/champions&amp;ved=0CBQQ0gIoAzAA&amp;usg=AFQjCNEpeBzNefwag5xmkFcFhCW27FoAew" >Champions</a> 
    </div>
    <span class="c">leagueoflegends.com/</span> -  
    <div class="txnles" onclick="_popup('web_result_popup_10836585','inline');">
      <div class="wx4xyp" id="web_result_popup_10836585"> <div class="vfc7iu">
        <a class="s" href="/search?q=cache:GCRD1wy5e3QJ:leagueoflegends.com/" >Cached</a> 
        <br/>
        <a class="s" href="/m/?q=related:leagueoflegends.com/&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;ved=0CBYQHzAA" >Similar</a> 
        <br/>
        <a class="s" href="/gwt/x?q=lol&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;hl=en&amp;source=m&amp;u=http://leagueoflegends.com/" >Mobile formatted</a> 
      </div> 
    </div>
    <a class="s" href="javascript:void(0)" >Options</a> 
    <div class="m6u8fq"> </div> 
    </div> 
  </div> 
</div> 
<div> 
  <div class="r ld"> 
    <div class="jd">
      <a class="p" href="/m/url?ei=ZtZnT8CTOMy48AbDzgE&amp;q=http://en.wikipedia.org/wiki/LOL&amp;ved=0CBgQFjAB&amp;usg=AFQjCNFOhgg5Y2E5SFuS5I-8830OJ9VR9Q" >
        <b>LOL</b> - Wikipedia, the free encyclopedia
      </a>
    </div>
    <div class="kd">
      <b>LOL</b>, an abbreviation for <b>laughing out loud</b>, or <b>laugh out loud</b>, is a common   element of Internet slang. It was used historically on Usenet&nbsp;… 
    </div>
    <span class="c">en.wikipedia.org/wiki/LOL</span> -  
    <div class="txnles" onclick="_popup('web_result_popup_30597472','inline');"> 
      <div class="wx4xyp" id="web_result_popup_30597472"> 
        <div class="vfc7iu">
          <a class="s" href="/search?q=cache:mhIpOeXQp38J:en.wikipedia.org/wiki/LOL" >Cached</a> 
          <br/>
          <a class="s" href="/m/?q=related:en.wikipedia.org/wiki/LOL&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;ved=0CBkQHzAB" >Similar</a> 
          <br/>
          <a class="s" href="/gwt/x?q=lol&amp;ei=ZtZnT8CTOMy48AbDzgE&amp;hl=en&amp;source=m&amp;u=http://en.wikipedia.org/wiki/LOL" >Mobile formatted</a> 
        </div> 
      </div>
      <a class="s" href="javascript:void(0)" >Options</a> 
      <div class="m6u8fq"> </div> 
    </div> 
  </div> 
</div>

so i was in kinda of same situation but with xml, and i created this html/xml parser for my own needs. 所以我也处于相同的情况,但是使用xml,并且我根据自己的需要创建了这个html / xml解析器。

i tried to make it look like browser model of DOM manipulation so most of the things work the same way 我试图使其看起来像DOM操作的浏览器模型,所以大多数事情都以相同的方式工作

First a Node just copy paste in your js file. 首先,一个Node只需将粘贴复制到您的js文件中。 dont forget to declare "use strict"; 不要忘了申报“使用严格的”; at the begining of the file. 在文件的开头。

class Node {
    constructor(nodeName, nodeType) {
        this.nodeName = nodeName;

        this.nodeType = nodeType;
        this.attributes = {};
        this.childNodes = [];
        this.parentNode = null;


    }

    removeChild(node) {
        if (node.parentNode != null) {
            for (var i = 0; i < this.childNodes.length; i++) {
                if (node == this.childNodes[i]) {
                    this.childNodes.splice(i, 1);
                    node.parentNode = null;
                }
            }
        }
    }

    appendChild(child) {
        if (child.parentNode == null) {
            this.childNodes.push(child);
            child.parentNode = this;

        } else {
            child.parentNode.removeChild(child);
            this.childNodes.push(child);
            child.parentNode = this;

        }
    }

    returnMyChildNodes() {
        return this.childNodes;
    }

    returnElementCollection() {
        var array = [];
        array.push(this);
        for (var i = 0; i < this.childNodes.length; i++) {
            var tmparray = [];
            tmparray = this.childNodes[i].returnElementCollection();
            array = array.concat(tmparray);
        }

        return array;
    }

    getELementsByAttributeValue(attribute, value) {
        var matchedElements = [];
        var Elements = this.returnElementCollection();
        console.log(Elements.length);
        for (var i = 0; i < Elements.length; i++) {
            if (typeof Elements[i].attributes[attribute] != "undefined") {
                if (Elements[i].attributes[attribute] == value) {
                    matchedElements.push(Elements[i]);
                }
            }
        }

        return matchedElements;
    }

}

This node object will Act like html nodes. 该节点对象的行为类似于html节点。 so we declare another class here. 所以我们在这里声明另一个类。

class Html_Node extends Node {
    constructor(name) {
        super(name, "HTML_ELEMENT");
    }

    toString() {

    }
}

class Xml_Node extends Node {
    constructor(name) {
        super(name, "XML_ELEMENT");

        this.innerText = "";
    }

}

after we have classes for both we go to the hard part Read the document and build our nodes in 1 document 在我们都拥有类之后,我们进入最困难的部分阅读文档,并在1个文档中构建节点

class XML_Reader {
    constructor() {
        this.rawContents = "";
        this.Document = null;
    }

    loadXML(documentPath) {
        if (documentPath != null && documentPath != "") {
            var fs = require("fs");
            var fc = fs.readFileSync(documentPath, {
                encoding: "utf-8"
            });
            if (typeof fc != "undefined" && fc != null) {
                this.rawContents = fc;
            } else {
                this.rawContents = null;
            }

            delete require.cache[require.resolve("fs")];
        } else {
            this.rawContents = null;
        }


    }

    processXML() {

        var XML_DOC = new Xml_Node("root");
        var rawElements = [];
        var TagStart_index = 0;
        var TagEnd_index = 0;

        var innerContent_Start = 0;
        var innerContent_End = 0;
        for (var i = 0; i < this.rawContents.length; i++) {
            // get starting tags
            if (this.rawContents[i] == "<") {
                TagStart_index = i;
                innerContent_End = i - 1;

                var innerContent = "";
                if (innerContent_End > innerContent_Start) {
                    for (var n = innerContent_Start; n <= innerContent_End; n++) {
                        innerContent += this.rawContents[n];
                    }

                    if (/\S/.test(innerContent)) {
                        // do smth with innerContent of tag
                        rawElements.push(innerContent);
                    }
                }


            } else if (this.rawContents[i] == ">") {

                TagEnd_index = i;
                innerContent_Start = i + 1;
                var contents = "";

                for (var n = TagStart_index; n <= TagEnd_index; n++) {
                    contents += this.rawContents[n];
                }

                rawElements.push(contents);
            }

        }

        var currentParent = XML_DOC;
        for (var i = 0; i < rawElements.length; i++) {
            if (/>/.test(rawElements[i]) && /</.test(rawElements[i])) {
                if (rawElements[i].indexOf("/") == 1) {
                    currentParent = currentParent.parentNode;

                } else {
                    var str = rawElements[i];
                    str = str.replace("<", "");
                    str = str.replace(">", "");
                    var IgnoreSpace = false;
                    var tempString = "";
                    var InnerNodeContents = [];
                    var wordIndex = 0;
                    for (var n = 0; n < str.length; n++) {
                        if (!IgnoreSpace) {

                            if (str[n] == "/") {
                                InnerNodeContents[wordIndex] = tempString;
                                tempString = "";
                                wordIndex++;
                            }
                            if (n + 1 == str.length) {
                                tempString +=
                                    str[n];
                                InnerNodeContents[wordIndex] = tempString;
                                tempString = "";
                                wordIndex++;
                            } else if (!/\S/.test(str[n])) {
                                InnerNodeContents[wordIndex] = tempString;
                                tempString = "";
                                wordIndex++;
                            } else {
                                tempString += str[n];
                            }

                            if (str[n] == '"') IgnoreSpace = true;

                        } else {
                            if (str[n] == "/") {
                                InnerNodeContents[wordIndex] = tempString;
                                tempString = "";
                                wordIndex++;
                            }
                            if (n + 1 == str.length) {
                                tempString += str[n];
                                InnerNodeContents[wordIndex] = tempString;
                                tempString = "";
                                wordIndex++;
                            } else {
                                tempString += str[n];
                            }
                            if (str[n] == '"') IgnoreSpace = false;

                        }
                    }

                    var node = new Xml_Node(InnerNodeContents[0]);

                    // add attributes
                    var switchParent = false;

                    if (InnerNodeContents[InnerNodeContents.length - 1] == "/") {
                        switchParent = true;

                        for (var n = 1; n < InnerNodeContents.length - 1; n++) {
                            var tmparray = InnerNodeContents[n].split("=");
                            node.attributes[tmparray[0]] = tmparray[1].replaceAll('"', "");
                        }
                    } else {

                        for (var n = 1; n < InnerNodeContents.length; n++) {
                            var tmparray = InnerNodeContents[n].split("=");
                            node.attributes[tmparray[0]] = tmparray[1].replaceAll('"', "");
                        }
                    }

                    currentParent.appendChild(node);
                    if (!switchParent) currentParent = node;


                }
            } else {
                currentParent.innerText = rawElements[i];
            }

        }

        this.Document = XML_DOC;

    }

}

and then all we need to do is: 然后我们要做的就是:

var xr = new XML_Reader();
xr.loadXML("Path to HTML file");
xr.processXML();

var Elements = xr.Document.getElementsByAttributeValue("class", "jd"); 

and now u have Everysingle element with class of jd in that Element variable. 现在,您在那个Element变量中有了具有jd类的Everysingle元素。

and then to get URL of each one you will do FOR loop and fetch href attribute 然后获取每个URL的FOR循环并获取href属性

var myUrl = Elements[0].attributes.href;

i made this script for my self, so feel free to use it:) 我为自己制作了此脚本,请随时使用它:)

one more thing. 还有一件事。 to get childs of DIV with class of JD u will need to get that div and search for nodeNames (a), and get heir href Attributes. 得到带班JD的DIV的孩子的ü将需要获得股利和搜索节点名(一),并获得继承人的href属性。

Luxun lead me back to the google API, and I found out how to search full web results instead of just included sites here: http://support.google.com/customsearch/bin/answer.py?hl=en&answer=1210656 鲁迅带领我回到google API,我找到了如何搜索完整的网络结果,而不是仅搜索此处包含的网站: http : //support.google.com/customsearch/bin/answer.py?hl=zh_CN&answer=1210656

To create a search engine that searches the entire web:

From the Google Custom Search homepage, click Create a Custom Search Engine.
Type a name and description for your search engine.
Under Define your search engine, in the Sites to Search box, enter at least one valid URL (e.g. www.google.com).
Select the CSE edition you want and accept the Terms of Service, then click Next. Select the layout option you want, and then click Next.
Click any of the links under the Next steps section to navigate to your Control panel.
In the left-hand menu, under Control Panel, click Basics.
In the Search Preferences section, select Search the entire web but emphasize included sites.
Click Save Changes.
In the left-hand menu, under Control Panel, click Sites.
Delete the site you entered during the initial setup process.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM