简体   繁体   中英

How can I select the element in JavaScript source?

I need to get the value of the "html" key in the bellow JavaScript source code which was extracted by xpath('.//script[34]') and embedded in a html source page.

   <script>
        FM.view({
            "ns": "pl.content.homeFeed.index",
            "domid": "Pl_Official_MyProfileFeed__24",
            "css": ["style/css/module/list/comb_WB_feed_profile.css?version=73267f08bd52356e"],
            "js": "page/js/pl/content/homeFeed/index.js?version=dad90e594db2c334",
            "html": "                <div class=\"WB_feed WB_feed_v3\" pageNum=\"\" node-type='feed_list' module-type=\"feed\">\r\n...."
        })
    </script>

I don't know how to process the text "FM.view" especially.

I would use .re() to extract the html key value from the script:

>>> response.xpath("//script[contains(., 'Pl_Official_MyProfileFeed__24')]/text()").re(r'"html": "(.*?)"\n')
[0].strip()
u'<div class=\\"WB_feed WB_feed_v3\\" pageNum=\\"\\" node-type=\'feed_list\' module-type=\\"feed\\">\\r\\n..'

Or, you can extract the complete object from the script, load it with json and get the html value:

>>> import json
>>> data = response.xpath("//script[contains(., 'Pl_Official_MyProfileFeed__24')]/text()").re(r'(?ms)FM\.view\((\{.*?\})\)')[0]
>>> obj = json.loads(data)
>>> obj['html'].strip()
u'<div class="WB_feed WB_feed_v3" pageNum="" node-type=\'feed_list\' module-type="feed">\r\n....'

Note the (?ms) part in the regular expression - this is the way we set the flags - multiline and dotall - required for the pattern to work in this case.

Here's an alternative to regular expression + json using js2xml package.

First step is to get the JavaScript statements within <script> from HTML. You probably have that step already. Here I'm building a Scrapy selector from your input HTML. In your case you are probably working with a response within a callback:

>>> import scrapy
>>> import js2xml
>>> t = r'''   <script>
...         FM.view({
...             "ns": "pl.content.homeFeed.index",
...             "domid": "Pl_Official_MyProfileFeed__24",
...             "css": ["style/css/module/list/comb_WB_feed_profile.css?version=73267f08bd52356e"],
...             "js": "page/js/pl/content/homeFeed/index.js?version=dad90e594db2c334",
...             "html": "                <div class=\"WB_feed WB_feed_v3\" pageNum=\"\" node-type='feed_list' module-type=\"feed\">\r\n...."
...         })
...     </script>'''
>>> selector = scrapy.Selector(text=t, type='html')

Second step is to build a tree representation of the JavaScript program using js2xml.parse() . You get an lxml tree back:

>>> js = selector.xpath('//script/text()').extract_first()
>>> jstree = js2xml.parse(js)
>>> jstree
<Element program at 0x7ff19ec94ea8>
>>> type(jstree)
<type 'lxml.etree._Element'>

>>> print(js2xml.pretty_print(jstree))
<program>
  <functioncall>
    <function>
      <dotaccessor>
        <object>
          <identifier name="FM"/>
        </object>
        <property>
          <identifier name="view"/>
        </property>
      </dotaccessor>
    </function>
    <arguments>
      <object>
        <property name="ns">
          <string>pl.content.homeFeed.index</string>
        </property>
        <property name="domid">
          <string>Pl_Official_MyProfileFeed__24</string>
        </property>
        <property name="css">
          <array>
            <string>style/css/module/list/comb_WB_feed_profile.css?version=73267f08bd52356e</string>
          </array>
        </property>
        <property name="js">
          <string>page/js/pl/content/homeFeed/index.js?version=dad90e594db2c334</string>
        </property>
        <property name="html">
          <string>                &lt;div class="WB_feed WB_feed_v3" pageNum="" node-type='feed_list' module-type="feed"&gt;&#13;
....</string>
        </property>
      </object>
    </arguments>
  </functioncall>
</program>

Third is to select the object you want from the tree. Here, it's the 1st argument of the FM.view() call. Calling .xpath() on the lxml tree gives you a list even if you selected 1 node (XPath returns node-sets)

# select the function call for "FM.view"
# and get first argument
>>> jstree.xpath('''
        //functioncall[
            function[.//identifier/@name="FM"]
                    [.//identifier/@name="view"]]
            /arguments
                /*[1]''')
[<Element object at 0x7ff19ec94ef0>]
>>> args = jstree.xpath('//functioncall[function[.//identifier/@name="FM"][.//identifier/@name="view"]]/arguments/*[1]')

Fourth, convert the <object> into a Python dict using js2xml.jsonlike.make_dict() :

# use js2xml.jsonlike.make_dict() on that argument
>>> js2xml.jsonlike.make_dict(args[0])
{'ns': 'pl.content.homeFeed.index', 'html': '                <div class="WB_feed WB_feed_v3" pageNum="" node-type=\'feed_list\' module-type="feed">\r\n....', 'css': ['style/css/module/list/comb_WB_feed_profile.css?version=73267f08bd52356e'], 'domid': 'Pl_Official_MyProfileFeed__24', 'js': 'page/js/pl/content/homeFeed/index.js?version=dad90e594db2c334'}
>>> from pprint import pprint
>>> pprint(js2xml.jsonlike.make_dict(args[0]))
{'css': ['style/css/module/list/comb_WB_feed_profile.css?version=73267f08bd52356e'],
 'domid': 'Pl_Official_MyProfileFeed__24',
 'html': '                <div class="WB_feed WB_feed_v3" pageNum="" node-type=\'feed_list\' module-type="feed">\r\n....',
 'js': 'page/js/pl/content/homeFeed/index.js?version=dad90e594db2c334',
 'ns': 'pl.content.homeFeed.index'}
>>> 

And finally, you simply use the "html" key from that dict:

>>> jsdata = js2xml.jsonlike.make_dict(args[0])
>>> jsdata['html']
'                <div class="WB_feed WB_feed_v3" pageNum="" node-type=\'feed_list\' module-type="feed">\r\n....'
>>> 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM