简体   繁体   English

使用python搜索javascript中的字符串

[英]Search a string in javascript using python

Following my previous question : how to fetch javascript contents in python 继我之前的问题: 如何在python中获取javascript内容

I tried to make another script which fetches the data from a javascript. 我试图制作另一个脚本,该脚本从javascript中获取数据。 After getting the webpage contents of course. 获得网页内容后当然。

But, it's just not showing up the content I want. 但是,它只是没有显示我想要的内容。 I want to find "content_id" from the javascript of the page. 我想从页面的javascript中找到“ content_id”。 This is the page :- http://www.hulu.com/watch/815743 这是页面: -http : //www.hulu.com/watch/815743

Here's what I have right now. 这就是我现在所拥有的。

import re
import requests
from bs4 import BeautifulSoup
import os
import fileinput


Link = 'http://www.hulu.com/watch/815743'
q = requests.get(Link)
soup = BeautifulSoup(q.text)
#print soup
subtitles = soup.findAll('script',{'type':'text/javascript'})
pattern = re.compile(r'"content_id":"(.*?)"', re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
print pattern.search(script.text).group(1)

I get this error : 我收到此错误:

AttributeError: 'NoneType' object has no attribute 'text' AttributeError:“ NoneType”对象没有属性“ text”

Any idea how to solve this issue..? 任何想法如何解决这个问题..?

There are two problems in your regular expression pattern: 正则表达式模式中有两个问题:

  • the quotes are escaped with backslashes in the script contents, take that into account 脚本内容中的引号用反斜杠转义 ,请考虑在内
  • there is a whitespace after the colon 冒号后面有一个空格

Here is the fixed version: 这是固定版本:

pattern = re.compile(r'\\"content_id\\":\s*\\"(.*?)\\"', re.MULTILINE | re.DOTALL)

Works for me, getting 60585710 as a result. 为我工作,得到60585710

FYI, here is the complete code that I'm executing: 仅供参考,这是我正在执行的完整代码:

import re

import requests
from bs4 import BeautifulSoup

Link = 'http://www.hulu.com/watch/815743'
q = requests.get(Link)
soup = BeautifulSoup(q.text)

pattern = re.compile(r'\\"content_id\\":\s*\\"(.*?)\\"', re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
print pattern.search(script.text).group(1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM