Get all the javascript file name and its contents in python perfectly

Question

I want to scan some websites and would like to get all the java script files names and content.I tried python requests with BeautifulSoup but wasn't able to get the scripts details and contents.am I missing something ?

I have been trying lot of methods to find but I felt like stumbling in the dark. This is the code I am trying

import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.marunadanmalayali.com/")
soup = BeautifulSoup(r.content)

Answer 1

You can get all the linked JavaScript code use the below code:

l = [i.get('src') for i in soup.find_all('script') if i.get('src')]

soup.find_all('script') returns a list of all the <script> tags in the page.
A list comprehension is used here to loop over all the elements in the list which returned by soup.find_all('script') .
i is a dict like object, use .get('src') to check if it has src attribute. If not, ignore it. Otherwise, put it into a list (which's called l in the example).

The output, in this case looks like below:

['http://adserver.adtech.de/addyn/3.0/1602/5506153/0/6490/ADTECH;loc=700;target=_blank;grp=[group]',
 'http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js',
 'http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js',
 'http://js.genieessp.com/t/057/794/a1057794.js',
 'http://ib.adnxs.com/ttj?id=5620689&cb=[CACHEBUSTER]&pubclick=[INSERT_CLICK_TAG]',
 'http://ib.adnxs.com/ttj?id=5531763',
 'http://advs.adgorithms.com/ttj?id=3279193&cb=[CACHEBUSTER]&pubclick=[INSERT_CLICK_TAG]',
 'http://xp2.zedo.com/jsc/xp2/fo.js',
 'http://www.marunadanmalayali.com/js/mnmads.js',
 'http://www.marunadanmalayali.com/js/jquery-2.1.0.min.js',
 'http://www.marunadanmalayali.com/js/jquery.hoverIntent.minified.js',
 'http://www.marunadanmalayali.com/js/jquery.dcmegamenu.1.3.3.js',
 'http://www.marunadanmalayali.com/js/jquery.cookie.js',
 'http://www.marunadanmalayali.com/js/swanalekha-ml.js',
 'http://www.marunadanmalayali.com/js/marunadan.js?r=1875',
 'http://www.marunadanmalayali.com/js/taboola_home.js',
 'http://d8.zedo.com/jsc/d8/fo.js']

My code missed some links because they're not in the HTML source actually.

You can see them in the console:

But they're not in the source:

Usually, that's because these links were generated by JavaScript. And the requests module doesn't run any JavaScript in the page like a real browser - it only send a request to get the HTML source.

If you also need them, you have to use another module to run the JavaScript in that page, and you can see these links then. For that, I'd suggest use selenium - which runs a real browser so it can runs JavaScript in the page.

For example (make sure that you have already installed selenium and a web driver for it):

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome()  # use Chrome driver for example
driver.get('http://www.marunadanmalayali.com/')

soup = BeautifulSoup(driver.page_source, "html.parser")
l = [i.get('src') for i in soup.find_all('script') if i.get('src')]

__import__('pprint').pprint(l)

Answer 2

You can use a select with script[src] which will only find script tags with a src, you don't need to call .get multiple times:

import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.marunadanmalayali.com/")
soup = BeautifulSoup(r.content)

src = [sc["src"] for sc in  soup.select("script[src]")]

You can also specify src=True with find_all to do the same:

src = [sc["src"] for sc in soup.find_all("script",src=True)]

Which will both give you the same output:

['http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js', 'http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js', 'http://js.genieessp.com/t/052/954/a1052954.js', '//s3-ap-northeast-1.amazonaws.com/tms-t/marunadanmalayali-7219.js', 'http://advs.adgorithms.com/ttj?id=3279193&cb=[CACHEBUSTER]&pubclick=[INSERT_CLICK_TAG]', 'http://www.marunadanmalayali.com/js/mnmcombined1.min.js', 'http://www.marunadanmalayali.com/js/mnmcombined2.min.js']

Also if you use selenium, you can use it with PhantomJs for headless browsing, you don't need beautufulSoup at all if you use selenium, you can use the same css selector directly in selenium:

from selenium import webdriver

driver = webdriver.PhantomJS()
driver.get('http://www.marunadanmalayali.com/')

src = [sc.get_attribute("src") for sc in driver.find_elements_by_css_selector("script[src]")]
print(src)

Which gives you all the links:

u'https://pixel.yabidos.com/fltiu.js?qid=836373f5137373f5131353&cid=511&p=165&s=http%3a%2f%2fwww.marunadanmalayali.com%2f&x=admeta&nci=&adtg=96331&nai=', u'http://gum.criteo.com/sync?c=72&r=2&j=TRC.getRTUS', u'http://b.scorecardresearch.com/beacon.js', u'http://cdn.taboola.com/libtrc/impl.201-1-RELEASE.js', u'http://p165.atemda.com/JSAdservingMP.ashx?pc=1&pbId=165&clk=&exm=&jsv=1.84&tsv=2.26&cts=1459160775430&arp=0&fl=0&vitp=0&vit=&jscb=&url=&fp=0;400;300;20&oid=&exr=&mraid=&apid=&apbndl=&mpp=0&uid=&cb=54613943&pId0=64056124&rank0=1&gid0=64056124:1c59ac&pp0=&clk0=[External%20click-tracking%20goes%20here%20(NOT%20URL-encoded)]&rpos0=0&ecpm0=&ntv0=&ntl0=&adsid0=', u'http://cdn.taboola.com/libtrc/marunadanaalayali-network/loader.js', u'http://s.atemda.com/Admeta.js', u'http://www.google-analytics.com/analytics.js', u'http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js', u'http://tags.expo9.exponential.com/tags/MarunadanMalayalicom/ROS/tags.js', u'http://js.genieessp.com/t/052/954/a1052954.js', u'http://s3-ap-northeast-1.amazonaws.com/tms-t/marunadanmalayali-7219.js', u'http://d8.zedo.com/jsc/d8/fo.js', u'http://z1.zedo.com/asw/fm/1185/7219/9/fm.js?c=7219&a=0&f=&n=1185&r=1&d=9&adm=&q=&$=&s=1936&l=%5BINSERT_CLICK_TRACKER_MACRO%5D&ct=&z=0.025054786819964647&tt=0&tz=0&pu=http%3A%2F%2Fwww.marunadanmalayali.com%2F&ru=&pi=1459160768626&ce=UTF-8&zpu=www.marunadanmalayali.com____1_&tpu=', u'http://cas.criteo.com/delivery/ajs.php?zoneid=308686&nodis=1&cb=38688817829&exclude=undefined&charset=UTF-8&loc=http%3A//www.marunadanmalayali.com/', u'http://ads.pubmatic.com/AdServer/js/showad.js', u'http://showads.pubmatic.com/AdServer/AdServerServlet?pubId=135167&siteId=135548&adId=600924&kadwidth=300&kadheight=250&SAVersion=2&js=1&kdntuid=1&pageURL=http%3A%2F%2Fwww.marunadanmalayali.com%2F&inIframe=0&kadpageurl=marunadanmalayali.com&operId=3&kltstamp=2016-3-28%2011%3A26%3A13&timezone=1&screenResolution=1024x768&ranreq=0.8869257988408208&pmUniAdId=0&adVisibility=2&adPosition=999x664', u'http://d8.zedo.com/jsc/d8/fo.js', u'http://z1.zedo.com/asw/fm/1185/7213/9/fm.js?c=7213&a=0&f=&n=1185&r=1&d=9&adm=&q=&$=&s=1948&l=%5BINSERT_CLICK_TRACKER_MACRO%5D&ct=&z=0.08655649935826659&tt=0&tz=0&pu=http%3A%2F%2Fwww.marunadanmalayali.com%2F&ru=&pi=1459160768626&ce=UTF-8&zpu=www.marunadanmalayali.com____1_&tpu=', u'http://advs.adgorithms.com/ttj?id=3279193&cb=[CACHEBUSTER]&pubclick=[INSERT_CLICK_TAG]', u'http://ib.adnxs.com/ttj?ttjb=1&bdc=1459160761&bdh=ZllBLkzcj2dGDVPeS0Sw_OTWjgQ.&tpuids=eyJ0cHVpZHMiOlt7InByb3ZpZGVyIjoiY3JpdGVvIiwidXNlcl9pZCI6Il9KRC1PUmhLX3hLczd1cUJhbjlwLU1KQ2VZbDQ2VVUxIn1dfQ==&view_iv=0&view_pos=664,2096&view_ws=400,300&view_vs=3&bdref=http%3A%2F%2Fwww.marunadanmalayali.com%2F&bdtop=true&bdifs=0&bstk=http%3A%2F%2Fwww.marunadanmalayali.com%2F&&id=3279193&cb=[CACHEBUSTER]&pubclick=[INSERT_CLICK_TAG]', u'http://www.marunadanmalayali.com/js/mnmcombined1.min.js', u'http://www.marunadanmalayali.com/js/mnmcombined2.min.js', u'http://pixel.yabidos.com/iftfl.js?ver=1.4.2&qid=836373f5137373f5131353&cid=511&p=165&s=http%3a%2f%2fwww.marunadanmalayali.com%2f&x=admeta&adtg=96331&nci=&nai=&nsi=&cstm1=&cstm2=&cstm3=&kqt=&xc=&test=&od1=&od2=&co=0&tps=34&rnd=3m17uji8ftbf']

Get all the javascript file name and its contents in python perfectly

Question

2 answers

solution1
4 ACCPTED 2016-02-29 06:01:59

solution2
1 2016-03-28 10:20:10

Get all the javascript file name and its contents in python perfectly

Question

2 answers

solution1 4 ACCPTED 2016-02-29 06:01:59

solution2 1 2016-03-28 10:20:10

solution1
4 ACCPTED 2016-02-29 06:01:59

solution2
1 2016-03-28 10:20:10