简体   繁体   中英

Alternative to a server side scripting language for scraping

I have a small website hosted by my university. The policy is that no server side scripting language (eg PHP, etc.) is enabled, hence websites are either static or can use client side scripting (eg javascript, etc.). I also can't touch the server/configure it/install things.

Anyway, I wanted to add some data from other websites (namely, google scholar citations) that I manage to scrape with Python+lxml, dynamically. Is there any way I can have these data dynamically queried - on the client side of course?

I tried using IronPython to embed my Python code in my webpage, but it was complaining about failing to find the lxml imported library. But a similar solution would be great. Or a library in pure javascript which allows for opening and parsing external webpages...?

Thanks!

No. The same origin policy prevents it.

Either use a third party proxy that will transcode the data to JSON-P, or use a different host.

Alternatively, have a cron job running on a server you control that periodically generates new static HTML and uploads to your host.

The trick is that you'll run afoul of the Same Origin Policy , which largely denies your client-side code access to the content of sites that aren't from the same origin as your page. There are things like CORS which allow cross-origin stuff, but the user has to be using a browser that supports it, and the site you're trying to load data from has to support it (and either specifically allow your page access, or allow all pages access) at the server level.

So you're probably going to need a server-side solution — but that doesn't necessarily mean that your server has to provide it. If the sites you're gathering information from support JSON-P , then you can use that without SOP issues. If not, you might be able to use a service like YQL as described in this article on using YQL as a cross-domain proxy .

If none of that works for you (and you can't do a cron job as Quentin cleverly suggested ), I will just point out that PHP web hosting is an abundant commodity and thus, dead cheap. Like, less-than-the-price-of-a-cup-of-coffee-per-month cheap. I've seen it for under $2/mo (cheapest I could find just now in 20 seconds of searching was $2.49, but I know I've seen better). So you could either relocate the site or, if having it on a University address was important, provide yourself a JSON-P interface from a cup-of-coffee-esque server that is then consumed by your University site's client-side code.

考虑到服务器的限制,我将在本地运行这些脚本以生成HTML,并将该静态输出推送到您的服务器上。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM