简体   繁体   中英

Drupal URL structure for scraping

I am trying to scrape a drupal site with a Python script for music gigs in the past.

In doing this with a wordpress site I would iterate through urls like this:

http://wordpressevents.com/?p=1 ... http://wordpressevents.com/?p=10000

...and that would get me forwarded to a page (if there's one there) that I could scrape. The actual URL would be something like:

http://wordpressevents.com/music/some-band-youve-never-heard-of/

My Drupal site also has sections (eg /gigs/ or /classical/ etc).

Is there any way I can find out what their urls might be so that I can go about scraping it with Python and BeautifulSoup (other suggestions welcome)?

Ideally, I would find out what the structure is...

http://drupalevents.com/drupost?=1 ... http://drupalevents.com/drupost?=10000

etc.

But maybe it doesn't work like this?

In drupal the only guaranteed content url structure is /node/[some number]

So the best way to do this to an arbitrary drupal site is to start at /node/1 and go up from there, incrementing by 1 every time. Or if you look at the source of the newest page on the site and find the node id of the page in the body class tag, then you would know the last number and work your way backwards. For example given the node/185324 the body could have the class node-1853524 on it. This might not be there as the body classes could be anything based on how the site was setup.

Most sites also use the pathauto module to give the pages something a bit more friendly than node/123

The pathauto module uses tokens based on things that the site builder specifies to give nice urls to content. One common one is /content/[node:title]. I doubt that this will really help you but at least it will give you some information on how the drupal site is setup.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM