My goal: to extract all of the transcripts in this url , and clean them for my particular use.
I need to recursively extract links which follow a pattern. I am a newbie and am having trouble formulating the full code that will work.
Here are some examples of how the URLs will look:
http://tvmegasite.net/transcripts/oltl/main/1998transcripts.shtml
http://tvmegasite.net/transcripts/oltl/older/2004/oltl-trans-01-20-04.htm
http://tvmegasite.net/transcripts/amc/main/2003transcripts.shtml
http://tvmegasite.net/transcripts/amc/older/2002/amc-trans-01-08-02.shtml
so all begin with http://tvmegasite.net/transcripts
, then the show abbreviation, then main or older, etc.
What I've tried so far: Getting urls from a particular page is easy with BeautifulSoup but I haven't figured out how to do it recursively. I was thinking of just using a scraper like Scrapy to get all the urls starting from tvmegasite.net/transcripts, and then using the re package to search for ones that match the pattern. I'm still not sure how to make this into a full code. From what I can guess, these are possibly the kinds of regular expressions that can work:
http://tvmegasite.net/transcripts\w+\/main/\d+\w+\.shtml
http://tvmegasite.net/transcripts\w+\/older/\d+/\w+\-\w+\-\d+\-\d+\.shtml
If you use Scrapy you do not need regular expressions -- or at least you can limit them to a minimum. For example with the LxmlLinkExtractor
you can set up which URLs to follow ( allow
) and in which XPath-branch ( restrict_xpaths
).
And you can use your regular expressions (which look fine for me at first glance) in in the allow
restriction -- and for this site you do not need a restriction on XPath.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.