简体   繁体   中英

produce list of urls in python using regular expressions

My goal: to extract all of the transcripts in this url , and clean them for my particular use.
I need to recursively extract links which follow a pattern. I am a newbie and am having trouble formulating the full code that will work.

Here are some examples of how the URLs will look:

http://tvmegasite.net/transcripts/oltl/main/1998transcripts.shtml
http://tvmegasite.net/transcripts/oltl/older/2004/oltl-trans-01-20-04.htm
http://tvmegasite.net/transcripts/amc/main/2003transcripts.shtml
http://tvmegasite.net/transcripts/amc/older/2002/amc-trans-01-08-02.shtml

so all begin with http://tvmegasite.net/transcripts , then the show abbreviation, then main or older, etc.

What I've tried so far: Getting urls from a particular page is easy with BeautifulSoup but I haven't figured out how to do it recursively. I was thinking of just using a scraper like Scrapy to get all the urls starting from tvmegasite.net/transcripts, and then using the re package to search for ones that match the pattern. I'm still not sure how to make this into a full code. From what I can guess, these are possibly the kinds of regular expressions that can work:

http://tvmegasite.net/transcripts\w+\/main/\d+\w+\.shtml
http://tvmegasite.net/transcripts\w+\/older/\d+/\w+\-\w+\-\d+\-\d+\.shtml

If you use Scrapy you do not need regular expressions -- or at least you can limit them to a minimum. For example with the LxmlLinkExtractor you can set up which URLs to follow ( allow ) and in which XPath-branch ( restrict_xpaths ).

And you can use your regular expressions (which look fine for me at first glance) in in the allow restriction -- and for this site you do not need a restriction on XPath.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM