I'm pretty new to python and I'm working on a web scraping project using the Scrapy library. I'm not using the built in domain restriction because I want to check if any of the links to pages outside the domain are dead. However, I still want to treat pages within the domain differently from those outside it and am trying to manually determine if a site is within the domain before parsing the response.
Response URL:
http://www.siteSection1.domainName.com
If Statement:
if 'domainName.com' and ('siteSection1' or 'siteSection2' or 'siteSection3') in response.url:
parsePageInDomain()
The above statement is true (the page is parsed) if 'siteSection1' is the first to appear in the list of or's but it will not parse the page if the response url is the same but the if statement were the following:
if 'domainName.com' and ('siteSection2' or 'siteSection1' or 'siteSection3') in response.url:
parsePageInDomain()
What am I doing wrong here? I haven't been able to think through what is going on with the logical operators very clearly and any guidance would be greatly appreciated. Thanks!
or
doesn't work that way. Try any
:
if 'domainName.com' in response.url and any(name in response.url for name in ('siteSection1', 'siteSection2', 'siteSection3')):
What's going on here is that or
returns a logical or
of its two arguments - x or y
returns x
if x
evaluates to True
, which for a string means it's not empty, or y
if x
does not evaluate to True
. So ('siteSection1' or 'siteSection2' or 'siteSection3')
evaluates to 'siteSection1'
because 'siteSection1'
is True
when considered as a boolean.
Moreover, you're also using and
to combine your criteria. and
returns its first argument if that argument evaluates to False
, or its second if the first argument evaluates to True
. Therefore, if x and y in z
does not test to see whether both x
and y
are in z
. in
has higher precedence than and
- and I had to look that up - so that tests if x and (y in z)
. Again, domainName.com
evaluates as True, so this will return just y in z
.
any
, conversely, is a built in function that takes an iterable of booleans and returns True
or False
- True
if any of them are True
, False
otherwise. It stops its work as soon as it hits a True
value, so it's efficient. I'm using a generator expression to tell it to keep checking your three different possible strings to see if any of them are in your response url.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.