简体   繁体   中英

Python : Web Scraping Specific Keywords

My Question shouldn't be too hard to answer, The problem im having is im not sure how to scrape a website for specific keywords.. I'm quite new to Python.. So i know i need to add in some more details , Firstly what i dont want to do is use Beautiful Soup or any of those libs, im using lxml and requests, What i do want to do is ask the user for an input for a website and once its provided , Send a request to the provided URL, once the request is made i want it to grab all the html which i believe ive done using html.fromstring(site.content) so all thats been done the problem im having is i want it to find any link or text with the ending '.swf' and print it below that.. Anyone know any way of doing this?

def ScrapeSwf():
     flashSite = raw_input('Please Provide Web URL : ')
     print 'Sending Requests...'
     flashReq = requests.get(flashSite)
     print 'Scraping...'
     flashTree = html.fromstring(flashReq.content)
     print ' Now i want to search the html for the swf link in the html'
     print ' And Display them using print probablly with a while condition'

Something like that .. Any help is highly appreciated

You're using lxhtml to build the HTML into an object model, so you probably want to use flashTree.xpath to search the DOM using XML Path Language. Find the path you want in the source DOM and then write an xpath that extracts it, your web browser's developer tools and w3schools can help you.

I personally wouldn't bother, I'd just extract the text I needed using a regular expression ( re.find(pattern, flashReq.content) ) because it's quicker. If I didn't know regex, wasn't comfortable with them, or I wanted raw speed then I'd use a crude string extraction like so:

start = flashReq.content.find(text_before_it) + len(text_before_it)
end = flashReq.content.find(text_after_it, start)
text_you_want = flashReq.content[start:end]

Here goes my attempt:

import requests [1]
response = requests.get(flashSite) [2]
myPage = response.content [3]
for line in myPage.splitlines(): [4]
    if '.swf' in line: [5]
        start = line.find('http') [6]
        end = line.find('.swf') + 4 [7]
        print line[start:end] [8]

Explanation:

1 : Import the request module. I couldn't really figure out a way to get what I needed out of lxml, so I just stuck with this.

2 : Send a HTTP GET method to whatever site that has the Flash file

3 : Save its contents to a variable

Yes, I realize you could condense lines 2 and 3, I just did it this way because I felt it makes a bit more sense to me.

4 : Now iterating through each line in the code, going line by line.

5 : Check to see if '.swf' is in that line

Lines 6 through 8 demonstrate the string slicing method that @GazDavidson mentioned in his answer. The reason I add 4 in line 7 is because '.swf' is 4 characters long.

You should be able to (roughly) get the result that provides a link to the SWF file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM