简体   繁体   中英

Using Regular Expressions to extract specific urls in python

I have parsed an html document containing javascript with beautifulsoup, and have managed to isolate the javascript within it and convert it into a string. The javascript looks like this:

<script>
    [irrelevant javascript code here]
    sources:[{file:"http://url.com/folder1/v.html",label:"label1"},
    {file:"http://url.com/folder2/v.html",label:"label2"},
    {file:"http://url.com/folder3/v.html",label:"label3"}],
    [irrelevant javascript code here]
</script>

I am trying to get an array with only urls contained in this sources array, which would look like so:

urls = ['http://url.com/folder1/v.html', 
        'http://url.com/folder2/v.html', 
        'http://url.com/folder3/v.html']

The domains are unknown IPs, the folders are of random name-length consisting of lowercase letters and numbers, and there are 1-5 of them in each file(usually 3). All that is constant is that they start with http and end with .html .

I decided to use regular expressions to deal with this problem(which I am quite new at) and my code looks like this: urls=re.findall(r'http://[^t][^s"]+', document)

The [^t] is there because there are other urls in the document whose domain names start with t. My problem is, there is another url with a jpg from the same domain as the urls I am extracting, which gets put into the urls array along with the others.

Example:

urls = ['http://123.45.67.89/asodibfo3ribawoifbadsoifasdf3/v.html'
        'http://123.45.67.89/alwefaoewifiasdof224a/v.html',
        'http://123.45.67.89/baoisdbfai235oubodsfb45/v.html',
        'http://123.45.67.89/i/0123/12345/aoief243oinsdf.jpg']

How would I go about only fetching the html urls?

You can use r'"(http.*?)"' to get the urls within your text :

>>> s="""<script>
...     [irrelevant javascript code here]
...     sources:[{file:"http://url.com/folder1/v.html",label:"label1"},
...     {file:"http://url.com/folder2/v.html",label:"label2"},
...     {file:"http://url.com/folder3/v.html",label:"label3"}],
...     [irrelevant javascript code here]
... </script>"""

>>> re.findall(r'"(http.*?)"',s,re.MULTILINE|re.DOTALL)
['http://url.com/folder1/v.html', 'http://url.com/folder2/v.html', 'http://url.com/folder3/v.html']

ans for extracting the .html 's in list of urls you can use str.endswith :

>>> urls = ['http://123.45.67.89/asodibfo3ribawoifbadsoifasdf3/v.html',
...         'http://123.45.67.89/alwefaoewifiasdof224a/v.html',
...         'http://123.45.67.89/baoisdbfai235oubodsfb45/v.html',
...         'http://123.45.67.89/i/0123/12345/aoief243oinsdf.jpg']
>>> 
>>> [i for i in urls if i.endswith('html')]
['http://123.45.67.89/asodibfo3ribawoifbadsoifasdf3/v.html', 
 'http://123.45.67.89/alwefaoewifiasdof224a/v.html', 
 'http://123.45.67.89/baoisdbfai235oubodsfb45/v.html']

Also as another general and flexible way for such tasks you can use fnmatch module :

>>> from fnmatch import fnmatch
>>> [i for i in urls if fnmatch(i,'*.html')]
['http://123.45.67.89/asodibfo3ribawoifbadsoifasdf3/v.html', 
 'http://123.45.67.89/alwefaoewifiasdof224a/v.html', 
 'http://123.45.67.89/baoisdbfai235oubodsfb45/v.html'] 

If the format is always the same with {file:url look for the substring between quotes following {file: :

s="""<script>
    [irrelevant javascript code here]
    sources:[{file:"http://url.com/folder1/v.html",label:"label1"},
    {file:"http://url.com/folder2/v.html",label:"label2"},
    {file:"http://url.com/folder3/v.html",label:"label3"}],
    [irrelevant javascript code here]
</script>"""


print(re.findall("\{file\:\"(.*?)\"",s))
['http://url.com/folder1/v.html', 'http://url.com/folder2/v.html', 'http://url.com/folder3/v.html']

You could also limit your strings to search by splitting once on sources:

s="""<script>
    [irrelevant javascript code here]
    sources:[{file:"http://url.com/folder1/v.html",label:"label1"},
    {file:"http://url.com/folder2/v.html",label:"label2"},
    {file:"http://url.com/folder3/v.html",label:"label3"}],
    [irrelevant javascript code here]
</script>"""

print(re.findall("\{file\:\"(.*?)\"",s.split("sources:[",1)[1]))

Which would remove all the other lines before sources:[ , presuming there are not other sources:[ .

像这样的东西?

re.findall(r'http://[^t][^s"]+\.html', document)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM