简体   繁体   中英

Python Regular Expression / Middle word in result

I have problem with unnecessary strings in result. I want pull only https from files. My code is:

import sys
import os
import hashlib
import re

if len(sys.argv) < 2 :
    sys.exit('Aby uzyc wpisz: python %s filename' % sys.argv[0])

if not os.path.exists(sys.argv[1]):
    sys.exit('BLAD!: Plik "%s" nie znaleziony!' % sys.argv[1])

with open(sys.argv[1], 'rb') as f:
    plik = f.read()
    print("MD5: %s" % hashlib.md5(plik).hexdigest())
    print("SHA1: %s" % hashlib.sha1(plik).hexdigest())
    print("SHA256: %s" % hashlib.sha256(plik).hexdigest())
    print("Podejrzane linki: \n")
    pliki = open(sys.argv[1], 'r')
    for line in pliki:
        if re.search("(H|h)ttps:(.*)",line):
            print(line)
        elif re.search("(H|h)ttp:(.*)",line):
            print(line)
    pliki.close()

In result:

MD5: f16a93fd2d6f2a9f90af9f61a19d28bd
SHA1: 0a9b89624696757e188412da268afb2bf5b600aa
SHA256: 3b365deb0e272146f00f9d723a9fd4dbeacddc10123aec8237a37c10c19fe6df
Podejrzane linki: 

        GrizliPolSurls = "http://xxx.xxx.xxx.xxx" 

        FilnMoviehttpsd.Open "GET", "https://xxx.xxx.xxx.xxx",False

I want only strings in "" and starts from http or https eg http://xxx.xxx.xxx.xxx

Desired result:

MD5: f16a93fd2d6f2a9f90af9f61a19d28bd
SHA1: 0a9b89624696757e188412da268afb2bf5b600aa
SHA256: 3b365deb0e272146f00f9d723a9fd4dbeacddc10123aec8237a37c10c19fe6df
Podejrzane linki: 
http://xxx.xxx.xxx.xxx
https://xxx.xxx.xxx.xxx

You can use re.findall with the following regex (explained on regex101 ):

"([Hh]ttps?.*?)"

so:

import re
s = '''MD5MD5:: f16a93fd2d6f2a9f90af9f61a19d28bd
SHA1 f16a93fd2 : 0a9b89624696757e188412da268afb2bf5b600aa
SHA256: 3b365deb0e272146f00f9d723a9fd4dbeacddc10123aec8237a37c10c19fe6df
Podejrzane linki: 

        GrizliPolSurls = "http://xxx.xxx.xxx.xxx" 

        FilnMoviehttpsd.Open "GET", "https://xxx.xxx.xxx.xxx",False'''
urls = re.findall('"([Hh]ttps?.*?)"', s)
#['http://xxx.xxx.xxx.xxx', 'https://xxx.xxx.xxx.xxx']

You need this pattern: (?<=")http[^"]+ .

(?<=") - positive lookbehind, to determine if " precceds current position.

http - match http literally.

[^"]+ - match everything until " , this is negated class technique to avoid quantifiers :)

Demo

re.search() returns a Match Object

You have to fetch the information from the result:

line = "my text line contains a http://192.168.1.1 magic url"
result = re.search("[Hh]ttps?://\d+\.\d+\.\d+\.\d+", line)
print(result.group())  # will print http://192.168.1.1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM