简体   繁体   中英

Python script to find Indicators of Compromise in a txt file and write the results to a text file

This is my first post to stack overflow. So I'm a super noob. I'm working on a script that reads a file (emails related to maiware analysis), then uses regex to identify IP Addresses, MD5 hashes, and domain names.

Here's my script so far:

import re # import the regex library

fobj = open('email_with_IOCs.txt', 'r') # open the file to search for IOCs

text = fobj.read() # read the IOC file

ip_address = re.findall('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', text) # find all the IPs
md5hash = re.findall('[a-fA-F0-9]{32}', text) # find all the MD5
domain = re.findall('[a-zA-Z0-9-]{1,61}[a-zA-Z0-9]\.[a-zA-Z]{2,}', text) # find all the domains

iocs = open('iocs.txt', 'w') # open a file to write to
iocs.write(str(ip_address) + str(md5hash) + str(domain) + '\n' ) # write all the IOCs to a file

fobj.close() # close the input file
iocs.close() # close the output file

Here are the issues I'm trying to resolve:

  1. I want the output to have one IP Address, MD5, or domain per line in the output file.

  2. Some of the indicators of compromise are obfuscated for safety with brackets. Ex-1. [http:]//www.mcafeea[.]cf/tools.zip, Ex-2.118.99.37[.]190. I need to remove the brackets so I don't miss IPs.

  3. My domain name regex is matching file names and domains. Ex-1. stuff.dll, Ex-2. setup.exe I'd like to read in all the TLDs (Top Level Domains) as a list and use the TLD list to separate domains from file names.

Questions 1 and 3: There is a python package to parse IOCs (Indicators of Compromise) from text here: https://github.com/fhightower/ioc-finder . The regex for hostnames in this package includes a list of valid TLDs.

Question 2: To remove the obfuscation on indicators of compromise (a process which is called "fanging" or "refanging"), there is a package to do this in a robust and systematic way: https://github.com/ioc-fang/ioc_fanger .

Full disclosure: I'm the one working on this package; feel free to raise an issue if you have any ideas.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM