简体   繁体   中英

How to filter out specific strings from a string

Python beginner here. I'm stumped on part of this code for a bot I'm writing.

I am making a reddit bot using Praw to comb through posts and removed a specific set of characters (steam CD keys).

I made a test post here: https://www.reddit.com/r/pythonforengineers/comments/91m4l0/testing_my_reddit_scraping_bot/

This should have all the formats of keys.

Currently, my bot is able to find the post using a regex expression. I have these variables:

steamKey15 = (r'\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w')
steamKey25 = (r'\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.\w\w\w\w\w.')
steamKey17 = (r'\w\w\w\w\w\w\w\w\w\w\w\w\w\w\w\s\w\w')

I am finding the text using this:

subreddit = reddit.subreddit('pythonforengineers')
for submission in subreddit.new(limit=20):

    if submission.id not in steamKeyPostID:        
        if re.search(steamKey15, submission.selftext, re.IGNORECASE):
            searchLogic()
            saveSteamKey()

So this is just to show that the things I should be using in a filter function is a combination of steamKey15/25/17, and submission.selftext.

So here is the part where I am confused. I cant find a function that works, or is doing what I want. My goal is to remove all the text from submission.selftext(the body of the post) BUT the keys, which will eventually be saved in a .txt file.

Any advice on a good way to go around this? I've looked into re.sub and .translate but I don't understand how the parts fit together.

I am using Python 3.7 if it helps.

can't you just get the regexp results?

m = re.search(steamKey15, submission.selftext, re.IGNORECASE)
if m:
    print(m.group(0))

Also note that a dot . means any char in a regexp. If you want to match only dots, you should use \\. . You can probably write your regexp like this instead:

r'\w{5}[-.]\w{5}[-.]\w{5}' 

This will match the key when separated by . or by - .

Note that this will also match anything that begin or end with a key, or has a key in the middle - that can cause you problems as your 15-char key regexp is contained in the 25-key one! To fix that use negative lookahead/negative lookbehind:

r'(?<![\w.-])\w{5}[-.]\w{5}[-.]\w{5}(?![\w.-])'

that will only find the keys if there are no extraneous characters before and after them

Another hint is to use re.findall instead of re.search - some posts contain more than one steam key in the same post! findall will return all matches while search only returns the first one.

So a couple things first . means any character in regex. I think you know that, but just to be sure. Also \\w\\w\\w\\w\\w can be replaced with \\w{5} where this specifies 5 alphanumerics. I would use re.findall .

import re
steamKey15 = (r'(?:\w{5}.){2}\w{5}')
steamKey25 = (r'(?:\w{5}.){5}')
steamKey17 = (r'\w{15}\s\w\w')
subreddit = reddit.subreddit('pythonforengineers')
for submission in subreddit.new(limit=20):
    if submission.id not in steamKeyPostID:
        finds_15 = re.findall(steamKey15, submission.selftext)
        finds_25 = re.findall(steamKey25, submission.selftext)
        finds_17 = re.findall(steamKey17, submission.selftext)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM