简体   繁体   中英

REGEX to find all matches inside a given string

I have a problem that drives me nuts currently. I have a list with a couple of million entries, and I need to extract product categories from them. Each entry looks like this: "[['Electronics', 'Computers & Accessories', 'Cables & Accessories', 'Memory Card Adapters']]" A type check did indeed give me string: print(type(item)) <class 'str'> Now I searched online for a possible (and preferably fast - because of the million entries) regex solution to extract all the categories.

I found several questions here Match single quotes from python re : I tried re.findall(r"'(\\w+)'", item) but only got empty brackets [] . Then I went on and searched for alternative methods like this one: Python Regex to find a string in double quotes within a string There someone tries the following matches=re.findall(r'\\"(.+?)\\"',item) print(matches) , but this failed in my case as well...

After that I tried some idiotic approach to get at least a workaround and solve this problem later: list_cat_split = item.split(',') which gives me

e["[['Electronics'"," 'Computers & Accessories'"," 'Cables & Accessories'"," 'Memory Card Adapters']]"]

Then I tried string methods to get rid of the stuff and then apply a regex:

list_categories = []
for item in list_cat_split:
    item.strip('\"')
    item.strip(']')
    item.strip('[')
    item.strip()
    category = re.findall(r"'(\w+)'", item)
    if category not in list_categories:
        list_categories.append(category)

however even this approach failed: [['Electronics'], []] I searched further but did not find a proper solution. Sorry if this question is completly stupid, I am new to regex, and probably this is a no-brainer for regular regex users?

UPDATE:

Somehow I cannot answer my own question, thererfore here an update: thanks for the answers - sorry for incomplete information, I very rarely ask here and usually try to find solutions on my own.. I do not want to use a database, because this is only a small fraction of my preprocessing work for an ML-application that is written entirely in Python. Also this is for my MSc project, so no production environment. Therefore I am fine with a slower, but working, solution as I do it once and for all. However as far as I can see the solution of @FailSafe worked for me: screenshot of my jupyter notebook here the result with list

But yes I totally agree with @ Wiktor Stribiżew: in a production setup, I would for sure set up a database and let this run over night,.. Thanks for all the help anyways, great people here :-)

this may not be your final answer but it creates a list of categories.

x="[['Electronics', 'Computers & Accessories', 'Cables & Accessories', 'Memory Card Adapters']]"

y=x[2:-2]
z=y.split(',')

for item in z:
    print(item)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM