I have a problem that drives me nuts currently. I have a list with a couple of million entries, and I need to extract product categories from them. Each entry looks like this: "[['Electronics', 'Computers & Accessories', 'Cables & Accessories', 'Memory Card Adapters']]"
A type check did indeed give me string: print(type(item))
<class 'str'>
Now I searched online for a possible (and preferably fast - because of the million entries) regex solution to extract all the categories.
I found several questions here Match single quotes from python re : I tried re.findall(r"'(\\w+)'", item)
but only got empty brackets []
. Then I went on and searched for alternative methods like this one: Python Regex to find a string in double quotes within a string There someone tries the following matches=re.findall(r'\\"(.+?)\\"',item) print(matches)
, but this failed in my case as well...
After that I tried some idiotic approach to get at least a workaround and solve this problem later: list_cat_split = item.split(',')
which gives me
e["[['Electronics'"," 'Computers & Accessories'"," 'Cables & Accessories'"," 'Memory Card Adapters']]"]
Then I tried string methods to get rid of the stuff and then apply a regex:
list_categories = []
for item in list_cat_split:
item.strip('\"')
item.strip(']')
item.strip('[')
item.strip()
category = re.findall(r"'(\w+)'", item)
if category not in list_categories:
list_categories.append(category)
however even this approach failed: [['Electronics'], []]
I searched further but did not find a proper solution. Sorry if this question is completly stupid, I am new to regex, and probably this is a no-brainer for regular regex users?
UPDATE:
Somehow I cannot answer my own question, thererfore here an update: thanks for the answers - sorry for incomplete information, I very rarely ask here and usually try to find solutions on my own.. I do not want to use a database, because this is only a small fraction of my preprocessing work for an ML-application that is written entirely in Python. Also this is for my MSc project, so no production environment. Therefore I am fine with a slower, but working, solution as I do it once and for all. However as far as I can see the solution of @FailSafe worked for me: screenshot of my jupyter notebook here the result with list
But yes I totally agree with @ Wiktor Stribiżew: in a production setup, I would for sure set up a database and let this run over night,.. Thanks for all the help anyways, great people here :-)
this may not be your final answer but it creates a list of categories.
x="[['Electronics', 'Computers & Accessories', 'Cables & Accessories', 'Memory Card Adapters']]"
y=x[2:-2]
z=y.split(',')
for item in z:
print(item)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.