[英]Extract Amzon ASIN from URL, RE, python
I have a huge list of urls with links to Amazon products, this urls have an information contained within that I need that is called ASIN number.我有一个包含亚马逊产品链接的大量 url 列表,这些 url 包含我需要的信息,称为 ASIN 编号。
I understand that one of the best ways to extract that information is via Regular Expressions , I found a pattern in the urls that could help我知道提取该信息的最佳方法之一是通过正则表达式,我在网址中找到了一个可以帮助的模式
1- https://www.amazon.com/adidas-Melange-Performance-T-Shirt-Charcoal/dp/B07P4LVZNL/ref=sr_1_fkmr1_2?dchild=1&keywords=Adidas+M%C3%A8lange+Tech+T-Shirt+A372&qid=1579685244&sr=8-2-fkmr1 1- https://www.amazon.com/adidas-Melange-Performance-T-Shirt-Charcoal/dp/B07P4LVZNL/ref=sr_1_fkmr1_2?dchild=1&keywords=Adidas+M%C3%A8lange+Tech+T-Shirt+ A372&qid=1579685244&sr=8-2-fkmr1
2- https://www.amazon.com/adidas-Originals-Solid-Melange-Purple/dp/B07DXPN7TK/ref=sr_1_fkmr2_1?dchild=1&keywords=Adidas+M%C3%A8lange+Tech+T-Shirt+A372&qid=1579685244&sr=8-1-fkmr2 2- https://www.amazon.com/adidas-Originals-Solid-Melange-Purple/dp/B07DXPN7TK/ref=sr_1_fkmr2_1?dchild=1&keywords=Adidas+M%C3%A8lange+Tech+T-Shirt+A372&qid= 1579685244&sr=8-1-fkmr2
3- https://www.amazon.com/adidas-Game-Mode-Polo-Multi-Sport/gp/B07R23QGH6/ref=sr_1_fkmr2_2?dchild=1&keywords=Adidas+M%C3%A8lange+Tech+T-Shirt+A372&qid=1579685244&sr=8-2-fkmr2 3- https://www.amazon.com/adidas-Game-Mode-Polo-Multi-Sport/gp/B07R23QGH6/ref=sr_1_fkmr2_2?dchild=1&keywords=Adidas+M%C3%A8lange+Tech+T-Shirt+ A372&qid=1579685244&sr=8-2-fkmr2
The respective ASIN numbers are:相应的 ASIN 编号为:
1- B07P4LVZNL , located between: dp/B07P4LVZNL/ref=sr_1_f 1- B07P4LVZNL ,位于: dp/B07P4LVZNL/ref=sr_1_f
2- B07DXPN7TK , located between: dp/B07DXPN7TK/ref=sr_1_fkmr2_ 2- B07DXPN7TK ,位于: dp/B07DXPN7TK/ref=sr_1_fkmr2_
3- B07R23QGH6 , located between: gp/B07R23QGH6/ref=sr_1_fkmr2_ 3- B07R23QGH6 ,位于: gp/B07R23QGH6/ref=sr_1_fkmr2_
I tried this code:我试过这个代码:
asin = re.match("http[s]?://www.amazon.com(\w+)(.*)/(dp|gp/product)/(?P<asin>\w+).*", href, flags=re.IGNORECASE)
href is the variable where I have stored the urls href 是我存储网址的变量
But well... It doesn't work quite well, this is the type of result I get:但是好吧......它不太好用,这是我得到的结果类型:
<re.Match object; span=(0, 175), match='https://www.amazon.com/adidas-Originals-Solid-Mel>
<re.Match object; span=(0, 171), match='https://www.amazon.com/adidas-Game-Mode-Polo-Mult>
<re.Match object; span=(0, 167), match='https://www.amazon.com/adidas-Tech-Tee-Black-X-La>
Thank you for your help感谢您的帮助
I suggest using我建议使用
/[dg]p/([^/]+)
It matches /dp/
or /gp/
and then captures into Group 1 any one or more characters other than /
.它匹配
/dp/
或/gp/
,然后将除/
之外的任何一个或多个字符捕获到组 1 中。
See the regex demo .请参阅正则表达式演示。 In Python :
在Python 中:
asin = re.search(r'/[dg]p/([^/]+)', href, flags=re.IGNORECASE)
if asin:
print(asin.group(1))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.