简体   繁体   English

按前 10 个字符拆分列表中的字符串,Python

[英]Splitting strings in a list by the first 10 characters, Python

I am trying to get the ASIN number for each product on Amazon which is the first ten digits after dp/.我正在尝试获取亚马逊上每个产品的 ASIN 编号,即 dp/ 之后的前十位数字。 I have gotten to the point where I have the digits but still have the junk after it.我已经到了我有数字但后面仍然有垃圾的地步。 Any help?有什么帮助吗?

   product_lst = [
            "https://www.amazon.com/Bentgo-Kids-Prints-Camouflage-5-Compartment/dp/B07R2CNSTK/ref=zg_bs_toys-and-games_home_2?_encoding=UTF8&psc=1&refRID=S3ESVW604M2GF8VYYVAZ",
            "https://www.amazon.com/Hamdol-Inflatable-Swimming-Sprinkler-Full-Sized/dp/B08SLYY1WD/?_encoding=UTF8&smid=AYKJMONAWDIKA&pf_rd_p=287d7433-71c6-4904-99b3-55833d0daaa0&pd_rd_wg=lMKJu&pf_rd_r=CR8F460JV643467SAG8Q&pd_rd_w=KgWnp&pd_rd_r=0e298b4a-6e52-4688-87bb-482fb6c1a56b&ref_=pd_gw_deals",
            "https://www.amazon.com/Fire-TV-Stick-4K-with-Alexa-Voice-Remote/dp/B079QHML21?ref=deals_primeday_deals-grid_slot-5_21f9_dt_dcell_img_0_ca4a9dae",
            "https://www.amazon.com/dp/B089RDSML3",
            "https://www.amazon.com/Lucky-Brand-Burnout-Notch-Shirt/dp/B081J8SGH7/ref=sr_1_2?dchild=1&pf_rd_i=7147441011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=e6aa97f3-9bc4-42c5-ac38-37844f71b469&pf_rd_r=S2F3A95JN2FDGBQ4V048&pf_rd_s=merchandised-search-9&pf_rd_t=101&qid=1624427428&s=apparel&sr=1-2"
        ]
for url in product_lst:
    product_lst = url.split("dp/")
    for url in product_lst:
        del product_lst[::2]
    print(product_lst)

Output:输出:

['B07R2CNSTK/ref=zg_bs_toys-and-games_home_2?_encoding=UTF8&psc=1&refRID=S3ESVW604M2GF8VYYVAZ'] ['B08SLYY1WD/? ['B07R2CNSTK/ref=zg_bs_toys-and-games_home_2?_encoding=UTF8&psc=1&refRID=S3ESVW604M2GF8VYYVAZ'] ['B08SLYY1WD/? encoding=UTF8&smid=AYKJMONAWDIKA&pf_rd_p=287d7433-71c6-4904-99b3-55833d0daaa0&pd_rd_wg=lMKJu&pf_rd_r=CR8F460JV643467SAG8Q&pd_rd_w=KgWnp&pd_rd_r=0e298b4a-6e52-4688-87bb-482fb6c1a56b&ref =pd_gw_deals']编码= UTF8&SMID = AYKJMONAWDIKA&pf_rd_p = 287d7433-71c6-4904-99b3-55833d0daaa0&pd_rd_wg = lMKJu&pf_rd_r = CR8F460JV643467SAG8Q&pd_rd_w = KgWnp&pd_rd_r = 0e298b4a-6e52-4688-87bb-482fb6c1a56b&REF = pd_gw_deals']
['B079QHML21?ref=deals_primeday_deals-grid_slot-5_21f9_dt_dcell_img_0_ca4a9dae'] ['B089RDSML3'] ['B081J8SGH7/ref=sr_1_2?dchild=1&pf_rd_i=7147441011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=e6aa97f3-9bc4-42c5-ac38-37844f71b469&pf_rd_r=S2F3A95JN2FDGBQ4V048&pf_rd_s=merchandised-search-9&pf_rd_t=101&qid=1624427428&s=apparel&sr=1-2'] [ 'B079QHML21?REF = deals_primeday_deals-grid_slot-5_21f9_dt_dcell_img_0_ca4a9dae'] [ 'B089RDSML3'] ['B081J8SGH7 / REF = sr_1_2?dchild = 1&pf_rd_i = 7147441011&pf_rd_m = ATVPDKIKX0DER&pf_rd_p = e6aa97f3-9bc4-42c5-ac38-37844f71b469&pf_rd_r = S2F3A95JN2FDGBQ4V048&pf_rd_s =商品化-搜索 - 9&pf_rd_t=101&qid=1624427428&s=服装&sr=1-2']

For searches in text the module re (regex) is a good choice:对于文本搜索,模块re (regex)是一个不错的选择:

product_lst = [
"https://www.amazon.com/Bentgo-Kids-Prints-Camouflage-5-Compartment/dp/B07R2CNSTK/ref=zg_bs_toys-and-games_home_2?_encoding=UTF8&psc=1&refRID=S3ESVW604M2GF8VYYVAZ",
"https://www.amazon.com/Hamdol-Inflatable-Swimming-Sprinkler-Full-Sized/dp/B08SLYY1WD/?_encoding=UTF8&smid=AYKJMONAWDIKA&pf_rd_p=287d7433-71c6-4904-99b3-55833d0daaa0&pd_rd_wg=lMKJu&pf_rd_r=CR8F460JV643467SAG8Q&pd_rd_w=KgWnp&pd_rd_r=0e298b4a-6e52-4688-87bb-482fb6c1a56b&ref_=pd_gw_deals",
"https://www.amazon.com/Fire-TV-Stick-4K-with-Alexa-Voice-Remote/dp/B079QHML21?ref=deals_primeday_deals-grid_slot-5_21f9_dt_dcell_img_0_ca4a9dae",
"https://www.amazon.com/dp/B089RDSML3",
"https://www.amazon.com/Lucky-Brand-Burnout-Notch-Shirt/dp/B081J8SGH7/ref=sr_1_2?dchild=1&pf_rd_i=7147441011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=e6aa97f3-9bc4-42c5-ac38-37844f71b469&pf_rd_r=S2F3A95JN2FDGBQ4V048&pf_rd_s=merchandised-search-9&pf_rd_t=101&qid=1624427428&s=apparel&sr=1-2"
]

import re

results = []
for url in product_lst:
    m = re.search(r"/dp/([^/?]+)",url)
    if m:
        results.append(m.groups()[0])
print(results)

Output:输出:

['B07R2CNSTK', 'B08SLYY1WD', 'B079QHML21', 'B089RDSML3', 'B081J8SGH7']

I use r"/dp/([^/?]+)" as pattern wich boils down to a grouped match for anything after /dp/ and then matches all things up to the next / or ?我使用r"/dp/([^/?]+)"作为模式,归结为/dp/之后任何内容的分组匹配,然后将所有内容匹配到下一个/? . .

You can test regexes online - I use http://regex101.com (for complex ones) - it can even provide python code based on what you insert in its fields (not using that though ;o) )您可以在线测试正则表达式 - 我使用http://regex101.com (对于复杂的) - 它甚至可以根据您在其字段中插入的内容提供 python 代码(虽然不使用它;o))


You can change your own code to您可以将自己的代码更改为

for url in product_lst:
    part = url.split("dp/")
    if len(part) > 1:            # blablubb dp/ more things => 2 or more parts
        print(part[1])           # print whats is left after dp/

to avoid overwriting your list product_lst - but you will still need to trim stuff after / and ?避免覆盖您的列表product_lst - 但您仍然需要在 / 和 ? with it.用它。

After you split() on the 'dp/' , there is absolutely no reason to loop.'dp/'split()之后,绝对没有理由循环。 You know exactly where the data is that you want, so just get it directly:你确切地知道你想要的数据在哪里,所以直接获取它:

   product_lst = [
            "https://www.amazon.com/Bentgo-Kids-Prints-Camouflage-5-Compartment/dp/B07R2CNSTK/ref=zg_bs_toys-and-games_home_2?_encoding=UTF8&psc=1&refRID=S3ESVW604M2GF8VYYVAZ",
            "https://www.amazon.com/Hamdol-Inflatable-Swimming-Sprinkler-Full-Sized/dp/B08SLYY1WD/?_encoding=UTF8&smid=AYKJMONAWDIKA&pf_rd_p=287d7433-71c6-4904-99b3-55833d0daaa0&pd_rd_wg=lMKJu&pf_rd_r=CR8F460JV643467SAG8Q&pd_rd_w=KgWnp&pd_rd_r=0e298b4a-6e52-4688-87bb-482fb6c1a56b&ref_=pd_gw_deals",
            "https://www.amazon.com/Fire-TV-Stick-4K-with-Alexa-Voice-Remote/dp/B079QHML21?ref=deals_primeday_deals-grid_slot-5_21f9_dt_dcell_img_0_ca4a9dae",
            "https://www.amazon.com/dp/B089RDSML3",
            "https://www.amazon.com/Lucky-Brand-Burnout-Notch-Shirt/dp/B081J8SGH7/ref=sr_1_2?dchild=1&pf_rd_i=7147441011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=e6aa97f3-9bc4-42c5-ac38-37844f71b469&pf_rd_r=S2F3A95JN2FDGBQ4V048&pf_rd_s=merchandised-search-9&pf_rd_t=101&qid=1624427428&s=apparel&sr=1-2"
        ]
for url in product_lst:
    split_lst = url.split("dp/")
    print(split_lst[1][:10]

I assume that the ASIN is always 10 characters.我假设 ASIN 总是 10 个字符。 Adjust the splice if there are more characters and it is always fixed.如果有更多字符,请调整拼接,它始终是固定的。 Otherwise you will need to find a different appproach.否则,您将需要找到不同的方法。

You can directly get the ASIN without splitting the data.无需拆分数据即可直接获取ASIN。

product_lst = [
            "https://www.amazon.com/Bentgo-Kids-Prints-Camouflage-5-Compartment/dp/B07R2CNSTK/ref=zg_bs_toys-and-games_home_2?_encoding=UTF8&psc=1&refRID=S3ESVW604M2GF8VYYVAZ",
            "https://www.amazon.com/Hamdol-Inflatable-Swimming-Sprinkler-Full-Sized/dp/B08SLYY1WD/?_encoding=UTF8&smid=AYKJMONAWDIKA&pf_rd_p=287d7433-71c6-4904-99b3-55833d0daaa0&pd_rd_wg=lMKJu&pf_rd_r=CR8F460JV643467SAG8Q&pd_rd_w=KgWnp&pd_rd_r=0e298b4a-6e52-4688-87bb-482fb6c1a56b&ref_=pd_gw_deals",
            "https://www.amazon.com/Fire-TV-Stick-4K-with-Alexa-Voice-Remote/dp/B079QHML21?ref=deals_primeday_deals-grid_slot-5_21f9_dt_dcell_img_0_ca4a9dae",
            "https://www.amazon.com/dp/B089RDSML3",
            "https://www.amazon.com/Lucky-Brand-Burnout-Notch-Shirt/dp/B081J8SGH7/ref=sr_1_2?dchild=1&pf_rd_i=7147441011&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=e6aa97f3-9bc4-42c5-ac38-37844f71b469&pf_rd_r=S2F3A95JN2FDGBQ4V048&pf_rd_s=merchandised-search-9&pf_rd_t=101&qid=1624427428&s=apparel&sr=1-2"
             ]

ASIN=[]
for url in product_lst:
    idx = url.find("/dp/")
    ASIN.append(url[idx+4:idx+14])

print(ASIN)

output输出

['B07R2CNSTK', 'B08SLYY1WD', 'B079QHML21', 'B089RDSML3', 'B081J8SGH7']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM