試圖找到唯一的子數組和子元素？

Question

我有一個數組將子數組包含 [page_name, url, and id] 在dirty_pages。 該數組包含重復的子數組。

我需要將dirty_pages 中的每個subarray dirty_pages解析為clean_pages ，這樣：

沒有重復項（重復子數組）
子數組中的1st index ，即 url 必須是唯一的！ 例如這個 url 應該算一個（ url/#review仍然是相同的 url） ：
```
 file:///home/joe/Desktop/my-projects/FashionShop/product.html#review
```
和
```
file:///home/joe/Desktop/my-projects/FashionShop/product.html
```

我當前的嘗試返回帶有 6 個子數組（重復！）的clean_pages ，而正確答案應該是 4

# clean pages
clean_pages = []


# dirty pages
dirty_pages = [
  ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
  ['Put a Sock in It Heel Boot | Nasty Gal', 'file:///home/joe/Desktop/my-projects/FashionShop/nastygal-product.html', '1608093980462.042'],
  ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
  ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
  ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/product.html', '1608093980462.042'],
  ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/product.html#review', '1608093980462.042'],
  ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/product.html#review', '1608093980462.042'],
  ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
  ['Put a Sock in It Heel Boot | Nasty Gal', 'file:///home/joe/Desktop/my-projects/FashionShop/nastygal-product.html', '1608093980462.042'],
  ['ICONIC EXCLUSIVE - Game Over Drop Crotch Track Pants - Kids by Rock Your Kid Online | THE ICONIC | Australia', 'file:///home/joe/Desktop/my-projects/FashionShop/iconic-product.html', '1608093980462.042'],
   ['Put a Sock in It Heel Boot | Nasty Gal', 'file:///home/joe/Desktop/my-projects/FashionShop/nastygal-product.html', '1608093980462.042'],
  ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/shahyan/Desktop/my-projects/FashionShop/index.html/#review', '1608093980462.042'],
  ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/shahyan/Desktop/my-projects/FashionShop/index.html/?123', '1608093980462.042'],
  ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/shahyan/Desktop/my-projects/FashionShop/index.html/', '1608093980462.042'],
    ]




# clean data - get unique pages for each session
for j in range(len(dirty_pages)):
    page_name = dirty_pages[j][0]
    page_url = dirty_pages[j][1]
    page_sessionId = dirty_pages[j][2]

    not_seen = False

    if len(clean_pages) == 0:
        clean_pages.append([page_name, page_url, page_sessionId])
    else:
        for i in range(len(clean_pages)):
            next_page_name = clean_pages[i][0]
            next_page_url = clean_pages[i][1]
            next_page_sessionId = clean_pages[i][2]

            if page_url != next_page_url and page_name != next_page_name \
                    and page_sessionId == next_page_sessionId:
                not_seen = True
            else:
                not_seen = False

    if not_seen is True:
        clean_pages.append([page_name, page_url, page_sessionId])

print("$$$ clean...", len(clean_pages))

# correct answer should be 4 - as anyting after url e.g. #review is still duplicate!

更新示例 - 如果示例不清楚，請道歉（就像在 url 之后的 # 這些應該被視為一個 url）

'file:///home/joe/Desktop/my-projects/FashionShop/index.html/'

'file:///home/joe/Desktop/my-projects/FashionShop/index.html/?123'

'file:///home/joe/Desktop/my-projects/FashionShop/index.html'

Answer 1

您可以使用furl來規范化網址

from furl import furl

# Iterate over each page - subarray
for page in dirty_pages:
    # normalize url
    page[1] = furl(page[1]).remove(args=True, fragment=True).url.strip("/")

    # check if subarray already in clean_pages
    if page not in clean_pages:
        clean_pages.append(page)

Answer 2

你可以這樣做：

for j in dirty_pages:
    page_name = j[0]
    long_url = j[1]
    split_url = long_url.split('#')
    short_url = split_url[0]
    page_sessionID = j[2]
    edited_subarray = [page_name, short_url, page_sessionID]
    if edited_subarray not in clean_pages:
        clean_pages.append(edited_subarray)

除非您需要在 clean_pages 列表中保留 url 的“#review”部分。

Answer 3

我認為本質上這個問題歸結為有一個獨特的 url。 如果是這樣，您之前檢查的方法有點過於復雜，您得到的結果歸結為僅添加唯一名稱的第一項。 因此 6 但是您似乎想要獨特的 url。

為了克服 # 問題，我只是將 url 拆分為主題標簽，並取了它的第一部分。 請注意，這只會在主題標簽之前獲取字符串的第一部分。 因此，如果有超過 1 個#，這可能會導致問題。

我也清理了一下

    # clean pages
clean_pages = []


# dirty pages
dirty_pages = [
  ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
  ['Put a Sock in It Heel Boot | Nasty Gal', 'file:///home/joe/Desktop/my-projects/FashionShop/nastygal-product.html', '1608093980462.042'],
  ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
  ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
  ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/product.html', '1608093980462.042'],
  ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/product.html#review', '1608093980462.042'],
  ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/product.html#review', '1608093980462.042'],
  ['Fashion Shop | Free Bootstrap Themes by 365Bootstrap.com', 'file:///home/joe/Desktop/my-projects/FashionShop/index.html', '1608093980462.042'],
  ['Put a Sock in It Heel Boot | Nasty Gal', 'file:///home/joe/Desktop/my-projects/FashionShop/nastygal-product.html', '1608093980462.042'],
  ['ICONIC EXCLUSIVE - Game Over Drop Crotch Track Pants - Kids by Rock Your Kid Online | THE ICONIC | Australia', 'file:///home/joe/Desktop/my-projects/FashionShop/iconic-product.html', '1608093980462.042'],
   ['Put a Sock in It Heel Boot | Nasty Gal', 'file:///home/joe/Desktop/my-projects/FashionShop/nastygal-product.html', '1608093980462.042'],
]


used_url = []
used_names = []

# clean data - get unique pages for each session
for page in dirty_pages:
    page_name = page[0]
    page_url = page[1].split('#')[0]
    page_sessionId = page[2]

    if page_url not in used_url:
        used_url.append(page_url)
        clean_pages.append(page)

print(clean_pages)
print("$$$ clean...", len(clean_pages))

試圖找到唯一的子數組和子元素？

問題描述

3 個解決方案

解決方案1
1 已采納 2020-12-17 13:48:04

解決方案2
0 2020-12-17 13:47:41

解決方案3
0 2020-12-17 13:56:35

試圖找到唯一的子數組和子元素？

問題描述

3 個解決方案

解決方案1 1 已采納 2020-12-17 13:48:04

解決方案2 0 2020-12-17 13:47:41

解決方案3 0 2020-12-17 13:56:35

解決方案1
1 已采納 2020-12-17 13:48:04

解決方案2
0 2020-12-17 13:47:41

解決方案3
0 2020-12-17 13:56:35