简体   繁体   English

在 DataFrame 的列表列中查找唯一元素的数量

[英]Find number of unique elements in a list column in DataFrame

I have a dataframe in which for column 'pages' I need to count number of unique elements until there's an appearance of an element that contains the sub-string 'log in'.我有一个 dataframe,其中对于“页面”列,我需要计算唯一元素的数量,直到出现包含子字符串“登录”的元素。 In case there's more than one element like this in the same list - I need to count until the first one.如果同一个列表中有多个这样的元素——我需要数到第一个。

input example:输入示例:

site地点 pages页数
zoom.us放大.us ['zoom.us/register', 'zoom.us/log_in/=?sdsd', 'zoom.us/log_in/=a3344'] ['zoom.us/register', 'zoom.us/log_in/=?sdsd', 'zoom.us/log_in/=a3344']
zoom.us放大.us ['zoom.us/about_us', 'zoom.us/error', 'zoom.us/help', 'zoom.us/log_in/jjjsl', 'zoom.us/log_in/llaye'] ['zoom.us/about_us', 'zoom.us/error', 'zoom.us/help', 'zoom.us/log_in/jjjsl', 'zoom.us/log_in/llaye']

output example: output 示例:

site地点 pages页数 unique_pages_before_log_in unique_pages_before_log_in
zoom.us放大.us ['zoom.us/register', 'zoom.us/register', 'zoom.us/log_in/=?sdsd', 'zoom.us/log_in/=a3344'] ['zoom.us/register', 'zoom.us/register', 'zoom.us/log_in/=?sdsd', 'zoom.us/log_in/=a3344'] 1 1个
zoom.us放大.us ['zoom.us/about_us', 'zoom.us/error', 'zoom.us/help', 'zoom.us/log_in/jjjsl', 'zoom.us/log_in/llaye'] ['zoom.us/about_us', 'zoom.us/error', 'zoom.us/help', 'zoom.us/log_in/jjjsl', 'zoom.us/log_in/llaye'] 3 3个

I thought about using set to count unique values, but I don't know how to count only until the first 'log in' sub-string appears.我考虑过使用 set 来计算唯一值,但我不知道如何计算直到出现第一个“登录”子字符串。 something like this:是这样的:

df['unique_pages_before_login'] = df['pages'].apply(lambda l: len(set(l[:l.index('zoom.us/log_in')])))

I will appreciate any help:)我将不胜感激任何帮助:)

First, let's apply a function to find the first log_in considering your needs.首先,让我们根据您的需要申请一个 function 来查找第一个 log_in。 This function, should count the unique pages (preserving order) until we find a log in instance.这个 function,应该计算唯一页面(保留顺序),直到我们找到一个登录实例。

def find_log_in(pages):
    # Duplicate removal while preserving order original idea from: https://stackoverflow.com/a/17016257/3281097
    # Python 3.7+ only
    for i, page in enumerate(dict.fromkeys(pages)):
        if page.startswith("zoom.us/log_in/"):
            return i
    return None  # -1 or any value that you prefer

Now, you just need to apply this function to your column:现在,您只需将此 function 应用到您的列:

df["unique_pages_before_log_in"] = df["pages"].apply(find_log_in)

Looks like you have to use .apply() here.看起来你必须在这里使用.apply() One approach is to add each element you find to a set until you find one that contains your search string.一种方法是将您找到的每个元素添加到一个集合中,直到找到一个包含您的搜索字符串的元素。 When you do find this, return the size of the set you've created.当你找到它时,返回你创建的集合的大小。

def count_unique_before_login(pages):
    c = set()
    for item in pages:
        if "log_in" in item: return len(c)
        c.add(item)
    return None # No log_in found


df = {'site': {0: 'zoom.us', 1: 'zoom.us'},
 'pages': {0: ['zoom.us/register',
   'zoom.us/log_in/=?sdsd',
   'zoom.us/log_in/=a3344'],
  1: ['zoom.us/about_us',
   'zoom.us/error',
   'zoom.us/help',
   'zoom.us/log_in/jjjsl',
   'zoom.us/log_in/llaye']}}

df["unique_pages_before_log_in"] = df["pages"].apply(count_unique_before_login)

Which gives:这使:

      site  ... unique_pages_before_log_in
0  zoom.us  ...                          1
1  zoom.us  ...                          3

You can try to use re.findall and for loop to get what you want.你可以尝试使用re.findallfor loop来获得你想要的。

import re

def find_unique_elements(list_, matchword):
    unique_no = []
    for row in list_:
        for i in range(len(row)):
            if matchword in re.findall(matchword,str(row[i])):
                unique_no.append(i)
                break

    return unique_no

matchword = "log_in"
list_ = df["pages"]

ddf = find_unique_elements(list_,matchword)
df["unique_pages_before_log_in"] = ddf

Try:尝试:

df["unique_pages_before_log_in"] = df["pages"].apply(lambda x: len(x[:min(i for i, s in enumerate(x) if "log_in" in s)]))

>>> df
      site  ... unique_pages_before_log_in
0  zoom.us  ...                          1
1  zoom.us  ...                          3

[2 rows x 3 columns]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM