[英]Find number of unique elements in a list column in DataFrame
I have a dataframe in which for column 'pages' I need to count number of unique elements until there's an appearance of an element that contains the sub-string 'log in'.我有一个 dataframe,其中对于“页面”列,我需要计算唯一元素的数量,直到出现包含子字符串“登录”的元素。 In case there's more than one element like this in the same list - I need to count until the first one.
如果同一个列表中有多个这样的元素——我需要数到第一个。
input example:输入示例:
site![]() |
pages![]() |
---|---|
zoom.us![]() |
['zoom.us/register', 'zoom.us/log_in/=?sdsd', 'zoom.us/log_in/=a3344'] ![]() |
zoom.us![]() |
['zoom.us/about_us', 'zoom.us/error', 'zoom.us/help', 'zoom.us/log_in/jjjsl', 'zoom.us/log_in/llaye'] ![]() |
output example: output 示例:
site![]() |
pages![]() |
unique_pages_before_log_in ![]() |
---|---|---|
zoom.us![]() |
['zoom.us/register', 'zoom.us/register', 'zoom.us/log_in/=?sdsd', 'zoom.us/log_in/=a3344'] ![]() |
1 ![]() |
zoom.us![]() |
['zoom.us/about_us', 'zoom.us/error', 'zoom.us/help', 'zoom.us/log_in/jjjsl', 'zoom.us/log_in/llaye'] ![]() |
3 ![]() |
I thought about using set to count unique values, but I don't know how to count only until the first 'log in' sub-string appears.我考虑过使用 set 来计算唯一值,但我不知道如何计算直到出现第一个“登录”子字符串。 something like this:
是这样的:
df['unique_pages_before_login'] = df['pages'].apply(lambda l: len(set(l[:l.index('zoom.us/log_in')])))
I will appreciate any help:)我将不胜感激任何帮助:)
First, let's apply a function to find the first log_in considering your needs.首先,让我们根据您的需要申请一个 function 来查找第一个 log_in。 This function, should count the unique pages (preserving order) until we find a log in instance.
这个 function,应该计算唯一页面(保留顺序),直到我们找到一个登录实例。
def find_log_in(pages):
# Duplicate removal while preserving order original idea from: https://stackoverflow.com/a/17016257/3281097
# Python 3.7+ only
for i, page in enumerate(dict.fromkeys(pages)):
if page.startswith("zoom.us/log_in/"):
return i
return None # -1 or any value that you prefer
Now, you just need to apply this function to your column:现在,您只需将此 function 应用到您的列:
df["unique_pages_before_log_in"] = df["pages"].apply(find_log_in)
Looks like you have to use .apply()
here.看起来你必须在这里使用
.apply()
。 One approach is to add each element you find to a set until you find one that contains your search string.一种方法是将您找到的每个元素添加到一个集合中,直到找到一个包含您的搜索字符串的元素。 When you do find this, return the size of the set you've created.
当你找到它时,返回你创建的集合的大小。
def count_unique_before_login(pages):
c = set()
for item in pages:
if "log_in" in item: return len(c)
c.add(item)
return None # No log_in found
df = {'site': {0: 'zoom.us', 1: 'zoom.us'},
'pages': {0: ['zoom.us/register',
'zoom.us/log_in/=?sdsd',
'zoom.us/log_in/=a3344'],
1: ['zoom.us/about_us',
'zoom.us/error',
'zoom.us/help',
'zoom.us/log_in/jjjsl',
'zoom.us/log_in/llaye']}}
df["unique_pages_before_log_in"] = df["pages"].apply(count_unique_before_login)
Which gives:这使:
site ... unique_pages_before_log_in
0 zoom.us ... 1
1 zoom.us ... 3
You can try to use re.findall
and for loop
to get what you want.你可以尝试使用
re.findall
和for loop
来获得你想要的。
import re
def find_unique_elements(list_, matchword):
unique_no = []
for row in list_:
for i in range(len(row)):
if matchword in re.findall(matchword,str(row[i])):
unique_no.append(i)
break
return unique_no
matchword = "log_in"
list_ = df["pages"]
ddf = find_unique_elements(list_,matchword)
df["unique_pages_before_log_in"] = ddf
Try:尝试:
df["unique_pages_before_log_in"] = df["pages"].apply(lambda x: len(x[:min(i for i, s in enumerate(x) if "log_in" in s)]))
>>> df
site ... unique_pages_before_log_in
0 zoom.us ... 1
1 zoom.us ... 3
[2 rows x 3 columns]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.