简体   繁体   中英

Count frequency of itemsets in the given data frame

I have following data frame,

data = pd.read_csv('sample.csv', sep=',')

数据框

I need to search the frequency of itemsets present in a set. For example:

itemsets = {(143, 157), (143, 166), (175, 178), (175, 190)}

This should search the frequency of each tuple in the data frame (Trying to implement Apriori's algorithm). I'm particularly having trouble with how to individually address the tuples in the data frame and to search the tuple instead of individual entries in the data.

Update-1

For example data frame is like this:

39, 120, 124, 205, 401, 581, 704, 814, 825, 834
35, 39,  205, 712, 733, 759, 854, 950
39, 422, 449, 704, 825, 857, 895, 937, 954, 964

Update-2

Function should increment the count for a tuple only if all the values in that tuple are present in a particular row. For example, if I search for (39, 205) , it should return the frequency of 2 because 2 of the rows include both 39 and 205 (the first and second rows).

This function will returns a dictionary which contains the occurrences of the tuple's count in the entire rows of the data frame.

from collections import defaultdict
def count(df, sequence):
    dict_data = defaultdict(int)
    shape = df.shape[0]
    for items in sequence:
        for row in range(shape):
            dict_data[items] += all([item in df.iloc[row, :].values for item in items])
    return dict_data

You can pass in the data frame and the set to the count() function and it will return the occurrences of the tuples in the entire rows of the data frame for you ie

>>> count(data, itemsets)
defaultdict(<class 'int'>, {(39, 205): 2})

And you can easily change it from defaultdict to dictionary by using the dict() method ie

>>> dict(count(data, itemsets))
{(39, 205): 2}

But both of them still works the same.

itemsets = {(39, 205),(39, 205, 401), (143, 157), (143, 166), (175, 178), (175, 190)}

x = [[39,120,124,205,401,581,704,814,825,834],
[35,39,205,712,733,759,854,950],
[39,422,449,704,825,857,895,937,954,964]]

data = pd.DataFrame(x)

for itemset in itemsets:
    print(itemset)
    count = 0
    for i in range(len(data)):
        flag = True
        for item in itemset:
            if item not in data.loc[i].value_counts():
                flag = False
        if flag:
            count += 1
    print(count)

Edited to take into account abstract itemset lengths, as suggested in the comments (many thanks for the useful insights).

First of all, since there's some misunderstanding about what the question is, this answer answers the question "How to count the number of rows in which every item in the item set appears at least once?".


for each row in the data frame, we can decide if it's counted in the frequency using

all(item in row for item in items)

where items is an item set, for example, (39, 205) .

We can iterate over all the rows using DataFrame.itertuples , so for every item set items , its frequency is

sum(1 for row in map(set, df.itertuples(name=None)) if all(item in row for item in items))

(We use map(set, ...) to turn the tuples into sets, this is not needed but it improves efficiency)

Finally, we iterate over all the item sets in itemsets and store the result in a dictionary where the keys are the item sets and the values are the frequencies:

{items: sum(1 for row in map(set, df.itertuples(name=None)) if all(item in row for item in items)) for items in itemsets}

Output: The output for the case you supplied is {(39, 205): 2}

If you didn't like the one-line version, you can expand the algorithm into several lines like so:

d = {}  # output dictionary
for items in itemsets:
    frequency = 0
    for row in df.itertuples(name=None):
        row = set(row)  # done for efficiency
        for item in items:
            if item not in row:
                break
        else:  # no break
            frequency += 1
    d[items] = frequency

Additional information about for... else can be found in this answer

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM