简体   繁体   中英

Pandas groupby and file writing problems

I have some pandas groupby functions that write data to file, but for some reason I'm getting redundant data written to file. Here's the code:

This function gets applied to each item in the dataframe

def item_grouper(df):
    # Get the frequency of each tag applied to the item
    tag_counts = df['tag'].value_counts() 
    # Get the most frequent tag (or tags, assuming a tie)
    max_tags = tag_counts[tag_counts==tag_counts.max()]
    # Get the total nummber of annotations for the item
    total_anno = len(df)
    # Now, process each user who tagged the item
    return df.groupby('uid').apply(user_grouper,total_anno,max_tags,tag_counts)

# This function gets applied to each user who tagged an item
def user_grouper(df,total_anno,max_tags,tag_counts):
    # subtract user's annoations from total annoations for the item
    total_anno = total_anno - len(df)
    # calculate weight
    weight = np.log10(total_anno)
    # check if user has used (one of) the top tag(s), and adjust max_tag_count
    if len(np.intersect1d(max_tags.index.values,df['iid']))>0:
        max_tag_count = float(max_tags[0]-1)
    else:
        max_tag_count = float(max_tags[0])
    # for each annotation...
    for i,row in df.iterrows():
        # calculate raw score
        raw_score = (tag_counts[row['tag']]-1) / max_tag_count
        # write to file
        out.write('\t'.join(map(str,[row['uid'],row['iid'],row['tag'],raw_score,weight]))+'\n')
    return df

So, one grouping function groups the data by iid (item id), does some processing, and then groups each sub-dataframe by uid (user_id), does some calculation, and writes to an output file. Now, the output file should have exactly one line per row in the original dataframe, but it doesn't! I keep getting the same data written to file multiple times. For instance, if I run:

out = open('data/test','w')
df.head(1000).groupby('iid').apply(item_grouper)
out.close()

The output should have 1000 lines (the code only writes one line per row in the dataframe), but the result output file has 1,997 lines. Looking at the file shows the exact same lines written multiple (2-4) times, seemingly at random (ie not all lines are double-written). Any idea what I'm doing wrong here?

See the docs on apply. Pandas will call the function twice on the first group (to determine between a fast/slow code path), so the side effects of the function (IO) will happen twice for the first group.

Your best bet here is probably to iterate over the groups directly, like this:

for group_name, group_df in df.head(1000).groupby('iid'):
    item_grouper(group_df)

I agree with chrisb's determination of the problem. As a cleaner way, consider having your user_grouper() function not save any values, but instead return these. With a structure as

def user_grouper(df, ...):
    (...)
    df['max_tag_count'] = some_calculation
    return df

results = df.groupby(...).apply(user_grouper, ...)
for i,row in results.iterrows():
    # calculate raw score
    raw_score = (tag_counts[row['tag']]-1) / row['max_tag_count']
    # write to file
    out.write('\t'.join(map(str,[row['uid'],row['iid'],row['tag'],raw_score,weight]))+'\n')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM