简体   繁体   中英

Break pandas DataFrame column into multiple pieces and combine with other DataFrame

I have a table of phrases and I have a table of individual words that make up these phrases. I want to break my phrases up into individual words, gather and reduce information about these individual words and add as a new column in my phrase data. Is there a smart way to do this using pandas DataFrames?

    df_multigram = pd.DataFrame([
        ["happy birthday", 23],
        ["used below", 10],
        ["frame for", 2]
    ], columns=["multigram", "frequency"])
    df_onegram = pd.DataFrame([
        ["happy", 35],
        ["birthday", 25],
        ["used", 14],
        ["below", 11],
        ["frame", 2],
        ["for", 13]
    ], columns=["onegram", "frequency"])

    ###### What do I do here????? #######

    sum_freq_onegrams = list(df_multigram["sum_freq_onegrams"])
    self.assertEqual(sum_freq_onegrams, [60, 25, 15])

Just to clarify, my desire is that sum_freq_onegrams is equal to [60, 25, 15], where 60 is the frequency of "happy" plus the frequency of "birthday".

You could use

freq = df_onegram.set_index(['onegram'])['frequency']
sum_freq_onegrams = df_multigram['multigram'].str.split().apply(
    lambda x: pd.Series(x).map(freq).sum())

which yields

In [43]: sum_freq_onegrams
Out[45]: 
0    60
1    25
2    15
Name: multigram, dtype: int64

But note that calling a (lambda) function once for every row and building a new (tiny) Series each time may be rather slow. Using a different data structure -- even plain Python lists and dicts -- may be faster. For example, if you defined the list phrases and the dict freq_dict ,

phrases = df_multigram['multigram'].tolist()
freq_dict = freq.to_dict()

then the list comprehension (below) is 280x faster than the Pandas-based method:

In [65]: [sum(freq_dict.get(item, 0) for item in phrase.split()) for phrase in phrases]
Out[65]: [60, 25, 15]

In [38]: %timeit [sum(freq_dict.get(item, 0)for item in phrase.split()) for phrase in phrases]
100000 loops, best of 3: 3.6 µs per loop

In [41]: %timeit df_multigram['multigram'].str.split().apply(lambda x: pd.Series(x).map(freq).sum())
1000 loops, best of 3: 1.01 ms per loop

Thus, using a Pandas DataFrame here to hold the phrases might not be the right data structure for this problem.

There is probably a better way to do this but this works:

In [131]:

def func(x):
    total = 0
    for w in x.split():
        if len(df_onegram[df_onegram['onegram'] == w]) > 0:
            total += df_onegram[df_onegram['onegram'] == w]['frequency'].values[0]
    return total
df_multigram['total_freq'] = df_multigram['multigram'].apply(lambda x: func(x))
df_multigram
Out[131]:
        multigram  frequency  total_freq
0  happy birthday         23          60
1      used below         10          25
2       frame for          2          15

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM