简体   繁体   中英

Calculating pandas DataFrame column which is equal to the missing words from one set to another in a previous DataFrame column

I need some tips to make a calculation.

My dataframe looks like the following:

text_id     name     date                words
1           John     2018-01-01          {ocean, blue}
1           John     2018-02-01          {ocean, green} 
2           Anne     2018-03-01          {table, chair}
3           Anne     2018-03-01          {hot, cold, warm}
3           Mark     2018-04-01          {hot, cold}
3           Ethan    2018-05-01          {warm, icy}
4           Paul     2018-01-01          {cat, dog, puppy}
4           John     2018-02-01          {cat}
5           Paul     2018-03-01          {cat, sheep, deer}

In the text, the text_id stands for an specific text ( SAME TEXT_ID = SAME TEXT ). The name column stands for the person that has edited the text. The date column stands for the date in which the user made the edit. The words column is composed by the words that form the text after the users edit.

The words column is a set . I need to add an aditional column, erased_words , which contains the set difference of the current edit (in the current row) and the previous one (in the previous row) on THE SAME text. This probably means the operation must be done grouping by text_id .

The sample output here would be:

text_id     name     date          words            erased_words
1           John     2018-01-01    {ocean,blue}     {}
1           John     2018-02-01    {ocean,green}    {blue}
2           Anne     2018-03-01    {table,chair}    {}
3           Anne     2018-03-01    {hot,cold,warm}  {}
3           Mark     2018-04-01    {hot,cold}       {warm}
3           Ethan    2018-05-01    {warm,icy}       {hot, cold}
4           Paul     2018-01-01    {cat,dog,puppy}  {}
4           John     2018-02-01    {cat}            {dog, puppy}
5           Paul     2018-03-01    {cat,sheep,deer} {}

Note that basically, the erased_words column contains the set difference among the words column in row i-1 and words column in row i , only if the text_id in row i and row i-1 is the same , because: I only want the words missing among consecutive editions in the SAME text (same text_id ), not different ones.

Any tips on this will be extremely helpful.

EDIT :

In order to turn the words column into a set, do:

df['words'] = df['words'].str.strip('{}').str.split(',').apply(set)

NOTE:

This isn't a duplicated question, I made a similar one, but note that the calculation I wanted to do was completely another.

Please, I still didn't get a correct answer. Any help will be really really appreciated

For the purposes of the question I have assumed that your text_id column is not the index of your dataframe, but even if it is then just call reset_index() before doing the following:

df = pd.DataFrame({"text_id": [1, 1, 2],
                  "name": ["John", "John", "Anne"],
                  "date": ["2018-01-01", "2018-02-01", "2018-03-01"],
                  "words": [{"ocean", "blue"}, {"ocean", "green"}, {"table", "chair"}]})

df["word history 1"] = df["words"].shift(1).fillna(pd.Series([set()]))
df["erased words"] = df["word history 1"] - df["words"]

idx = df.groupby("text_id").head(1).index
df.loc[idx, "erased words"] = df.loc[idx, "erased words"].apply(lambda x: set())
df.drop("word history 1", axis=1, inplace=True)

So in essence, I've created a history column that has a delay of 1 for each of the rows in the original words column. You'll end up with:

df
    text_id  name   date        words           erased words
0   1        John   2018-01-01  {blue, ocean}   {}
1   1        John   2018-02-01  {green, ocean}  {blue}
2   2        Anne   2018-03-01  {chair, table}  {}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM