I need some tips to make a calculation.
My dataframe looks like the following:
text_id name date words
1 John 2018-01-01 {ocean, blue}
1 John 2018-02-01 {ocean, green}
2 Anne 2018-03-01 {table, chair}
3 Anne 2018-03-01 {hot, cold, warm}
3 Mark 2018-04-01 {hot, cold}
3 Ethan 2018-05-01 {warm, icy}
4 Paul 2018-01-01 {cat, dog, puppy}
4 John 2018-02-01 {cat}
5 Paul 2018-03-01 {cat, sheep, deer}
In the text, the text_id
stands for an specific text ( SAME TEXT_ID = SAME TEXT ). The name
column stands for the person that has edited the text. The date
column stands for the date in which the user made the edit. The words
column is composed by the words that form the text after the users edit.
The words
column is a set . I need to add an aditional column, erased_words
, which contains the set difference of the current edit (in the current row) and the previous one (in the previous row) on THE SAME text. This probably means the operation must be done grouping by text_id
.
The sample output here would be:
text_id name date words erased_words
1 John 2018-01-01 {ocean,blue} {}
1 John 2018-02-01 {ocean,green} {blue}
2 Anne 2018-03-01 {table,chair} {}
3 Anne 2018-03-01 {hot,cold,warm} {}
3 Mark 2018-04-01 {hot,cold} {warm}
3 Ethan 2018-05-01 {warm,icy} {hot, cold}
4 Paul 2018-01-01 {cat,dog,puppy} {}
4 John 2018-02-01 {cat} {dog, puppy}
5 Paul 2018-03-01 {cat,sheep,deer} {}
Note that basically, the erased_words
column contains the set difference among the words column in row i-1
and words column in row i
, only if the text_id in row i and row i-1 is the same , because: I only want the words missing among consecutive editions in the SAME text (same text_id
), not different ones.
Any tips on this will be extremely helpful.
EDIT :
In order to turn the words
column into a set, do:
df['words'] = df['words'].str.strip('{}').str.split(',').apply(set)
NOTE:
This isn't a duplicated question, I made a similar one, but note that the calculation I wanted to do was completely another.
Please, I still didn't get a correct answer. Any help will be really really appreciated
For the purposes of the question I have assumed that your text_id
column is not the index of your dataframe, but even if it is then just call reset_index()
before doing the following:
df = pd.DataFrame({"text_id": [1, 1, 2],
"name": ["John", "John", "Anne"],
"date": ["2018-01-01", "2018-02-01", "2018-03-01"],
"words": [{"ocean", "blue"}, {"ocean", "green"}, {"table", "chair"}]})
df["word history 1"] = df["words"].shift(1).fillna(pd.Series([set()]))
df["erased words"] = df["word history 1"] - df["words"]
idx = df.groupby("text_id").head(1).index
df.loc[idx, "erased words"] = df.loc[idx, "erased words"].apply(lambda x: set())
df.drop("word history 1", axis=1, inplace=True)
So in essence, I've created a history
column that has a delay of 1 for each of the rows in the original words
column. You'll end up with:
df
text_id name date words erased words
0 1 John 2018-01-01 {blue, ocean} {}
1 1 John 2018-02-01 {green, ocean} {blue}
2 2 Anne 2018-03-01 {chair, table} {}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.