Pandas: Truncate string in column based on substring pulled from other column (Python 3)

Question

I have a dataframe with two pertinent columns, "rm_word" and "article."

Data Sample:

,grouping,fts,article,rm_word
0,"1",fts,"This is the article. This is a sentence. This is a sentence. This is a sentence. This goes on for awhile and that's super ***crazy***. It goes on and on.",crazy

I want to query the last 100 characters of each "article" to determine if its row's respective "rm_word" appears. If it does, then I want to delete the entire sentence in which "rm_word" appears as well as all the sentences that follows it from the "article."

Desired Result (when "crazy" is the "rm_word"):

,grouping,fts,article,rm_word
0,"1",fts,"This is the article. This is a sentence. This is a sentence. This is a sentence.",crazy

This mask is able to determine when an article contains its "rm_word," but I'm having trouble with the sentence deletion bit.

mask = ([ (str(a) in b[-100:].lower()) for a,b in zip(df["rm_word"], df["article"])])

print (df.loc[mask])

Any help would be much appreciated. Thank you so much.

Answer 1

Does this work?

df = pd.DataFrame(
    columns=['article', 'rm_word'],
    data=[["This is the article. This is a sentence. This is a sentence. This is a sentence.", 'crazy'],
          ["This is the article. This is a sentence. This is a sentence. This is a sentence. This goes on for awhile and that's super crazy. It goes on and on.", 'crazy']]
)

def clean_article(x):
    if x['rm_word'] not in x['article'][-100:].lower():
        return x
    article = x['article'].rsplit(x['rm_word'])[0]
    article = article.split('.')[:-1]
    x['article'] = '.'.join(article) + '.'
    return x


df = df.apply(lambda x: clean_article(x), axis=1)
df['article'].values

Returns

array(['This is the article. This is a sentence. This is a sentence. This is a sentence.',
       'This is the article. This is a sentence. This is a sentence. This is a sentence.'],
      dtype=object)

Pandas: Truncate string in column based on substring pulled from other column (Python 3)

Question

1 answers

solution1
1 ACCPTED 2020-06-30 17:06:12

Pandas: Truncate string in column based on substring pulled from other column (Python 3)

Question

1 answers

solution1 1 ACCPTED 2020-06-30 17:06:12

solution1
1 ACCPTED 2020-06-30 17:06:12