简体   繁体   中英

How to check if a string value of one row is contained in the string value of another row in the same column in pandas dataframe

I have a dataframe as follows:

The "docid" is the exploded column of "DocID". I want to check if a string in the "Term" column is contained in another row in the same column. For example, rows 3 and 4 have "in the treatment" and "in the treatment of".

The DocFreq is the number of documents in which those terms occurred.


在此处输入图像描述

I want to check if there are documents in which both those strings occurred and keep only the rows with the longer string.

So for example: "in the treatment" occurs in 26 documents while "in the treatment of" occurs in 22 documents. So there are only 4 documents that have only "in the treatment".

I want to reduce the DocFreq to only the count of documents that contain that particluar ngram and not a superset of that ngram. So ideally "in the treatment" should have 4 as DocFreq.

Can this be done? I don't know how to begin.

EDIT

Input dataframe:

# | Term                | DocFreq | Ngram | docID 

1 |are to be             2         3       doc103.txt,doc11.txt
2 |are widely used       2         3       doc102.txt,doc80.txt
3 |in the treatment      6         3     doc10.txt,doc9.txt,doc21.txt,doc22.txt..
4 |in the treatment of   4         4      doc21.txt,doc22.txt,doc23.txt,doc24.txt
5 |more effective than   8         3       doc11.txt,...
6 |did not improve       3         3       doc15.txt,doc16.txt,doc17.txt
7 |did not improve the   2         4       doc15.txt,doc17.txt  
8 |not improve the       2         3       doc15.txt,doc14.txt

Here, in 3 and 4, there are overlapping documents. doc10.txt and doc9.txt contain only "in the treatment" while the rest of the documents contain "in the treatment of" which is the bigger ngram.

I need the DocFreq to represent only those number of documents that contain that absolute term. So I need to remove the other documents and bring down the docFreq to 2 in that instance. Similarly for 6,7 and 8.

So the output I need is:

# | Term                | DocFreq | Ngram | docID 

1 |are to be             2         3       doc103.txt,doc11.txt
2 |are widely used       2         3       doc102.txt,doc80.txt
3 |in the treatment      2         3       doc10.txt,doc9.txt
4 |in the treatment of   4         4      doc21.txt,doc22.txt,doc23.txt,doc24.txt
5 |more effective than   8         3       doc11.txt,...
6 |did not improve       1         3       doc16.txt
7 |did not improve the   2         4       doc15.txt,doc17.txt  
8 |not improve the       1         3       doc14.txt

Please help! Thank you!

What you can do is substract the docfreq of "in the treatment" by "in the treatment of" which will return the number of only "in the treatment", then for docID, remove any inctances of docID "in the treatment of" from "in the treatment"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM