[英]How to check if a string value of one row is contained in the string value of another row in the same column in pandas dataframe
I have a dataframe as follows:我有一个 dataframe 如下:
The "docid" is the exploded column of "DocID". “docid”是“DocID”的分解列。 I want to check if a string in the "Term" column is contained in another row in the same column.
我想检查“术语”列中的字符串是否包含在同一列的另一行中。 For example, rows 3 and 4 have "in the treatment" and "in the treatment of".
例如,第 3 行和第 4 行有“in the treatment”和“in the treatment of”。
The DocFreq is the number of documents in which those terms occurred. DocFreq 是出现这些术语的文档数。
I want to check if there are documents in which both those strings occurred and keep only the rows with the longer string.我想检查是否存在同时出现这两个字符串的文档,并只保留字符串较长的行。
So for example: "in the treatment" occurs in 26 documents while "in the treatment of" occurs in 22 documents.因此,例如:“in the treatment”出现在 26 个文档中,而“in the treatment of”出现在 22 个文档中。 So there are only 4 documents that have only "in the treatment".
所以只有4个文件只有“在治疗中”。
I want to reduce the DocFreq to only the count of documents that contain that particluar ngram and not a superset of that ngram.我想将 DocFreq 减少到仅包含该特定 ngram 而不是该 ngram 的超集的文档数。 So ideally "in the treatment" should have 4 as DocFreq.
所以理想情况下,“在治疗中”应该有 4 作为 DocFreq。
Can this be done?这可以做到吗? I don't know how to begin.
我不知道如何开始。
EDIT编辑
Input dataframe:输入 dataframe:
# | Term | DocFreq | Ngram | docID
1 |are to be 2 3 doc103.txt,doc11.txt
2 |are widely used 2 3 doc102.txt,doc80.txt
3 |in the treatment 6 3 doc10.txt,doc9.txt,doc21.txt,doc22.txt..
4 |in the treatment of 4 4 doc21.txt,doc22.txt,doc23.txt,doc24.txt
5 |more effective than 8 3 doc11.txt,...
6 |did not improve 3 3 doc15.txt,doc16.txt,doc17.txt
7 |did not improve the 2 4 doc15.txt,doc17.txt
8 |not improve the 2 3 doc15.txt,doc14.txt
Here, in 3 and 4, there are overlapping documents.这里,在 3 和 4 中,有重叠的文档。 doc10.txt and doc9.txt contain only "in the treatment" while the rest of the documents contain "in the treatment of" which is the bigger ngram.
doc10.txt 和 doc9.txt 仅包含“in the treatment”,而文档的 rest 包含“in the treatment of”,这是更大的 ngram。
I need the DocFreq to represent only those number of documents that contain that absolute term.我需要 DocFreq 来表示仅包含该绝对术语的文档数量。 So I need to remove the other documents and bring down the docFreq to 2 in that instance.
所以我需要删除其他文档并将 docFreq 降低到 2。 Similarly for 6,7 and 8.
同样对于 6,7 和 8。
So the output I need is:所以我需要的 output 是:
# | Term | DocFreq | Ngram | docID
1 |are to be 2 3 doc103.txt,doc11.txt
2 |are widely used 2 3 doc102.txt,doc80.txt
3 |in the treatment 2 3 doc10.txt,doc9.txt
4 |in the treatment of 4 4 doc21.txt,doc22.txt,doc23.txt,doc24.txt
5 |more effective than 8 3 doc11.txt,...
6 |did not improve 1 3 doc16.txt
7 |did not improve the 2 4 doc15.txt,doc17.txt
8 |not improve the 1 3 doc14.txt
Please help!请帮忙! Thank you!
谢谢!
What you can do is substract the docfreq of "in the treatment" by "in the treatment of" which will return the number of only "in the treatment", then for docID, remove any inctances of docID "in the treatment of" from "in the treatment"您可以做的是用“在治疗中”减去“在治疗中”的 docfreq,这将仅返回“在治疗中”的数量,然后对于 docID,从中删除“在治疗中”的任何 docID 实例“在治疗中”
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.