简体   繁体   English

pandas dataframe 如何判断一行的字符串值是否包含在同一列的另一行的字符串值中

[英]How to check if a string value of one row is contained in the string value of another row in the same column in pandas dataframe

I have a dataframe as follows:我有一个 dataframe 如下:

The "docid" is the exploded column of "DocID". “docid”是“DocID”的分解列。 I want to check if a string in the "Term" column is contained in another row in the same column.我想检查“术语”列中的字符串是否包含在同一列的另一行中。 For example, rows 3 and 4 have "in the treatment" and "in the treatment of".例如,第 3 行和第 4 行有“in the treatment”和“in the treatment of”。

The DocFreq is the number of documents in which those terms occurred. DocFreq 是出现这些术语的文档数。


在此处输入图像描述

I want to check if there are documents in which both those strings occurred and keep only the rows with the longer string.我想检查是否存在同时出现这两个字符串的文档,并只保留字符串较长的行。

So for example: "in the treatment" occurs in 26 documents while "in the treatment of" occurs in 22 documents.因此,例如:“in the treatment”出现在 26 个文档中,而“in the treatment of”出现在 22 个文档中。 So there are only 4 documents that have only "in the treatment".所以只有4个文件只有“在治疗中”。

I want to reduce the DocFreq to only the count of documents that contain that particluar ngram and not a superset of that ngram.我想将 DocFreq 减少到仅包含该特定 ngram 而不是该 ngram 的超集的文档数。 So ideally "in the treatment" should have 4 as DocFreq.所以理想情况下,“在治疗中”应该有 4 作为 DocFreq。

Can this be done?这可以做到吗? I don't know how to begin.我不知道如何开始。

EDIT编辑

Input dataframe:输入 dataframe:

# | Term                | DocFreq | Ngram | docID 

1 |are to be             2         3       doc103.txt,doc11.txt
2 |are widely used       2         3       doc102.txt,doc80.txt
3 |in the treatment      6         3     doc10.txt,doc9.txt,doc21.txt,doc22.txt..
4 |in the treatment of   4         4      doc21.txt,doc22.txt,doc23.txt,doc24.txt
5 |more effective than   8         3       doc11.txt,...
6 |did not improve       3         3       doc15.txt,doc16.txt,doc17.txt
7 |did not improve the   2         4       doc15.txt,doc17.txt  
8 |not improve the       2         3       doc15.txt,doc14.txt

Here, in 3 and 4, there are overlapping documents.这里,在 3 和 4 中,有重叠的文档。 doc10.txt and doc9.txt contain only "in the treatment" while the rest of the documents contain "in the treatment of" which is the bigger ngram. doc10.txt 和 doc9.txt 仅包含“in the treatment”,而文档的 rest 包含“in the treatment of”,这是更大的 ngram。

I need the DocFreq to represent only those number of documents that contain that absolute term.我需要 DocFreq 来表示仅包含该绝对术语的文档数量。 So I need to remove the other documents and bring down the docFreq to 2 in that instance.所以我需要删除其他文档并将 docFreq 降低到 2。 Similarly for 6,7 and 8.同样对于 6,7 和 8。

So the output I need is:所以我需要的 output 是:

# | Term                | DocFreq | Ngram | docID 

1 |are to be             2         3       doc103.txt,doc11.txt
2 |are widely used       2         3       doc102.txt,doc80.txt
3 |in the treatment      2         3       doc10.txt,doc9.txt
4 |in the treatment of   4         4      doc21.txt,doc22.txt,doc23.txt,doc24.txt
5 |more effective than   8         3       doc11.txt,...
6 |did not improve       1         3       doc16.txt
7 |did not improve the   2         4       doc15.txt,doc17.txt  
8 |not improve the       1         3       doc14.txt

Please help!请帮忙! Thank you!谢谢!

What you can do is substract the docfreq of "in the treatment" by "in the treatment of" which will return the number of only "in the treatment", then for docID, remove any inctances of docID "in the treatment of" from "in the treatment"您可以做的是用“在治疗中”减去“在治疗中”的 docfreq,这将仅返回“在治疗中”的数量,然后对于 docID,从中删除“在治疗中”的任何 docID 实例“在治疗中”

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pandas:检查一列中的字符串值是否是 dataframe 同一行中另一列的字符串的一部分 - 当前脚本返回全部是 - Pandas: check if string value in one column is part of string of another column in same row of dataframe - current script returning all Yes Python Pandas:检查一列中的字符串是否包含在同一行中另一列的字符串中 - Python Pandas: Check if string in one column is contained in string of another column in the same row Pandas dataframe:检查列中包含的正则表达式是否与同一行中另一列中的字符串匹配 - Pandas dataframe: Check if regex contained in a column matches a string in another column in the same row 如何检查一个 Pandas 列的字符串值是否包含在另一个 Pandas 列的字符串值中? - How to check whether the string value of a Pandas Column is contained in the string value of another Pandas Column? 如何检查PANDAS DataFrame列中是否包含一系列字符串,并将该字符串分配为行中的新列? - How to check if a series of strings is contained in a PANDAS DataFrame columns and assign that string as a new column in the row? Label 基于另一列(同一行)的值的列 pandas dataframe - Label a column based on the value of another column (same row) in pandas dataframe Python:在DataFrame中,在新列中为另一列中具有最高值的行添加值,在第三列中添加相同的字符串 - Python: In DataFrame, add value in a new column for row with highest value in another column and string identical in a third one 如果一个字符串列包含在 Pandas 的另一列中,则合并两个数据框 - Merge two dataframe if one string column is contained in another column in Pandas 如何使用 Pandas 根据同一行中另一列的值替换一列中的 NaN 值? - How to replace NaN value in one column based on the value of another column in the same row using Pandas? 如果一列的字符串包含 pandas dataframe 中另一列的单词,如何删除整行 - How to drop entire row if string of one column contains the word from another column in pandas dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM