简体   繁体   English

使用python计算tsv文件列中单词的出现次数

[英]Counting occurrence of a word in a column of a tsv file using python

Question from a python beginner! 来自python初学者的问题! I have a tsv file looking like this: 我有一个tsv文件,如下所示:

WHI5    YOR083W CDC28   YBR160W physical interactions   19823668
WHI5    YOR083W CDC28   YBR160W physical interactions   21658602
WHI5    YOR083W CDC28   YBR160W physical interactions   24186061
WHI5    YOR083W RPD3    YNL330C physical interactions   19823668
WHI5    YOR083W SWI4    YER111C physical interactions   15210110
WHI5    YOR083W SWI4    YER111C physical interactions   15210111

I would like to count all the lines containing the same word in row[3], and only output the first one with the number of occurrence in a new column. 我想计算行[3]中包含相同单词的所有行,并且只输出第一个包含新列中出现次数的行。

WHI5    YOR083W CDC28   YBR160W physical interactions   19823668    3
WHI5    YOR083W RPD3    YNL330C physical interactions   19823668    1
WHI5    YOR083W SWI4    YER111C physical interactions   15210110    2

So far I tried a combination of 'csv' and 'Counter' or 'pandas' and 'Counter' without success... 到目前为止,我尝试了'csv'和'Counter'或'pandas'和'Counter'的组合没有成功......

using pandas: 使用熊猫:

>>> import pandas as pd
>>> from io import BytesIO
>>> df = pd.read_table(BytesIO("""\
... col1 col2 col3 col4 col5 col6
... WHI5    YOR083W CDC28   YBR160W "physical interactions"   19823668
... WHI5    YOR083W CDC28   YBR160W "physical interactions"   21658602
... WHI5    YOR083W CDC28   YBR160W "physical interactions"   24186061
... WHI5    YOR083W RPD3    YNL330C "physical interactions"   19823668
... WHI5    YOR083W SWI4    YER111C "physical interactions"   15210110
... WHI5    YOR083W SWI4    YER111C "physical interactions"   15210111"""),
... delim_whitespace=True)

the pandas data-frame will look like: pandas数据框看起来像:

>>> df
   col1     col2   col3     col4                   col5      col6
0  WHI5  YOR083W  CDC28  YBR160W  physical interactions  19823668
1  WHI5  YOR083W  CDC28  YBR160W  physical interactions  21658602
2  WHI5  YOR083W  CDC28  YBR160W  physical interactions  24186061
3  WHI5  YOR083W   RPD3  YNL330C  physical interactions  19823668
4  WHI5  YOR083W   SWI4  YER111C  physical interactions  15210110
5  WHI5  YOR083W   SWI4  YER111C  physical interactions  15210111

[6 rows x 6 columns]

to get the count, group by col3 and take the length of each group: 得到计数,按col3分组并取每组的长度:

>>> df['cnt'] = df.groupby('col3')['col3'].transform(len)
>>> df
   col1     col2   col3     col4                   col5      col6 cnt
0  WHI5  YOR083W  CDC28  YBR160W  physical interactions  19823668   3
1  WHI5  YOR083W  CDC28  YBR160W  physical interactions  21658602   3
2  WHI5  YOR083W  CDC28  YBR160W  physical interactions  24186061   3
3  WHI5  YOR083W   RPD3  YNL330C  physical interactions  19823668   1
4  WHI5  YOR083W   SWI4  YER111C  physical interactions  15210110   2
5  WHI5  YOR083W   SWI4  YER111C  physical interactions  15210111   2

[6 rows x 7 columns]

to pick the first of each group: 选择每组的第一个:

>>> df.groupby('col3').apply(lambda obj: obj.head(n=1))
         col1     col2   col3     col4                   col5      col6 cnt
col3
CDC28 0  WHI5  YOR083W  CDC28  YBR160W  physical interactions  19823668   3
RPD3  3  WHI5  YOR083W   RPD3  YNL330C  physical interactions  19823668   1
SWI4  4  WHI5  YOR083W   SWI4  YER111C  physical interactions  15210110   2

[3 rows x 7 columns]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM