简体   繁体   English

查找每一列中的字母数

[英]Finding the count of letters in each column

I need to find the count of letters in each column as follows: 我需要找到每一列中的字母数,如下所示:

String: ATCG
        TGCA
        AAGC
        GCAT

string is a series. 字符串是一个系列。

I need to write a program to get the following: 我需要编写一个程序来获取以下信息:

  0 1 2 3
A 2 1 1 1
T 1 1 0 1
C 0 1 2 1
G 1 1 1 1 

I have written the following code but I am getting a row in 0 index and column at the end (column index 450, actual column no 451) with nan values. 我已经写了以下代码,但是我得到的是0索引中的一行,最后是nan值(列索引450,实际列号451)。 I should not be getting either the row or the column 451. I need to have only 450 columns. 我不应该获得行或列451。我只需要拥有450列。

f = zip(*string)
counts = [{letter: column.count(letter) for letter in column} for column in 
f]
counts=pd.DataFrame(counts).transpose()
print(counts)
counts = counts.drop(counts.columns[[450]], axis =1)

Can anyone please help me understand the issue? 谁能帮我理解这个问题?

Here is one way you can implement your logic. 这是实现逻辑的一种方法。 If required, you can turn your series into a list via lst = s.tolist() . 如果需要,您可以通过lst = s.tolist()将系列转换为列表。

lst = ['ATCG', 'TGCA', 'AAGC', 'GCAT']

arr = [[i.count(x) for i in zip(*lst)] for x in ('ATCG')]

res = pd.DataFrame(arr, index=list('ATCG'))

Result 结果

   0  1  2  3
A  2  1  1  1
T  1  1  0  1
C  0  1  2  1
G  1  1  1  1

Explanation 说明

  • In the list comprehension, deal with columns first by iterating the first, second, third and fourth elements of each string sequentially. 在列表理解中,首先通过依次迭代每个字符串的第一,第二,第三和第四个元素来处理列。
  • Deal with rows second by iterating through 'ATCG' sequentially. 通过依次遍历“ ATCG”来处理第二行。
  • This produces a list of lists which can be fed directly into pd.DataFrame . 这将产生一个列表列表,这些列表可以直接输入pd.DataFrame

With Series.value_counts() : 随着Series.value_counts()

>>> s = pd.Series(['ATCG', 'TGCA', 'AAGC', 'GCAT'])

>>> s.str.join('|').str.split('|', expand=True)\
...     .apply(lambda row: row.value_counts(), axis=0)\
...     .fillna(0.)\
...     .astype(int)
   0  1  2  3
A  2  1  1  1
C  0  1  2  1
G  1  1  1  1
T  1  1  0  1

I'm not sure how logically you want to order the index, but you could call .reindex() or .sort_index() on this result. 我不确定要对索引进行逻辑排序,但是可以在此结果上调用.reindex().sort_index()

The first line, s.str.join('|').str.split('|', expand=True) gets you an "expanded" version 第一行s.str.join('|').str.split('|', expand=True)您提供“扩展”版本

   0  1  2  3
0  A  T  C  G
1  T  G  C  A
2  A  A  G  C
3  G  C  A  T

which should be faster than calling pd.Series(list(x)) ... on each row. 这应该比在每一行上调用pd.Series(list(x)) ...更快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM