查找每一列中的字母数

Question

I need to find the count of letters in each column as follows: 我需要找到每一列中的字母数，如下所示：

String: ATCG
        TGCA
        AAGC
        GCAT

string is a series. 字符串是一个系列。

I need to write a program to get the following: 我需要编写一个程序来获取以下信息：

I have written the following code but I am getting a row in 0 index and column at the end (column index 450, actual column no 451) with nan values. 我已经写了以下代码，但是我得到的是0索引中的一行，最后是nan值（列索引450，实际列号451）。 I should not be getting either the row or the column 451. I need to have only 450 columns. 我不应该获得行或列451。我只需要拥有450列。

f = zip(*string)
counts = [{letter: column.count(letter) for letter in column} for column in 
f]
counts=pd.DataFrame(counts).transpose()
print(counts)
counts = counts.drop(counts.columns[[450]], axis =1)

Can anyone please help me understand the issue? 谁能帮我理解这个问题？

Answer 1

Here is one way you can implement your logic. 这是实现逻辑的一种方法。 If required, you can turn your series into a list via lst = s.tolist() . 如果需要，您可以通过lst = s.tolist()将系列转换为列表。

lst = ['ATCG', 'TGCA', 'AAGC', 'GCAT']

arr = [[i.count(x) for i in zip(*lst)] for x in ('ATCG')]

res = pd.DataFrame(arr, index=list('ATCG'))

Result 结果

   0  1  2  3
A  2  1  1  1
T  1  1  0  1
C  0  1  2  1
G  1  1  1  1

Explanation 说明

In the list comprehension, deal with columns first by iterating the first, second, third and fourth elements of each string sequentially. 在列表理解中，首先通过依次迭代每个字符串的第一，第二，第三和第四个元素来处理列。
Deal with rows second by iterating through 'ATCG' sequentially. 通过依次遍历“ ATCG”来处理第二行。
This produces a list of lists which can be fed directly into pd.DataFrame . 这将产生一个列表列表，这些列表可以直接输入pd.DataFrame 。

Answer 2

With Series.value_counts() : 随着Series.value_counts() ：

>>> s = pd.Series(['ATCG', 'TGCA', 'AAGC', 'GCAT'])

>>> s.str.join('|').str.split('|', expand=True)\
...     .apply(lambda row: row.value_counts(), axis=0)\
...     .fillna(0.)\
...     .astype(int)
   0  1  2  3
A  2  1  1  1
C  0  1  2  1
G  1  1  1  1
T  1  1  0  1

I'm not sure how logically you want to order the index, but you could call .reindex() or .sort_index() on this result. 我不确定要对索引进行逻辑排序，但是可以在此结果上调用.reindex()或.sort_index() 。

The first line, s.str.join('|').str.split('|', expand=True) gets you an "expanded" version 第一行s.str.join('|').str.split('|', expand=True)您提供“扩展”版本

   0  1  2  3
0  A  T  C  G
1  T  G  C  A
2  A  A  G  C
3  G  C  A  T

which should be faster than calling pd.Series(list(x)) ... on each row. 这应该比在每一行上调用pd.Series(list(x)) ...更快。

查找每一列中的字母数

问题描述

2 个解决方案

解决方案1
3 2018-03-24 20:16:28

解决方案2
2 2018-03-24 21:03:33

查找每一列中的字母数

问题描述

2 个解决方案

解决方案1 3 2018-03-24 20:16:28

解决方案2 2 2018-03-24 21:03:33

解决方案1
3 2018-03-24 20:16:28

解决方案2
2 2018-03-24 21:03:33