I need to find the count of letters in each column as follows:
String: ATCG
TGCA
AAGC
GCAT
string is a series.
I need to write a program to get the following:
0 1 2 3
A 2 1 1 1
T 1 1 0 1
C 0 1 2 1
G 1 1 1 1
I have written the following code but I am getting a row in 0 index and column at the end (column index 450, actual column no 451) with nan values. I should not be getting either the row or the column 451. I need to have only 450 columns.
f = zip(*string)
counts = [{letter: column.count(letter) for letter in column} for column in
f]
counts=pd.DataFrame(counts).transpose()
print(counts)
counts = counts.drop(counts.columns[[450]], axis =1)
Can anyone please help me understand the issue?
Here is one way you can implement your logic. If required, you can turn your series into a list via lst = s.tolist()
.
lst = ['ATCG', 'TGCA', 'AAGC', 'GCAT']
arr = [[i.count(x) for i in zip(*lst)] for x in ('ATCG')]
res = pd.DataFrame(arr, index=list('ATCG'))
Result
0 1 2 3
A 2 1 1 1
T 1 1 0 1
C 0 1 2 1
G 1 1 1 1
Explanation
pd.DataFrame
. With Series.value_counts()
:
>>> s = pd.Series(['ATCG', 'TGCA', 'AAGC', 'GCAT'])
>>> s.str.join('|').str.split('|', expand=True)\
... .apply(lambda row: row.value_counts(), axis=0)\
... .fillna(0.)\
... .astype(int)
0 1 2 3
A 2 1 1 1
C 0 1 2 1
G 1 1 1 1
T 1 1 0 1
I'm not sure how logically you want to order the index, but you could call .reindex()
or .sort_index()
on this result.
The first line, s.str.join('|').str.split('|', expand=True)
gets you an "expanded" version
0 1 2 3
0 A T C G
1 T G C A
2 A A G C
3 G C A T
which should be faster than calling pd.Series(list(x)) ...
on each row.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.