[英]How to extract row counts based on multiple descriptors in a columns from csv and then export output a new csv using bash/python script?
I am working with a csv file (100s of rows) containing data as follows.我正在使用包含如下数据的 csv 文件(100 行)。 I would like to get counts per each gene for each element in csv/tab format.
我想以 csv/tab 格式获取每个元素的每个基因的计数。
Input输入
Gene Element
---------- ----------
STBZIP1 G-box
STBZIP1 G-box
STBZIP1 MYC
STBZIP1 MYC
STBZIP1 MYC
STBZIP10 MYC
STBZIP10 MYC
STBZIP10 MYC
STBZIP10 G-box
STBZIP10 G-box
STBZIP10 G-box
STBZIP10 G-box
Expected output预期输出
Gene G-Box MYC
---------- ------- -----
STBZIP1 2 3
STBZIP10 4 3
Can someone please help me to come up with a bash script (or python) in this regard?有人可以帮我在这方面想出一个bash脚本(或python)吗?
Update更新
I am trying the following and stuck for the time being :|我正在尝试以下操作并暂时卡住:| ;
;
import pandas as pd
df = pd.read_csv("Promoter_Element_Distribution.csv")
print (df)
df.groupby(['Gene', 'Element']).size().unstack(fill_value=0)
With the file in the form (named input.csv
here):使用表单中的文件(此处命名为
input.csv
):
Gene Element
---------- ----------
STBZIP1 G-box
STBZIP1 G-box
STBZIP1 MYC
STBZIP1 MYC
STBZIP1 MYC
STBZIP10 MYC
STBZIP10 MYC
STBZIP10 MYC
STBZIP10 G-box
STBZIP10 G-box
STBZIP10 G-box
STBZIP10 G-box
this这个
import pandas as pd
df = pd.read_csv('input.csv', delim_whitespace=True, skiprows=1)
df.columns = ['Gene', 'Element']
df['Count'] = 1
df = df.pivot_table(index='Gene', columns='Element', aggfunc=sum)
print(df)
gives you给你
Count
Element G-box MYC
Gene
STBZIP1 2 3
STBZIP10 4 3
Since you asked also for a bash version, here is a use of awk
1 .由于您还要求提供 bash 版本,因此这里使用
awk
1 。 It's commented, and also the output is "well" formatted, so the code is a little huge (about 20 lines without the comments).它被注释了,而且输出的格式也“很好”,所以代码有点大(大约 20 行没有注释)。
awk '# First record line:
# Storing all column names into elements, including
# the first column name
NR == 1 {firstcol=$1;element[$1]++}
# Each line starting with the second one are datas
# Occurrences are counted with an indexed array
# count[x][y] contains the count of Element y for the Gene x
NR > 2 {element[$2]++;count[$1][$2]++}
# Done, time for displaying the results
END {
# Let us display the first line, column names
## Left-justify the first col, because it is text
printf "%-10s ", firstcol
## Other are counts, so we right-justify
for (i in element) if (i != firstcol) printf "%10s ", i
printf "\n"
# Now an horizontal bar
for (i in element) {
c = 0
while (c++ < 10) { printf "-"}
printf " ";
}
printf "\n"
# Now, loop through the count records
for (i in count) {
# Left justification for the column name
printf "%-10s ", i ;
for(j in element)
# For each counted element (ie except the first one),
# print it right-justified
if (j in count[i]) printf "%10s", count[i][j]
printf "\n"
}
}' tab-separated-input.txt
Result:结果:
Gene G-box MYC
---------- ---------- ----------
STBZIP10 4 3
STBZIP1 2 3
1 This solution requires GNU awk
for arrays of arrays ( count[$1][$2]
syntax) - Thanks to Ed Morton 1此解决方案需要GNU
awk
来处理数组数组( count[$1][$2]
语法) - 感谢Ed Morton
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.