![](/img/trans.png)
[英]How do export these data separated by multiple columns in a single row into .csv or .xls using Python in BeautifulSoup?
[英]How to extract row counts based on multiple descriptors in a columns from csv and then export output a new csv using bash/python script?
我正在使用包含如下數據的 csv 文件(100 行)。 我想以 csv/tab 格式獲取每個元素的每個基因的計數。
輸入
Gene Element
---------- ----------
STBZIP1 G-box
STBZIP1 G-box
STBZIP1 MYC
STBZIP1 MYC
STBZIP1 MYC
STBZIP10 MYC
STBZIP10 MYC
STBZIP10 MYC
STBZIP10 G-box
STBZIP10 G-box
STBZIP10 G-box
STBZIP10 G-box
預期輸出
Gene G-Box MYC
---------- ------- -----
STBZIP1 2 3
STBZIP10 4 3
有人可以幫我在這方面想出一個bash腳本(或python)嗎?
更新
我正在嘗試以下操作並暫時卡住:| ;
import pandas as pd
df = pd.read_csv("Promoter_Element_Distribution.csv")
print (df)
df.groupby(['Gene', 'Element']).size().unstack(fill_value=0)
使用表單中的文件(此處命名為input.csv
):
Gene Element
---------- ----------
STBZIP1 G-box
STBZIP1 G-box
STBZIP1 MYC
STBZIP1 MYC
STBZIP1 MYC
STBZIP10 MYC
STBZIP10 MYC
STBZIP10 MYC
STBZIP10 G-box
STBZIP10 G-box
STBZIP10 G-box
STBZIP10 G-box
這個
import pandas as pd
df = pd.read_csv('input.csv', delim_whitespace=True, skiprows=1)
df.columns = ['Gene', 'Element']
df['Count'] = 1
df = df.pivot_table(index='Gene', columns='Element', aggfunc=sum)
print(df)
給你
Count
Element G-box MYC
Gene
STBZIP1 2 3
STBZIP10 4 3
由於您還要求提供 bash 版本,因此這里使用awk
1 。 它被注釋了,而且輸出的格式也“很好”,所以代碼有點大(大約 20 行沒有注釋)。
awk '# First record line:
# Storing all column names into elements, including
# the first column name
NR == 1 {firstcol=$1;element[$1]++}
# Each line starting with the second one are datas
# Occurrences are counted with an indexed array
# count[x][y] contains the count of Element y for the Gene x
NR > 2 {element[$2]++;count[$1][$2]++}
# Done, time for displaying the results
END {
# Let us display the first line, column names
## Left-justify the first col, because it is text
printf "%-10s ", firstcol
## Other are counts, so we right-justify
for (i in element) if (i != firstcol) printf "%10s ", i
printf "\n"
# Now an horizontal bar
for (i in element) {
c = 0
while (c++ < 10) { printf "-"}
printf " ";
}
printf "\n"
# Now, loop through the count records
for (i in count) {
# Left justification for the column name
printf "%-10s ", i ;
for(j in element)
# For each counted element (ie except the first one),
# print it right-justified
if (j in count[i]) printf "%10s", count[i][j]
printf "\n"
}
}' tab-separated-input.txt
結果:
Gene G-box MYC
---------- ---------- ----------
STBZIP10 4 3
STBZIP1 2 3
1此解決方案需要GNU awk
來處理數組數組( count[$1][$2]
語法) - 感謝Ed Morton
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.