简体   繁体   English

Python 堆叠条形图,其中 y 轴刻度是线性的,但条形填充是 10 秒的对数

[英]Python stacked barchart where y-axis scale is linear but the bar fill is logarithmic in the order of 10s

As the title explains, I am trying to reproduce a stacked barchart where the y-axis scale is linear but the inside fill of the plot (ie the stacked bars) are logarithmic and grouped in the order of 10s.正如标题所解释的,我试图重现一个堆叠条形图,其中 y 轴刻度是线性的,但图的内部填充(即堆叠条)是对数的,并按 10 秒的顺序分组。

I have made this plot before on R-Studio with an in-house package, however I am trying to reproduce the plot with other programs (python) to validate and confirm my analysis.我之前在 R-Studio 上用内部包制作了这个图,但是我试图用其他程序(python)重现这个图来验证和确认我的分析。

Quick description of the data w/ more detail:数据的快速描述和更多细节:

I have thousands of entries of clonal cell information.我有数以千计的克隆细胞信息条目。 They have multiple identifiers, such as "Strain", "Sample", "cloneID", as well as a frequency value ("cloneFraction") for each clone.它们有多个标识符,例如“Strain”、“Sample”、“cloneID”,以及每个克隆的频率值(“cloneFraction”)。

This is the .head() of the dataset I am working with to give you an idea of my data这是我正在使用的数据集的 .head() ,让您了解我的数据

I am trying to reproduce this following plot I made with R-Studio: this one here我正在尝试重现我用 R-Studio 制作的以下图:这里是这个

This plot has the dataset divided in groups based on their frequency, with the top 10 most frequent grouped in red, followed by the next top 100, next 1000, etc etc. The y-axis has a 0.00-1.00 scale but also a 100% scale wouldn't change, they mean the same thing in this context.该图将数据集根据其频率分组,其中最常见的前 10 个用红色分组,其次是下一个前 100 个、下一个 1000 等。y 轴的比例为 0.00-1.00,但也有 100 % scale 不会改变,在这种情况下它们的意思是一样的。

This is just to get an idea and visualize if I have big clones (the top 10) and how much of the overall dataset they occupy in frequency - ie the bigger the red stack the larger clones I have, signifying there has been a significant clonal expansion in my sample of a few selected cells.这只是为了获得一个想法并可视化我是否有大克隆(前 10 名)以及它们在频率上占据整个数据集的多少 - 即红色堆栈越大,我拥有的克隆越大,表明存在重要的克隆在我的几个选定单元格样本中进行扩展。

What I have done so far:到目前为止我做了什么:

import matplotlib
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
%matplotlib inline

MYDATAFRAME.groupby(['Sample','cloneFraction']).size().groupby(level=0).apply(lambda x: 100 * x / x.sum()).unstack().plot(kind='bar',stacked=True, legend=None)
plt.yscale('log')
plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter())
plt.show()

And I get this plot here我在这里得到这个情节

Now, I realize there is no order in the stacked plot, so the most frequent aren't on top - it's just stacking in the order of the entries in my dataset (which I assume I can just fix by sorting my dataframe by the column of interest).现在,我意识到堆叠图中没有顺序,所以最常见的不在顶部 - 它只是按照我的数据集中条目的顺序堆叠(我假设我可以通过按列对我的数据框进行排序来修复)出于兴趣)。

Other than the axis messing up and not giving my a % when I use log scale (which is a secondary issue), I can't seem/wouldn't know how to group the data entries by frequency as I mentioned above.除了当我使用对数刻度(这是次要问题)时轴混乱并且没有给出我的 % 之外,我似乎无法/不知道如何按频率对数据条目进行分组,如上所述。

I have tried things such as:我尝试过以下事情:

temp = X.SOME_IDENTIFIER.value_counts()
temp2 = temp.head(10)
if len(temp) > 10:
    temp2['remaining {0} items'.format(len(temp) - 10)] = sum(temp[10:])
temp2.plot(kind='pie')

Just to see if I could separate them in a correct way but this does not achieve what I would like (other than being a pie chart, but I changed that in my code).只是想看看我是否可以以正确的方式将它们分开,但这并没有达到我想要的效果(除了饼图,但我在代码中更改了它)。

I have also tried using iloc[n:n] to select specific entries, but I can't seem to get that working either, as I get errors when I try adding it to the code I've used above to plot my graph - and if I use it without the other fancy stuff in the code (% scale, etc) it gets confused in the stacked barplot and just plots the top 10 out of all the 4 samples in my data, rather than the top 10 per sample.我也尝试过使用 iloc[n:n] 来选择特定的条目,但我似乎也无法让它工作,因为当我尝试将它添加到我上面用来绘制图形的代码中时出现错误 -如果我在代码中没有其他花哨的东西(百分比比例等)使用它,它会在堆积的条形图中混淆,并且只绘制数据中所有 4 个样本中的前 10,而不是每个样本的前 10 I also wouldn't know how to get the next 100, 1000, etc.我也不知道如何获得下一个 100、1000 等。

If you have any suggestions and can help in any way, that would be much appreciated!如果您有任何建议并能以任何方式提供帮助,我们将不胜感激!

Thanks谢谢

I fixed what I wanted to do with the following:我修复了我想做的事情:

I created a new column with the category my samples fall in, base on their value (ie if they're the top 10 most frequent, next 100, etc etc).我创建了一个新列,其中包含我的样本所属的类别,基于它们的值(即,如果它们是最常出现的前 10 名、接下来的 100 名等)。

df['category']='10001+'

for sampleref in df.sample_ref.unique().tolist():

print(f'Setting sample {sampleref}')

df.loc[df[df.sample_ref == sampleref].nlargest(10000, 'cloneCount')['category'].index,'category']='1001-10000'

df.loc[df[df.sample_ref == sampleref].nlargest(1000, 'cloneCount')['category'].index,'category']='101-1000'

df.loc[df[df.sample_ref == sampleref].nlargest(100, 'cloneCount')['category'].index,'category']='11-100'

df.loc[df[df.sample_ref == sampleref].nlargest(10, 'cloneCount')['category'].index,'category']='top10'

This code starts from the biggest group (10001+) and goes smaller and smaller, to include overlapping samples that might fall into the next big group.此代码从最大的组 (10001+) 开始,然后越来越小,以包含可能落入下一个大组的重叠样本。

Following this, I plotted the samples with the following code:在此之后,我使用以下代码绘制了样本:

fig, ax = plt.subplots(figsize=(15,7))


df.groupby(['Sample','category']).sum()['cloneFraction'].unstack().plot(ax=ax, kind="bar", stacked=True)


plt.xticks(rotation=0)

plt.gca().yaxis.set_major_formatter(mtick.PercentFormatter(1))

handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[::-1], labels[::-1], title='Clonotype',bbox_to_anchor=(1.04,0), loc="lower left", borderaxespad=0)

And this is the result: Here这就是结果:这里

I hope this helps anyone struggling with the same issue!我希望这可以帮助任何在同一问题上挣扎的人!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM