简体   繁体   English

Pandas:如何使用 df.to_dict() 轻松共享示例 dataframe?

[英]Pandas: How to easily share a sample dataframe using df.to_dict()?

This question was earlier marked as a duplicate of How to make good reproducible pandas examples .这个问题之前被标记为How to make good reproducible pandas examples的副本。 That contribution should undoubtedly be the go-to post for anyone seeking to make such a reproducible data sample, while this post is meant to clarify a very practical and efficient way to include a given data sample in a question using df.to_dict() in combination with df=pd.DataFrame(<dict>) .毫无疑问,该贡献应该是任何寻求制作此类可重复数据样本的人的首选帖子,而这篇文章旨在阐明一种非常实用且有效的方法,可以使用df.to_dict()在问题中包含给定的数据样本结合df=pd.DataFrame(<dict>) This was not explicitly covered in neither the question nor the answers in How to make good reproducible pandas examples . How to make good reproducible pandas examples中的问题和答案都没有明确涵盖这一点。 Using df.to_dict() also works very well in tandem with df.to_clipboard() , concisely covered in the post How to provide a reproducible copy of your DataFrame with to_clipboard()使用df.to_dict()也可以很好地与df.to_clipboard()一起使用,简明扼要地包含在如何使用 to_clipboard() 提供 DataFrame 的可复制副本中


Despite the clear and concise guidance on How do I ask a good question?尽管关于如何提出一个好问题有清晰简洁的指导? and How to create a Minimal, Reproducible Example , many just seem to ignore to include a reproducible data sample in their question.How to create a Minimal, Reproducible Example ,许多人似乎只是忽略了在他们的问题中包含可重现的数据样本。 So what is a practical and easy way to reproduce a data sample when a simple pd.DataFrame(np.random.random(size=(5, 5))) is not enough?那么当简单的pd.DataFrame(np.random.random(size=(5, 5)))不够用时,有什么实用且简单的方法来重现数据样本呢? How can you, for example, use df.to_dict() and include the output in a question?例如,您如何使用df.to_dict()并在问题中包含 output?

The answer:答案:

In many situations, using an approach with df.to_dict() will do the job perfectly: Here are two cases that come to mind:在许多情况下,使用带有df.to_dict()的方法可以完美地完成工作:以下是我想到的两种情况:

Case 1: You've got a dataframe built or loaded in Python from a local source案例 1:您有一个 dataframe 从本地来源构建或加载到 Python

Case 2: You've got a table in another application (like Excel)案例 2:您在另一个应用程序(如 Excel)中有一个表格


The details:细节:

Case 1: You've got a dataframe built or loaded from a local source案例 1:您从本地源构建或加载了 dataframe

Given that you've got a pandas dataframe named df , just假设您有一个名为df的 pandas dataframe ,只需

  1. run df.to_dict() in you console or editor, and在控制台或编辑器中运行df.to_dict() ,并且
  2. copy the output that is formatted as a dictionary, and复制格式化为字典的 output,和
  3. paste the content into pd.DataFrame(<output>) and include that chunk in your now reproducible code snippet.将内容粘贴到pd.DataFrame(<output>)并将该块包含在您现在可重现的代码片段中。

Case 2: You've got a table in another application (like Excel)案例 2:您在另一个应用程序(如 Excel)中有一个表格

Depending on the source and separator like (',', ';' '\\s+') where the latter means any spaces, you can simply:根据来源和分隔符,如(',', ';' '\\s+')后者表示任何空格,您可以简单地:

  1. Ctrl+C the contents Ctrl+C内容
  2. run df=pd.read_clipboard(sep='\\s+') in your console or editor, and在控制台或编辑器中运行df=pd.read_clipboard(sep='\\s+') ,然后
  3. run df.to_dict() , and运行df.to_dict() ,并且
  4. include the output in df=pd.DataFrame(<output>)df=pd.DataFrame(<output>)中包含 output

In this case, the start of your question would look something like this:在这种情况下,您的问题的开头将如下所示:

import pandas as pd
df = pd.DataFrame({0: {0: 0.25474768796402636, 1: 0.5792136563952824, 2: 0.5950396800676201},
                   1: {0: 0.9071073567355232, 1: 0.1657288354283053, 2: 0.4962367707789421},
                   2: {0: 0.7440601352930207, 1: 0.7755487356392468, 2: 0.5230707257648775}})

Of course, this gets a little clumsy with larger dataframes.当然,对于较大的数据帧,这会有点笨拙。 But very often, all anyone who seeks to answer your question need is a little sample of your real world data to take the structure of your data into consideration.但很多时候,所有试图回答您问题的人都需要您的真实世界数据的一小部分样本,以考虑您的数据结构。

And there are two ways you can handle larger dataframes:有两种方法可以处理更大的数据帧:

  1. run df.head(20).to_dict() to only include the first 20 rows , and运行df.head(20).to_dict()仅包含前20 rows ,并且
  2. change the format of your dict using, for example, df.to_dict('split') (there are other options besides 'split' ) to reshape your output to a dict that requires fewer lines.使用例如df.to_dict('split') (除了'split'之外还有其他选项)将您的 output 重塑为需要更少行的 dict 来更改您的 dict 的格式。

Here's an example using the iris dataset, among other places available from plotly express.这是一个使用iris数据集的示例,以及 plotly express 提供的其他位置。

If you just run:如果你只是运行:

import plotly.express as px
import pandas as pd
df = px.data.iris()
df.to_dict()

This will produce an output of nearly 1000 lines, and won't be very practical as a reproducible sample.这将产生近 1000 行的 output,并且作为可重现的样本不太实用。 But if you include .head(25) , you'll get:但是如果你包括.head(25) ,你会得到:

{'sepal_length': {0: 5.1, 1: 4.9, 2: 4.7, 3: 4.6, 4: 5.0, 5: 5.4, 6: 4.6, 7: 5.0, 8: 4.4, 9: 4.9},
 'sepal_width': {0: 3.5, 1: 3.0, 2: 3.2, 3: 3.1, 4: 3.6, 5: 3.9, 6: 3.4, 7: 3.4, 8: 2.9, 9: 3.1},
 'petal_length': {0: 1.4, 1: 1.4, 2: 1.3, 3: 1.5, 4: 1.4, 5: 1.7, 6: 1.4, 7: 1.5, 8: 1.4, 9: 1.5},
 'petal_width': {0: 0.2, 1: 0.2, 2: 0.2, 3: 0.2, 4: 0.2, 5: 0.4, 6: 0.3, 7: 0.2, 8: 0.2, 9: 0.1},
 'species': {0: 'setosa', 1: 'setosa', 2: 'setosa', 3: 'setosa', 4: 'setosa', 5: 'setosa', 6: 'setosa', 7: 'setosa', 8: 'setosa', 9: 'setosa'},
 'species_id': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1}}

And now we're getting somewhere.现在我们正在取得进展。 But depending on the structure and content of the data, this may not cover the complexity of the contents in a satisfactory manner.但是根据数据的结构和内容,这可能无法以令人满意的方式涵盖内容的复杂性。 But you can include more data on fewer lines by including to_dict('split') like this:但是您可以通过像这样包含to_dict('split')来在更少的行中包含更多数据

import plotly.express as px
df = px.data.iris().head(10)
df.to_dict('split')

Now your output will look like:现在您的 output 将如下所示:

{'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 'columns': ['sepal_length',
  'sepal_width',
  'petal_length',
  'petal_width',
  'species',
  'species_id'],
 'data': [[5.1, 3.5, 1.4, 0.2, 'setosa', 1],
  [4.9, 3.0, 1.4, 0.2, 'setosa', 1],
  [4.7, 3.2, 1.3, 0.2, 'setosa', 1],
  [4.6, 3.1, 1.5, 0.2, 'setosa', 1],
  [5.0, 3.6, 1.4, 0.2, 'setosa', 1],
  [5.4, 3.9, 1.7, 0.4, 'setosa', 1],
  [4.6, 3.4, 1.4, 0.3, 'setosa', 1],
  [5.0, 3.4, 1.5, 0.2, 'setosa', 1],
  [4.4, 2.9, 1.4, 0.2, 'setosa', 1],
  [4.9, 3.1, 1.5, 0.1, 'setosa', 1]]}

And now you can easily increase the number in .head(10) without cluttering your question too much.现在您可以轻松地增加.head(10)中的数字,而不会过多地混淆您的问题。 But there's one minor drawback.但有一个小缺点。 Now you can no longer use the input directly in pd.DataFrame .现在您不能再直接在pd.DataFrame中使用输入。 But if you include a few specifications with regards to index, column, and data you'll be just fine.但是如果你包含一些关于index, column, and data的规范,你会很好的。 So for this particluar dataset, my preferred approach would be:因此,对于这个特定的数据集,我首选的方法是:

import pandas as pd
import plotly.express as px

sample = {'index': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
             'columns': ['sepal_length',
              'sepal_width',
              'petal_length',
              'petal_width',
              'species',
              'species_id'],
             'data': [[5.1, 3.5, 1.4, 0.2, 'setosa', 1],
              [4.9, 3.0, 1.4, 0.2, 'setosa', 1],
              [4.7, 3.2, 1.3, 0.2, 'setosa', 1],
              [4.6, 3.1, 1.5, 0.2, 'setosa', 1],
              [5.0, 3.6, 1.4, 0.2, 'setosa', 1],
              [5.4, 3.9, 1.7, 0.4, 'setosa', 1],
              [4.6, 3.4, 1.4, 0.3, 'setosa', 1],
              [5.0, 3.4, 1.5, 0.2, 'setosa', 1],
              [4.4, 2.9, 1.4, 0.2, 'setosa', 1],
              [4.9, 3.1, 1.5, 0.1, 'setosa', 1],
              [5.4, 3.7, 1.5, 0.2, 'setosa', 1],
              [4.8, 3.4, 1.6, 0.2, 'setosa', 1],
              [4.8, 3.0, 1.4, 0.1, 'setosa', 1],
              [4.3, 3.0, 1.1, 0.1, 'setosa', 1],
              [5.8, 4.0, 1.2, 0.2, 'setosa', 1]]}

df = pd.DataFrame(index=sample['index'], columns=sample['columns'], data=sample['data'])
df

Now you'll have this dataframe to work with:现在您将拥有此 dataframe 可以使用:

    sepal_length  sepal_width  petal_length  petal_width species  species_id
0            5.1          3.5           1.4          0.2  setosa           1
1            4.9          3.0           1.4          0.2  setosa           1
2            4.7          3.2           1.3          0.2  setosa           1
3            4.6          3.1           1.5          0.2  setosa           1
4            5.0          3.6           1.4          0.2  setosa           1
5            5.4          3.9           1.7          0.4  setosa           1
6            4.6          3.4           1.4          0.3  setosa           1
7            5.0          3.4           1.5          0.2  setosa           1
8            4.4          2.9           1.4          0.2  setosa           1
9            4.9          3.1           1.5          0.1  setosa           1
10           5.4          3.7           1.5          0.2  setosa           1
11           4.8          3.4           1.6          0.2  setosa           1
12           4.8          3.0           1.4          0.1  setosa           1
13           4.3          3.0           1.1          0.1  setosa           1
14           5.8          4.0           1.2          0.2  setosa           1

Which will increase your chances of receiving useful answers significantly!这将大大增加您获得有用答案的机会!

Edit:编辑:

df_to_dict() will not be able to read timestamps like 1: Timestamp('2020-01-02 00:00:00') without also including from pandas import Timestamp df_to_dict()将无法读取像1: Timestamp('2020-01-02 00:00:00')这样的时间戳,而不包括from pandas import Timestamp

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM