简体   繁体   English

使用 pandas 读取单列文件

[英]Reading one-column file with pandas

I'm trying to read the following file into a pandas dataframe:我正在尝试将以下文件读入 pandas dataframe:

(dataA
0   
400 
2800
9200
5600
2000
8400
4800
1200
7600
4000
400
6800
)
(dataB
30
30
30
30
30
30
20
500
30
50
330
530
930
)

The objective being to have something as this:目标是拥有这样的东西:

dataA dataB
0     30
400   30
2800  30
9200  30
5600  30
2000  30
8400  20
4800  500
1200  30
7600  50
4000  330
400   530
6800  930

I know this can be done by reading the file line by line, but I was wondering if there is an easy way to have it read by pandas (as read_csv for example).我知道这可以通过逐行读取文件来完成,但我想知道是否有一种简单的方法可以让 pandas 读取它(例如 read_csv)。 This is because there are lots of files similar to this one and the post-processing is already automatized for that type of data.这是因为有很多与此类似的文件,并且已针对该类型的数据自动进行后处理。

based on the fact that you have parethisis that break the columns apart we can create two new indexes and unstack your columns.基于你有 parethisis 将列分开的事实,我们可以创建两个新索引并拆开你的列。

It's important you read your file with header=None使用header=None阅读文件很重要

df = pd.read_excel(...,header=None)

s = df[0].str.contains('\(',regex=True)

df1 = df.set_index([s.cumsum(), df.groupby(s.cumsum()).cumcount()]).unstack(0)
#additional clean up
df1 = df1.replace('\(|\)','',regex=True).replace('',np.nan).dropna().droplevel(0,1)

#setup columns.
df1.columns = df1.iloc[0]
df1 = df1.iloc[1:]



print(df1)
0  dataA dataB
1   0       30
2   400     30
3   2800    30
4   9200    30
5   5600    30
6   2000    30
7   8400    20
8   4800   500
9   1200    30
10  7600    50
11  4000   330
12   400   530
13  6800   930

You need to create dataframe from a dictionary of lists :您需要从列表字典中创建 dataframe :

  1. Import pandas library:导入pandas库:

     import pandas as pd
  2. Create a dictionary from your list:从列表中创建字典:

     data = { 'dataA': [0,400,2800,9200,5600,2000,8400,4800,1200,7600,4000,400,6800], 'dataB': [30,30,30,30,30,30,20,500,30,50,330,530,930]}
  3. Create your dataframe:创建您的 dataframe:

     df = pd.DataFrame(data)
  4. Call your data frame:调用您的数据框:

     df

Overally, you can see the total code:综上,可以看到总代码:

import pandas as pd
data = { 'dataA': [0,400 ,2800,9200,5600,2000,8400,4800,1200,7600,4000,400,6800],
        'dataB': [30,30,30,30,30,30,20,500,30,50,330,530,930]}
df = pd.DataFrame(data)
df

and the output will be: output 将是:

    dataA   dataB
0   0       30
1   400     30
2   2800    30
3   9200    30
4   5600    30
5   2000    30
6   8400    20
7   4800    500
8   1200    30
9   7600    50
10  4000    330
11  400     530
12  6800    930
 

If you are willing to not see the number of rows in your dataframe, add this code at the end:如果您不想看到您的 dataframe 中的行数,请在末尾添加此代码:

print(df.to_string(index=False))

The output will be: output 将是:

 dataA  dataB
     0     30
   400     30
  2800     30
  9200     30
  5600     30
  2000     30
  8400     20
  4800    500
  1200     30
  7600     50
  4000    330
   400    530
  6800    930

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM