如何在Python中将未格式化的数据拆分为多个列？

Question

Dears, 亲爱，

recently I use crawler to fetch information from website, and get a column of data like this: 最近我使用crawler从网站获取信息，并得到一列这样的数据：

|               **Hotel Info**           |  
| 2014 open    2016 retrofit    50 rooms |  
| 60 rooms                               |       
| 2012 open    100 rooms                 |
| 80 rooms                               |
| 2010 open                              |

I want it to be like this finally: 我希望它最终是这样的：

| **Hotel Open** | **Hotel Retrofit** | **Hotel Rooms** |
|   2014         |   2016             |   50            |
|   null         |   null             |   60            |
|   2012         |   null             |   100           |
|   null         |   null             |   80            |
|   2010         |   null             |   null          |

NOTE: 注意：
The original website doesn't split these 3 'information blocks' separately. 原始网站不会单独拆分这3个'信息块'。 They are all under a <p>...</p> block. 它们都在<p>...</p>块下。 Therefore I cannot avoid this issue. 因此我无法避免这个问题。

I am using Python, and totally new in it. 我正在使用Python，并且是全新的。 Please help me and THANK YOU VERY MUCH!!! 请帮助我，谢谢你！

Answer 1

suppose you have data in test.xlsx file, you can try this : 假设你有test.xlsx文件中的数据，你可以试试这个：

import collections
import numpy as np
import pandas as pd
df = pd.read_excel('test.xlsx', sheetname='Sheet1')
df_dict = collections.defaultdict(list)
for i in df['**Hotel Info**']:
    i_list = i.split('    ') #split with multiple spaces (&nbsp;&nbsp;)
    df_dict['**Hotel Open**'].append([e.split('open')[0].strip() for e in i_list if 'open' in e])
    df_dict['**Hotel Retrofit**'].append([e.split('retrofit')[0].strip() for e in i_list if 'retrofit' in e])
    df_dict['**Hotel Rooms**'].append([e.split('rooms')[0].strip() for e in i_list if 'rooms' in e])
df_dict['**Hotel Open**']=[np.nan if len(item)==0 else int(item[0]) for item in df_dict['**Hotel Open**']]
df_dict['**Hotel Retrofit**']=[np.nan if len(item)==0 else int(item[0]) for item in df_dict['**Hotel Retrofit**']]
df_dict['**Hotel Rooms**']=[np.nan if len(item)==0 else int(item[0]) for item in df_dict['**Hotel Rooms**']]
new_df = pd.DataFrame(df_dict)
new_df

new_df will be: new_df将是：

    **Hotel Open**  **Hotel Retrofit**  **Hotel Rooms**
0   2014            2016                50
1   NaN             NaN                 60
2   2012            NaN                 100
3   NaN             NaN                 80
4   2010            NaN                 NaN

如何在Python中将未格式化的数据拆分为多个列？

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-06-25 04:39:07

如何在Python中将未格式化的数据拆分为多个列？

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-06-25 04:39:07

解决方案1
0 已采纳 2017-06-25 04:39:07