简体   繁体   English

如何在Python中将未格式化的数据拆分为多个列?

[英]How to split unformatted data into several columns in Python?

Dears, 亲爱,

recently I use crawler to fetch information from website, and get a column of data like this: 最近我使用crawler从网站获取信息,并得到一列这样的数据:

|               **Hotel Info**           |  
| 2014 open    2016 retrofit    50 rooms |  
| 60 rooms                               |       
| 2012 open    100 rooms                 |
| 80 rooms                               |
| 2010 open                              |

I want it to be like this finally: 我希望它最终是这样的:

| **Hotel Open** | **Hotel Retrofit** | **Hotel Rooms** |
|   2014         |   2016             |   50            |
|   null         |   null             |   60            |
|   2012         |   null             |   100           |
|   null         |   null             |   80            |
|   2010         |   null             |   null          |

NOTE: 注意:
The original website doesn't split these 3 'information blocks' separately. 原始网站不会单独拆分这3个'信息块'。 They are all under a <p>...</p> block. 它们都在<p>...</p>块下。 Therefore I cannot avoid this issue. 因此我无法避免这个问题。

I am using Python, and totally new in it. 我正在使用Python,并且是全新的。 Please help me and THANK YOU VERY MUCH!!! 请帮助我,谢谢你!

suppose you have data in test.xlsx file, you can try this : 假设你有test.xlsx文件中的数据,你可以试试这个:

import collections
import numpy as np
import pandas as pd
df = pd.read_excel('test.xlsx', sheetname='Sheet1')
df_dict = collections.defaultdict(list)
for i in df['**Hotel Info**']:
    i_list = i.split('    ') #split with multiple spaces (&nbsp;&nbsp;)
    df_dict['**Hotel Open**'].append([e.split('open')[0].strip() for e in i_list if 'open' in e])
    df_dict['**Hotel Retrofit**'].append([e.split('retrofit')[0].strip() for e in i_list if 'retrofit' in e])
    df_dict['**Hotel Rooms**'].append([e.split('rooms')[0].strip() for e in i_list if 'rooms' in e])
df_dict['**Hotel Open**']=[np.nan if len(item)==0 else int(item[0]) for item in df_dict['**Hotel Open**']]
df_dict['**Hotel Retrofit**']=[np.nan if len(item)==0 else int(item[0]) for item in df_dict['**Hotel Retrofit**']]
df_dict['**Hotel Rooms**']=[np.nan if len(item)==0 else int(item[0]) for item in df_dict['**Hotel Rooms**']]
new_df = pd.DataFrame(df_dict)
new_df

new_df will be: new_df将是:

    **Hotel Open**  **Hotel Retrofit**  **Hotel Rooms**
0   2014            2016                50
1   NaN             NaN                 60
2   2012            NaN                 100
3   NaN             NaN                 80
4   2010            NaN                 NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM