[英]Split columns into a csv with panda
只是一個簡單的問題。
我有一個CSV,其中包含很多列。 我有1列名為:美食,具有很多價值。
name,Cuisine
Real Talent Cafe,"Italian, American, Pizza, Mediterranean, European, Fusion"
Dogma,"International, Mediterranean, Barbecue, Spanish, Fusion"
Taberna El Callejon,"Mediterranean, European, Spanish"
Astor,"International, Mediterranean, European, Fusion"
La Gaditana Castellana,"Spanish, Seafood, International, Diner, Wine Bar"
我想從此CSV格式創建一個新的CSV格式,其中包含2列:-名稱-美食(通過拆分第一個CSV)
這是我創建的腳本, 我只選擇兩列對我的興趣:名稱和美食 :
# -*- coding: utf-8 -*-
from itertools import chain
import numpy as np
import pandas as pd
df = pd.read_csv('res_madrid.csv', usecols=['name','Cuisine'])
items_count = df["Cuisine"].str.count(",") +1
pd.DataFrame({"name": np.repeat(df["name"], items_count),
"Cuisine": list(chain.from_iterable(df["Cuisine"].str.split(",")))})
我得到以下錯誤:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib64/python3.6/site-packages/numpy/core/fromnumeric.py", line 471, in repeat
return _wrapfunc(a, 'repeat', repeats, axis=axis)
File "/usr/lib64/python3.6/site-packages/numpy/core/fromnumeric.py", line 56, in _wrapfunc
return getattr(obj, method)(*args, **kwds)
File "/usr/lib64/python3.6/site-packages/pandas/core/series.py", line 1157, in repeat
new_index = self.index.repeat(repeats)
File "/usr/lib64/python3.6/site-packages/pandas/core/indexes/base.py", line 862, in repeat
return self._shallow_copy(self._values.repeat(repeats))
ValueError: count < 0
請注意,如果進行測試,然后復制我共享的數據,它將可以正常工作。 當我加載包含更多列的CSV文件並且使用“ usecols”參數時 ,將出現此問題。
預期結果如下:
name Cuisine
0 Real Talent Cafe Italian
0 Real Talent Cafe American
0 Real Talent Cafe Pizza
0 Real Talent Cafe Mediterranean
0 Real Talent Cafe European
0 Real Talent Cafe Fusion
1 Dogma International
1 Dogma Mediterranean
1 Dogma Barbecue
1 Dogma Spanish
1 Dogma Fusion
2 Taberna El Callejon Mediterranean
2 Taberna El Callejon European
2 Taberna El Callejon Spanish
3 Astor International
3 Astor Mediterranean
3 Astor European
3 Astor Fusion
4 La Gaditana Castellana Spanish
4 La Gaditana Castellana Seafood
4 La Gaditana Castellana International
4 La Gaditana Castellana Diner
4 La Gaditana Castellana Wine Bar
編輯:錯誤來,因為我在美食列中有空值。 我該如何避免呢?
感謝您的幫助:)問候亞歷山大
data = pd.read_csv(#path to txt file)
數據
name Cuisine
0 Real Talent Cafe Italian, American, Pizza, Mediterranean, Europ...
1 Dogma International, Mediterranean, Barbecue, Spanis...
2 Taberna El Callejon Mediterranean, European, Spanish
3 Astor International, Mediterranean, European, Fusion
4 La Gaditana Castellana Spanish, Seafood, International, Diner, Wine Bar
采用
data.set_index('name')['Cuisine'].apply(lambda x: x.split(',')).apply(pd.Series).stack().reset_index().drop('level_1', axis=1)
data.columns = ['name', 'cusisine']
產量
data.head()
name cusisine
0 Real Talent Cafe Italian
1 Real Talent Cafe American
2 Real Talent Cafe Pizza
3 Real Talent Cafe Mediterranean
4 Real Talent Cafe European
這個怎么樣
pd.concat([Series(row['name'], row['Cuisine'].split(','))
for index, row in df.iterrows()]).reset_index()
然后,您只需要重命名列
如果您想要一個沒有apply
和列表組合的解決方案,則可以執行以下操作:
pd.DataFrame(df.Cuisine.str.split(',').values.tolist(), index=df.Name)\
.stack().reset_index().drop('level_1', axis=1)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.