簡體   English   中英

使用熊貓將列拆分為csv

[英]Split columns into a csv with panda

只是一個簡單的問題。

我有一個CSV,其中包含很多列。 我有1列名為:美食,具有很多價值。

name,Cuisine
Real Talent Cafe,"Italian, American, Pizza, Mediterranean, European, Fusion"
Dogma,"International, Mediterranean, Barbecue, Spanish, Fusion"
Taberna El Callejon,"Mediterranean, European, Spanish"
Astor,"International, Mediterranean, European, Fusion"
La Gaditana Castellana,"Spanish, Seafood, International, Diner, Wine Bar"

我想從此CSV格式創建一個新的CSV格式,其中包含2列:-名稱-美食(通過拆分第一個CSV)

這是我創建的腳本, 我只選擇兩列對我的興趣:名稱和美食

# -*- coding: utf-8 -*-
from itertools import chain
import numpy as np
import pandas as pd

df = pd.read_csv('res_madrid.csv', usecols=['name','Cuisine'])
items_count = df["Cuisine"].str.count(",") +1

pd.DataFrame({"name": np.repeat(df["name"], items_count),
    "Cuisine": list(chain.from_iterable(df["Cuisine"].str.split(",")))})

我得到以下錯誤:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python3.6/site-packages/numpy/core/fromnumeric.py", line 471, in repeat
    return _wrapfunc(a, 'repeat', repeats, axis=axis)
  File "/usr/lib64/python3.6/site-packages/numpy/core/fromnumeric.py", line 56, in _wrapfunc
    return getattr(obj, method)(*args, **kwds)
  File "/usr/lib64/python3.6/site-packages/pandas/core/series.py", line 1157, in repeat
    new_index = self.index.repeat(repeats)
  File "/usr/lib64/python3.6/site-packages/pandas/core/indexes/base.py", line 862, in repeat
    return self._shallow_copy(self._values.repeat(repeats))
ValueError: count < 0

請注意,如果進行測試,然后復制我共享的數據,它將可以正常工作。 當我加載包含更多列的CSV文件並且使用“ usecols”參數時 ,將出現此問題。

預期結果如下:

                     name         Cuisine
0        Real Talent Cafe         Italian
0        Real Talent Cafe        American
0        Real Talent Cafe           Pizza
0        Real Talent Cafe   Mediterranean
0        Real Talent Cafe        European
0        Real Talent Cafe          Fusion
1                   Dogma   International
1                   Dogma   Mediterranean
1                   Dogma        Barbecue
1                   Dogma         Spanish
1                   Dogma          Fusion
2     Taberna El Callejon   Mediterranean
2     Taberna El Callejon        European
2     Taberna El Callejon         Spanish
3                   Astor   International
3                   Astor   Mediterranean
3                   Astor        European
3                   Astor          Fusion
4  La Gaditana Castellana         Spanish
4  La Gaditana Castellana         Seafood
4  La Gaditana Castellana   International
4  La Gaditana Castellana           Diner
4  La Gaditana Castellana        Wine Bar

編輯:錯誤來,因為我在美食列中有空值。 我該如何避免呢?

感謝您的幫助:)問候亞歷山大

data = pd.read_csv(#path to txt file)

數據

                     name                                            Cuisine
0        Real Talent Cafe  Italian, American, Pizza, Mediterranean, Europ...
1                   Dogma  International, Mediterranean, Barbecue, Spanis...
2     Taberna El Callejon                   Mediterranean, European, Spanish
3                   Astor     International, Mediterranean, European, Fusion
4  La Gaditana Castellana   Spanish, Seafood, International, Diner, Wine Bar

采用

data.set_index('name')['Cuisine'].apply(lambda x: x.split(',')).apply(pd.Series).stack().reset_index().drop('level_1', axis=1)
data.columns = ['name', 'cusisine']

產量

 data.head()


               name        cusisine
0  Real Talent Cafe         Italian
1  Real Talent Cafe        American
2  Real Talent Cafe           Pizza
3  Real Talent Cafe   Mediterranean
4  Real Talent Cafe        European

這個怎么樣

pd.concat([Series(row['name'], row['Cuisine'].split(','))              
                for index, row in df.iterrows()]).reset_index()

然后,您只需要重命名列

如果您想要一個沒有apply和列表組合的解決方案,則可以執行以下操作:

pd.DataFrame(df.Cuisine.str.split(',').values.tolist(), index=df.Name)\
.stack().reset_index().drop('level_1', axis=1)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM