接收IndexError：使用apply时字符串索引超出范围

Question

I want to pick the most used nouns from a data frame by 我想从数据框中选择最常用的名词

Segregating the nouns from each rows of my data. 将名词从我的数据的每一行中分离出来。
Storing them a new column called train['token'] 存储一个名为train ['token']的新列

For this I am passing my function to the apply function but I am receiving this error 为此，我将我的函数传递给apply函数，但我收到此错误

IndexError: string index out of range IndexError：字符串索引超出范围

This is my code 这是我的代码

import pandas as pd
import numpy as np
import nltk

train= pd.read_csv(r'C:\Users\JKC\Downloads\classification_train.csv',names=['product_title','brand_id','category_id'])

train['product_title'] = train['product_title'].apply(lambda x: x.lower())

def preprocessing(x):
    tokens = nltk.pos_tag(x.split(" "))
    list=[]
    for y,x in tokens:
        if(x=="NN" or x=="NNS" or x=="NNP" or x=="NNPS"):
            list.append(y)
    return(' '.join(list))
# My function works fine if I use preprocessing(train['product_title'][1])    



train['token'] = train['product_title'].apply(preprocessing,1)

Traceback : 追溯：

IndexError                                Traceback (most recent call last)
<ipython-input-53-f9f247eec617> in <module>()
     10 
     11 
---> 12 train['token'] = train['product_title'].apply(preprocessing,1)
     13 

C:\Users\JKC\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
   2235             values = lib.map_infer(values, boxer)
   2236 
-> 2237         mapped = lib.map_infer(values, f, convert=convert_dtype)
   2238         if len(mapped) and isinstance(mapped[0], Series):
   2239             from pandas.core.frame import DataFrame

pandas\src\inference.pyx in pandas.lib.map_infer (pandas\lib.c:63043)()

<ipython-input-53-f9f247eec617> in preprocessing(x)
      1 def preprocessing(x):
----> 2         tokens = nltk.pos_tag(x.split(" "))
      3         list=[]
      4         for y,x in tokens:
      5                 if(x=="NN" or x=="NNS" or x=="NNP" or x=="NNPS"):

C:\Users\JKC\Anaconda3\lib\site-packages\nltk\tag\__init__.py in pos_tag(tokens, tagset)
    109     """
    110     tagger = PerceptronTagger()
--> 111     return _pos_tag(tokens, tagset, tagger)
    112 
    113 

C:\Users\JKC\Anaconda3\lib\site-packages\nltk\tag\__init__.py in _pos_tag(tokens, tagset, tagger)
     80 
     81 def _pos_tag(tokens, tagset, tagger):
---> 82     tagged_tokens = tagger.tag(tokens)
     83     if tagset:
     84         tagged_tokens = [(token, map_tag('en-ptb', tagset, tag)) for (token, tag) in tagged_tokens]

C:\Users\JKC\Anaconda3\lib\site-packages\nltk\tag\perceptron.py in tag(self, tokens)
    150         output = []
    151 
--> 152         context = self.START + [self.normalize(w) for w in tokens] + self.END
    153         for i, word in enumerate(tokens):
    154             tag = self.tagdict.get(word)

C:\Users\JKC\Anaconda3\lib\site-packages\nltk\tag\perceptron.py in <listcomp>(.0)
    150         output = []
    151 
--> 152         context = self.START + [self.normalize(w) for w in tokens] + self.END
    153         for i, word in enumerate(tokens):
    154             tag = self.tagdict.get(word)

C:\Users\JKC\Anaconda3\lib\site-packages\nltk\tag\perceptron.py in normalize(self, word)
    224         elif word.isdigit() and len(word) == 4:
    225             return '!YEAR'
--> 226         elif word[0].isdigit():
    227             return '!DIGITS'
    228         else:

IndexError: string index out of range

Data:
                                           product_title brand_id category_id
    0  120gb hard disk drive with 3 years warranty fo...     3950           8
    1  toshiba satellite l305-s5919 laptop lcd screen...    35099         324
    2  hobby-ace pixhawk px4 rgb external led indicat...    21822         510
    3                                  pelicans mousepad    44629         260
    4    p4648-60029 hewlett-packard tc2100 system board    42835          68

There are no empty rows in my data: 我的数据中没有空行：

train.isnull().sum()
Out[12]: 
product_title    0
brand_id         0
category_id      0
dtype: int64

Answer 1

Your input contains two or more consecutive spaces in some places. 您的输入在某些位置包含两个或多个连续空格。 When you split it with x.split(" ") , you get zero-length "words" between adjacent spaces. 当您使用x.split(" ")拆分它时，您将在相邻空格之间获得零长度的“单词”。

Fix it by splitting with x.split() , which will treat any run of consecutive whitespace characters as a token separator. 通过使用x.split()进行拆分来修复它，它会将任何连续的空格字符作为标记分隔符处理。

接收IndexError：使用apply时字符串索引超出范围

问题描述

1 个解决方案

解决方案1
8 已采纳 2016-07-18 19:52:52

接收IndexError：使用apply时字符串索引超出范围

问题描述

1 个解决方案

解决方案1 8 已采纳 2016-07-18 19:52:52

解决方案1
8 已采纳 2016-07-18 19:52:52