Pandas Series.apply无法使用字符串

Question

It's seems possible to relate with Japanese Language problem, So I asked in Japanese StackOverflow also. 这似乎可能与日语问题有关，所以我也用日语StackOverflow问道。

When I use string just object, it works fine. 当我使用string just object时，它工作正常。

I tried to encode but I couldn't find the reason of this error. 我试图编码，但我找不到这个错误的原因。 Could you please give me advice? 你能给我一些建议吗？

MeCab is an open source text segmentation library for use with text written in the Japanese language originally developed by the Nara Institute of Science and Technology and currently maintained by Taku Kudou (工藤拓) as part of his work on the Google Japanese Input project. MeCab是一个开源文本分割库，用于最初由奈良科学技术研究所开发的日语文本，目前由Taku Kudou（工藤拓）维护，作为他在Google Japanese Input项目上的工作的一部分。 https://en.wikipedia.org/wiki/MeCab https://en.wikipedia.org/wiki/MeCab

sample.csv sample.csv

0,今日も夜まで働きました。
1,オフィスには誰もいませんが、エラーと格闘中
2,デバッグばかりしていますが、どうにもなりません。

This is Pandas Python3 code 这是Pandas Python3代码

import pandas as pd
import MeCab  
# https://en.wikipedia.org/wiki/MeCab
from tqdm import tqdm_notebook as tqdm
# This is working...
df = pd.read_csv('sample.csv', encoding='utf-8')

m = MeCab.Tagger ("-Ochasen")

text = "りんごを食べました、そして、みかんも食べました"
a = m.parse(text)

print(a)# working! 

# But I want to use Pandas's Series



def extractKeyword(text):
    """Morphological analysis of text and returning a list of only nouns"""
    tagger = MeCab.Tagger('-Ochasen')
    node = tagger.parseToNode(text)
    keywords = []
    while node:
        if node.feature.split(",")[0] == u"名詞": # this means noun
            keywords.append(node.surface)
        node = node.next
    return keywords



aa = extractKeyword(text) #working!!

me = df.apply(lambda x: extractKeyword(x))

#TypeError: ("in method 'Tagger_parseToNode', argument 2 of type 'char const *'", 'occurred at index 0')

This is the trace error 这是跟踪错误

りんご リンゴ りんご 名詞-一般       
を   ヲ   を   助詞-格助詞-一般       
食べ  タベ  食べる 動詞-自立   一段  連用形
まし  マシ  ます  助動詞 特殊・マス   連用形
た   タ   た   助動詞 特殊・タ    基本形
、   、   、   記号-読点       
そして ソシテ そして 接続詞     
、   、   、   記号-読点       
みかん ミカン みかん 名詞-一般       
も   モ   も   助詞-係助詞      
食べ  タベ  食べる 動詞-自立   一段  連用形
まし  マシ  ます  助動詞 特殊・マス   連用形
た   タ   た   助動詞 特殊・タ    基本形
EOS

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-174-81a0d5d62dc4> in <module>()
    32 aa = extractKeyword(text) #working!!
    33 
---> 34 me = df.apply(lambda x: extractKeyword(x))

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
4260                         f, axis,
4261                         reduce=reduce,
-> 4262                         ignore_failures=ignore_failures)
4263             else:
4264                 return self._apply_broadcast(f, axis)

~/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in _apply_standard(self, func, axis, ignore_failures, reduce)
4356             try:
4357                 for i, v in enumerate(series_gen):
-> 4358                     results[i] = func(v)
4359                     keys.append(v.name)
4360             except Exception as e:

<ipython-input-174-81a0d5d62dc4> in <lambda>(x)
    32 aa = extractKeyword(text) #working!!
    33 
---> 34 me = df.apply(lambda x: extractKeyword(x))

<ipython-input-174-81a0d5d62dc4> in extractKeyword(text)
    20     """Morphological analysis of text and returning a list of only nouns"""
    21     tagger = MeCab.Tagger('-Ochasen')
---> 22     node = tagger.parseToNode(text)
    23     keywords = []
    24     while node:

~/anaconda3/lib/python3.6/site-packages/MeCab.py in parseToNode(self, *args)
    280     __repr__ = _swig_repr
    281     def parse(self, *args): return _MeCab.Tagger_parse(self, *args)
--> 282     def parseToNode(self, *args): return _MeCab.Tagger_parseToNode(self, *args)
    283     def parseNBest(self, *args): return _MeCab.Tagger_parseNBest(self, *args)
    284     def parseNBestInit(self, *args): return _MeCab.Tagger_parseNBestInit(self, *args)

TypeError: ("in method 'Tagger_parseToNode', argument 2 of type 'char const *'", 'occurred at index 0')w

Answer 1

I see you got some help on the Japanese StackOverflow, but here's an answer in English: 我看到你在日语StackOverflow上得到了一些帮助，但这里有一个英文答案：

The first thing to fix is that read_csv was treating the first line of your example.csv as the header. 要解决的第一件事是read_csv将example.csv的第一行视为标题。 To fix that, use the names argument in read_csv . 要解决此问题，请使用read_csv的names参数。

Next, df.apply will by default apply the function on columns of the dataframe. 接下来， df.apply将默认在数据帧的列上应用该函数。 You need to do something like df.apply(lambda x: extractKeyword(x['String']), axis=1) , but this won't work because each sentence will have a different number of nouns and Pandas will complain it cannot stack a 1x2 array on top of a 1x5 array. 你需要做一些像df.apply(lambda x: extractKeyword(x['String']), axis=1) ，但这不会起作用，因为每个句子都有不同数量的名词而熊猫会抱怨它不能在1x5阵列的顶部堆叠1x2阵列。 The simplest way is to apply on the Series of String . 最简单的方法是apply String系列。

The final problem is, there's a bug in the MeCab Python3 bindings: see https://github.com/SamuraiT/mecab-python3/issues/3 You found a workaround by running parseToNode twice, you can also call parse before parseToNode . 最后一个问题是，MeCab Python3绑定中存在一个错误：请参阅https://github.com/SamuraiT/mecab-python3/issues/3您通过运行parseToNode两次找到了解决方法，您也可以在parseToNode之前调用parse 。

Putting all these three things together: 将所有这三件事放在一起：

import pandas as pd
import MeCab  
df = pd.read_csv('sample.csv', encoding='utf-8', names=['Number', 'String'])

def extractKeyword(text):
    """Morphological analysis of text and returning a list of only nouns"""
    tagger = MeCab.Tagger('-Ochasen')
    tagger.parse(text)
    node = tagger.parseToNode(text)
    keywords = []
    while node:
        if node.feature.split(",")[0] == u"名詞": # this means noun
            keywords.append(node.surface)
        node = node.next
    return keywords

me = df['String'].apply(extractKeyword)
print(me)

When you run this script, with the example.csv you provide: 当您运行此脚本时，使用example.csv提供：

➜  python3 demo.py
0                  [今日, 夜]
1    [オフィス, 誰, エラー, 格闘, 中]
2                   [デバッグ]
Name: String, dtype: object

Answer 2

parseToNode fail everytime , so needed to put this code parseToNode每次都失败，因此需要放置此代码

 tagger.parseToNode('dummy')

before 之前

 node = tagger.parseToNode(text)

and It's worked! 它的工作原理！

But I don't know the reason, maybe parseToNode method has bug.. 但我不知道原因，也许parseToNode方法有bug ..

def extractKeyword(text):
    """Morphological analysis of text and returning a list of only nouns"""
   tagger = MeCab.Tagger('-Ochasen')
   tagger.parseToNode('ダミー') 
   node = tagger.parseToNode(text)
   keywords = []
   while node:
       if node.feature.split(",")[0] == u"名詞": # this means noun
           keywords.append(node.surface)
       node = node.next
   return keywords

Pandas Series.apply无法使用字符串

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-01-17 14:48:31

解决方案2
1 2018-01-17 13:52:10

Pandas Series.apply无法使用字符串

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-01-17 14:48:31

解决方案2 1 2018-01-17 13:52:10

解决方案1
2 已采纳 2018-01-17 14:48:31

解决方案2
1 2018-01-17 13:52:10