如何让我的 python 代码再次运行

Question

我用for loops写了一个 python 脚本，目的是从推文中提取元数据，最初效果很好。 现在，我用list comprehension替换了for loops ，我的代码抛出了一个我无法真正破译的错误。 下面是我的代码：

def tweetFeatures(tweet):
        #Count the number of words in each tweet
        wordcount = [len(tweet.loc[j].split()) for j in range(len(tweet))]
        
        #Count the number of characters in each tweet
        chars = [len(tweet.loc[k]) for k in range(len(tweet))]
        
        #Extract the mentions in each tweet
        mentions = [list(re.findall("@([a-zA-Z0-9_]{1,50})",tweet.loc[p])) for p in range(len(tweet))]
        
        #Counts the number of mentions in each tweet 
        mention_count = [len(mentions[t]) for t in range(len(mentions))]
        
        #Extracts the hashtags in each tweet    
        hashtags = [list(re.findall("#([a-zA-Z0-9_]{1,50})",tweet.loc[f])) for f in range(len(tweet))]
        
        #Counts the number of hashtags in each tweet    
        hashtag_count = [len(hashtags[d]) for d in range(len(hashtags))]
        
        #Extracts the urls in each tweet
        url = [list(re.findall("(?P<url>https?://[^\s]+)",tweet.loc[l])) for l in range(len(tweet))]
        
        #Counts the number of urls in each tweet
        url_count = [len(url[c]) for c in range(len(url))]
        
        #Put everything into a dataframe
        feats = {"n_words":wordcount,"n_chars":chars,"n_mentions":mention_count,"n_hashtag":hashtag_count,"n_url":url_count}
        feats_df = pd.DataFrame(feats)

        return feats_df

这是我在运行这行代码tweetFeatures(tweet = text_df)后得到的错误

AttributeError                            Traceback (most recent call last)
<ipython-input-22-a074a939c816> in <module>
----> 1 tweetFeatures(tweet = text_df)

<ipython-input-21-36def6dfde04> in tweetFeatures(tweet)
      1 def tweetFeatures(tweet):
      2         #Count the number of words in each tweet
----> 3         wordcount = [len(tweet.loc[j].split()) for j in range(len(tweet))]
      4 
      5         #Count the number of characters in each tweet

<ipython-input-21-36def6dfde04> in <listcomp>(.0)
      1 def tweetFeatures(tweet):
      2         #Count the number of words in each tweet
----> 3         wordcount = [len(tweet.loc[j].split()) for j in range(len(tweet))]
      4 
      5         #Count the number of characters in each tweet

~\anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5463             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5464                 return self[name]
-> 5465             return object.__getattribute__(self, name)
   5466 
   5467     def __setattr__(self, name: str, value) -> None:

AttributeError: 'Series' object has no attribute 'split'

这是我创建的测试数据：

text = ["@Harrison2Jennifer Tokyo 2020 is so much fun. Loving every bit of it just as @MeggyJane & @Tommy620 say #Tokyo2020 https://www.corp.com",
           "Gabrielle Thomas is my favourite sprinter @TooSports https://www.flick.org https://www.bugger.com",
           "@Sports_head I wish the #Tokyo2020 @Olympics will never end #Athletics #Sprints",
           "I hate the attitude of officials at this olympics @Kieran https://www.launch.com",
           "Why can't the #Athletics be more exciting? #Tokyo2020",
           "It is so much fun to see beautful colors at the #Olympics"]

我使用text_df = pd.DataFrame(text)将其转换为 Pandas dataframe ，然后使用print(text_df) ，结果如下：

0
0   @Harrison2Jennifer Tokyo 2020 is so much fun. ...
1   Gabrielle Thomas is my favourite sprinter @Too...
2   @Sports_head I wish the #Tokyo2020 @Olympics w...
3   I hate the attitude of officials at this olymp...
4   Why can't the #Athletics be more exciting? #To...
5   It is so much fun to see beautful colors at th...

代码是在 Jupyter 笔记本中编写的。 拜托，我将感谢您对究竟出了什么问题的有用建议，谢谢。

Answer 1

您正在做的是创建一个pd.DataFrame ，但您只有一个列。 在您的情况下，此列称为0 。

因此，您可以通过以下任一方式修复代码：

tweetFeatures(tweet = text_df[0])
创建一个系列而不是 DataFrame： text_df = pd.Series(text)并像您现在所做的那样调用它。

此外，在大多数情况下，您可以通过使用 apply 来加快 function 的速度。 请注意，对于您提供的示例等小输入，这会有点慢，但在使用更多推文时会显着加快速度：

text = ["@Harrison2Jennifer Tokyo 2020 is so much fun. Loving every bit of it just as @MeggyJane & @Tommy620 say #Tokyo2020 https://www.corp.com",
       "Gabrielle Thomas is my favourite sprinter @TooSports https://www.flick.org https://www.bugger.com",
       "@Sports_head I wish the #Tokyo2020 @Olympics will never end #Athletics #Sprints",
       "I hate the attitude of officials at this olympics @Kieran https://www.launch.com",
       "Why can't the #Athletics be more exciting? #Tokyo2020",
       "It is so much fun to see beautful colors at the #Olympics"]*1000

from functools import partial
def tweetFeatures_speedup(tweet):
    #Count the number of words in each tweet
    wordcount = tweet.apply(lambda x: len(x.split()))
    
    #Count the number of characters in each tweet
    chars = tweet.apply(len)
    
    #Extract the mentions in each tweet
    mention_finder = partial(re.findall, "@([a-zA-Z0-9_]{1,50})")
    
    #Counts the number of mentions in each tweet
    mention_count = tweet.apply(lambda x: len(mention_finder(x)))
    
    #Extracts the hashtags in each tweet    
    #Counts the number of hashtags in each tweet   
    hashtag_finder = partial(re.findall, "#([a-zA-Z0-9_]{1,50})")
    hashtag_count = tweet.apply(lambda x: len(hashtag_finder(x)))
    
    #Extracts the urls in each tweet
    #Counts the number of urls in each tweet
    url_finder = partial(re.findall, "(?P<url>https?://[^\s]+)")
    url_count = tweet.apply(lambda x: len(url_finder(x)))
    
    #Put everything into a dataframe
    feats = {"n_words":wordcount,"n_chars":chars,"n_mentions":mention_count,"n_hashtag":hashtag_count,"n_url":url_count}
    feats_df = pd.DataFrame(feats)

    return feats_df

这导致%%timeit比较：

您的版本： 193 ms ± 1.95 ms per loop
我的版本： 21.3 ms ± 85.1 µs

Answer 2

根据您的错误消息AttributeError: 'Series' object has no attribute 'split'您正试图在pandas系列 object 上调用 String 方法split() 。

wordcount = [len(tweet.loc[j].split()) for j in range(len(tweet))]

通过查看您提供的测试数据，您可以执行以下操作来修复错误：

import pandas as pd

text_df = pd.DataFrame(text,columns=["tweet"])

text_df.tweet.loc[0].split()

将返回：

['@Harrison2Jennifer',
 'Tokyo',
 '2020',
 'is',
 'so',
 'much',
 'fun.',
 'Loving',
 'every',
 'bit',
 'of',
 'it',
 'just',
 'as',
 '@MeggyJane',
 '&',
 '@Tommy620',
 'say',
 '#Tokyo2020',
 'https://www.corp.com']

或者，通过传递推文的“原始”列表并将列表理解更改为，没有pandas的解决方案

wordcount = [len(t.split()) for t in tweet]

如何让我的 python 代码再次运行

问题描述

2 个解决方案

解决方案1
1 2022-02-09 15:13:22

解决方案2
1 2022-02-09 15:13:47

如何让我的 python 代码再次运行

问题描述

2 个解决方案

解决方案1 1 2022-02-09 15:13:22

解决方案2 1 2022-02-09 15:13:47

解决方案1
1 2022-02-09 15:13:22

解决方案2
1 2022-02-09 15:13:47