[英]How can I get my python code to run again
我用for loops
写了一个 python 脚本,目的是从推文中提取元数据,最初效果很好。 现在,我用list comprehension
替换了for loops
,我的代码抛出了一个我无法真正破译的错误。 下面是我的代码:
def tweetFeatures(tweet):
#Count the number of words in each tweet
wordcount = [len(tweet.loc[j].split()) for j in range(len(tweet))]
#Count the number of characters in each tweet
chars = [len(tweet.loc[k]) for k in range(len(tweet))]
#Extract the mentions in each tweet
mentions = [list(re.findall("@([a-zA-Z0-9_]{1,50})",tweet.loc[p])) for p in range(len(tweet))]
#Counts the number of mentions in each tweet
mention_count = [len(mentions[t]) for t in range(len(mentions))]
#Extracts the hashtags in each tweet
hashtags = [list(re.findall("#([a-zA-Z0-9_]{1,50})",tweet.loc[f])) for f in range(len(tweet))]
#Counts the number of hashtags in each tweet
hashtag_count = [len(hashtags[d]) for d in range(len(hashtags))]
#Extracts the urls in each tweet
url = [list(re.findall("(?P<url>https?://[^\s]+)",tweet.loc[l])) for l in range(len(tweet))]
#Counts the number of urls in each tweet
url_count = [len(url[c]) for c in range(len(url))]
#Put everything into a dataframe
feats = {"n_words":wordcount,"n_chars":chars,"n_mentions":mention_count,"n_hashtag":hashtag_count,"n_url":url_count}
feats_df = pd.DataFrame(feats)
return feats_df
这是我在运行这行代码tweetFeatures(tweet = text_df)
后得到的错误
AttributeError Traceback (most recent call last)
<ipython-input-22-a074a939c816> in <module>
----> 1 tweetFeatures(tweet = text_df)
<ipython-input-21-36def6dfde04> in tweetFeatures(tweet)
1 def tweetFeatures(tweet):
2 #Count the number of words in each tweet
----> 3 wordcount = [len(tweet.loc[j].split()) for j in range(len(tweet))]
4
5 #Count the number of characters in each tweet
<ipython-input-21-36def6dfde04> in <listcomp>(.0)
1 def tweetFeatures(tweet):
2 #Count the number of words in each tweet
----> 3 wordcount = [len(tweet.loc[j].split()) for j in range(len(tweet))]
4
5 #Count the number of characters in each tweet
~\anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
5463 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5464 return self[name]
-> 5465 return object.__getattribute__(self, name)
5466
5467 def __setattr__(self, name: str, value) -> None:
AttributeError: 'Series' object has no attribute 'split'
这是我创建的测试数据:
text = ["@Harrison2Jennifer Tokyo 2020 is so much fun. Loving every bit of it just as @MeggyJane & @Tommy620 say #Tokyo2020 https://www.corp.com",
"Gabrielle Thomas is my favourite sprinter @TooSports https://www.flick.org https://www.bugger.com",
"@Sports_head I wish the #Tokyo2020 @Olympics will never end #Athletics #Sprints",
"I hate the attitude of officials at this olympics @Kieran https://www.launch.com",
"Why can't the #Athletics be more exciting? #Tokyo2020",
"It is so much fun to see beautful colors at the #Olympics"]
我使用text_df = pd.DataFrame(text)
将其转换为 Pandas dataframe ,然后使用print(text_df)
,结果如下:
0
0 @Harrison2Jennifer Tokyo 2020 is so much fun. ...
1 Gabrielle Thomas is my favourite sprinter @Too...
2 @Sports_head I wish the #Tokyo2020 @Olympics w...
3 I hate the attitude of officials at this olymp...
4 Why can't the #Athletics be more exciting? #To...
5 It is so much fun to see beautful colors at th...
代码是在 Jupyter 笔记本中编写的。 拜托,我将感谢您对究竟出了什么问题的有用建议,谢谢。
您正在做的是创建一个pd.DataFrame
,但您只有一个列。 在您的情况下,此列称为0
。
因此,您可以通过以下任一方式修复代码:
tweetFeatures(tweet = text_df[0])
text_df = pd.Series(text)
并像您现在所做的那样调用它。此外,在大多数情况下,您可以通过使用 apply 来加快 function 的速度。 请注意,对于您提供的示例等小输入,这会有点慢,但在使用更多推文时会显着加快速度:
text = ["@Harrison2Jennifer Tokyo 2020 is so much fun. Loving every bit of it just as @MeggyJane & @Tommy620 say #Tokyo2020 https://www.corp.com",
"Gabrielle Thomas is my favourite sprinter @TooSports https://www.flick.org https://www.bugger.com",
"@Sports_head I wish the #Tokyo2020 @Olympics will never end #Athletics #Sprints",
"I hate the attitude of officials at this olympics @Kieran https://www.launch.com",
"Why can't the #Athletics be more exciting? #Tokyo2020",
"It is so much fun to see beautful colors at the #Olympics"]*1000
from functools import partial
def tweetFeatures_speedup(tweet):
#Count the number of words in each tweet
wordcount = tweet.apply(lambda x: len(x.split()))
#Count the number of characters in each tweet
chars = tweet.apply(len)
#Extract the mentions in each tweet
mention_finder = partial(re.findall, "@([a-zA-Z0-9_]{1,50})")
#Counts the number of mentions in each tweet
mention_count = tweet.apply(lambda x: len(mention_finder(x)))
#Extracts the hashtags in each tweet
#Counts the number of hashtags in each tweet
hashtag_finder = partial(re.findall, "#([a-zA-Z0-9_]{1,50})")
hashtag_count = tweet.apply(lambda x: len(hashtag_finder(x)))
#Extracts the urls in each tweet
#Counts the number of urls in each tweet
url_finder = partial(re.findall, "(?P<url>https?://[^\s]+)")
url_count = tweet.apply(lambda x: len(url_finder(x)))
#Put everything into a dataframe
feats = {"n_words":wordcount,"n_chars":chars,"n_mentions":mention_count,"n_hashtag":hashtag_count,"n_url":url_count}
feats_df = pd.DataFrame(feats)
return feats_df
这导致%%timeit
比较:
193 ms ± 1.95 ms per loop
21.3 ms ± 85.1 µs
根据您的错误消息AttributeError: 'Series' object has no attribute 'split'
您正试图在pandas
系列 object 上调用 String 方法split()
。
wordcount = [len(tweet.loc[j].split()) for j in range(len(tweet))]
通过查看您提供的测试数据,您可以执行以下操作来修复错误:
import pandas as pd
text_df = pd.DataFrame(text,columns=["tweet"])
text_df.tweet.loc[0].split()
将返回:
['@Harrison2Jennifer',
'Tokyo',
'2020',
'is',
'so',
'much',
'fun.',
'Loving',
'every',
'bit',
'of',
'it',
'just',
'as',
'@MeggyJane',
'&',
'@Tommy620',
'say',
'#Tokyo2020',
'https://www.corp.com']
或者,通过传递推文的“原始”列表并将列表理解更改为,没有pandas
的解决方案
wordcount = [len(t.split()) for t in tweet]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.