pandas 解码字符串返回 NaN

Question

I'm practicing on the kaggle news headline dataset: https://www.kaggle.com/aaron7sun/stocknews#Combined_News_DJIA.csv我正在练习 kaggle 新闻标题数据集： https://www.kaggle.com/aaron7sun/stocknews#Combined_News_DJIA.csv

df = pd.read_csv('./data/Combined_News_DJIA.csv')

When read the DataFrame of news headline, I get this off formatting of the series:当阅读新闻标题的 DataFrame 时，我得到了这个系列的格式：

0       b"Georgia 'downs two Russian warplanes' as cou...
1       b'Why wont America &amp; Nato help us? If they w...
2       b'Remember that adorable 9-year-old who sang a...
3       b' U.S. refuses Israel weapons to attack Iran:...
4       b'All the experts admit that we should legalis...

I tried using the following:我尝试使用以下内容：

df['Series'].str.decode("utf-8")

However the output is a list of NaN .但是 output 是NaN的列表。 Any ideas?有任何想法吗？ Would be great to implement on the whole DataFrame and not just one Series.在整个 DataFrame 而不仅仅是一个系列上实施会很棒。

Answer 1

You can't decode it from UTF-8 because it's already a string - not a byte-sequence.您无法从 UTF-8 解码它，因为它已经是一个字符串 - 而不是字节序列。

The content of the file is indeed confusing: it contains strings that start with "b'... , which misleads the use to thinks it's bytes - but it's not.该文件的内容确实令人困惑：它包含以"b'...开头的字符串，这会误导用户认为它是字节 - 但事实并非如此。

If you run df.Top1[0] , you'll see that it contains:如果你运行df.Top1[0] ，你会看到它包含：

'b"Georgia \'downs two Russian warplanes\' as countries move to brink of war"'

And type(df.Top1[0]) is just a string.而type(df.Top1[0])只是一个字符串。 Therefore - you can't decode it from UTF-8.因此 - 你不能从 UTF-8 解码它。

pandas 解码字符串返回 NaN

问题描述

1 个解决方案

解决方案1
4 2020-05-23 10:01:40

pandas 解码字符串返回 NaN

问题描述

1 个解决方案

解决方案1 4 2020-05-23 10:01:40

解决方案1
4 2020-05-23 10:01:40