简体   繁体   English

在Python中打印tsv文件的内容(使用UTF-8)

[英]Printing contents of a tsv file (with UTF-8) in Python

The code I have below works fine in a file I've named tsv_test.py: 我下面的代码在名为tsv_test.py的文件中可以正常工作:

import csv

class ReadUTF8():

    def unicode_csv_reader(self, utf8_data, dialect=csv.excel_tab, **kwargs):
        csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)
        for row in csv_reader:
            yield [unicode(cell, 'utf-8') for cell in row]


    def load_deck_data(self):
        filename = 'lexicon.tsv'
        reader = self.unicode_csv_reader(open(filename))
        for field1, field2, field3, field4 in reader:
            print field1, field2, field3, field4

ReadUTF8().load_deck_data()

But when I copy/paste it into my project (this is a kivy project), it breaks. 但是,当我将其复制/粘贴到我的项目(这是一个奇异的项目)中时,它就会中断。 Code and error below: 下面的代码和错误:

class StudyScreenManagement(ScreenManager):

    def unicode_csv_reader(self, utf8_data, dialect=csv.excel_tab, **kwargs):
        csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)
        for row in csv_reader:
            yield [unicode(cell, 'utf-8') for cell in row]


    def load_deck_data(self):
        filename = 'lexicon.tsv'
        reader = self.unicode_csv_reader(open(filename))
        for field1, field2, field3, field4 in reader:
            print field1, field2, field3, field4

I doubt this is related, but just in case, the related .kv file: 我怀疑这是相关的,但以防万一,相关的.kv文件:

Button:
    text: 'Lexicon'
    on_press: app.root.load_deck_data()

Output: 输出:

 File "/Users/bearnun/code/mingyu/mingyuKivy/mingyu_controllers.py", line 14, in load_deck_data
 for field1, field2, field3, field4 in reader:
 ValueError: need more than 1 value to unpack

::Side Note:: ::边注::

I tried just printing 'field1' in both cases. 在这两种情况下,我都尝试仅打印“ field1”。 With that change the output for both is: 更改后,两者的输出为:

[u'\u4b03', u'\u98d2', u'[sa4]', u'/variant of \u98af|\u98d2[sa4]/']
[u'\u4b20', u'\u4b20', u'[fei1]', u'/old variant of \u970f[fei1]/']

My desired output: 我想要的输出:

䬃 飒 [sa4] /variant of 颯|飒[sa4]/
䬠 䬠 [fei1] /old variant of 霏[fei1]/

[EDIT BELOW] [下面的编辑]

lexicon.tsv contents: lexicon.tsv内容:

䬃   飒   [sa4]   /variant of 颯|飒[sa4]/
䬠   䬠   [fei1]  /old variant of 霏[fei1]/

Apparently, I am receiving a list instead of a generator, so if in load_deck_data() I change: 显然,我收到的是列表而不是生成器,因此如果在load_deck_data()中,则更改:

 for field1, field2, field3, field4 in reader: print field1, field2, field3, field4 

to: 至:

 for line in reader: print ''.join(line) 

my project works fine. 我的项目运作良好。

Check out this example: 看看这个例子:

data = [
    ['a', 'b', 'c', 'd'],
    ['e'],
]

def mygen(x):
    for item in x:
        yield item

for line in mygen(data):
    print ''.join(line)

--output:--
abcd
e

for col1, col2, col3, col4 in mygen(data):
    print col1, col2, col3, col4


--output:--
a b c d

Traceback (most recent call last):
  File "1.py", line 13, in <module>
    for col1, col2, col3, col4 in mygen(data):
ValueError: need more than 1 value to unpack

In the first for-in loop, you are asking, "Please retrieve all the elements in the list and join them together." 在第一个for-in循环中,您询问:“请检索列表中的所有元素并将它们连接在一起。” In the second for-in loop, you are demanding, "Retrieve four elements from the list!" 在第二个forin循环中,您要求“从列表中检索四个元素!” See the difference? 看到不同? In the first case, the list can contain 0 to n elements and there won't be an error. 在第一种情况下,列表可以包含0到n个元素,并且不会出现错误。 In the second case, the list has to have at least 4 elements--otherwise there will be an error. 在第二种情况下,该列表必须至少包含4个元素-否则将出现错误。

I would love to know why I'm getting a generator in one place, but a list in another. 我很想知道为什么要在一个地方放发电机,而在另一个地方放发电机。

Simple. 简单。 You aren't. 你不是 csv.reader() returns a list of strings for each row, which means your generator function returns a list of strings for each iteration. csv.reader()返回每一行的字符串列表,这意味着your generator function为每次迭代返回一个字符串列表。

I think you changed the data in your file. 我认为您更改了文件中的数据。 In one file, you have tab delimited data, and csv.reader() returns a list of four things for each line in your file, which can be unpacked into four variables; 在一个文件中,您可以使用tab delimited数据,而csv.reader()为文件中的每一行返回一列包含四项内容的列表,可以将其解压缩为四个变量。 but your other file has non-tab delimited data, which causes csv.reader() to read the whole line as one item, so the list of strings that csv.reader() returns contains only one item, and a one item list cannot be unpacked into four variables. 但是您的其他文件具有non-tab delimited数据,这将导致csv.reader()将整行作为一项读取,因此csv.reader()返回的字符串列表仅包含一项,而一项列表不能解压成四个变量。

I tried just printing 'field1' in both cases. 在这两种情况下,我都尝试仅打印“ field1”。 With that change the output for both is: 更改后,两者的输出为:

 [u'\䬃', u'\飒', u'[sa4]', u'/variant of \颯|\飒[sa4]/'] [u'\䬠', u'\䬠', u'[fei1]', u'/old variant of \霏[fei1]/'] 

Instead of doing print field1 , if you do print repr(field1) I suspect you will get: 我不print repr(field1) print field1 ,而是print repr(field1)我怀疑您会得到:

"[u'\u4b03', u'\u98d2', u'[sa4]', u'/variant of \u98af|\u98d2[sa4]/']"

Note the outer quotes, which means your tsv file literally has the following on one line: 注意外引号,这意味着您的tsv文件实际上在一行上包含以下内容:

[䬃, 飒, [sa4], /variant of 颯|飒[sa4]/]

with no tabs separating anything, so the whole line-that-looks-like-a-list is read in as one item, therefore csv.reader() returns a list containing that one item. 没有制表符分隔任何内容,因此整条看起来像列表的行作为一项读入,因此csv.reader()返回包含该项的列表。 You were fooled into thinking the single item was a python list because when you print a string, python does not display the quotes. 您被愚蠢地认为单个项目是python列表,因为当您打印字符串时,python不会显示引号。 For example, there is no difference in the output for the following two print statements: 例如,以下两个打印语句的输出没有差异:

>>> print "[1, 2, 3]"
[1, 2, 3]
>>> print [1, 2, 3]
[1, 2, 3]

print can fool you in other situations as well because a string can contain unprintable characters, which the output of print won't reveal: 在其他情况下, print也会使您不知所措,因为字符串可能包含不可打印的字符,而print的输出不会显示这些字符:

>>> print "hello\bworld"
hellworld

The bottom line is: you can never know what the original thing was by looking at the output of print. 最重要的是:通过查看打印输出,您永远无法知道原始内容。 Whenever you want to know exactly what the original thing is, always use: 每当您想确切了解原始内容时,请始终使用:

print repr(some_string)

Now, look at the results: 现在,看一下结果:

>>> print repr([1, 2, 3])
[1, 2, 3]
>>> print repr('[1, 2, 3]')
'[1, 2, 3]'
>>> print repr('hello\bworld')
'hello\x08world'

The output tells you exactly what the original thing was. 输出确切地告诉您原始内容是什么。

With the following tab delimited lexicon.tsv file: 使用以下制表符分隔的lexicon.tsv文件:

1   2   3   €
䬃   飒   [sa4]   /variant of 颯|飒[sa4]/

the code below causes no errors after clicking on the Lexicon button: 单击“词典”按钮后,以下代码不会导致任何错误:

from kivy.app import App
from kivy.uix.screenmanager import ScreenManager, Screen
import csv

class StudyScreenManager(ScreenManager):

    def unicode_csv_reader(self, utf8_data, dialect=csv.excel_tab, **kwargs):
        csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)
        for row in csv_reader:
            yield [unicode(cell, 'utf-8') for cell in row]


    def load_deck_data(self):
        filename = 'lexicon.tsv'
        reader = self.unicode_csv_reader(open(filename))
        for field1, field2, field3, field4 in reader:
            print field1, field2, field3, field4


class HistoryScreen(Screen):
    pass

class MathScreen(Screen):
    pass

class MyApp(App):
    def build(self):
        sm = StudyScreenManager()
        sm.add_widget(HistoryScreen(name='history'))
        sm.add_widget(MathScreen(name='math'))

        return sm

MyApp().run()

my.kv: my.kv:

<HistoryScreen>:  #the 'root' of the following widget hierarchy:
    BoxLayout:
        Button:
            text: 'Lexicon'
            on_press: app.root.load_deck_data()  #self=Button, root=HistoryScreen, app.root=the Widget returned by build()
        Button:
            text: "Next"
            on_press: root.manager.current = "math"

<MathScreen>: #the 'root' of the following widget heirarchy:
    BoxLayout:
        Button:
            text: 'Lexicon'
            on_press: app.root.load_deck_data()
        Button:
            text: 'Previous'
            on_press: root.manager.current = "history"

After clicking on the Lexicon button, here is the output I see in my utf-8 aware terminal window : 单击“词典”按钮后,这是我在utf-8 aware terminal window看到的输出:

1 2 3 €
䬃 飒 [sa4] /variant of 颯|飒[sa4]/

Apparently, I am receiving a list instead of a generator, so if in load_deck_data() I change...: 显然,我收到的是列表而不是生成器,所以如果在load_deck_data()中,则更改...:

for field1, field2, field3, field4 in reader:
    print field1, field2, field3, field4

...to...: ...至...:

for line in reader:
    print ''.join(line)

...my project works fine. ...我的项目效果很好。 This, of course, does not work in the small code snippet that originally worked. 当然,这在最初起作用的小代码段中不起作用。

I would love to know why I'm getting a generator in one place, but a list in another. 我很想知道为什么要在一个地方放发电机,而在另一个地方放发电机。 :) :)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM