简体   繁体   English

python:ValueError:以10为底的int()的无效文字:''

[英]python : ValueError: invalid literal for int() with base 10: ' '

I have a text file which contains entry like 我有一个文本文件,其中包含类似

70154::308933::3
UserId::ProductId::Score

I wrote this program to read: (Sorry the indendetion is bit messed up here) 我将此程序编写为:(对不起,这里的混乱有点混乱)

def generateSyntheticData(fileName):
 dataDict = {}
 # rowDict = []
 innerDict = {}


 try:
    # for key in range(5):
    # count = 0
    myFile = open(fileName)
    c = 0
        #del innerDict[0:len(innerDict)]

    for line in myFile:
        c += 1
        #line = str(line)
        n = len(line)
        #print 'n: ',n
        if n is not 1:
       # if c%100 ==0: print "%d: "%c, " entries read so far"
       # words = line.replace(' ','_')
            words = line.replace('::',' ')

            words = words.strip().split()


            #print 'userid: ', words[0]
            userId = int( words[0]) # i get error here
            movieId = int (words[1])
            rating =float( words[2])
            print "userId: ", userId, " productId: ", movieId," :rating: ", rating
            #print words
            #words = words.replace('_', ' ')
            innerDict = dataDict.setdefault(userId,{})
            innerDict[movieId] = rating
            dataDict[userId] = (innerDict)
            innerDict = {}
except IOError as (errno,strerror):
    print "I/O error({0}) :{1} ".format(errno,strerror)

finally:
    myFile.close() 
print "total ratings read from file",fileName," :%d " %c
return dataDict

But i get the error: 但是我得到了错误:

ValueError: invalid literal for int() with base 10: ''

Funny thing is, it is working just fine reading the same format data from other file.. Actually while posting this question, I noticed something weird.. The entry 70154::308933::3 each number has a space.in between like 7 space 0 space 1 space 5 space 4 space :: space 3... BUt the text file looks fine..:( on copy pasting only it shows this nature.. Anyways.. but any clue whats going on. Thanks 有趣的是,从其他文件中读取相同格式的数据也可以正常工作。实际上,在发布此问题时,我注意到了一些奇怪的情况。条目70154 :: 308933 :: 3每个数字之间都有一个空格。例如7空格0空格1空格5空格4空格::空格3 ... BUT文本文件看起来很好.. :(复制时仅粘贴就显示了这种性质..无论如何..但任何线索都在发生什么。谢谢

The "spaces" thay you are seeing appear to be NULs ("\\x00"). 您看到的“空格”似乎是NUL(“ \\ x00”)。 There is a 99.9% chance that your file is encoded in UTF-16, UTF-16LE, or UTF-16BE. 您的文件以UTF-16,UTF-16LE或UTF-16BE编码的机率有99.9%。 If this is a one-off file, just open it with Notepad and save as "ANSI", not "Unicode" and not "Unicode bigendian". 如果这是一个一次性文件,只需使用记事本打开它并另存为“ ANSI”,而不是“ Unicode”和“ Unicode bigendian”。 If however you need to process it as is, you'll need to know/detect what the encoding is. 但是,如果您需要按原样进行处理,则需要知道/检测编码是什么。 To find out which, do this: 要找出哪个,请执行以下操作:

print repr(open("yourfile.txt", "rb").read(20))

and compare the srtart of the output with the following: 并将输出的srtart与以下内容进行比较:

>>> ucode = u"70154:"
>>> for sfx in ["", "LE", "BE"]:
...     enc = "UTF-16" + sfx
...     print enc, repr(ucode.encode(enc))
...
UTF-16 '\xff\xfe7\x000\x001\x005\x004\x00:\x00'
UTF-16LE '7\x000\x001\x005\x004\x00:\x00'
UTF-16BE '\x007\x000\x001\x005\x004\x00:'
>>>

You can make a detector that's good enough for your purposes by inspecting the first 2 bytes: 您可以通过检查前2个字节来制成足以满足您的目的的检测器:

[pseudocode]
if f2b in `"\xff\xfe\xff"`: UTF-16
elif f2b[1] == `"\x00"`: UTF-16LE
elif f2b[0] == `"\x00"`: UTF-16BE
else: cp1252 or UTF-8 or whatever else is prevalent in your neck of the woods.

You could avoid hard-coding the fallback encoding: 您可以避免对后备编码进行硬编码:

>>> import locale
>>> locale.getpreferredencoding()
'cp1252'

Your line-reading code will look like this: 您的行读取代码如下所示:

rawbytes = open(myFile, "rb").read()
enc = detect_encoding(rawbytes[:2])
for line in rawbytes.decode(enc).splitlines():
    # whatever

Oh, and the lines will be unicode objects ... if that gives you a problem, ask another question. 哦,这些行将是unicode对象...如果那给您带来了问题,请提出另一个问题。

Debugging 101: simply change the line: 调试101:只需更改以下行:

words = words.strip().split()

to: 至:

words = words.strip().split()
print words

and see what comes out. 看看结果如何。

I will mention a couple of things. 我会提到几件事。 If you have the literal UserId::... in the file and you try to process it, it won't take kindly to trying to convert that to an integer. 如果您在文件中包含文字UserId::...并尝试对其进行处理,则尝试将其转换为整数并不会很友好。

And the ... unusual line: 还有...异常行:

if n is not 1:

I would probably write as: 我可能会这样写:

if n != 1:

If, as you indicate in your comment, you end up seeing: 如您在评论中指出,如果最终看到:

['\x007\x000\x001\x005\x004\x00', '\x003\x000\x008\x009\x003\x003\x00', '3']

then I'd be checking your input file for binary (non-textual) data. 那么我要检查您的输入文件中的二进制(非文本)数据。 You should never end up with that binary information if you're just reading text and trimming/splitting. 如果您只是阅读文本并进行修剪/分割,则永远不要以二进制信息告终。

And because you state that the digits seem to have spaces between them, you should do a hex dump of the file to find out what's really in there. 并且因为您指出数字之间似乎有空格,所以您应该对文件进行十六进制转储以找出其中的实际内容。 It may be a UTF-16 Unicode string, for example. 例如,它可以是UTF-16 Unicode字符串。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM