[英]Python split() not working as expected for first line in file
I have a large text file of data mined opinions and each is classified as positive, negative, neutral, or mixed. 我有一个大型文本文件的数据挖掘意见,每个分类为正面,负面,中立或混合。 Every line begins with "+ ", "- ", "= ", or "* " which correspond to these classifiers.
每行以“+”,“ - ”,“=”或“*”开头,它们对应于这些分类器。 Additionally, lines that begin with "!! " represent a comment to ignore.
此外,以“!!”开头的行表示要忽略的注释。
Below is a simple Python script that is just supposed to count each of the classifiers and ignore the comment lines: 下面是一个简单的Python脚本,它只是计算每个分类器并忽略注释行:
classes = [0, 0, 0, 0] # "+", "-", "=", "*"
f = open("All_Classified.txt")
for i, line in enumerate(f):
line = line.strip()
classifier = line.split(" ")[0]
if classifier == "+": classes[0] += 1
elif classifier == "-": classes[1] += 1
elif classifier == "=": classes[2] += 1
elif classifier == "*": classes[3] += 1
elif classifier == "!!": continue
else: print "Line "+str(i+1)+": \""+line+"\""
f.close()
print classes
Here is a sample of the first 5 lines of "All_Classified.txt": 以下是“All_Classified.txt”前5行的示例:
!! GROUP 1, 1001 - 1512
= 1001//CD TITLETITLE//NNP How//WRB many//JJ conditioners/conditioner/NNS do//VBP you//PRP have//VBP ?//.
= 1002//CD I//PRP have//VBP two//CD different//JJ kinds/kind/NNS ,//, Garnier//NNP Fructis//NNP Triple//NNP Nutrition//NNP conditioner//NN ,//, and//CC Suave//NNP coconut//NN .//.
= 1003//CD But//CC I//PRP think//VBP I//PRP have//VBP about//IN 8//CD bottles/bottle/NNS of//IN the//DT Suave//NNP coconut//NN My//PRP$ mom//NN gave/give/VBD me//PRP a//DT bunch//NN for//IN Christmas//NNP because//IN she//PRP was/be/VBD getting/get/VBG tired/tire/VBN of//IN me//PRP saying/say/VBG I//PRP was/be/VBD out//IN
= 1004//CD TITLETITLE//NNP Need//VB a//DT gel//NN that//IN works/work/NNS ,//, 8//CD mo//NN ,//, post//NN ,//, ready//JJ to//TO relax//VB edges/edge/NNS ,//, HELP//NNP ,//,
For whatever reason my output is triggering the else statement during the first iteration as if it does not recognize the "!!", I am not sure why. 无论出于何种原因,我的输出在第一次迭代期间触发else语句,好像它不能识别“!!”,我不知道为什么。 I am getting this as output:
我得到这个作为输出:
Line 1: "!! GROUP 1, 1001 - 1512"
[2986, 1034, 16278, 535]
Additionally, If I delete the first line from "All_Classified.txt" then it does not recognize the "=" of what would then be the first line. 另外,如果我从“All_Classified.txt”删除第一行,那么它不会识别第一行的“=”。 Not sure what has to be done for the first line to be recognized as expected.
不确定要按预期识别第一行需要做什么。
Edit (again) : As Peter asked, here is the output if I replace else: print "Line "+str(i+1)+": \\""+line+"\\""
with else: print "Classifier "+classifier+ " Line "+str(i+1)+": \\""+line+"\\""
: 编辑(再次) :正如彼得问的那样,如果我替换
else: print "Line "+str(i+1)+": \\""+line+"\\""
,这里是输出else: print "Line "+str(i+1)+": \\""+line+"\\""
with else: print "Classifier "+classifier+ " Line "+str(i+1)+": \\""+line+"\\""
:
Classifier !! Line 1: "!! GROUP 1, 1001 - 1512"
[2986, 1034, 16278, 535]
Edit : First section using xxd All_Classified.txt
: 编辑 :第一部分使用
xxd All_Classified.txt
:
0000000: efbb bf21 2120 4752 4f55 5020 312c 2031 ...!! GROUP 1, 1
0000010: 3030 3120 2d20 3135 3132 0d0a 3d20 3130 001 - 1512..= 10
0000020: 3031 2f2f 4344 2054 4954 4c45 5449 544c 01//CD TITLETITL
0000030: 452f 2f4e 4e50 2048 6f77 2f2f 5752 4220 E//NNP How//WRB
I suspect your input file isn't what it seems. 我怀疑你的输入文件不是它看起来的样子。 For example,
classifier
could contain some control characters that are not shown when you print it (but which affect the comparison): 例如,
classifier
可能包含一些在打印时未显示的控制字符(但会影响比较):
>>> classifier = '!\0!'
>>> print classifier
!!
>>> classifier == '!!'
False
edit There you have it: 编辑你有它:
0000000: efbb bf21 2120
^^^^ ^^
It's the UTF-8 BOM , which becomes part of classifier
. 这是UTF-8 BOM ,它成为
classifier
一部分。
Try opening the file using codecs.open()
with "utf-8-sig"
as the encoding (see, for example, https://stackoverflow.com/a/13156715/367273 ). 尝试使用
codecs.open()
打开文件,并使用"utf-8-sig"
作为编码(例如,参见https://stackoverflow.com/a/13156715/367273 )。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.