简体   繁体   English

根据逗号和空格在文本文件中分割文本(python)

[英]Split text in text file on the basis of comma and space (python)

I need to parse text of text file into two categories: 我需要将文本文件的文本解析为两类:

  1. University 大学
  2. Location(Example: Lahore, Peshawar, Jamshoro, Faisalabad) 位置(例如:拉合尔,白沙瓦,贾姆肖罗,费萨拉巴德)

but the text file contain following text: 但是文本文件包含以下文本:

"Imperial College of Business Studies, Lahore"
"Government College University Faisalabad"
"Imperial College of Business Studies Lahore"
"University of Peshawar, Peshawar"
"University of Sindh, Jamshoro"
"London School of Economics"
"Lahore School of Economics, Lahore"

I have written code that separate locations on the basis of 'comma'. 我编写了基于“逗号”分隔位置的代码。 The below code only work for first line of file and prints 'Lahore' after that it give following error 'list index out of range'. 以下代码仅适用于文件的第一行,并在显示以下错误“列表索引超出范围”后显示“ Lahore”。

file = open(path,'r')
content = file.read().split('\n')

for line in content:
    rep = line.replace('"','')
    loc = rep.split(',')[1]
    print "uni: "+replace
    print "Loc: "+str(loc)

Please help I'm stuck on this. 请帮助我坚持下去。 Thanks 谢谢

Your input file does not have commas on every line, causing the code to fail. 您的输入文件的每一行都没有逗号,从而导致代码失败。 You could do something like 你可以做类似的事情

if ',' in line:
    loc = rep.split(',')[1].strip()
else:
    loc = rep.split()[-1].strip()

to handle the lines without comma differently, or simply reformat the input. 处理这些行而没有逗号不同,或者只是重新格式化输入。

You can split using a comma, the result is always a list, you can check its size, if it is more than one, then you had already at least one comma, otherwise (if the size is one) you didn't have any comma 您可以使用逗号分割,结果始终是一个列表,可以检查其大小,如果大于一个,则说明您已经至少有一个逗号,否则(如果大小为1)则没有任何逗号逗号

>>> word = "somethign without a comma"
>>> afterSplit = word.split(',')
>>> afterSplit
['somethign without a comma']
>>> word2 = "something with, just one comma"
>>> afterSplit2 = word2.split(',')
>>> afterSplit2
['something with', ' just one comma']

I hope this will work, but I couldn't get 'London' though. 我希望这能奏效,但我无法获得“伦敦”。 May be the data should be constant. 可能数据应该是恒定的。

f_data = open('places.txt').readlines()
stop_words = ['school', 'Economics', 'University', 'College']
places = []
for p in f_data:
    p = p.replace('"', '')
    if ',' in p:
        city = p.split(',')[-1].strip()
    else:
        city = p.split(' ')[-1].strip()
    if city not in places and city not in stop_words:
            places.append(city)
print places

o/p [' Lahore', ' Faisalabad', 'Lahore', 'Peshawar', ' Jamshoro'] o / p ['Lahore','Faisalabad','Lahore','Peshawar','Jamshoro']

It would appear that you can only be certain that a line has a location if there is a comma. 看起来,如果有逗号,则只能确定某行具有位置。 So it would make sense to parse the file in two passes. 因此,分两遍解析文件是有意义的。 The first pass can build a set holding all known locations. 第一遍可以建立一个包含所有已知位置的set You can start this off with some known examples or problem cases. 您可以从一些已知的示例或问题案例开始。

Pass two could then also use the comma to match known locations but if there is no comma, the line is split into a set of words. 然后,第二遍也可以使用逗号来匹配已知位置,但是如果没有逗号,则将行分成一组单词。 The intersection of these with the location set should give you the location. 这些与位置设置的交集应该为您提供位置。 If there is no intersection then it is flagged as "unknown". 如果没有交集,则将其标记为“未知”。

locations = set(["London", "Faisalabad"])

with open(path, 'r') as f_input:
    unknown = 0
    # Pass 1, build a set of locations
    for line in f_input:
        line = line.strip(' ,"\n')
        if ',' in line:
            loc = line.rsplit("," ,1)[1].strip()
            locations.add(loc)

    # Pass 2, try and find location in line
    f_input.seek(0)

    for line in f_input:
        line = line.strip(' "\n')
        if ',' in line:
            uni, loc = line.rsplit("," ,1)
            loc = loc.strip()
        else:
            uni = line
            loc_matches = set(re.findall(r"\b(\w+)\b", line)).intersection(locations)

            if loc_matches:
                loc = list(loc_matches)[0]
            else:
                loc = "<unknown location>"
                unknown += 1

        uni = uni.strip()

        print "uni:", uni
        print "Loc:", loc

    print "Unknown locations:", unknown

Output would be: 输出为:

uni: Imperial College of Business Studies
Loc: Lahore
uni: Government College University Faisalabad
Loc: Faisalabad
uni: Imperial College of Business Studies Lahore
Loc: Lahore
uni: University of Peshawar
Loc: Peshawar
uni: University of Sindh
Loc: Jamshoro
uni: London School of Economics
Loc: London
uni: Lahore School of Economics
Loc: Lahore
Unknown locations: 0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM