簡體   English   中英

根據逗號和空格在文本文件中分割文本(python)

[英]Split text in text file on the basis of comma and space (python)

我需要將文本文件的文本解析為兩類:

  1. 大學
  2. 位置(例如:拉合爾,白沙瓦,賈姆肖羅,費薩拉巴德)

但是文本文件包含以下文本:

"Imperial College of Business Studies, Lahore"
"Government College University Faisalabad"
"Imperial College of Business Studies Lahore"
"University of Peshawar, Peshawar"
"University of Sindh, Jamshoro"
"London School of Economics"
"Lahore School of Economics, Lahore"

我編寫了基於“逗號”分隔位置的代碼。 以下代碼僅適用於文件的第一行,並在顯示以下錯誤“列表索引超出范圍”后顯示“ Lahore”。

file = open(path,'r')
content = file.read().split('\n')

for line in content:
    rep = line.replace('"','')
    loc = rep.split(',')[1]
    print "uni: "+replace
    print "Loc: "+str(loc)

請幫助我堅持下去。 謝謝

您的輸入文件的每一行都沒有逗號,從而導致代碼失敗。 你可以做類似的事情

if ',' in line:
    loc = rep.split(',')[1].strip()
else:
    loc = rep.split()[-1].strip()

處理這些行而沒有逗號不同,或者只是重新格式化輸入。

您可以使用逗號分割,結果始終是一個列表,可以檢查其大小,如果大於一個,則說明您已經至少有一個逗號,否則(如果大小為1)則沒有任何逗號逗號

>>> word = "somethign without a comma"
>>> afterSplit = word.split(',')
>>> afterSplit
['somethign without a comma']
>>> word2 = "something with, just one comma"
>>> afterSplit2 = word2.split(',')
>>> afterSplit2
['something with', ' just one comma']

我希望這能奏效,但我無法獲得“倫敦”。 可能數據應該是恆定的。

f_data = open('places.txt').readlines()
stop_words = ['school', 'Economics', 'University', 'College']
places = []
for p in f_data:
    p = p.replace('"', '')
    if ',' in p:
        city = p.split(',')[-1].strip()
    else:
        city = p.split(' ')[-1].strip()
    if city not in places and city not in stop_words:
            places.append(city)
print places

o / p ['Lahore','Faisalabad','Lahore','Peshawar','Jamshoro']

看起來,如果有逗號,則只能確定某行具有位置。 因此,分兩遍解析文件是有意義的。 第一遍可以建立一個包含所有已知位置的set 您可以從一些已知的示例或問題案例開始。

然后,第二遍也可以使用逗號來匹配已知位置,但是如果沒有逗號,則將行分成一組單詞。 這些與位置設置的交集應該為您提供位置。 如果沒有交集,則將其標記為“未知”。

locations = set(["London", "Faisalabad"])

with open(path, 'r') as f_input:
    unknown = 0
    # Pass 1, build a set of locations
    for line in f_input:
        line = line.strip(' ,"\n')
        if ',' in line:
            loc = line.rsplit("," ,1)[1].strip()
            locations.add(loc)

    # Pass 2, try and find location in line
    f_input.seek(0)

    for line in f_input:
        line = line.strip(' "\n')
        if ',' in line:
            uni, loc = line.rsplit("," ,1)
            loc = loc.strip()
        else:
            uni = line
            loc_matches = set(re.findall(r"\b(\w+)\b", line)).intersection(locations)

            if loc_matches:
                loc = list(loc_matches)[0]
            else:
                loc = "<unknown location>"
                unknown += 1

        uni = uni.strip()

        print "uni:", uni
        print "Loc:", loc

    print "Unknown locations:", unknown

輸出為:

uni: Imperial College of Business Studies
Loc: Lahore
uni: Government College University Faisalabad
Loc: Faisalabad
uni: Imperial College of Business Studies Lahore
Loc: Lahore
uni: University of Peshawar
Loc: Peshawar
uni: University of Sindh
Loc: Jamshoro
uni: London School of Economics
Loc: London
uni: Lahore School of Economics
Loc: Lahore
Unknown locations: 0

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM