简体   繁体   中英

Parsing a huge dictionary file with python. Simple task I cant get my head around

I just got a giant 1.4m line dictionary for other programming uses, and i'm sad to see notepad++ is not powerful enough to do the parsing job to the problem. The dictionary contains three types of lines:

<ar><k>-aaltoiseen</k>
yks.ill..ks. <kref>-aaltoinen</kref></ar>
yks.nom. -aaltoinen; yks.gen. -aaltoisen; yks.part. -aaltoista; yks.ill. -aaltoiseen; mon.gen. -aaltoisten -aaltoisien; mon.part. -aaltoisia; mon.ill. -aaltoisiinesim. Lyhyt-, pitkäaaltoinen.</ar>

and I want to extract every word of it to a list of words without duplicates. Lets start by my code.

f = open('dic.txt')
p = open('parsed_dic.txt', 'r+')
lines = f.readlines()
for line in lines:
    #<ar><k> lines
    #<kref> lines
    #ending to ";" - lines
    for word in listofwordsfromaline:
        p.write(word,"\n")
f.close()
p.close()

Im not particulary asking you how to do this whole thing, but anything would be helpful. A link to a tutorial or one type of line parsing method would be highly appreciated.

First find what defines a word for you. Make a regular expression to capture those matches. For example - word break '\\b' will match word boundaries (non word characters). https://docs.python.org/2/howto/regex.html

If the word definition in each type of line is different - then if statements to match the line first, then corresponding regular expression match for the word, and so on.

Match groups in Python

For the first two cases you can see that any word starts and ends with a specific tag , if we see it closely , then we can say that every word must have a ">-" string preceding it and a "

# First and second cases
start = line.find(">-")+2
end = line.find("</")+1
required_word = line[start:end]

In the last case you can use the split method:

    word_lst = line.split(";")
    ans = []
    for word in word_list:
      start = word.find("-")
      ans.append(word[start:])
    ans = set(ans)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM