Parsing a huge dictionary file with python. Simple task I cant get my head around

Question

I just got a giant 1.4m line dictionary for other programming uses, and i'm sad to see notepad++ is not powerful enough to do the parsing job to the problem. The dictionary contains three types of lines:

<ar><k>-aaltoiseen</k>
yks.ill..ks. <kref>-aaltoinen</kref></ar>
yks.nom. -aaltoinen; yks.gen. -aaltoisen; yks.part. -aaltoista; yks.ill. -aaltoiseen; mon.gen. -aaltoisten -aaltoisien; mon.part. -aaltoisia; mon.ill. -aaltoisiinesim. Lyhyt-, pitkäaaltoinen.</ar>

and I want to extract every word of it to a list of words without duplicates. Lets start by my code.

f = open('dic.txt')
p = open('parsed_dic.txt', 'r+')
lines = f.readlines()
for line in lines:
    #<ar><k> lines
    #<kref> lines
    #ending to ";" - lines
    for word in listofwordsfromaline:
        p.write(word,"\n")
f.close()
p.close()

Im not particulary asking you how to do this whole thing, but anything would be helpful. A link to a tutorial or one type of line parsing method would be highly appreciated.

Answer 1

First find what defines a word for you. Make a regular expression to capture those matches. For example - word break '\\b' will match word boundaries (non word characters). https://docs.python.org/2/howto/regex.html

If the word definition in each type of line is different - then if statements to match the line first, then corresponding regular expression match for the word, and so on.

Match groups in Python

Answer 2

For the first two cases you can see that any word starts and ends with a specific tag , if we see it closely , then we can say that every word must have a ">-" string preceding it and a "

# First and second cases
start = line.find(">-")+2
end = line.find("</")+1
required_word = line[start:end]

In the last case you can use the split method:

    word_lst = line.split(";")
    ans = []
    for word in word_list:
      start = word.find("-")
      ans.append(word[start:])
    ans = set(ans)

Parsing a huge dictionary file with python. Simple task I cant get my head around

Question

2 answers

solution1
0 2014-12-14 18:33:39

solution2
0 ACCPTED 2014-12-14 18:35:38

Parsing a huge dictionary file with python. Simple task I cant get my head around

Question

2 answers

solution1 0 2014-12-14 18:33:39

solution2 0 ACCPTED 2014-12-14 18:35:38

solution1
0 2014-12-14 18:33:39

solution2
0 ACCPTED 2014-12-14 18:35:38