简体   繁体   中英

get rid of duplicates in list of multi word strings

I have a parallel corpus in the format

one sentence in English: one sentence in Italian

And I have a list of bilingual terms extracted from the parallel corpus, in this format

terms_list =  expression liberty, human rights > libertà di espression, diritti umani

What I want is to fistly create bigrams for translation pair for every line in the terms list, and then calculate the statistics for every pairs. To create the pairs I tried this

bigrams = []
for line in terms_list.splitlines():
    trans = line.split(' > ')
    for it in trans[0].split(', '):
        for en in trans[1].split(', '):
            bigrams.append((it, en))

Thsi gives the following output

('expression liberty', 'libertà di espression')
('expression liberty', 'diritti umani')
('human rights', 'libertà di espression')
('human rights', 'diritti umani') 

The following step is to calculate la frequency of every pair of the above pairs. For doing this I have to separate for every pair the source language and the target language, ie, for the pair

('expression liberty', 'libertà di espression')

I have to separate 'expression liberty' from 'libertà di espression'

To do this I used this code

for i in bigrams:
    one = str([ii for ii in str(i).split("', '")[0][2: ].split('\n')])[2: -2]
    two = str([iii for iii in str(i).split("', '")[1][: -2].split('\n')])[2: -2]
    print (one)

This will give

expression liberty
expression liberty
human rights
human rights

For every item in the bilingual pairs I have to know their statistics in the parallel corpus, ie, for the ('expression liberty', 'libertà di espression') I will know for every line in the parallel corpus how many times 'expression liberty' and 'libertà di espression' co occur, how many times only 'expression liberty' occurs,how many times only 'libertà di espression' occurs and how many times neither of them occurs.

This is my try

en = set([x[0] for x in bigrams])
it = set([x[1] for x in bigrams])
a =0
b = 0
c = 0
d =  0

for one in en:
    for two in it:
        for line in parallel_corpus.splitlines():
                    if one in line and two in line:
                            a += 1
                    elif one in line and not two in line:
                            b+= 1
                    elif two in line and not one in line:
                            c+= 1
                    else:
                            d +=1

You really didn't have to go through all that complex code with conversions to strings and lists. using python gives you much more power than that.

english_words = set([x[0] for x in bigrams])
italian_words = set([x[1] for x in bigrams])

now english_words is now an unordered set of unique words extracted from bigrams (saying unordered because you don't guarantee that they come in the order they were stored at)

now printing english_words will produce:

expression liberty
human rights

Edit: Second part of your question

The code you wrote to extract the frequencies should work, however it is unnecessary complicated. You already made bigrams from the parallel corpus, that means you already have everything from the parallel corpus into a friendly format; a list of tuples.

as a general practice for making count statistics, you create a dictionary (hashMap) with a key being the thing you want to count, and the value being the count itself. then iterate over the list of bigrams and if the item is not in the dictionary you add it once, if it is in the dictionary you just increment it's counter. this goes like this:

en_terms_dict = {}
it_terms_dict = {}
bigrams_dict = {}    
for line in parallel_corpus:
    en, it = line.split(' : ')
    if en in en_terms_dict:
        en_terms_dict[en] += 1
    else:
        en_terms_dict[en] = 1
    if it in it_terms_dict:
        it_terms_dict[it] += 1
    else:
        it_terms_dict[it] = 1
    if (en, it) in bigrams_dict:
        bigrams_dict[(en,it)] += 1
    else:
        bigrams_dict[(en, it)] = 1

now by iterating over each dictionary you know the frequency of each term. and of course you can deduce the frequency of non-term by subtraction (I am not sure why in the first place are you counting the frequency where a term does not appear)

for k, v in en_terms_dict:
    print "the term %s appeared %d times"%(k,v)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM