简体   繁体   中英

dictionaries feature extraction Python

I'm doing a text categorization experiment. For the feature extraction phase I'm trying to create a feature dictionary per document. For now, I have two features, Type token ratio and n-grams of the relative frequency of function words. When I print my instances, only the feature type token ratio is in the dictionary. This seems to be because an ill functioning get_pos() . It returns empty lists. This is my code:

instances = []
labels = []
directory = "\\Users\OneDrive\Data"
for dname, dirs, files in os.walk(directory):
    for fname in files:
        fpath = os.path.join(dname, fname)
        with open(fpath,'r') as f:
             text = csv.reader(f, delimiter='\t')
             vector = {}

             #TTR
             lemmas = get_lemmas(text)
             unique_lem = set(lemmas)
             TTR = str(len(unique_lem) / len(lemmas))
             name = fname[:5]
             vector['TTR'+ '+' + name] = TTR


             #function word ngrams
             pos = get_pos(text)
             fw = []
             regex = re.compile(
               r'(LID)|(VNW)|(ADJ)|(TW)|(VZ)|(VG)|(BW)')
             for tag in pos:
                 if regex.search(tag):
                    fw.append(tag)
             for n in [1,2,3]:  
                 grams = ngrams(fw, n)
                 fdist = FreqDist(grams)
                 total = sum(c for g,c in fdist.items())
                 for gram, count in fdist.items():
                     vector['fw'+str(n)+'+'+' '+ name.join(gram)] = count/total

                 instances.append(vector)
                 labels.append(fname[:1])
print(instances)

And this is an example of a Dutch input file: 荷兰语输入示例

This is the code from the get_pos function, which I call from another script:

  def get_pos(text): row4=[] pos = [] for row in text: if not row: continue else: row4.append(row[4]) pos = [x.split('(')[0] for x in row4] # remove what's between the brackets return pos 

Can you help me find what's wrong with the get_pos function?

When you call get_lemmas(text) , all contents of the file are consumed, so get_pos(text) has nothing left to iterate over. If you want to go through a file's content multiple times, you need to either f.seek(0) between the calls, or read the rows into a list in the beginning and iterate over the list when needed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM