简体   繁体   中英

How to count unique words from a text file after a specific string in every line?

Python-noob here:

I have a text file that looks like this:

{'{http://www.omg.org/XMI}id': '18836', 'sofa': '12', 'begin': '27', 'end': '30', 'Character': 'Jonathan'} 
{'{http://www.omg.org/XMI}id': '18836', 'sofa': '12', 'begin': '27', 'end': '30', 'Character': 'Jonathan'} 
{'{http://www.omg.org/XMI}id': '18828', 'sofa': '12', 'begin': '31', 'end': '37', 'Character': 'Joyce'} 
{'{http://www.omg.org/XMI}id': '18828', 'sofa': '12', 'begin': '31', 'end': '37', 'Character': 'Joyce'} 
{'{http://www.omg.org/XMI}id': '18918', 'sofa': '12', 'begin': '81', 'end': '95', 'Character': 'Will'} 
{'{http://www.omg.org/XMI}id': '19012', 'sofa': '12', 'begin': '155', 'end': '158', 'Character': 'Jonathan'} 
{'{http://www.omg.org/XMI}id': '19050', 'sofa': '12', 'begin': '239', 'end': '242', 'Character': 'Nancy'} 
{'{http://www.omg.org/XMI}id': '19111', 'sofa': '12', 'begin': '845', 'end': '850', 'Character': 'Steve'} 

etc.

I would like to be able to count the unique characters' names and count each of their occurances. As in: ignore everything in every line until the string 'Character': and therefore considering only the character's name.

So far I have this code, after trying many other approaches, including RegEx, but without the wanted results (it prints and counts everything):

import re
from collections import Counter
import tkFileDialog

filename = tkFileDialog.askopenfilename()

f = open(filename, "r")

lines = f.readlines()

f.close()


cnt = Counter()

for line in lines:
    cnt[line.split("'Character':", 2)] +=1

print cnt
print sum(cnt.values())

An optimal output would be like so:

Jonathan: 3
Joyce: 2
Will: 1
Nancy: 1
Steve: 1

Any kind of help or hints would be appreciated!

EDIT: The text file above was generated from an .xmi file that has information in a way that is not easily readable. As I mentioned in a comment for one of the answers below: This was my first-try approach to represent wanted combined information visually. I am not sure if there's a better way to represent such data other than in a text file to be able to work with it. Create a new .xmi file for that, maybe?

So, as requested, here's my code that generated the .xmi file to the text file:

# coding: utf-8

# In[ ]:

import xml.etree.cElementTree as ET
from xml.etree.ElementTree import (Element, ElementTree, SubElement, Comment, tostring)

ET.register_namespace("pos","http:///de/tudarmstadt/ukp/dkpro/core/api/lexmorph/type/pos.ecore")
ET.register_namespace("tcas","http:///uima/tcas.ecore")
ET.register_namespace("xmi","http://www.omg.org/XMI")
ET.register_namespace("cas","http:///uima/cas.ecore")
ET.register_namespace("tweet","http:///de/tudarmstadt/ukp/dkpro/core/api/lexmorph/type/pos/tweet.ecore")
ET.register_namespace("morph","http:///de/tudarmstadt/ukp/dkpro/core/api/lexmorph/type/morph.ecore")
ET.register_namespace("dependency","http:///de/tudarmstadt/ukp/dkpro/core/api/syntax/type/dependency.ecore")
ET.register_namespace("type5","http:///de/tudarmstadt/ukp/dkpro/core/api/semantics/type.ecore")
ET.register_namespace("type6","http:///de/tudarmstadt/ukp/dkpro/core/api/syntax/type.ecore")
ET.register_namespace("type2","http:///de/tudarmstadt/ukp/dkpro/core/api/metadata/type.ecore")
ET.register_namespace("type3","http:///de/tudarmstadt/ukp/dkpro/core/api/ner/type.ecore")
ET.register_namespace("type4","http:///de/tudarmstadt/ukp/dkpro/core/api/segmentation/type.ecore")
ET.register_namespace("type","http:///de/tudarmstadt/ukp/dkpro/core/api/coref/type.ecore")
ET.register_namespace("constituent","http:///de/tudarmstadt/ukp/dkpro/core/api/syntax/type/constituent.ecore")
ET.register_namespace("chunk","http:///de/tudarmstadt/ukp/dkpro/core/api/syntax/type/chunk.ecore")
ET.register_namespace("custom","http:///webanno/custom.ecore")

def sofa(annotation):
    f = open(annotation)
    tree = ET.ElementTree(file=f)
    root = tree.getroot()

    node = root.find("{http:///uima/cas.ecore}Sofa") # we remove cas:View
    return node.attrib['sofaString']

path ="valhalla.xmi"
with open(path, 'r', encoding="utf-8") as filename:
    tree = ET.ElementTree(file=filename)
    root = tree.getroot()

ns = {'emospan': 'http:///webanno/custom.ecore', 
      'id':'http://www.omg.org/XMI',
      'relspan': 'http:///webanno/custom.ecore',
      'sentence': 'http:///de/tudarmstadt/ukp/dkpro/core/api/segmentation/type.ecore',
      'annotator': "http:///de/tudarmstadt/ukp/dkpro/core/api/metadata/type.ecore"}
my_id = '{http://www.omg.org/XMI}id'


top = Element('corpus', encoding="utf-8") 
text = sofa(path).replace("\n"," ")

def stimcount():
    with open('results.txt', 'w') as f:
        for rel_node in root.findall("emospan:CharacterRelation",ns):
            if rel_node.attrib['Relation']=="Stimulus":
                source = rel_node.attrib['Governor']
                target = rel_node.attrib['Dependent']
                for span_node in root.findall("emospan:CharacterEmotion",ns):
                    if span_node.attrib[my_id]==source:

                        print(span_node.attrib['Emotion'])

                    if span_node.attrib[my_id]==target:
                        print(span_node.attrib)
                        print(span_node.attrib, file=f)

Here's a Regex solution:

file_stuff = """{'{http://www.omg.org/XMI}id': '18836', 'sofa': '12', 'begin': '27', 'end': '30', 'Character': 'Jonathan'}
{'{http://www.omg.org/XMI}id': '18836', 'sofa': '12', 'begin': '27', 'end': '30', 'Character': 'Jonathan'}
{'{http://www.omg.org/XMI}id': '18828', 'sofa': '12', 'begin': '31', 'end': '37', 'Character': 'Joyce'}
{'{http://www.omg.org/XMI}id': '18828', 'sofa': '12', 'begin': '31', 'end': '37', 'Character': 'Joyce'}
{'{http://www.omg.org/XMI}id': '18918', 'sofa': '12', 'begin': '81', 'end': '95', 'Character': 'Will'}
{'{http://www.omg.org/XMI}id': '19012', 'sofa': '12', 'begin': '155', 'end': '158', 'Character': 'Jonathan'}
{'{http://www.omg.org/XMI}id': '19050', 'sofa': '12', 'begin': '239', 'end': '242', 'Character': 'Nancy'}
{'{http://www.omg.org/XMI}id': '19111', 'sofa': '12', 'begin': '845', 'end': '850', 'Character': 'Steve'}"""

import re
from collections import Counter

r = re.compile("(?<=\'Character\'\:\s\')\w+(?=\')")
# EDIT: use "(?<=\'Character\'\:\s\')(.+)(?=\')" to match names with quotes...
# or other characters, as pointed out in comments.
print(Counter(r.findall(file_stuff)))
# Counter({'Jonathan': 3, 'Joyce': 2, 'Will': 1, 'Nancy': 1, 'Steve': 1})

Using ast and collections modules

Ex:

import ast
from collections import defaultdict

d = defaultdict(int)
with open(filename) as infile:
    for line in infile:
        val = ast.literal_eval(line)
        d[val["Character"]] += 1
print(d)

Output:

defaultdict(<type 'int'>, {'Will': 1, 'Steve': 1, 'Jonathan': 3, 'Nancy': 1, 'Joyce': 2})

Your original text file is very sad, as it seems to contain representations of python dicts written in text format, one per line!

This is a very bad way of generating a text data file. You should change the code that generates this file, to generate another format like csv or json file instead of naively writing string representations to a text file. If you use csv or json, then you have libraries already written and tested to help you parse the contents and extract each element easily.

If you still want that, you can use ast.literal_eval to actually run the code on each line:

import ast
import collections
with open(filename) as infile:
     print(collections.Counter(ast.literal_eval(line)['Character'] for line in infile))

EDIT: Now that you added an example of the file generation, I can suggest you use another format, like json:

def stimcount():
    results = []
    for rel_node in root.findall("emospan:CharacterRelation",ns):
        if rel_node.attrib['Relation']=="Stimulus":
            source = rel_node.attrib['Governor']
            target = rel_node.attrib['Dependent']
            for span_node in root.findall("emospan:CharacterEmotion",ns):
                if span_node.attrib[my_id]==source:

                    print(span_node.attrib['Emotion'])

                if span_node.attrib[my_id]==target:
                    print(span_node.attrib)
                    results.append(span_node.attrib)

    with open('results.txt', 'w') as f:
        json.dump(results, f)

Then your code that reads the data could be as simple as:

with open('results.txt') as f:
    results = json.load(f)
r = collections.Counter(d['Character'] for d in results)
for n, (ch, number) in enumerate(r.items()): 
    print('{} - {}, {}'.format(n, ch, number))

Another option is to use csv format. It allows you to specify a list of interesting columns and ignore the rest:

def stimcount():
    with open('results.txt', 'w') as f:
        cf = csv.DictWriter(f, ['begin', 'end', 'Character'], extrasaction='ignore')
        cf.writeheader()
        for rel_node in root.findall("emospan:CharacterRelation",ns):
            if rel_node.attrib['Relation']=="Stimulus":
                source = rel_node.attrib['Governor']
                target = rel_node.attrib['Dependent']
                for span_node in root.findall("emospan:CharacterEmotion",ns):
                    if span_node.attrib[my_id]==source:

                        print(span_node.attrib['Emotion'])

                    if span_node.attrib[my_id]==target:
                        print(span_node.attrib)
                        cf.writerow(span_node.attrib)

Then to read it easily:

with open('results.txt') as f:
    cf = csv.DictReader(f)
    r = collections.Counter(d['Character'] for d in cf)
    for n, (ch, number) in enumerate(r.items()): 
        print('{} - {}, {}'.format(n, ch, number))

If you want, you can have a pandas solution, too...:

txt = """{'{http://www.omg.org/XMI}id': '18836', 'sofa': '12', 'begin': '27', 'end': '30', 'Character': 'Jonathan'}
{'{http://www.omg.org/XMI}id': '18836', 'sofa': '12', 'begin': '27', 'end': '30', 'Character': 'Jonathan'}
{'{http://www.omg.org/XMI}id': '18828', 'sofa': '12', 'begin': '31', 'end': '37', 'Character': 'Joyce'}
{'{http://www.omg.org/XMI}id': '18828', 'sofa': '12', 'begin': '31', 'end': '37', 'Character': 'Joyce'}
{'{http://www.omg.org/XMI}id': '18918', 'sofa': '12', 'begin': '81', 'end': '95', 'Character': 'Will'}
{'{http://www.omg.org/XMI}id': '19012', 'sofa': '12', 'begin': '155', 'end': '158', 'Character': 'Jonathan'}
{'{http://www.omg.org/XMI}id': '19050', 'sofa': '12', 'begin': '239', 'end': '242', 'Character': 'Nancy'}
{'{http://www.omg.org/XMI}id': '19111', 'sofa': '12', 'begin': '845', 'end': '850', 'Character': 'Steve'}"""

import pandas as pd

# replace the StringIO-stuff by your file-path
df = pd.read_table(StringIO(txt), sep="'Character': '", header=None, usecols=[1])
            1
0  Jonathan'}
1  Jonathan'}
2     Joyce'}
3     Joyce'}
4      Will'}
5  Jonathan'}
6     Nancy'}
7     Steve'}

df = df[1].str.split('\'', expand=True)
          0  1
0  Jonathan  }
1  Jonathan  }
2     Joyce  }
3     Joyce  }
4      Will  }
5  Jonathan  }
6     Nancy  }
7     Steve  }

df.groupby(0).count()
          1
0          
Jonathan  3
Joyce     2
Nancy     1
Steve     1
Will      1

The idea is to read the file as two columns sep arated by 'Character': ' and import only the second ( usecols ).
Then split again at ' .
The rest is ordinary groupby / count

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM