简体   繁体   中英

Python program which can count and identify the number of acronyms in a text file

I have tried this code from my side, any suggestion and help is appreciated. To be more specific, I want to create a python program which can count and identify the number of acronyms in a text file. And the output of the program should display every acronyms present in the specified text file and how many time each of those acronyms occurred in the file.

*Note- The below code is not giving the desired output. Any type of help and suggestion is appreciated.

Link for the Text File, You guys can have a look- https://drive.google.com/file/d/1zlqsmJKqGIdD7qKicVmF0W6OgF5-g7Qk/view?usp=sharing

This text file contain various acronyms which are used in it. So, I basically want to write a python script to identify those acronyms and count how many times those acronyms occurred. The acronyms are of various type which can be 2 or more letters and it can either be of small or capital letters. For further reference about acronyms please have a look at the text file provided at the google drive.

Any updated code is also appreciated.

acronyms = 0 # number of acronyms

#open file File.txt in read mode with name file
with open('Larex_text_file.txt', "r", errors ='ignore') as file:
    text = str(file.read())
    import re

    print(re.sub("([a-zA-Z]\.*){2,}s?", "", text))

    for line in text: # for every line in file
        for word in line.split(' '): # for every word in line
            if word.isupper(): # if word is all uppercase letters
                acronyms+=1

print("Number of acronyms:", acronyms) #print number of acronyms

In building a small text file and then trying out your code, I came up with a couple of tweaks to your code to simplify it and still acquire the count of words within the text file that are all uppercase letters.

acronyms = 0 # number of acronyms

#open file File.txt in read mode with name file
with open('Larex_text_file.txt', "r", errors ='ignore') as file:
    text = str(file.read())

    for word in text.split(' '): # for every word in line
        if word.isupper() and word.isalpha(): # if word is all uppercase letters
            acronyms+=1

print("Number of words that are all uppercase:", acronyms) #print number of acronyms

First off, just a simple loop is used through the words that are split out from the read text, and then the program just checks that the word is all alpha and that all of the letters in the word are all uppercase.

To test, I built a small text file with some words all in uppercase.

NASA and UCLA have teamed up with the FBI and JPL.
also UNICEF and the WWE have teamed up.  
With that, there should be five words that are all uppercase.

And, when run, this was the output on the terminal.

@Una:~/Python_Programs/Acronyms$ python3 Acronym.py 
Number of words that are all uppercase: 5

You will note that I am being a bit pedantic here referring to the count of "uppercase" words and not calling them acronyms. I am not sure if you are attempting to actually derive true acronyms, but if you are, this link might help:

Acronyms

Give that a try to see if it meets the spirit of your project.

Answer to the question-

#open file File.txt in read mode with name file
with open('Larex_text_file.txt', "r", errors ='ignore') as file:
    text = str(file.read())
    for word in text.split(' '): # for every word in line
        if word.isupper() and word.isalpha(): # if word is all uppercase letters
            acronyms+=1
            if len(word) == 1:  #ignoring the word found in the file of single character as they are not acronyms
              pass
            else:
              index = len(acronym_word)
              acronym_word.insert(index, word)  #storing all the acronyms founded in the file to a list

uniqWords = sorted(set(acronym_word)) #remove duplicate words and sort the list of acronyms
for word in uniqWords:
    print(word, ":", acronym_word.count(word))

From your comments, it sounds like every acronym appears at least once as an all-uppercase word, then can appear several more times in lowercase.

I suggest making two passes on the text: a first time to collect all uppercase words, and a second pass to search for every occurrence, case-insensitive, of the words you collected on the first pass.

You can use collections.Counter to quickly count words.

You can use ''.join(filter(str.isalpha, word.lower())) to strip a word of its non-alphabetical characters and disregard its case.

In the code snippet below, I used io.StringIO to emulate opening a text file.

from io import StringIO
from collections import Counter

text = '''First we have CR and PU as uppercase words. A word which first
 appeared as uppercase can also appear as lowercase.
For instance, cr and pu appear in lowercase, and pu appears again.
And again: here is a new occurrence of pu.
An acronym might or might not have punctuation or numbers in it: CR-1,
 C.R., cr.
A word that contains only a singly letter will look like an acronym
 if it ever appears as the first word of a sentence.'''

#with open('path/to/file.txt', 'r') as f:
with StringIO(text) as f:
    counts = Counter(''.join(filter(str.isalpha, word.lower()))
                     for line in f for word in line.split())
    f.seek(0)
    uppercase_words = set(''.join(filter(str.isalpha, word.lower()))
                          for line in f
                          for word in line.split() if word.isupper())

acronyms = Counter({w: c for w,c in counts.items() if w in uppercase_words})

print(acronyms)
# Counter({'cr': 5, 'a': 5, 'pu': 4})

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM