简体   繁体   中英

Splitting up text file into pieces, then searching key phrases in those sections

I am new to Python and I am already a fan of the language. I have a program that does the following:

  1. Opens a text file that has sections of text separated by asterisks ( *** )

  2. Uses the split() function to split up this text file into sections separated by these asterisks. The line of asterisks is uniform across the text file.

  3. I want my code to iterate through each of these sections and do the following:

    • I have a dictionary with "key phrases" assigned to values. The value of each key in the dictionary is 0 .

    • The code needs to iterate through each section created from the split and check to see if the keys in the dictionary are found in each section. If a key term is found, the value for that key increases by 1.

    • Once the code iterates through one section and has counted how many of the keys are in the section and added values accordingly, it should print out dictionary keys and the counts (values) for that setting, set the values to 0, and move on to the next section of text starting at #3 again.

My code is:

    from bs4 import BeautifulSoup
   import re
   import time
   import random
   import glob, os
   import string


termz = {'does not exceed' : 0, 'shall not exceed' : 0, 'not exceeding' : 0,
  'do not exceed' : 0, 'not to exceed' : 0, 'shall at no time exceed' : 0,
  'shall not be less than' : 0, 'not less than' : 0}
with open('Q:/hello/place/textfile.txt', 'r') as f:
  sections = f.read().split('**************************************************')
  for p in sections[1:]:
      for eachKey in termz.keys():
        if eachKey in p:
          termz[eachKey] = termz.get(eachKey) + 1
          print(termz)  


#print(len(sections))  #there are thirty sections      

        #should be if code encounters ***** then it resets the counters and just moves on....
        #so far only can count the phrases over the entire text file....

#GO BACK TO .SPLIT()
# termz = dict.fromkeys(termz,0) #resets the counter

It spits out what it counts but it isn't the first, last, or even the entire file it's tracing - I don't know what it is doing.

The print statement at the end is out of place. The termz = dict.fromkeys(termz,0) line is a method I found to reset the values of the dictionary to 0, but is commented out because I'm not sure how to approach this. Essentially, struggling with Python control structures. If someone could point me in the right direction, that'd be amazing.

Your code is pretty close. See the comments below:

termz = {
    'does not exceed': 0,
    'shall not exceed': 0,
    'not exceeding': 0,
    'do not exceed': 0,
    'not to exceed': 0,
    'shall at no time exceed': 0,
    'shall not be less than': 0,
    'not less than': 0
}

with open('Q:/hello/place/textfile.txt', 'r') as f:
    sections = f.read().split('**************************************************')

    # Skip the first section. (I assume this is on purpose?)
    for p in sections[1:]:
        for eachKey in termz:
            if eachKey in p:
                # This is simpler than termz[eachKey] = termz.get(eachKey) + 1
                termz[eachKey] += 1

        # Move this outside of the inner loop
        print(termz)

        # After printing the results for that section, reset the counts
        termz = dict.fromkeys(termz, 0)

EDIT

Sample input and output:

input = '''
Section 1:

This section is ignored.
does not exceed
**************************************************
Section 2:

shall not exceed
not to exceed
**************************************************
Section 3:

not less than'''

termz = {
    'does not exceed': 0,
    'shall not exceed': 0,
    'not exceeding': 0,
    'do not exceed': 0,
    'not to exceed': 0,
    'shall at no time exceed': 0,
    'shall not be less than': 0,
    'not less than': 0
}

sections = input.split('**************************************************')

# Skip the first section. (I assume this is on purpose?)
for p in sections[1:]:
    for eachKey in termz:
        if eachKey in p:
            # This is simpler than termz[eachKey] = termz.get(eachKey) + 1
            termz[eachKey] += 1

    # Move this outside of the inner loop
    print(termz)

    # After printing the results for that section, reset the counts
    termz = dict.fromkeys(termz, 0)

# OUTPUT:
# {'not exceeding': 0, 'shall not exceed': 1, 'not less than': 0, 'shall not be less than': 0, 'shall at no time exceed': 0, 'not to exceed': 1, 'do not exceed': 0, 'does not exceed': 0}
# {'not exceeding': 0, 'shall not exceed': 0, 'not less than': 1, 'shall not be less than': 0, 'shall at no time exceed': 0, 'not to exceed': 0, 'do not exceed': 0, 'does not exceed': 0}
if eachKey in p:
          termz[eachKey] += 1  # might do it
          print(termz)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM