简体   繁体   中英

optimal way to count string tags using python

so I have this list:

tokens = ['<greeting>', 'Hello World!', '</greeting>']

the task is to count the number of strings that have XML tags. what I have so far (that works) is this:

tokens = ['<greeting>', 'Hello World!', '</greeting>']
count = 0

for i in range(len(tokens)):
    if tokens[i].find('>') >1: 
        print(tokens[i])
        count += 1
        print(count)
    else:
        count += 0 

what puzzles me is that I'm inclined in using the following line for the if statement

  if tokens[i].find('>') == True:

but it won't work. what's the optimal way of writing this loop, in your opinion? many thanks! alex.

One issue I see with you approach is that it might capture false positives (eg "gree>ting" ), so checking only for a closing tag is not enough.

If your definition of "contains a tag" simply means checking whether the string contains a < followed by some characters, then another > , you could use a regular expression ( keeping this in mind in case you were thinking about something more complex).

This, combined with the compact list generator method proposed by @aws_apprentice in the comments, gives us:

import re

regex = "<.+>"
count = sum([1 if re.search(regex, t) else 0 for t in tokens])
print(count) #done!

Explanation:

This one-liner we used is called a list generator , which will generate a list of ones and zeros. For each string t in tokens , if the string contains a tag, append 1 to the new list, else append 0 . And re.search is used for checking whether the string (or a substring of it) matches the given regex.

The following approach checks for the opening < at the start of the string and also checks for > at the end of the string.

In [4]: tokens = ['<greeting>', 'Hello World!', '</greeting>']

In [5]: sum([1 if i.startswith('<') and i.endswith('>') else 0 for i in tokens])
Out[5]: 2

Anis R.'s answer should work fine but this is a non-regex alternative (and not as elegant. In fact I would call this clumsy).

This code just looks at the beginning and end of each list element for carats. I'm a novice to the extreme but I think a range(len(tokens)) is redundant and can be simplified like this as well.

tokens = ['<greeting>', 'Hello World!', '</greeting>']
count = 0

for i in tokens:
    if i[0].find('<') == 0 and i[-1].find('>') != -1:
        print(i)
        count += 1

print(count)

str.find() returns an index position, not a boolean as others have noted, so your if statement must reflect that. A .find() with no result returns -1 . As you can see, for the first carat checking for an index of 0 will work, as long as your data follows the scheme in your example list. The second if component is negative (using != ), since it checks the last character in the list item. I don't think you could use a positive if statement there since, again, .find() returns an index position and your data presumably has variable lengths. I'm sure you could complicate that check to be positive by adding more code but that shortcut seems satisfactory in your case to me. The only time it wouldn't work is if your list components can look like '<greeting> Hello'

Happy to be corrected by others, that's why I'm here.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM