简体   繁体   中英

What's the most efficient way to match strings in Python?

I need to write some user-defined-functions in Python for a Pig data transforming job. To describe the situation, data is being parsed and fed, and the Pig script will be calling this Python UDF for basically every field of data in the column.

Most of the UDFs are similar in nature where I need to essentially match a string to 'something + wildcard'. I know of regex and have used it so far, but before I get any further, I want to make sure this is an efficient way of matching strings since the script will be iterating and calling the UDF thousands of times.

To show example: say we have a field where we need to match to sales . The possible values of this field could potentially be anything, as the source data might go wacko in the future and append something random and spit out saleslol . Other possible values are sales. , salessales , sales.yes .

Whatever is after 'sales' doesn't matter; if it starts with sales , then I want to grab it.

So is this following method efficient in doing so? The word variable is the input, or values from the sales column. First row is for the Pig script

@outputSchema("num:int")
def rule2(word):
  sales_match = re.match('sales', word, flags=re.IGNORECASE)

  if sales_match:
    return 1
  else:
    return 0

2

I have another scenario where I need to match to 4 exact, known strings. Is this efficient as well?

@outputSchema("num:int")
def session1(word):
  if word in ['first', 'second', 'third', 'fourth']:
    return 1
  else:
    return 0

You can use str.startswith() :

>>> [s for s in 'saleslol. Other possible values are sales. salessales sales.yes'.split() if s
.lower().startswith('sales')]
['saleslol.', 'sales.', 'salessales', 'sales.yes']

You also do not need to do this in Python:

if word in ['first', 'second', 'third', 'fourth']:
    return 1
else:
    return 0

Instead, it is better to do:

def session1(word):
    return word in {'first', 'second', 'third', 'fourth'}

(Note the set literal vs a list, but the syntax would be the same for a list)

For the form of testing the prefix, your function would be:

def f(word):
    return word.startswith('sales')    # returns True or False

If you want to test several possible words, use any :

>>> def test(tgt, words):
...    return any(word.startswith(tgt) for word in words)
>>> test('sales', {'boom', 'blast', 'saleslol'})
True
>>> test('boombang', {'sales', 'boom', 'blast'})
False

Conversely, if you want to test several prefixes, use the tuple form of startswith:

>>> 'tenthhaha'.startswith(('first', 'second', 'third', 'fourth'))
False
>>> 'firstlol'.startswith(('first', 'second', 'third', 'fourth'))
True

Actually function A seems to be faster for some reason, i did 1 million loops over each function, the first one is 20% faster if my the measurement is correct


from pythonbenchmark import compare, measure

def session1_A(word):
  if word in ['first', 'second', 'third', 'fourth']:
    return 1
  else:
    return 0

def session1_B(word):
    return word in {'first', 'second', 'third', 'fourth'}

compare(session1_A, session1_B, 1000000, "fourth")

在此处输入图片说明

https://github.com/Karlheinzniebuhr/pythonbenchmark/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM