简体繁体中英

how to reduce the processing cost of comparing many strings together in Python?

原文 2016-01-07 18:25:58 0 2 python/ string/ performance

I have two datasets, A and B, that contain a string variable similar to a headline.

example : "this is a very nice string".

Both datasets are large (millions of observations).

I need to see whether the strings in A also appear somewhere in B. I was wondering if there is a specific Python library that would reduce the computational cost of comparing some many strings together?

Maybe via some smart indexing of the datasets before running the comparison? Any idea/suggestion is welcome.

Important problem: matching should be fuzzy, because I can have the following headlines

A: "this is an apple" B: "this is a red apple"

they dont match perfectly, but they are really close. If there is not better matching (such as exact matching) then I consider they are the same.

Many thanks

2 answers

One option is to convert the two datasets to python set and check whether the set of A is subset of the set of B . You should experiment what is the complexity, but I believe python code is pretty optimized.

Other option is to build trie of the strings in B . This will take O(|B| * max_str_len_in_B) . After that you will iterate over the strings in A and check if everyone of them is in the trie. This will cost you O(|A| * max_str_len_in_A) .

Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python. Programmers can use it to easily add search functionality to their applications and websites. Every part of how Whoosh works can be extended or replaced to meet your needs exactly.

Documentation: Whoosh package documentation

Home Page: http://bitbucket.org/mchaput/whoosh

How to reduce the computational cost

How to reduce multiple condition with strings - python

How to join parts of two strings together in python

How to use python to print strings as processing like this

comparing strings in python

Comparing two strings in python?

Comparing consequent strings in Python

python: comparing two strings

Comparing Strings in an Array in Python

Comparing date strings in python

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How to reduce the computational cost How to reduce multiple condition with strings - python How to join parts of two strings together in python How to use python to print strings as processing like this comparing strings in python Comparing two strings in python? Comparing consequent strings in Python python: comparing two strings Comparing Strings in an Array in Python Comparing date strings in python

Related Tags

how to reduce the processing cost of comparing many strings together in Python?

Question

2 answers

solution1
1 2016-01-07 18:34:17

solution2
1 ACCPTED 2016-01-07 19:33:25

how to reduce the processing cost of comparing many strings together in Python?

Question

2 answers

solution1 1 2016-01-07 18:34:17

solution2 1 ACCPTED 2016-01-07 19:33:25

solution1
1 2016-01-07 18:34:17

solution2
1 ACCPTED 2016-01-07 19:33:25