简体   繁体   中英

Building a reverse language dictionary

I was wondering what does it take to build a reverse language dictionary.

The user enters something along the lines of: "red edible fruit" and the application would return: "tomatoes, strawberries, ..."

I assume these results should be based on some form of keywords such as synonyms, or some form of string search.

This is an online implementation of this concept.

What's going on there and what is involved?

EDIT 1: The question is more about the "how" rather than the "which tool"; However, feel free to provide the tools you think to do the job.

OpenCyc is a computer-usable database of real-world concepts and meanings. From their web site:

OpenCyc is the open source version of the Cyc technology, the world's largest and most complete general knowledge base and commonsense reasoning engine. OpenCyc can be used as the basis of a wide variety of intelligent applications

Beware though, that it's an enormously complex reasoning engine -- real-world facts never were simple. Documentation is quite sparse and the learning curve is steep.

Any approach would basically involve having a normalized database . Here is a basic example of what your database structure might look like:

// terms
+-------------------+
| id | name         |
| 1  | tomatoes     |
| 2  | strawberries |
| 3  | peaches      |
| 4  | plums        |
+-------------------+

// descriptions
+-------------------+
| id | name         |
| 1  | red          |
| 2  | edible       |
| 3  | fruit        |
| 4  | purple       |
| 5  | orange       |
+-------------------+

// connections
+-------------------------+
| terms_id | descript_id  |
| 1        | 1            |
| 1        | 2            |
| 1        | 3            |
| 2        | 1            |
| 2        | 2            |
| 2        | 3            |
| 3        | 1            |
| 3        | 2            |
| 3        | 5            |
| 4        | 1            |
| 4        | 2            |
| 4        | 4            |
+-------------------------+

This would be a fairly basic setup, however it should give you an idea how many-to-many relationships using a look-up table work within databases.

Your application would have to break apart strings and be able to handle normalizing the input for example getting rid of suffixes with user input. Then the script would query the connections table and return the results.

To answer the "how" part of your question, you could utilize human computation: There are hordes of bored teenagers with iPhones around the globe, so create a silly game whose byproduct is filling your database with facts -- to harness their brainpower for your purposes.

Sounds like an awkward concept at first, but look at this lecture on Human Computation for an example.

First, there must be some way of associating concepts (like 'snow') with particular words.

So rather than simply storing a wordlist, you would also need to store concepts or properties like "red", "fruit", and "edible" as well as the keywords themselves, and model relationships between them.

At a simple level, you could have two tables (don't have to be database tables): a list of keywords, and a list of concepts/properties/adjectives, then you model the the relationship by storing another table which represents the mapping from keyword to adjective.

So if you have:

keywords:

0001  aardvark
....
0050  strawberry
....
0072  tomato
....
0120  zoo

and concepts:

0001  big
0002  small
0003  fruit
0004  vegetable
0005  mineral
0006  metal
....
0250  black
0251  blue
0252  red
....
0570  edible

you would need a mapping containing:

0050 -> 0003
0050 -> 0252
0050 -> 0570
0072 -> 0003
0072 -> 0252
0072 -> 0570

You may like to think of this as modelling an "is" relationship: 0050 (a strawberry) "is" 0003 (fruit), and "is" 0252 (red), and "is" 0570 (edible).

How will your engine know that

  • "An incredibly versatile ingredient, essential for any fridge chiller drawer. Whether used for salads, soups, sauces or just raw in sandwiches, make sure they are firm and a rich red colour when purchased",
  • "mildly acid red or yellow pulpy fruit eaten as a vegetable", and
  • "an American musician who is known for being the lead singer/drummer for the alternative rock band Sound of Urchin"

all map to the same original word ? Natural language definitions are unstructured, you can't store them in a normalized database. You can attempt to structure it by reducing to an ontology , like Princeton's WordNet , but creating and using ontologies is an extremely difficult problem, topic of phd theses and well funded advanced research.

It should be fairly straightforward. You can use straight synonyms in addition to a series of words to define each word. The word order in the definition is sometimes important. Each word can have multiple definitions, of course.

You can develop a rating system to see which definitions are the closest match to the input, then display the top 3 or 4 words.

如何使用字典,并对定义执行全文搜索(删除链接词和文章,如'和','或'...),然后返回具有最佳分数的单词(最高数量的匹配单词或者更复杂的评分方法)?

There are several ways you can go about this depending on how much work you want to put into it. One way you can build a reverse dictionary is to use the definitions to help calculate which words are closely related. This way can be the most difficult because you need to have a pretty extensive algorithm that can associate phrases.

Finding Similar Definitions

One way you could do this is by matching the definition string with others and see which ones match the closest. In php you can use the similar_text function. problem with this method is that if your database has a ton of words and definitions then you will use a lot of overhead on your SQL DB.

Use An API

There are several resources out there you can use to help you get a reverse dictionary by using an API. Here are some of them.

这听起来像是Prolog的工作。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM