I want to use naive bayes to classify documents into a relatively large number of classes. I'm looking to confirm whether an mention of an entity name in an article really is that entity, on the basis of whether that article is similar to articles where that entity has been correctly verified.
Say, we find the text "General Motors" in an article. We have a set of data that contains articles and the correct entities mentioned within in. So, if we have found "General Motors" mentioned in a new article, should it fall into that class of articles in the prior data that contained a known genuine mention "General Motors" vs. the class of articles which did not mention that entity?
(I'm not creating a class for every entity and trying to classify every new article into every possible class. I already have a heuristic method for finding plausible mentions of entity names, and I just want to verify the plausibility of the limited number of entity name mentions per article that the method already detects.)
Given that the number of potential classes and articles was quite large and naive bayes is relatively simple, I wanted to do the whole thing in sql, but I'm having trouble with the scoring query...
Here's what I have so far:
CREATE TABLE `each_entity_word` (
`word` varchar(20) NOT NULL,
`entity_id` int(10) unsigned NOT NULL,
`word_count` mediumint(8) unsigned NOT NULL,
PRIMARY KEY (`word`, `entity_id`)
);
CREATE TABLE `each_entity_sum` (
`entity_id` int(10) unsigned NOT NULL DEFAULT '0',
`word_count_sum` int(10) unsigned DEFAULT NULL,
`doc_count` mediumint(8) unsigned NOT NULL,
PRIMARY KEY (`entity_id`)
);
CREATE TABLE `total_entity_word` (
`word` varchar(20) NOT NULL,
`word_count` int(10) unsigned NOT NULL,
PRIMARY KEY (`word`)
);
CREATE TABLE `total_entity_sum` (
`word_count_sum` bigint(20) unsigned NOT NULL,
`doc_count` int(10) unsigned NOT NULL,
`pkey` enum('singleton') NOT NULL DEFAULT 'singleton',
PRIMARY KEY (`pkey`)
);
Each article in the marked data is split into distinct words, and for each article for each entity every word is added to each_entity_word
and/or its word_count
is incremented and doc_count
is incremented in entity_word_sum
, both with respect to an entity_id
. This is repeated for each entity known to be mentioned in that article.
For each article regardless of the entities contained within for each word total_entity_word
total_entity_word_sum
are similarly incremented.
word_count
in total_entity_word
for that word over doc_count
in total_entity_sum
word_count
in each_entity_word
for that word for entity_id
x over doc_count
in each_entity_sum
for entity_id
x word_count
in total_entity_word
minus its word_count
in each_entity_word
for that word for that entity) over (the doc_count
in total_entity_sum
minus doc_count
for that entity in each_entity_sum
) doc_count
in each_entity_sum
for that entity id over doc_count
in total_entity_word
doc_count
in each_entity_sum
for x 's entity id over doc_count
in total_entity_word
). For a new article that comes in, split it into words and just select where word in ('I', 'want', 'to', 'use'...) against either each_entity_word
or total_entity_word
. In the db platform I'm working with (mysql) IN clauses are relatively well optimized.
Also there is no product() aggregate function in sql, so of course you can just do sum(log(x)) or exp(sum(log(x))) to get the equivalent of product(x).
So, if I get a new article in, split it up into distinct words and put those words into a big IN() clause and a potential entity id to test, how can I get the naive bayesian probability that the article falls into that entity id's class in sql?
EDIT:
Try #1:
set @entity_id = 1;
select @entity_doc_count = doc_count from each_entity_sum where entity_id=@entity_id;
select @total_doc_count = doc_count from total_entity_sum;
select
exp(
log(@entity_doc_count / @total_doc_count) +
(
sum(log((ifnull(ew.word_count,0) + 1) / @entity_doc_count)) /
sum(log(((aew.word_count + 1) - ifnull(ew.word_count, 0)) / (@total_doc_count - @entity_doc_count)))
)
) as likelihood,
from total_entity_word aew
left outer join each_entity_word ew on ew.word=aew.word and ew.entity_id=@entity_id
where aew.word in ('I', 'want', 'to', 'use'...);
Use an R to Postgres (or MySQL, etc.) interface
Alternatively, I'd recommend using an established stats package with a connector to the db. This will make your app a lot more flexible if you want to switch from Naive Bayes to something more sophisticated:
http://rpgsql.sourceforge.net/
bnd.pr> data(airquality)
bnd.pr> db.write.table(airquality, no.clobber = F)
bnd.pr> bind.proxy("airquality")
bnd.pr> summary(airquality)
Table name: airquality
Database: test
Host: localhost
Dimensions: 6 (columns) 153 (rows)
bnd.pr> print(airquality)
Day Month Ozone Solar.R Temp
1 1 5 41 190 67
2 2 5 36 118 72
3 3 5 12 149 74
4 4 5 18 313 62
5 5 5 NA NA 56
6 6 5 28 NA 66
7 7 5 23 299 65
8 8 5 19 99 59
9 9 5 8 19 61
10 10 5 NA 194 69
Continues for 143 more rows and 1 more cols...
bnd.pr> airquality[50:55, ]
Ozone Solar.R Wind Temp Month Day
50 12 120 11.5 73 6 19
51 13 137 10.3 76 6 20
52 NA 150 6.3 77 6 21
53 NA 59 1.7 76 6 22
54 NA 91 4.6 76 6 23
55 NA 250 6.3 76 6 24
bnd.pr> airquality[["Ozone"]]
[1] 41 36 12 18 NA 28 23 19 8 NA 7 16 11 14 18 14 34 6
[19] 30 11 1 11 4 32 NA NA NA 23 45 115 37 NA NA NA NA NA
[37] NA 29 NA 71 39 NA NA 23 NA NA 21 37 20 12 13 NA NA NA
[55] NA NA NA NA NA NA NA 135 49 32 NA 64 40 77 97 97 85 NA
[73] 10 27 NA 7 48 35 61 79 63 16 NA NA 80 108 20 52 82 50
[91] 64 59 39 9 16 78 35 66 122 89 110 NA NA 44 28 65 NA 22
[109] 59 23 31 44 21 9 NA 45 168 73 NA 76 118 84 85 96 78 73
[127] 91 47 32 20 23 21 24 44 21 28 9 13 46 18 13 24 16 13
[145] 23 36 7 14 30 NA 14 18 20
You'll then want to install the e1071 package to do Naive Bayes. At the R prompt:
[ramanujan:~/base]$R
R version 2.7.2 (2008-08-25)
Copyright (C) 2008 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
~/.Rprofile loaded.
Welcome at Sun Apr 19 00:45:30 2009
> install.packages("e1071")
> install.packages("mlbench")
> library(e1071)
> ?naiveBayes
> example(naiveBayes)
More info:
Here's a simple version for SQL Server. I run it on a free SQL Express implementation and it is pretty fast.
http://sqldatamine.blogspot.com/2013/07/classification-using-naive-bayes.html
Here is a blog post detailing what you are looking for: http://nuncupatively.blogspot.com/2011/07/naive-bayes-in-sql.html
I have coded up many versions of NB classifiers in SQL. The answers above advocating changing analysis packages were not scalable to my large data and processing time requirements. I had a table with a row for each word/class combination (nrows = words * classes) and a coefficient column. I had another table with a column for document_id and word. I just joined these tables together on word, grouped by document, and summed the coefficients and then adjusted the sums for the class probability. This left me with a table of document_id, class, score. I then just picked the min score (since I was doing a complement naive bayes approach, which I found worked better in a multi-class situation).
As a side note, I found many transformations/algorithm modifications improved my holdout predictions a great deal. They are described in the work of Jason Rennie on "Tackling the Poor Assumptions of Naive Bayes Text Classifiers" and summarized here: http://www.ist.temple.edu/~vucetic/cis526fall2007/liang.ppt
I don't have time to calculate all the expressions for the NB
formula, but here's the main idea:
SET @entity = 123;
SELECT EXP(SUM(LOG(probability))) / (EXP(SUM(LOG(probability))) + EXP(SUM(LOG(1 - probability))))
FROM (
SELECT @entity AS _entity,
/* Above is required for efficiency, subqueries using _entity will be DEPENDENT and use the indexes */
(
SELECT SUM(word_count)
FROM total_entity_word
WHERE word = d.word
)
/
(
SELECT doc_count
FROM each_entity_sum
WHERE entity_id = _entity
) AS pwordentity,
/* I've just referenced a previously selected field */
(
SELECT 1 - pwordentity
) AS pwordnotentity,
/* Again referenced a previously selected field */
... etc AS probability
FROM total_entity_word
) q
Note that you can easily refer to the previous field in SELECT
by using them in correlated subqueries (as in example).
If using Oracle, it has data mining built in
I'm not sure what db you're running, but if you're using Oracle, data mining capabilities are baked into the db:
http://www.oracle.com/technology/products/bi/odm/index.html
...including Naive Bayes:
http://download.oracle.com/docs/cd/B28359_01/datamine.111/b28129/algo_nb.htm
and a ton of others:
http://www.oracle.com/technology/products/bi/odm/odm_techniques_algorithms.html
That was surprising to me. Definitely one of the competitive advantages that Oracle has over the open source alternatives in this area.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.