hg/stx-libbasic2: TextClassifier.st@a03fb375c047 (annotated)

3678 a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	1	"{ Package: 'stx:libbasic2' }"
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	2
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	3	"{ NameSpace: Smalltalk }"
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	4
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	5	Object subclass:#TextClassifier
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	6	instanceVariableNames:'wordBag sentences docCounts wordCounts wordFrequencyCounts
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	7	categories vocabulary'
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	8	classVariableNames:''
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	9	poolDictionaries:''
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	10	category:'Collections-Text-Support'
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	11	!
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	12
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	13	!TextClassifier class methodsFor:'documentation'!
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	14
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	15	documentation
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	16	"
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	17	an initial experiment in bayes text classification.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	18	see BayesClassifierTest
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	19	This is possibly unfinished and may need more work.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	20
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	21	[author:]
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	22	cg
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	23
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	24	[instance variables:]
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	25
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	26	[class variables:]
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	27
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	28	[see also:]
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	29
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	30	"
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	31	! !
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	32
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	33	!TextClassifier class methodsFor:'instance creation'!
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	34
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	35	new
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	36	"return an initialized instance"
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	37
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	38	^ self basicNew initialize.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	39	! !
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	40
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	41	!TextClassifier methodsFor:'initialization'!
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	42
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	43	initialize
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	44	"Invoked when a new instance is created."
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	45
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	46	wordBag := Bag new.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	47	"/ sentences := nil.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	48	docCounts := Dictionary new.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	49	wordCounts := Dictionary new.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	50	wordFrequencyCounts := Dictionary new.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	51	categories := Set new.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	52	vocabulary := Set new.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	53
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	54	"/ super initialize. -- commented since inherited method does nothing
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	55	!
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	56
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	57	initializeCategory:categoryName
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	58	(categories includes:categoryName) ifFalse:[
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	59	docCounts at:categoryName put:0.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	60	wordCounts at:categoryName put:0.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	61	wordFrequencyCounts at:categoryName put:(Dictionary new).
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	62	categories add:categoryName
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	63	].
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	64	! !
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	65
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	66	!TextClassifier methodsFor:'text handling'!
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	67
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	68	classify:string
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	69	"assume that it is a regular text.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	70	split first into lines..."
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	71
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	72	\|tokens frequencyTable maxProbability chosenCategory\|
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	73
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	74	maxProbability := Infinity negative.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	75
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	76	tokens := self tokenize:string.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	77	frequencyTable := tokens asBag.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	78
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	79	categories do:[:categoryName \|
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	80	\|categoryProbability logProbability\|
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	81
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	82	categoryProbability := (docCounts at:categoryName) / docCounts size.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	83	logProbability := categoryProbability log.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	84	frequencyTable valuesAndCountsDo:[:token :frequencyInText \|
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	85	\| tokenProbability\|
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	86
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	87	tokenProbability := self tokenProbabilityOf:token inCategory:categoryName.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	88	logProbability := logProbability + (frequencyInText * tokenProbability log).
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	89	].
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	90	Transcript show:'P(',categoryName,') = '; showCR:logProbability.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	91
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	92	logProbability > maxProbability ifTrue:[
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	93	maxProbability := logProbability.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	94	chosenCategory := categoryName.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	95	].
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	96	].
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	97	^ chosenCategory
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	98	!
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	99
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	100	classify:string asCategory:categoryName
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	101	\|tokens frequencyTable sumWordCount\|
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	102
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	103	self initializeCategory:categoryName.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	104	docCounts incrementAt:categoryName.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	105	tokens := self tokenize:string.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	106	frequencyTable := tokens asBag.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	107	sumWordCount := 0.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	108	frequencyTable valuesAndCountsDo:[:token :count \|
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	109	vocabulary add:token.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	110	(wordFrequencyCounts at:categoryName) incrementAt:token by:count.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	111	sumWordCount := sumWordCount + count.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	112	].
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	113	wordCounts incrementAt:categoryName by:sumWordCount
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	114	!
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	115
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	116	collectWords:lines
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	117	"computes words from lines"
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	118
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	119	\|words\|
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	120
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	121	words := lines collectAll:[:l \|
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	122	l asCollectionOfSubCollectionsSeparatedByAnyForWhich:[:ch \| ch isLetterOrDigit not]
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	123	].
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	124	^ words
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	125	!
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	126
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	127	dehyphenate:linesCollection
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	128	"join hypens"
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	129
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	130	\|lines partialLine\|
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	131
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	132	lines := OrderedCollection new.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	133	linesCollection do:[:eachLine \|
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	134	\|l isHyphenated\|
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	135
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	136	l := eachLine withoutSeparators.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	137	l notEmptyOrNil ifTrue:[
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	138	isHyphenated := (l endsWith:'-')
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	139	and:[ l size > 1
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	140	and:[ (l at:(l size-1)) isLetter ]].
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	141	isHyphenated ifFalse:[
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	142	partialLine := (partialLine ? '') , l.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	143	lines add:partialLine.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	144	partialLine := nil.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	145	] ifTrue:[
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	146	l := l copyButLast.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	147	partialLine := (partialLine ? '') , l.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	148	].
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	149	].
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	150	].
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	151	partialLine notEmptyOrNil ifTrue:[
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	152	lines add:partialLine
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	153	].
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	154	^ lines
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	155	!
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	156
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	157	tokenProbabilityOf:token inCategory:category
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	158	"Calculate probability that a `token` belongs to a `category`"
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	159
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	160	\|wordFrequencyCount wordCount prob\|
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	161
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	162	wordFrequencyCount := (wordFrequencyCounts at:category) at:token ifAbsent:0.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	163	wordCount := wordCounts at:category.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	164
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	165	"/use laplace Add-1 Smoothing equation
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	166	prob :=( wordFrequencyCount + 1 ) / ( wordCount + vocabulary size ).
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	167	prob := prob asFloat.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	168	Transcript showCR:(' P(%1, %2) = %3' bindWith:token with:category with:prob).
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	169	^ prob
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	170	!
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	171
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	172	tokenize:string
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	173	\|rawLines lines allWords\|
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	174
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	175	rawLines := string asCollectionOfLines.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	176	lines := self dehyphenate:rawLines.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	177	allWords := self collectWords:lines.
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	178	^ allWords
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	179	! !
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	180
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	181	!TextClassifier class methodsFor:'documentation'!
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	182
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	183	version
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	184	^ '$Header$'
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	185	!
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	186
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	187	version_CVS
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	188	^ '$Header$'
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	189	! !
a03fb375c047 initial checkin Claus Gittinger <cg@exept.de> parents: diff changeset	190

author	Claus Gittinger <cg@exept.de>
	Wed, 06 Jan 2016 01:42:18 +0100
changeset 3678	a03fb375c047
child 3682	1629a0dc2875
permissions	-rw-r--r--