Week 7: NeWiktionary Analyzer Dev. Continuation, KB Dev. Initiation and Testing, and Planning for Remaining Weeks

Day 1:

I completed parsing the Wiktionary data by building all the required analyzer sequences that match the pattern of word entry text in the Wiktionary word entry format.

Day 2 & 3:

I started to build a knowledge base (KB) by adding post areas in the existing analyzer sequences and even by building new analyzer sequences as needed. Some of the analyzer sequences I had to build to build a knowledge base are kbZones, kbPhonetics, kbPhonemics, kbDefText, kbVariations, kbExamples, kbDerivedTerms, and kbTranslations. These analyzer sequences help organize all the content of a word in the computer’s knowledge same as humans perceive these words’ content in their head. To illustrate, the following are codes written to build a knowledge base of kbZones that includes sub-zones of words such as pronunciation, synonyms, derived terms, and translations:

@NODES _ROOT

@POST
N("pronunciation") = makeconcept(G("word"),"pronunciation");

@RULES
_xNIL <-
	_pronunciations	### (1)
	@@

@POST
N("synonym") = makeconcept(G("word"),"synonym");

@RULES
_xNIL <-
	_synonyms	### (1)
	@@


@POST
N("derivedTerms") = makeconcept(G("word"),"derivedTerms");

@RULES
_xNIL <-
	_derivedTerms	### (1)
	@@

@POST
N("translation") = makeconcept(G("word"),"translation");

@RULES
_xNIL <-
	_translations	### (1)
	@@

In the above code, makeconcept creates concept in knowedge base for the particular zone of word’s content where text from each subzone will be added.

I also learned to use MakeCountCon which counts all the entries of a particular concept such as if there are more than one part of speech or multiple definitions in the def zone then it counts all and posts under the same zone i.e. pronunciation zone or def zone respectively. The code examples of MakeCountCon are as follows:

@NODES _posZone

@POST
N("con",1) = MakeCountCon(X("con"),"definition");

@RULES
_xNIL <-
	_defZone	### (1)
	@@
@PATH  _ROOT _posZone _defZone _definition _headerZone _LINE _item

@POST
L("con") = MakeCountCon(X("con",3),"variation");
addstrval(L("con"),"text",N("$text",2));
singler(2,2);

@RULES
_variation <-
	\;								### (1)
	_xWILD [plus fail=(\; । _xEND)]	### (2)
	@@

In the above code block, variation is parsed and added to the knowledge base using NLP++. we have defined variations as different forms of definitions or different ways to say the same meaning. In our knowledge base, variations are parts of the definition separated by “;” in the definition text and there can be one or more variations of definition so, I used MakeCountCon to count all the concepts of variations and post them in their respective zone.

In this way, I build all the knowledge base analyzer sequences and could add all the content of word in it.

Day 4 & 5

On the fourth day, I ran a newly developed analyzer (having analyzer sequences to parse the word content and zones along with knowledge base analyzer sequences) on all input .txt files of words to test if it works for all kinds of input data. While running the analyzer, I found some bugs such as those words having only phonetic/phonemic or more than one phonetics/phonemics was not showing in the output of knowledge base, definitions count was wrong for many words, [[ ]] was not removed from many word’s defText and variations which creates a link for the particular word that shows the very word in either red or green color (red shows that word is not added to Wiktionary yet whereas word in the green text means it is added to the Wiktionary and can be opened to view its content.

Planning for Remaining Weeks

I had a meeting with my mentor, David to plan and discuss about work to accomplish in the remaining weeks. During that meeting, we discussed about our targetted goal which is to build a dictionary in NLP++. For this, first, I will have to take the NeWiktionary analyzer that I have built and run it on HPCC systems, design a record structure in database format having word, definition, pronunciation, parts of speech, definitions, examples, derived terms, synonyms, and translations in English. To do this, I have to download Ubuntu on Windows, run HPCC systems on it, and finally build a dictionary for NLP++ from the Knowledge base of Wiktionary data.

Leave a comment