Week 6 Day 2: Initiation of developing NeWiktionary Analyzer

Today, I started building Nepali Wiktionary Analyzer (NeWiktionary Analyzer) for which I am using text files, each having a Nepali word entry in the Wiktionary. At first, I am parsing word contents and developing general rules using NLP++ to create structured data. For example, today I parsed the data of text files line by line and then cleaned the data by removing whitespace. Then, I wrote rules to put headings of each content of word such as pronunciation, part of speech, definition, etc. I create analyzer sequences of definition, header, and headerZone, each having corresponding content, for example, definition analyzer finds definitions and put it under a node called “definition” based on its rule.

For reference, one of the text file content is as follows:

==नेपाली==
जवाफ

==उच्चारण==
[javāpha], [javaapha]   

==पदवर्ग==
नाम

===अर्थ १===
# [[उत्तर]]; उत्तर पक्ष; [[उत्तरा]]; [[प्रतिवचन]]

====उदाहरण==== 
# श्यामले आफ्नो [[प्रियसी]]लाई छोडेर जाने कारण सोध्दा प्रियसीले केहि [[जवाफ]] दीनन। 

==समानार्थी शब्द== 
[[उत्तर]], [[प्रतिक्रिया]], 

==व्युत्पन्न सर्तहरू== 
[[जवाफदेही]], [[जवाफ माग्नु]], [[जवाफ दिनु]]
 
==अनुवाद== 
अङ्ग्रेजी: [[answer]], [[reply]], [[response]]

I first imported two library passes i.e. “lines.nlp” and “whitespace.nlp” that parses file line by line and remove whitespace respectively. The code rules are as follows:

Rule to parse line by line:

@NODES _ROOT

@RULES

_BLANKLINE <-
    _xWILD [min=0 max=0 matches=(\  \t \r)] ### (1)
    \n ### (2)
    @@

_LINE <-
    _xWILD [min=0 max=0 fails=(\r \n)] ### (1)
    _xWILD [one match=(\n _xEND)] ### (2)
    @@

Rule to remove white spaces

@NODES _LINE

@POST
    excise(1,1);
    noop();

@RULES

_xNIL <-
    _xWHITE [s] ### (1)
    @@

Then, I wrote the following rule to parse all the headers and post it under “_header” if it meets the rule pattern:

@NODES _LINE

@POST
X("header",2) = N("$text",3);
single();

@RULES
_header <-
	_xSTART						### (1)
	_xWILD [plus match=(\=)]	### (2)
	_xWILD [plus fail=(\=)]		### (3)
	_xWILD [plus match=(\=)]	### (4)
	_xEND						### (5)
	@@

Similarly, I wrote another analyzer sequence i.e. headerZone that includes the titles of each word content type as follows:

@NODES _ROOT

@PRE
<1,1> var("header");
<2,2> varz("header");

@POST
S("header") = N("header",1);
single();

@RULES
_headerZone <-
	_LINE	### (1)
	_xWILD [plus]	### (2)
	@@

Leave a comment