Today, I started building Nepali Wiktionary Analyzer (NeWiktionary Analyzer) for which I am using text files, each having a Nepali word entry in the Wiktionary. At first, I am parsing word contents and developing general rules using NLP++ to create structured data. For example, today I parsed the data of text files line by line and then cleaned the data by removing whitespace. Then, I wrote rules to put headings of each content of word such as pronunciation, part of speech, definition, etc. I create analyzer sequences of definition, header, and headerZone, each having corresponding content, for example, definition analyzer finds definitions and put it under a node called “definition” based on its rule.
For reference, one of the text file content is as follows:
==नेपाली==
जवाफ
==उच्चारण==
[javāpha], [javaapha]
==पदवर्ग==
नाम
===अर्थ १===
# [[उत्तर]]; उत्तर पक्ष; [[उत्तरा]]; [[प्रतिवचन]]
====उदाहरण====
# श्यामले आफ्नो [[प्रियसी]]लाई छोडेर जाने कारण सोध्दा प्रियसीले केहि [[जवाफ]] दीनन।
==समानार्थी शब्द==
[[उत्तर]], [[प्रतिक्रिया]],
==व्युत्पन्न सर्तहरू==
[[जवाफदेही]], [[जवाफ माग्नु]], [[जवाफ दिनु]]
==अनुवाद==
अङ्ग्रेजी: [[answer]], [[reply]], [[response]]
I first imported two library passes i.e. “lines.nlp” and “whitespace.nlp” that parses file line by line and remove whitespace respectively. The code rules are as follows:
Rule to parse line by line:
@NODES _ROOT
@RULES
_BLANKLINE <-
_xWILD [min=0 max=0 matches=(\ \t \r)] ### (1)
\n ### (2)
@@
_LINE <-
_xWILD [min=0 max=0 fails=(\r \n)] ### (1)
_xWILD [one match=(\n _xEND)] ### (2)
@@
Rule to remove white spaces
@NODES _LINE
@POST
excise(1,1);
noop();
@RULES
_xNIL <-
_xWHITE [s] ### (1)
@@
Then, I wrote the following rule to parse all the headers and post it under “_header” if it meets the rule pattern:
@NODES _LINE
@POST
X("header",2) = N("$text",3);
single();
@RULES
_header <-
_xSTART ### (1)
_xWILD [plus match=(\=)] ### (2)
_xWILD [plus fail=(\=)] ### (3)
_xWILD [plus match=(\=)] ### (4)
_xEND ### (5)
@@
Similarly, I wrote another analyzer sequence i.e. headerZone that includes the titles of each word content type as follows:
@NODES _ROOT
@PRE
<1,1> var("header");
<2,2> varz("header");
@POST
S("header") = N("header",1);
single();
@RULES
_headerZone <-
_LINE ### (1)
_xWILD [plus] ### (2)
@@
