Pali language tools


Contents:

 


Pali word list generator

simple tool to generate a list of pali words from a pali text. The pali text file has to be unicode-encoded. (You may want to use Frank Snow's conversion utility) The pali word list generator extracts all words and creates a word list with each single (unique) occurence of  a word listed once. The following set of parameters can customize the output:

syntax:

 

pwlg.exe "PATH / PaliTextFile.txt" [roman|indian] [lexical|grammatical] [count]

 

parameter explanation:

roman    :    (optional) generated word list will be sorted alphabetically according to latin alphabet (a, b, c, etc.)

indian    :    (default) generated word list will be sorted alphabetically according to indian alphabet (a, aa, i, ii, u, uu, etc.)

lexical    :    (default) sorting according to word initials

grammatical    :    (optional) sorting according to word suffixes

count    :    (optional) creates a numerically ordered list according to word frequency (number of occurences of each unique word in the text)

 

example:

pwlg.exe "C:/majjhimaNikaya.txt" indian lexical

(ENTER)

...will produce a file named "majjhimaNikaya_wordList.txt"

abaddho

ābādhā

ābādhikaṃ

ābādhiko

abbhācikkhanaṃ

abbhācikkhanti

abbhācikkhasi

abbhācikkhati

abbhācikkhi

abbhakkhānaṃ

abbhaaṃsu

...etc.

 

download (Windows XP/2000/9.x - source code included):

pwlg.exe (this program is based on the .NET Framework 1.1 library. If it is not installed on your machine already, make sure to download and install it prior to running this utility. You can download the .NET Framework from here (dotnetfx.exe) or here (Windows update service)



Buddhist Calendar Library

This C# library implements a Calendar Class for traditional Theravada Buddhist Dates based on an astronomical lunar algorithm. 

Traditionally, the Buddhist Era (BE) starts 543 BC. New year begins every Vesakh full moon. The year has 12 lunar months. Each month is divided into two fortnights (sukka- and kanhapakkhe, the light and dark week). Days are enumerated from 1 to the end of each fortnight (14/15th).

The Buddhist equivalent date for the Gregorian date:

2005 - 01 - 29 CE

will thus be

2548 - 02 - 1 - 04 BE

...meaning: in the year 2548 after the Blessed Ones Parinibbana, in the (second) lunar month, in the dark fortnight of the 4th day.

More detailed explanation of the traditional Buddhist Calendar can be found here.

The library can be included in any .NET project for privat and commercial use. License included. 

The download comes with a sample program and source code.

BuddhistCalendarLibrary (this program is based on the .NET Framework 1.1 library. If it is not installed on your machine already, make sure to download and install it prior to running this utility. You can download the .NET Framework from here (dotnetfx.exe) or here (Windows update service)



Buddhaghoso - Listen To the Sound of Awakening

This is a small C# programm which serves devotional / inspirational purposes only :-)

Basically its like a "Buddhist" slideshow - It reads aloud (Text To Speech!) line by line (interval is adjustable) a paragraph from the Dhammapada (actually from a text file where you can put any text you want, but i initially copied some lines from the Dhammapada)

Here is a screenshot, even if you can't here the voice...

Very inspirational :-)

 

Download Buddhaghoso here

(this program is based on the .NET Framework 1.1 library. If it is not installed on your machine already, make sure to download and install it prior to running this utility. You can download the .NET Framework from here (dotnetfx.exe) or here (Windows update service)



Pali Canon - complete word list

This over 900 000 unique items containing list was a by-product of the Pali Text Reader's indexing test / implementation.

You are free to download and use this word list for whatever purpose. Even the zipped version is 4.5 MB big, uncompressed you face about 32MB.

Download the GlobalPaliWordList here

Update: Thanks to Alan McClure i also uploaded the list with a frequency count accompanying all unique pali words. Always wanted to know how often "gacchati" appears in the canon ? Have a look:

FrequencyGlobalPaliWordList

By the way: "unique", in this case means "syntactically" unique not "grammatically". Thus, "gacchati" and "gacchatiti" appear as two different entries... And don't mind the files extension ".widx" - just open it with wordpad or any other unicode enabled text editor which can handle 30 MB of text...

For anyone who wants to play around with the code, i uploaded the Offline Aaalekh Encoding tool as well - its part of the Pali Text Readers toolset...(and far from bugfree -) ... C#, by the way:

AalekhConverterTools_Source (as is)

(this program is based on the .NET Framework 1.1 library. If it is not installed on your machine already, make sure to download and install it prior to running this utility. You can download the .NET Framework from here (dotnetfx.exe) or here (Windows update service)

 



Pali Canon - complete word list (sorted according to the Pali alphabet)

This word list (again containing the entire number of all unique pali terms) has been alphabetically sorted. This list too, is a by product from ongoing efforts to improve the Pali Text Reader. The implementation of an IComparer for the Pali alphabet soon resulted in this interesting wordlist.

Pali-alphabetically sorted word list: download here

 


Pali Language IComparer Implementation

Based on the .NET Framework follows a link to the PaliComparer - an IComparer implementation for sorting lists (ArrayList, SortedLists) according to the Pali alphabet. (Unless you wonder - the "sa-IN" or "sa" cultural info implemenation coming with the .NET environment IS NOT sorting alphabetically according to Sanskrit or Indian standards - at least not, if you are using default UNICODE characters. So this class should provide the missing functionality...

PaliComparer.cs


Reverse sorted Pali word list

It was only a question of time for this list to come alive :-) For further ideas around utilizing this kind of list have a look at the Pali Group , especially message #10057

Download: reverse sorted Pali Canon word list

Responding to a request by Harry Liew (message # 10062) for extracting all verbs from this list, i propose the following listing as a first approach given the difficulty of distinguishing verb forms from all other words. The following list is based on the future tense, which almost all pali verbs will (hopefully) inflect at least through one occurence within the pitakas. As the future tense in "-issati" is quite easy to recover, the following list contains all matching words ending in -issati:

Download: comprehensive list of future tense verbs found in the Pali Canon (704 words)


Pali Language - complete frequency word list of unique words

IThis is a *complete* word list of all Pali words (about 967.000) as occuring in the CSCD (VRI) Tipitaka edition. It contains all unique occuring words (tokens) and is sorted by frequency, descending. It has been a *by product* of PTR's efforts for a machine translation plugin.

Download: frequency sorted Pali Canon word list




Pali Translator - Proof of Concept

This is a simple and quite crude attempt of using a binary search dictionary lookup for creating some kind of automatic translation tool from Pali to English. Does not break compounds and misses any sophisticated grammatical analysis. Just uses the dictionary to lookup words and outputs them into a list as well as an inter-linear style html page.

Download: Pali Translator Tool

Screenshots: Screenshot 1

Screenshot 2




Semantic closeness of canonical (Tipitaka) and post canonical books in percentages

Below you find an excel report which is a product of a little program I wrote (source available below as well). The concept is pretty simple: Quite often Pali texts from similar text strata show a remarkable closeness in the vocabulary they would share. The 4 Nikayas are very similar in style and language but at the same time very different from the Commentaries or Abhidhamma texts.

Based on that fact this little program extracted a-declension nominative forms from all 220 books available in the VRI Tipitaka edition and compared them against each other (=48000 combinations).

I sorted the resulting table by percentage and uploaded it as well (see below). Of course this has to be taken with caution and is very crude as we are just comparing one characteristic (nom. sing. a-decl). However, because this test is applied to the entire range of texts we can still use the percentages as a crude indicator of proximity. For instance you will see that the 4 Nikaya share a great percentage in similarity as espected. We can also see that parts of the AN match the Puggalapannatti or observe the closeness between Nettipakarana and Petakopadesa. From here we can go through the list and discover interesting relationships which may have been not that obvious.

Download:

TipitakaSemanticProximity sorted by Frequency

TipitakaSemanticProximity sorted by Book

Source-Code: TipitakaSemanticProximity_SourceCode.zip



Help this page:In case you like to support our site, consider visiting one of the sponsoring links. Thank you!