Open Source & stuff 

Salix.gr

User login

Greek stemmer class

After heavy googling I found a greek stemmer which is the product of the Master Thesis of Georgios Ntais at Royal Institute of Technology [KTH] (Stockholm, Sweden) supervised by assoc. professor Hercules Dalianis. The stemmer is implemented in javascript and you can find it online here.

You can download the php port, published under GNU license. You can check out the first demo page. It is a php4 compatible class. The usage is really simple just create an instance of the class and call stem_word method. The input for stem_word must be in upper case.

require("GreekStemmer.class.php");
$st = new GreekStemmer();
echo $st->stem_word('????????');

I hope the companion class is (i hope) stable. Greek_text class contains some helpful static methods for handling greek text:

  • stopwords filter for greek found at lecture slides by Marios Dikaiakos and Georgios Pallis.
  • to_upper function working with any locale setting
  • to_greeklish function, used by the next one
  • titlize function for making greek titles readable with latin characters for nice urls
  • splitWords function a replacement for php's str_word_count function that works with any locale setting (that one holds me, need some improvements)

More detailed documentation is coming soon...

Links
Greek stemmer class demo page
Greek text class demo page
Download GreekStemmer class version 1.0
Download Greek Text class version 1.0
New! Basos's improved version plus lucene addon Download!

So I believe the system

So I believe the system takes as input a word and removes its inflexional suffix according to a rule based algorithm and then the algorithm follows the known Porter algorithm for the English language and it is developed according to the grammatical rules of the Modern Greek language? Am I understanding this correctly?
A. D Singleton - adrikl[@]gmail.com
Language Translator - Millionaire Mind Book Europe

true Panos Kyriakakis Owner

true

Panos Kyriakakis
Owner of Salix.gr
Larissa, Greece

Nicely done, while it has

EDIT: changed to greeklish because greek is not supported (turns to ??????)

Nicely done, while it has some flaws, for the most part it works well.

What would be the correct stem for these though? Shouldn't they both be the same?

cheking diminutives TEMAXIO -> step 6-2 TEMAXI
cheking diminutives TEMAXIA -> step 3 TEMAX

Also, there is a bug in step 4, the regex is:

$re = '/'.$v.'$/';

but $v doesn't exist, maybe it should be

$re = '/'.$this->v.'$/';

Because otherwise the conditional will always match, however in that case the following doesn't stem properly:

APSENIKOS -> ARSENIK
APSENIKA -> ARSEN

Finally - It would be better to convert this to snowball (http://snowball.tartarus.org/) rather than use regexs all over the place. Also a good suite of tests is quite necessary.

splitWords function a

splitWords function a replacement for php's str_word_count function that works with any locale setting (that one holds me
congstar prepaid

Greek stemmer and Lucene

@basos I would be very interested in any adaptations of a greek stemmer to be used with Lucene, could you publish your code?

I am happy!

I am so happy, first time that happens to me :D
Someone take a code from here, improve it and share it back here with us.
Baso, could you give us a geolocation information about you?
This class is the result of the work of greeks not just in greece ;)
Best Regards
Panos Kyriakakis
Owner of Salix.gr
Larissa, Greece

Much appreciation for this

Much appreciation for this article. Very riveting and accurately composed blog post. I will return in the near future.
nikon d3s

The usage is really simple

The usage is really simple just create an instance of the class and call stem_word method.
Excessive Sweating()How to Stop Excessive Sweating

New version

Hello,
i want to say that you did a nice work, it is very important to find out the port of greek stemmer for PHP.
I modified it a little to use the PHP5 OO model, since PHP4 is considered ancient already,
and to use utf8 encoding. It is very important especially for greek users to use and promote unicode (utf8) in all levels of applicatio development. You change the code once and it will play on all machines.
Also i corrected some minor errors, and optimized a little removing unneeded double calls to preg_match.

I sent the code thru the contact form to avoid overfilling this post.

My purpose for this code is to be used in a search engine. So, this piece of code is easily intergrated into the Zend port of the infamous Lucene search engine (initialy implemented for Java). It is a good PHP search engine implemted in a very modular way, that can be integrated to various projects. The adoption of the greek stemmer there was fairly straightforward.
I have some small classes for this to work, if you are interested also i can sent them.

basos

Thanks for the Tip

I appreciate your sharing this piece of information. I'm from Ikaria. It's unfortunate that we have to resort to third parties to provide Greek FTS. Isn't there like 50 languages already supported? Is Greek really that far down the line?


All Rights Reserved 2006-2010 Salix.gr | Hosting by e-emporio