SVMTrainer is an experimental project in computer learning. It deals with the task of text categorization, and asks whether a computer can be trained to recognize a certain topic by letting it teach itself via web searches. In its current form, SVMTrainer is powered by the Yahoo! search API. For more information, browse the SVMTrainer category on the blog.
New! SVMTrainer 0.30 is now being made available online! Feel free to experiment with the application. Be warned: There are still many optimizations to be made with regard to speed and accuracy. I encourage you to try making those optimizations yourself, and if you do, please let me know how they work! Important! You must get your own Yahoo! Search API application key… mine is not included in the source!
Download: My apologies, my university account has expired. I’m currently tracking down these files and will have a new link up soon.
The same goes for the documentation.
Here are a few things you should know about the source:
- The first thing you need to do is drop your Yahoo! app id into Searcher.java
- Next, open up SVMSetGenerator.java and modify main() to set your search terms, number of results, etc. Then run SVMSetGenerator and watch your computer learn about whatever topic you gave it!
- If you want to implement a more complex search behavior, try extending the Searcher class.
- If you want to control the internal representation of words, or ignore certain words, write a new class that implements WordFilter.
- If you want to ignore whole sections of a document, write a new class that extends DocumentParser.
Finally, once you’ve generated a training set, it can be used with SVMlight, by Thorsten Joachims of Cornell Univeristy. SVMlight will generate a model file. Now with the model file, the lexicon generated by SVMSetGenerator, and a DocumentParser in conversion mode, you can have your newly-trained vector machine categorize text for you!