Using Apache OpenNLP Document Categorizer
Document Categorizer is an interesting tool provided by Apache OpenNLP, which allows you to classify text into pre-defined categories of your choice. That being said, this tool does not come with a pre-trained model as the categories are subjective (categorized to users' liking). Hence it is required by the user who intend to use the Document Categorizer to train their own model using the Document Categorizer tool or the API (programmatically).
To get things started, let's see how we can train a model to be used in the Document Categorizer. The most convenient way to go about it is to use the command line tool to train a desired model.
Step 1
Prepare a text file which would include sample set of texts (preferably a set of around 15000 ) which would be a sub set of texts that needs to be categorized using the Document Categorizer. (in this example, I'll be considering a set of texts which would include traffic related information extracted from Twitter. This needs to be categorized into either High, Medium , Low or Random to indicate the level of traffic the texts imply ).
High @IlhamS road closed till peliyagoda . Baseline clear for now but might be closed later .
Low RT @Daraka_S: @road_lk no traffic towards peliyagoda Borall . Low RT @Afzy_T: No traffic near town hall towards colpetty @road_lk .
Medium @k9yosh not closed yet but traffic in Maradana, don't know about Kotahena. earlier in the morning it was clear .
High RT @realmau5: quite a crowd I see (@ Maradana Railway Station - in Maradana , Western Province) http://t.co/gIpiyrDILv .
High RT @diliniheshala: Heavy traffic near Pattiya junc towards colombo .
Low RT@BrianMorenze: @road_lk Deserted Colombo - Negombo Rd, http://t.co/l06VZsd67.
High RT @softnuwan: Huge traffic jam in Rajagiriya , Buthgamuwa Road .
Random @Formerareef unfortunately we don't always get photos .
High RT @deleepa_perera: Negombo road closed off from Wattala onwards . Heavy STF deployment . #PapalVisitSL @road_lk http://t.co/8eOyNJRuhv .
Medium @k9yosh not closed yet but traffic in Maradana, don't know about Kotahena. earlier in the morning it was clear .
High RT @realmau5: quite a crowd I see (@ Maradana Railway Station - in Maradana , Western Province) http://t.co/gIpiyrDILv .
High RT @diliniheshala: Heavy traffic near Pattiya junc towards colombo .
Low RT@BrianMorenze: @road_lk Deserted Colombo - Negombo Rd, http://t.co/l06VZsd67.
High RT @softnuwan: Huge traffic jam in Rajagiriya , Buthgamuwa Road .
Random @Formerareef unfortunately we don't always get photos .
High RT @deleepa_perera: Negombo road closed off from Wattala onwards . Heavy STF deployment . #PapalVisitSL @road_lk http://t.co/8eOyNJRuhv .
When training a model, you must make sure that the training file should contain;
- One document per line
- Category and text separated by a whitespace
As shown above, categories are highlighted for the ease of identification. A single document (document in this context would mean a single Twitter feed which needs to be categorized as either High, Medium, Low or Random) should be in one line (it appears to be in two lines due to space restriction)
Step 2
To be able to run trainer in the command line, you have to set the OPENNLP_HOME in your computer
OPENNLP_HOME=/path/to/your/apache-opennlp-1.5.3
export OPENNLP_HOME
PATH=$PATH:$OPENNLP_HOME/bin
export PATH
export OPENNLP_HOME
PATH=$PATH:$OPENNLP_HOME/bin
export PATH
Step 3
Save the train file with a name of your choice and the extension of .train .Once it is done, run the following command from where the document is saved.
$ opennlp DoccatTrainer -model en-doccat.bin -lang en -data en-doccat.train -encoding UTF-8
I have named the training file as 'en-doccat.train' and the model will be saved as 'en-doccat.bin'. When training models as mentioned above, it's advisable to run the above command repeatedly for every insertion to the model, without running it after adding all the training data at once. This is due to the possibility of you making errors while preparing the training file, and having to make changes to a file which has thousands of lines. This can be avoided by continuously running the above command to see if any errors prevail.
Now that you have a trained model, you can test how it works on a new set of test documents. For that you can use the following code.
import opennlp.tools.doccat.DoccatModel;
import opennlp.tools.doccat.DocumentCategorizerME;
import org.apache.log4j.Logger;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
public class FeedClassifierTrainer {
private static final Logger log = Logger
.getLogger(FeedClassifierTrainer.class);
public static void main(String[] args) {
String content = "Document that needs to categorized goes here";
try {
new FeedClassifierTrainer().DocumentCategorizer(content);
} catch (IOException e) {
e.printStackTrace();
}
}
public void DocumentCategorizer(String text) throws IOException {
File test = new File("Path to your en-doccat.bin model file");
String classificationModelFilePath = test.getAbsolutePath();
DocumentCategorizerME classificationME = new DocumentCategorizerME(
new DoccatModel(
new FileInputStream(classificationModelFilePath)));
String documentContent = text;
double[] classDistribution = classificationME
.categorize(documentContent);
String predictedCategory = classificationME
.getBestCategory(classDistribution);
System.out.println("Model prediction : " + predictedCategory);
}
}
This can be used in different instances like;
- Sentiment analysis of Twitter/facebook users
- Price changes in certain stocks/ profit increase or reduction
Hope this was helpful for you to use Document Categorizer in developing your own applications.
Cheers...!!!
No comments:
Post a Comment