Just a way to convert agiga data into a processors Document.
Why spend the time and resources parsing and annotating over 183 million sentences when it has already been done?
import org.clulab.agiga
// build a processors.Document
val doc = agiga.toDocument("path/to/agiga/xml/ltw_eng_200705.xml.gz")Everything is configured in the application.conf file.
-
Change the
viewproperty to "lemmas" -
Change the
inputDirproperty to wherever your copy ofagigais nestled on your disk -
Change the
outputDirproperty to wherever you want your compressed of the lemmatized English Gigaword to be written -
(Optional) Change the
nthreadsproperty to the maximum number of threads you prefer to use for parallelization.
All that's left is to run AgigaReader:
sbt "runMain sem.AgigaReader"| Value | Description |
|---|---|
| "words" | word form of each token |
| "lemmas" | lemma form of each token |
| "tags" | PoS tag of each token |
| "entities" | NE labels of each token |
| "deps" | <word form of head>_<relation>_<word form of dependent> |
| "lemma-deps" | <lemmatized head>_<relation>_<lemmatized dependent> |
| "tag-deps" | <pos tag of head>_<relation>_<pos tag of dependent> |
| "entity-deps" | <NE label of head>_<relation>_<NE label of dependent> |
- Add output options for dependencies using the DFS ordering described in "Higher-order Lexical Semantic Models for Non-factoid Answer Reranking"