For text analysis it is often useful to POS tag and lemmatize your text, especially with non-English data. R does not really have built-in functions for that, but there are libraries that connect to external tools to help you do this. This handout reviews two common tools (spacy and coreNLP) and two tools developed by us that are useful for Dutch (frogr) and distributed processing (nlpipe).
Spacy is a python package with processing models for 6 different languages, which makes it attractive to use if you need e.g. French or German lemmatizing.
To install it, you need to install the spacy
module in python and load the appropriate language model. See https://spacy.io/usage/
After that, you can install spacyr
and use it to tag, lemmatize, and/or parse text:
library(spacyr)
spacy_initialize("de", python_executable = "/home/wva/env/bin/python")
tokens = spacy_parse("Ich bin ein Berliner")
head(tokens)
doc_id | sentence_id | token_id | token | lemma | pos | entity |
---|---|---|---|---|---|---|
text1 | 1 | 1 | Ich | Ich | PRON | |
text1 | 1 | 2 | bin | sein | AUX | |
text1 | 1 | 3 | ein | einen | DET | |
text1 | 1 | 4 | Berliner | Berliner | NOUN | MISC_B |
CoreNLP is a java-based toolkit that offers a lot of NLP processing for English and (more limited) for other languages. It is used a lot for English-based research.
To install it, you need to have java on your system, and install the R coreNLP
and download the program and models:
install.packages("rJava")
install.packages("coreNLP")
downloadCoreNLP()
After this, you can use coreNLP to parse:
library(coreNLP)
initCoreNLP(type='english')
output = annotateString("John loves Hannover")
tokens = getToken(output)
head(tokens)
sentence | id | token | lemma | CharacterOffsetBegin | CharacterOffsetEnd | POS | NER | Speaker |
---|---|---|---|---|---|---|---|---|
1 | 1 | John | John | 0 | 4 | NNP | PERSON | PER0 |
1 | 2 | loves | love | 5 | 10 | VBZ | O | PER0 |
1 | 3 | Hannover | Hannover | 11 | 19 | NNP | LOCATION | PER0 |
Unfortunately, while spacy has Dutch models the lemmatizer doesn’t seem to work. The University of Tilburg has the Frog program which performs lemmatization pretty fast. You need to install and run it via docker. Install docker and run the following command:
docker run --name frog -dp 9887:9887 proycon/lamachine frog -S 9887 --skip=pm
and then you can use the frogr
library to call it:
# install with: devtools::install_github("vanatteveldt/frogr")
library(frogr)
tokens = frogr::call_frog("Tulpen uit Amsterdam", port=9887)
head(tokens)
docid | sent | position | word | lemma | morph | pos | prob | ner | chunk | parse1 | parse2 | majorpos |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | Tulpen | tulp | [tulp][en] | N(soort,mv,basis) | 0.996849 | O | B-NP | NA | NA | N |
1 | 1 | 2 | uit | uit | [uit] | VZ(init) | 0.992727 | O | B-PP | NA | NA | VZ |
1 | 1 | 3 | Amsterdam | Amsterdam | [Amsterdam] | SPEC(deeleigen) | 1.000000 | B-LOC | B-NP | NA | NA | SPEC |
NLPipe is a platform developed at the VU that allows you to use a separate server (or multiple servers) to do the processing, which can be quite useful if you have to process a lot of documents. Also, since it runs outside of R it saves you the hassle of tying R to java or python like for coreNLP or spacy.
If you want to install an NLPipe server and workers on separate servers, see http://github.com/vanatteveldt/nlpipe.
For testing and smaller data sets, you can also install it directly on your computer using docker. For example, this sets up coreNLP and NLPipe using docker:
docker run --name corenlp -dp 9000:9000 chilland/corenlp-docker
docker run --name nlpipe --link corenlp:corenlp -e "CORENLP_HOST=http://corenlp:9000" -dp 5001:5001 vanatteveldt/nlpipe
You can go to http://localhost:5001 to verify that it is running correctly. On that page you will see how many documents are assigned for processing and how many are done.
Now, you can assign documents to be parsed from within R:
library(nlpiper)
id = process_async("corenlp_lemmatize", "This is a test!")
status("corenlp_lemmatize", id)
## 0x702edca0b2181c15d457eacac39de39b
## "DONE"
tokens=result("corenlp_lemmatize", id, format='csv')
head(tokens)
id | sentence | offset | word | lemma | POS | POS1 | ner |
---|---|---|---|---|---|---|---|
1.491169e+38 | 1 | 0 | This | this | DT | D | O |
1.491169e+38 | 1 | 5 | is | be | VBZ | V | O |
1.491169e+38 | 1 | 8 | a | a | DT | D | O |
1.491169e+38 | 1 | 10 | test | test | NN | N | O |
1.491169e+38 | 1 | 14 | ! | ! | . | O | O |