For text analysis it is often useful to POS tag and lemmatize your text, especially with non-English data. R does not really have built-in functions for that, but there are libraries that connect to external tools to help you do this. This handout reviews two common tools (spacy and coreNLP) and two tools developed by us that are useful for Dutch (frogr) and distributed processing (nlpipe).

Spacyr

Spacy is a python package with processing models for 6 different languages, which makes it attractive to use if you need e.g. French or German lemmatizing.

To install it, you need to install the spacy module in python and load the appropriate language model. See https://spacy.io/usage/

After that, you can install spacyr and use it to tag, lemmatize, and/or parse text:

library(spacyr)
spacy_initialize("de", python_executable = "/home/wva/env/bin/python")
tokens = spacy_parse("Ich bin ein Berliner")
head(tokens)
doc_id sentence_id token_id token lemma pos entity
text1 1 1 Ich Ich PRON
text1 1 2 bin sein AUX
text1 1 3 ein einen DET
text1 1 4 Berliner Berliner NOUN MISC_B

CoreNLP

CoreNLP is a java-based toolkit that offers a lot of NLP processing for English and (more limited) for other languages. It is used a lot for English-based research.

To install it, you need to have java on your system, and install the R coreNLP and download the program and models:

install.packages("rJava")
install.packages("coreNLP")
downloadCoreNLP()

After this, you can use coreNLP to parse:

library(coreNLP)
initCoreNLP(type='english')
output = annotateString("John loves Hannover")
tokens = getToken(output)
head(tokens)
sentence id token lemma CharacterOffsetBegin CharacterOffsetEnd POS NER Speaker
1 1 John John 0 4 NNP PERSON PER0
1 2 loves love 5 10 VBZ O PER0
1 3 Hannover Hannover 11 19 NNP LOCATION PER0

Frog

Unfortunately, while spacy has Dutch models the lemmatizer doesn’t seem to work. The University of Tilburg has the Frog program which performs lemmatization pretty fast. You need to install and run it via docker. Install docker and run the following command:

docker run --name frog -dp 9887:9887 proycon/lamachine frog -S 9887 --skip=pm

and then you can use the frogr library to call it:

# install with: devtools::install_github("vanatteveldt/frogr")
library(frogr)
tokens = frogr::call_frog("Tulpen uit Amsterdam", port=9887)
head(tokens)
docid sent position word lemma morph pos prob ner chunk parse1 parse2 majorpos
1 1 1 Tulpen tulp [tulp][en] N(soort,mv,basis) 0.996849 O B-NP NA NA N
1 1 2 uit uit [uit] VZ(init) 0.992727 O B-PP NA NA VZ
1 1 3 Amsterdam Amsterdam [Amsterdam] SPEC(deeleigen) 1.000000 B-LOC B-NP NA NA SPEC

NLPipe

NLPipe is a platform developed at the VU that allows you to use a separate server (or multiple servers) to do the processing, which can be quite useful if you have to process a lot of documents. Also, since it runs outside of R it saves you the hassle of tying R to java or python like for coreNLP or spacy.

If you want to install an NLPipe server and workers on separate servers, see http://github.com/vanatteveldt/nlpipe.

For testing and smaller data sets, you can also install it directly on your computer using docker. For example, this sets up coreNLP and NLPipe using docker:

docker run --name corenlp -dp 9000:9000 chilland/corenlp-docker 
docker run --name nlpipe --link corenlp:corenlp -e "CORENLP_HOST=http://corenlp:9000" -dp 5001:5001 vanatteveldt/nlpipe

You can go to http://localhost:5001 to verify that it is running correctly. On that page you will see how many documents are assigned for processing and how many are done.

Now, you can assign documents to be parsed from within R:

library(nlpiper)
id = process_async("corenlp_lemmatize", "This is a test!")
status("corenlp_lemmatize", id)
## 0x702edca0b2181c15d457eacac39de39b 
##                             "DONE"
tokens=result("corenlp_lemmatize", id, format='csv')
head(tokens)
id sentence offset word lemma POS POS1 ner
1.491169e+38 1 0 This this DT D O
1.491169e+38 1 5 is be VBZ V O
1.491169e+38 1 8 a a DT D O
1.491169e+38 1 10 test test NN N O
1.491169e+38 1 14 ! ! . O O