=========================================
(C) 2015 Wouter van Atteveldt, license: [CC-BY-SA]
The most important object in frequency-based text analysis is the document term matrix. This matrix contains the documents in the rows and terms (words) in the columns, and each cell is the frequency of that term in that document.
In R, these matrices are provided by the tm
(text mining) package. Although this package provides many functions for loading and manipulating these matrices, using them directly is relatively complicated.
Fortunately, the RTextTools
package provides an easy function to create a document-term matrix from a data frame. To create a term document matrix from a simple data frame with a ‘text’ column, use the create_matrix
function
library(RTextTools)
input = data.frame(text=c("Chickens are birds", "The bird eats"))
m = create_matrix(input$text, removeStopwords=F)
We can inspect the resulting matrix using the regular R functions:
class(m)
## [1] "DocumentTermMatrix" "simple_triplet_matrix"
dim(m)
## [1] 2 6
So, m
is a DocumentTermMatrix
, which is derived from a simple_triplet_matrix
as provided by the slam
package. Internally, document-term matrices are stored as a sparse matrix: if we do use real data, we can easily have hundreds of thousands of rows and columns, while the vast majority of cells will be zero (most words don’t occur in most documents). Storing this as a regular matrix would waste a lot of memory. In a sparse matrix, only the non-zero entries are stored, as ‘simple triplets’ of (document, term, frequency).
As seen in the output of dim
, Our matrix has only 2 rows (documents) and 6 columns (unqiue words). Since this is a rather small matrix, we can visualize it using as.matrix
, which converts the ‘sparse’ matrix into a regular matrix:
as.matrix(m)
## Terms
## Docs are bird birds chickens eats the
## 1 1 0 1 1 0 0
## 2 0 1 0 0 1 1
So, we can see that each word is kept as is. We can reduce the size of the matrix by dropping stop words and stemming: (see the create_matrix documentation for the full range of options)
m = create_matrix(input$text, removeStopwords=T, stemWords=T, language='english')
dim(m)
## [1] 2 3
as.matrix(m)
## Terms
## Docs bird chicken eat
## 1 1 1 0
## 2 1 0 1
As you can see, the stop words (the and are) are removed, while the two verb forms of to eat are joined together.
In RTextTools, the language for stemming and stop words can be given as a parameter, and the default is English. Note that stemming works relatively well for English, but is less useful for more highly inflected languages such as Dutch or German. An easy way to see the effects of the preprocessing is by looking at the colSums of a matrix, which gives the total frequency of each term:
colSums(as.matrix(m))
## bird chicken eat
## 2 1 1
For more richly inflected languages like Dutch, the result is less promising:
text = c("De kip eet", "De kippen hebben gegeten")
m = create_matrix(text, removeStopwords=T, stemWords=T, language="dutch")
colSums(as.matrix(m))
## eet geget kip kipp
## 1 1 1 1
As you can see, de and hebben are correctly recognized as stop words, but gegeten (eaten) and kippen (chickens) have a different stem than eet (eat) and kip (chicken). German gets similarly bad results.
AmCAT can automatically lemmatize text. Before we can use it, we need to connect with a valid username and password:
library(amcatr)
## Loading required package: rjson
## Loading required package: RCurl
## Loading required package: bitops
## Loading required package: plyr
## Loading required package: tm
## Loading required package: NLP
## Loading required package: slam
## Loading required package: lda
## Loading required package: topicmodels
## Loading required package: Matrix
## Loading required package: httr
##
## Attaching package: 'httr'
##
## The following object is masked from 'package:NLP':
##
## content
##
##
## Attaching package: 'amcatr'
##
## The following object is masked _by_ '.GlobalEnv':
##
## amcat.tokens.unique_indices
conn = amcat.connect("http://preview.amcat.nl")
Now, we can use the amcat.gettokens
sentence = "Chickens are birds. The bird eats"
t = amcat.gettokens(conn, sentence=as.character(sentence), module="corenlp_lemmatize")
## GET http://preview.amcat.nl/api/v4/tokens/?module=corenlp_lemmatize&page_size=1&format=csv&sentence=Chickens%20are%20birds.%20The%20bird%20eats
t
## word sentence pos lemma offset aid id pos1
## 1 Chickens 1 NNS chicken 0 NA 1 N
## 2 are 1 VBP be 9 NA 2 V
## 3 birds 1 NNS bird 13 NA 3 N
## 4 . 1 . . 18 NA 4 .
## 5 The 2 DT the 20 NA 5 D
## 6 bird 2 NN bird 24 NA 6 N
## 7 eats 2 VBZ eat 29 NA 7 V
As you can see, this provides real-time lemmatization and Part-of-Speech tagging using the Stanford CoreNLP toolkit: ‘are’ is recognized as V(erb) and has lemma ‘be’. To create a term-document matrix from a list of tokens, we can use the dtm.create
function. Since the token list is a regular R data frame, we can use normal selection to e.g. select only the verbs and nouns:
library(corpustools)
## Loading required package: reshape2
## Loading required package: RColorBrewer
## Loading required package: wordcloud
t = t[t$pos1 %in% c("V", "N"), ]
dtm = dtm.create(documents=t$sentence, terms=t$lemma)
as.matrix(dtm)
## Terms
## Docs chicken be bird eat
## 1 1 1 1 0
## 2 0 0 1 1
Normally, rather than ask for a single ad hoc text to be parsed, we would upload a selection of articles to AmCAT. This can be done from R using the amcat.upload.articles
function, but for now we will use an existing article set: set 17667 in project 688, which contains American newspaper coverage about the 2009 gaza war.
t = amcat.gettokens(conn, project=688, articleset = 17667, module = "corenlp_lemmatize", page_size = 100, drop=NULL)
save(t, file="tokens_17667.rda")
Note that the first time you run this command on an article set, the articles will be preprocessed on the fly, so it could take quite a long time. After this, however, the results are stored in the AmCAT database so getting te tokens should go relatively quickly, although still only around 10 articles per second - so it is wise to save the tokens after getting them.
load("tokens_17667.rda")
nrow(t)
## [1] 7669594
head(t, n=20)
## word sentence pos lemma offset aid id pos1 freq
## 1 Dec. 1 NNP Dec. 0 26074649 1 M 1
## 2 29 1 CD 29 5 26074649 2 Q 1
## 3 , 1 , , 7 26074649 3 . 1
## 4 2008 1 CD 2008 9 26074649 4 Q 1
## 5 -LRB- 1 -LRB- -lrb- 14 26074649 5 . 1
## 6 The 1 DT the 15 26074649 6 D 1
## 7 Western 1 JJ western 19 26074649 7 A 1
## 8 Confucian 1 JJ confucian 27 26074649 8 A 1
## 9 delivered 1 VBN deliver 37 26074649 9 V 1
## 10 by 1 IN by 47 26074649 10 P 1
## 11 Newstex 1 NNP Newstex 50 26074649 11 M 1
## 12 -RRB- 1 -RRB- -rrb- 57 26074649 12 . 1
## 13 -- 1 : -- 59 26074649 13 . 1
## 14 `` 1 `` `` 62 26074649 14 . 1
## 15 I 1 PRP I 63 26074649 15 O 1
## 16 dont 1 VBP dont 65 26074649 16 V 1
## 17 think 1 VB think 70 26074649 17 V 1
## 18 there 1 EX there 76 26074649 18 ? 1
## 19 is 1 VBZ be 82 26074649 19 V 1
## 20 such 1 JJ such 85 26074649 20 A 1
As you can see, the result is similar to the ad-hoc lemmatized tokens, but now we have around 8 million tokens rather than 6. We can create a document-term matrix using the same commands as above, restricting ourselves to nouns, names, verbs, and adjectives:
t = t[t$pos1 %in% c("V", "N", 'M', 'A'), ]
dtm = dtm.create(documents=t$aid, terms=t$lemma)
## (Duplicate row-column matches occured. Values of duplicates are added up)
dtm
## <<DocumentTermMatrix (documents: 6893, terms: 72364)>>
## Non-/sparse entries: 1840938/496964114
## Sparsity : 100%
## Maximal term length: 80
## Weighting : term frequency (tf)
So, we now have a “sparse” matrix of almost 7,000 documents by more than 70,000 terms. Sparse here means that only the non-zero entries are kept in memory, because otherwise it would have to keep all 70 million cells in memory (and this is a relatively small data set). Thus, it might not be a good idea to use functions like as.matrix
or colSums
on such a matrix, since these functions convert the sparse matrix into a regular matrix. The next section investigates a number of useful functions to deal with (sparse) document-term matrices.
What are the most frequent words in the corpus? As shown above, we could use the built-in colSums
function, but this requires first casting the sparse matrix to a regular matrix, which we want to avoid (even our relatively small dataset would have 400 million entries!). However, we can use the col_sums
function from the slam
package, which provides the same functionality for sparse matrices:
library(slam)
freq = col_sums(dtm)
# sort the list by reverse frequency using built-in order function:
freq = freq[order(-freq)]
head(freq, n=10)
## be have say Gaza Israel do Hamas go will
## 274308 84449 48738 39912 39665 38138 28976 25433 24795
## israeli
## 21720
As can be seen, the most frequent terms are all the main actors/countries involved and the ‘stop’ words be, have, etc. It can be useful to compute different metrics per term, such as term frequency, document frequency (how many documents does it occur), and td.idf (term frequency * inverse document frequency, which removes both rare and overly frequent terms). The function term.statistics
from the corpus-tools
package provides this functionality:
terms = term.statistics(dtm)
terms = terms[order(-terms$termfreq), ]
head(terms)
## term characters number nonalpha termfreq docfreq reldocfreq
## be be 2 FALSE FALSE 274308 6821 0.9895546
## have have 4 FALSE FALSE 84449 6447 0.9352967
## say say 3 FALSE FALSE 48738 5487 0.7960250
## Gaza Gaza 4 FALSE FALSE 39912 6681 0.9692442
## Israel Israel 6 FALSE FALSE 39665 5763 0.8360656
## do do 2 FALSE FALSE 38138 4697 0.6814159
## tfidf
## be 0.0008800757
## have 0.0020562789
## say 0.0058860143
## Gaza 0.0007636685
## Israel 0.0050126632
## do 0.0051695716
As you can see, for each word the total frequency and the relative document frequency is listed, as well as some basic information on the number of characters and the occurrence of numerals or non-alphanumeric characters. This allows us to create a ‘common sense’ filter to reduce the amount of terms, for example removing all words containing a letter or punctuation mark, and all short (characters<=2
) infrequent (termfreq<25
) and overly frequent (reldocfreq>.5
) words:
subset = terms[!terms$number & !terms$nonalpha & terms$characters>2 & terms$termfreq>=25 & terms$reldocfreq<.5, ]
nrow(subset)
## [1] 8423
head(subset, n=10)
## term characters number nonalpha termfreq docfreq reldocfreq
## get get 3 FALSE FALSE 17387 2744 0.3980850
## know know 4 FALSE FALSE 14913 2300 0.3336718
## Obama Obama 5 FALSE FALSE 13759 2010 0.2916002
## think think 5 FALSE FALSE 13074 2006 0.2910199
## see see 3 FALSE FALSE 12640 3212 0.4659800
## make make 4 FALSE FALSE 11491 3441 0.4992021
## year year 4 FALSE FALSE 11292 3123 0.4530683
## time time 4 FALSE FALSE 10485 3349 0.4858552
## come come 4 FALSE FALSE 10376 2988 0.4334832
## end end 3 FALSE FALSE 9549 3266 0.4738140
## tfidf
## get 0.007402339
## know 0.008088574
## Obama 0.017005069
## think 0.009220950
## see 0.005432677
## make 0.004985312
## year 0.005909995
## time 0.004963740
## come 0.005349094
## end 0.005763226
This seems more to be a relatively useful set of words. We now have about 8 thousand terms left of the original 72 thousand. To create a new document-term matrix with only these terms, we can use normal matrix indexing index on the columns (which contain the words):
dtm_filtered = dtm[, colnames(dtm) %in% subset$term]
dim(dtm_filtered)
## [1] 6893 8423
Which yields a much more managable dtm. As a bonus, we can use the dtm.wordcloud
function in corpustools (which is a thin wrapper around the wordcloud
package) to visualize the top words as a word cloud:
dtm.wordcloud(dtm_filtered)
Note that such corpus analytics might not seem very informative, but it is quite easy to use this to e.g. see which names occur in a set of documents:
names = t[t$pos1 == 'M', ]
dtm_names = dtm.create(names$aid, names$lemma)
## (Duplicate row-column matches occured. Values of duplicates are added up)
name.terms = term.statistics(dtm_names)
name.terms = name.terms [order(-name.terms$termfreq), ]
head(name.terms )
## term characters number nonalpha termfreq docfreq
## Gaza Gaza 4 FALSE FALSE 39912 6681
## Israel Israel 6 FALSE FALSE 39665 5763
## Hamas Hamas 5 FALSE FALSE 28976 4845
## Obama Obama 5 FALSE FALSE 13759 2010
## Palestinians Palestinians 12 FALSE FALSE 7277 3327
## United United 6 FALSE FALSE 6923 2813
## reldocfreq tfidf
## Gaza 0.9698069 0.003517966
## Israel 0.8365510 0.023505464
## Hamas 0.7032951 0.038803425
## Obama 0.2917695 0.080890384
## Palestinians 0.4829438 0.029633259
## United 0.4083321 0.032329197
And of course we can visualize this (using a square root transformation of the frequency to prevent the top names from dominating the word cloud):
dtm.wordcloud(dtm_names, freq.fun = sqrt)
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : America could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Saturday could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Lebanon could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Sunday could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Tuesday could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Bank could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Security could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Congress could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Friday could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Thursday could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Hezbollah could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Foreign could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Olmert could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Wednesday could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Prime could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : BEGIN could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Blagojevich could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Abbas could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Secretary could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : George could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Times could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Newstex could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : CORRESPONDENT could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Ehud could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Palestine could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : News could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Roland could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Americans could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : David could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : January could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Democrats could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Jerusalem could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Press could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Rice could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : COOPER could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : GLICK could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : December could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : BLITZER could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : King could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Jordan could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Department could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Committee could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : International could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Group could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Street could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Services could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Europe could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Afghanistan could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Defense could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Information could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Chicago could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : Authority could not be fit on page. It will not be
## plotted.
Another useful thing we can do is comparing two corpora: Which words or names are mentioned more in e.g. one country or speech compared to another. To do this, we get the tokens from set 17668, which contains the coverage of the Gaza war in newspapers from Islamic countries.
t2 = amcat.gettokens(conn, project=688, articleset = 17668, module = "corenlp_lemmatize", page_size = 100, drop=NULL)
save(t2, file="tokens_17668.rda")
And we create a term-document matrix from the second article set as well:
load("tokens_17668.rda")
t2 = t2[t2$pos1 %in% c("V", "N", 'M', 'A'), ]
dtm2 = dtm.create(documents=t2$aid, terms=t2$lemma)
## (Duplicate row-column matches occured. Values of duplicates are added up)
dtm2
## <<DocumentTermMatrix (documents: 846, terms: 15782)>>
## Non-/sparse entries: 141939/13209633
## Sparsity : 99%
## Maximal term length: 79
## Weighting : term frequency (tf)
Let’s also remove the non-informative words from this matrix:
terms2 = term.statistics(dtm2)
subset2 = terms2[!terms2$number & !terms2$nonalpha & terms2$characters>2 & terms2$termfreq>=25 & terms2$reldocfreq<.5, ]
dtm2_filtered = dtm2[, colnames(dtm2) %in% subset2$term]
So how can we check which words are more frequent in the American discourse than in the ‘Islamic’ discource? The function corpora.compare
provides this functionality, given two document-term matrices:
cmp = corpora.compare(dtm_filtered, dtm2_filtered)
cmp = cmp[order(cmp$over), ]
head(cmp)
## term termfreq.x termfreq.y relfreq.x relfreq.y over
## 8439 Hamas 0 1457 0 0.011078753 0.08279001
## 8431 can 0 789 0 0.005999407 0.14286925
## 8430 call 0 782 0 0.005946180 0.14396402
## 8426 attack 0 687 0 0.005223818 0.16067307
## 8480 take 0 661 0 0.005026119 0.16594428
## 8466 other 0 643 0 0.004889250 0.16980089
## chi
## 8439 29792.58
## 8431 16129.56
## 8430 15986.42
## 8426 14043.86
## 8480 13512.24
## 8466 13144.20
As you can see, for each term the absolute and relative frequencies are given for both corpora. In this case, x
is American newspapers and y
is Muslim-country newspapers. The ‘over’ column shows the amount of overrepresentation: a high number indicates that it is relatively more frequent in the x (positive) corpus. ‘Chi’ is a measure of how unexpected this overrepresentation is: a high number means that it is a very typical term for that corpus. Since the output above is sorted by ascending overrepresentation, these terms are the overrepresented terms in the Muslim-country newspapers. Let’s have a look at the American papers:
cmp = cmp[order(-cmp$over), ]
head(cmp, n=10)
## term termfreq.x termfreq.y relfreq.x relfreq.y over
## 5431 Palestinians 7277 0 0.002707445 0.0000000000 3.707445
## 7614 think 13074 140 0.004864248 0.0010645335 2.840471
## 1391 CNN 4858 0 0.001807444 0.0000000000 2.807444
## 4221 know 14913 184 0.005548458 0.0013991012 2.729546
## 8064 video 3908 0 0.001453991 0.0000000000 2.453991
## 6749 Senate 3808 0 0.001416786 0.0000000000 2.416786
## 3130 get 17387 294 0.006468922 0.0022355204 2.308414
## 1373 clip 3139 0 0.001167881 0.0000000000 2.167881
## 4504 lot 5234 48 0.001947336 0.0003649829 2.159248
## 7597 thank 2914 0 0.001084169 0.0000000000 2.084169
## chi
## 5431 356.9856
## 7614 388.0337
## 1391 238.1126
## 4221 405.2883
## 8064 191.4842
## 6749 186.5778
## 3130 360.5502
## 1373 153.7627
## 4504 167.8744
## 7597 142.7298
So, to draw very precocious conclusions, Americans seem to talk about Palestinians and politics, while the Muslim-countries talk about Hamas and fighting.
We can also sort by chi-squared, taking only the underrepresented (Muslim) words:
Let’s make a word cloud of the words in the American papers, with size indicating chi-square overrepresentation:
us = cmp[cmp$over > 1,]
dtm.wordcloud(terms = us$term, freqs = us$chi)
And for the Muslim-country papers:
mus = cmp[cmp$over < 1,]
dtm.wordcloud(terms = mus$term, freqs = mus$chi, freq.fun = sqrt)
As you can see, these differences are for a large part due to place names: American papers talk about American states and cities, while Muslim-country papers talk about their localities.
So, it can be more informative to exclude names, and focus instead on e.g. the used nouns or verbs:
nouns = t[t$pos1 == "N" & t$lemma %in% subset$term, ]
nouns2 = t2[t2$pos1 == "N" & t2$lemma %in% subset2$term, ]
cmp = corpora.compare(dtm.create(nouns$aid, nouns$lemma), dtm.create(nouns2$aid, nouns2$lemma))
## (Duplicate row-column matches occured. Values of duplicates are added up)
## (Duplicate row-column matches occured. Values of duplicates are added up)
with(cmp[cmp$over > 1,], dtm.wordcloud(terms=term, freqs=chi))
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : inauguration could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : correspondent could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : everybody could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : package could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : question could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : money could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : neighborhood could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : tonight could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : appointment could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : begin could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : email could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : credit could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : spending could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : copyright could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : weekend could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : bailout could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : transition could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : investigation could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : crosstalk could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : coverage could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : industry could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : artillery could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : daughter could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : hearing could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : budget could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : mistake could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : attorney could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : somebody could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : laughter could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : explosion could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : something could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : adviser could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : evening could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : afternoon could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : secretary could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : trip could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : door could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : vice could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : warning could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : neighbor could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : economy could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : president could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : progress could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : gallon could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : moment could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : stuff could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : trouble could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : minute could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : debt could not be fit on page. It will not be plotted.
with(cmp[cmp$over < 1,], dtm.wordcloud(terms=term, freqs=chi, freq.fun=sqrt))
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : emergency could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : opposition could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : suffering could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : tourism could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : invasion could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : export could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : specialist could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : journalist could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : occupation could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : counterpart could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : community could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : yesterday could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : ambassador could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : security could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : peace could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : memorandum could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : statement could not be fit on page. It will not be
## plotted.
verbs = t[t$pos1 == "V" & t$lemma %in% subset$term, ]
verbs2 = t2[t2$pos1 == "V" & t2$lemma %in% subset2$term, ]
cmp = corpora.compare(dtm.create(verbs$aid, verbs$lemma), dtm.create(verbs2$aid, verbs2$lemma))
## (Duplicate row-column matches occured. Values of duplicates are added up)
## (Duplicate row-column matches occured. Values of duplicates are added up)
with(cmp[cmp$over > 1,], dtm.wordcloud(terms=term, freqs=chi, freq.fun=sqrt))
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : happen could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : suspect could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : assume could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : smuggle could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : succeed could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : solve could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : identify could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : investigate could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : replace could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : settle could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : celebrate could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : capture could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : organize could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : spend could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : count could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : suspend could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : jump could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : enjoy could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : remember could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : evacuate could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : confront could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : predict could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : update could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : torture could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : finish could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : weigh could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : admit could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : mark could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : eliminate could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : deserve could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : spread could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : abandon could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : recall could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : conclude could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : design could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : contain could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : explode could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : expire could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : cry could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : afford could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : beat could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : assist could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : hang could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : shake could not be fit on page. It will not be plotted.
with(cmp[cmp$over < 1,], dtm.wordcloud(terms=term, freqs=chi, freq.fun=log))
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : collect could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : unite could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : channel could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : recognise could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : convene could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : urge could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : rocket could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : support could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : must could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : impose could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : transport could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : open could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : commit could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : unify could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : discuss could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : perpetrate could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : express could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : should could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : provide could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : result could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : attend could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : accord could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : implement could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : suffer could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : contribute could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : hand could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : extend could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : carry could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : neighbour could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : include could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : witness could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : lift could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : present could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : state could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : describe could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : firm could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : occupy could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : slaughter could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : form could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : base could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : launch could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : receive could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : contact could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : pressure could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : kill could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : explain could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : reach could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : wound could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : ease could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : prefer could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : lead could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : highlight could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : boost could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : remain could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : stage could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : enter could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : protest could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : influence could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : expose could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : unleash could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : publish could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : ensure could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : increase could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : issue could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : second could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : title could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : maintain could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : end could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : propose could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : export could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : cancel could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : inflict could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : announce could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : allow could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : offer could not be fit on page. It will not be plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : reject could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : achieve could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : accuse could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : follow could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(terms, freqs, scale = scale, min.freq = min.freq,
## max.words = Inf, : spark could not be fit on page. It will not be plotted.
Topics can be seen as groups of words that cluster together. Similar to factor analysis, topic modeling reduces the dimensionality of the feature space (the term-document matrix) assuming that the latent factors (the topics) will correspond to meaningful latent classes (e.g. issues, frames) With a given dtm, a topic model can be trained using the topmod.lda.fit
function:
set.seed(12345)
m = topmod.lda.fit(dtm_filtered, K = 10, alpha = .5)
terms(m, 10)
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
## [1,] "think" "percent" "war" "Egypt" "protest"
## [2,] "get" "price" "peace" "official" "police"
## [3,] "know" "year" "Palestinians" "end" "group"
## [4,] "want" "market" "world" "Minister" "child"
## [5,] "make" "oil" "Israelis" "border" "year"
## [6,] "thing" "gas" "state" "leader" "New"
## [7,] "Senate" "company" "year" "President" "student"
## [8,] "talk" "fall" "terrorist" "Arab" "city"
## [9,] "see" "money" "should" "stop" "hold"
## [10,] "look" "GLICK" "civilian" "offensive" "rally"
## Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
## [1,] "Council" "Obama" "civilian" "kill" "get"
## [2,] "support" "Bush" "humanitarian" "fire" "know"
## [3,] "resolution" "president" "food" "military" "see"
## [4,] "situation" "President" "child" "militant" "CNN"
## [5,] "United" "administration" "aid" "ground" "come"
## [6,] "follow" "Barack" "medical" "civilian" "look"
## [7,] "send" "Clinton" "supplies" "official" "think"
## [8,] "international" "new" "Nations" "force" "want"
## [9,] "ceasefire" "House" "school" "soldier" "video"
## [10,] "Group" "Washington" "United" "strike" "lot"
The terms
command gives the top N terms per topic, with each column forming a topic. Although interpreting topics on the top words alone is always iffy, it seems that most of the topics have a distinct meaning. For example, topic 3 seems to be about the conflict itself (echoing Tolstoy), while topic 9 describes the episodic action on the ground. Topic 4 and 6 seems mainly about international (Arabic and UN) politics, while topic 7 covers American politics. Topics 1 and 10 are seemingly ‘mix-in’ topics with various verbs, although it would be better to see usage in context for interpreting such less obvious topics. (note the use of set.seed
to make sure that running this again will yield the same topics. Since LDA topics are unordered, running it again will create (slightly) different topics, but certainly with different numbers)
Of course, we can also create word clouds of each topic to visualize the top-words:
topmod.plot.wordcloud(m, topic_nr = 9)
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : militant could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : ground could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : civilian could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : strike could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : target could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : operation could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : offensive could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : troops could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : border could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : southern could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : home could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : wound could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : building could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : shell could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : house could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : bomb could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : airstrike could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : Palestinians could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : tank could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : least could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : army could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : launch could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : begin could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : Israelis could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : include could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : resident could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : Saturday could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : report could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : casualty could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : group could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : death could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : tunnel could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : fighter could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : many could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : assault could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : missile could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : leader could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : northern could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : Sunday could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : campaign could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : Monday could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : security could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : could could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : Israeli could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : mortar could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : territory could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : send could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : school could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : child could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : weapon could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : continue could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : thousand could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : dozen could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : fighting could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : mosque could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : time could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : first could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : several could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : accord could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : move could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : hospital could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : street could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : destroy could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : spokesman could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : stop could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : week could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : government could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : artillery could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : family could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : warn could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : leave could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : Lebanon could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : member could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : defense could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : heavy could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : firing could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : mile could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : Defense could not be fit on page. It will not be
## plotted.
If we retrieve the meta-date (e.g. article dates, medium), we can make a more informative plot:
meta = amcat.getarticlemeta(conn, set=17667)
## GET http://preview.amcat.nl/api/v4/articlemeta?articleset=17667&page_size=1000&format=csv&page=1
## GET http://preview.amcat.nl/api/v4/articlemeta?articleset=17667&page_size=1000&format=csv&page=2
## GET http://preview.amcat.nl/api/v4/articlemeta?articleset=17667&page_size=1000&format=csv&page=3
## GET http://preview.amcat.nl/api/v4/articlemeta?articleset=17667&page_size=1000&format=csv&page=4
## GET http://preview.amcat.nl/api/v4/articlemeta?articleset=17667&page_size=1000&format=csv&page=5
## GET http://preview.amcat.nl/api/v4/articlemeta?articleset=17667&page_size=1000&format=csv&page=6
## GET http://preview.amcat.nl/api/v4/articlemeta?articleset=17667&page_size=1000&format=csv&page=7
## GET http://preview.amcat.nl/api/v4/medium?pk=10054&pk=10415&pk=10185&pk=10215&pk=10512&pk=10446&pk=10513&pk=989899164&pk=989899302&pk=10309&pk=10057&pk=989899155&pk=10511&pk=10450&pk=10217&pk=989899291&pk=10438&pk=10380&pk=10090&pk=989899208&pk=307&pk=989899227&pk=989899167&pk=10137&pk=10452&pk=10459&pk=989899286&pk=989899158&pk=989899149&pk=10308&pk=989899127&pk=10124&pk=10059&pk=10386&pk=10341&pk=989899347&pk=10042&pk=10404&pk=313&pk=10028&pk=10377&pk=10456&pk=10186&pk=10480&pk=989899110&pk=10122&pk=10417&pk=989899174&pk=10058&pk=989899123&pk=10220&pk=10430&pk=10281&pk=10178&pk=989899147&pk=302&pk=989899265&pk=989899272&pk=989899323&pk=10316&pk=10096&pk=10092&pk=10035&pk=989899177&pk=10365&pk=10130&pk=10139&pk=10119&pk=989899148&pk=989899350&pk=989899293&pk=10154&pk=989899329&pk=989899122&pk=10448&pk=989899115&pk=10246&pk=10439&pk=10239&pk=10368&pk=10275&pk=10194&pk=310&pk=10132&pk=10389&pk=10469&pk=10243&pk=989899117&pk=989899151&pk=10419&pk=10338&pk=989899119&pk=989899183&pk=989899150&pk=989899118&pk=10115&pk=989899146&pk=10245&pk=989899181&pk=989899162&pk=989899161&pk=989899192&pk=989899179&pk=989899201&pk=10537&pk=989899157&pk=10221&pk=989899239&pk=989899168&pk=989899152&pk=989899195&pk=10218&pk=989899186&pk=989899156&pk=10036&pk=10253&pk=10229&pk=989899355&pk=10453&pk=10444&pk=10416&pk=10283&pk=10219&pk=10249&pk=10104&pk=10070&pk=10272&pk=10041&pk=10343&pk=10264&pk=989899191&pk=989899333&pk=10485&pk=10089&pk=10323&pk=10026&pk=10193&pk=989899328&pk=10159&pk=989899358&pk=989899324&pk=10492&pk=10297&pk=989899189&pk=10168&pk=10495&pk=10332&pk=989899309&pk=10322&pk=989899345&pk=989899169&pk=989899351&pk=10517&pk=10184&pk=989899292&pk=10011&pk=989899340&pk=10345&pk=10477&pk=10364&pk=10274&pk=10180&pk=10314&pk=989899187&pk=10016&pk=989899171&pk=989899354&pk=10146&pk=989899301&pk=989899124&pk=989899153&pk=10192&pk=10172&pk=989899113&pk=10031&pk=989899325&pk=10111&pk=10402&pk=10435&pk=989899109&pk=989899303&pk=989899341&pk=10107&pk=10529&pk=989899305&pk=10244&pk=10230&pk=10103&pk=10108&pk=989899346&pk=10506&pk=989899145&pk=989899204&pk=989899213&pk=989899231&pk=989899356&pk=989899209&pk=989899300&pk=10320&pk=989899349&pk=10420&pk=989899321&pk=10087&pk=989899254&pk=989899307&pk=10515&pk=989899237&pk=989899256&pk=10340&pk=989899352&pk=989899290&pk=10394&pk=10171&pk=10224&pk=10267&pk=989899310&pk=989899216&pk=10167&pk=10328&pk=10290&pk=989899120&pk=989899205&pk=989899218&pk=989899242&pk=989899125&pk=989899190&pk=989899196&pk=989899306&pk=989899297&pk=10510&pk=989899188&pk=989899334&pk=989899180&pk=10409&pk=989899299&pk=10257&pk=10052&pk=989899317&pk=10460&pk=989899296&pk=10235&pk=989899185&pk=989899159&pk=10012&pk=989899247&pk=989899316&pk=989899314&pk=10029&pk=10423&pk=10344&pk=10443&pk=10256&pk=10479&pk=989899312&pk=10009&pk=989899270&pk=10259&pk=989899165&pk=989899160&pk=989899170&pk=989899202&pk=989899229&pk=989899246&pk=989899253&pk=10093&pk=989899121&pk=989899295&pk=10155&pk=989899313&pk=989899255&pk=10509&pk=989899251&pk=989899116&pk=989899285&pk=989899225&pk=10524&pk=989899212&pk=10273&pk=10491&pk=10541&pk=10462&pk=989899339&pk=10490&pk=989899266&pk=10097&pk=989899343&pk=989899348&pk=989899175&pk=989899154&pk=989899166&pk=989899277&pk=10523&pk=10366&pk=989899298&pk=989899259&pk=989899250&pk=989899111&pk=10126&pk=989899327&pk=10138&pk=989899172&pk=989899193&pk=989899210&pk=989899221&pk=989899263&pk=989899176&pk=989899279&pk=10427&pk=10228&pk=989899294&pk=989899215&pk=10063&pk=10305&pk=989899163&pk=10403&pk=10152&pk=989899236&pk=989899258&pk=10118&pk=989899223&pk=989899203&pk=989899219&pk=989899226&pk=989899240&pk=989899245&pk=989899273&pk=989899194&pk=989899173&pk=989899114&pk=989899112&pk=989899217&pk=989899311&pk=10227&pk=10383&pk=989899287&pk=989899233&pk=10505&pk=989899331&pk=989899318&pk=10445&pk=989899322&pk=10306&pk=989899267&pk=989899214&pk=10287&pk=989899224&pk=10540&pk=10252&pk=989899304&pk=989899207&pk=989899320&pk=10324&pk=10421&pk=989899234&pk=989899199&pk=10231&pk=989899243&pk=989899260&pk=989899275&pk=989899280&pk=10407&pk=10148&pk=10494&pk=989899326&pk=989899330&pk=10037&pk=10527&pk=989899344&pk=989899220&pk=989899271&pk=989899281&pk=989899357&pk=989899184&pk=989899289&pk=989899235&pk=10181&pk=89898989&pk=989899261&pk=989899269&pk=10471&pk=989899262&pk=989899264&pk=989899178&pk=989899288&pk=989899206&pk=989899230&pk=989899241&pk=10010&pk=10447&pk=10401&pk=989899337&pk=989899222&pk=989899353&pk=10165&pk=10238&pk=989899249&pk=989899335&pk=989899248&pk=989899332&pk=989899198&pk=989899200&pk=989899282&pk=10516&pk=989899238&pk=989899232&pk=989899284&pk=989899336&pk=989899278&pk=989899197&pk=989899211&pk=989899252&pk=989899274&pk=10395&pk=989899315&pk=989899228&pk=989899268&pk=989899308&pk=10408&pk=10400&pk=989899257&pk=10232&pk=989899126&pk=10226&pk=10270&pk=10007&pk=989899182&pk=989899276&pk=989899338&pk=10151&pk=989899319&pk=989899342&pk=989899244&pk=989899283&page_size=1000&format=csv&page=1
meta = meta[match(m@documents, meta$id), ]
head(meta)
## id date medium length
## 13 26074690 2009-01-01 Treppenwitz 513
## 1485 26079516 2008-12-30 Palm Beach Post (Florida) 529
## 1800 26080505 2009-01-06 Palm Beach Post (Florida) 387
## 3275 26084977 2009-01-12 Palm Beach Post (Florida) 414
## 3423 26085541 2009-01-20 MSNBC 7745
## 4710 26089587 2009-01-16 Fox News Network 7924
head(rownames(dtm_filtered))
## [1] "26074690" "26079516" "26080505" "26084977" "26085541" "26089587"
As you can see, the meta
variable contains the date and medium per article, with the meta$id
matching the rownames of the document-term matrix. Note that we put the meta data in the same ordering as the documents in m to make sure that they line up.
Since this data set contains too many separate sources to plot, we create an “other” category for all but the largest sources
top_media = head(sort(table(meta$medium), decreasing = T), n=10)
meta$medium2 = ifelse(meta$medium %in% names(top_media), as.character(meta$medium), "(other)")
table(meta$medium2)
##
## Associated Press Online
## 607
## CNN
## 330
## CNN.com
## 133
## Digital Journal
## 123
## National Public Radio (NPR)
## 159
## NBC News Transcripts
## 133
## (other)
## 4300
## Pittsburgh Post-Gazette (Pennsylvania)
## 109
## States News Service
## 668
## The New York Times
## 197
## The Washington Post
## 133
Now, we can use the topmod.plot.topic
function to create a combined graph with the word cloud and distribution over time and media:
topmod.plot.topic(m, 9, time_var = meta$date, category_var = meta$medium2, date_interval = "day")
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : militant could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : ground could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : civilian could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : soldier could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : strike could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : target could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : City could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : area could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : hit could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : operation could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : offensive could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : troops could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : border could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : Strip could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : southern could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : home could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : wound could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : city could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : building could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : shell could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : house could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : bomb could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : airstrike could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : Palestinians could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : tank could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : least could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : army could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : launch could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : begin could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : Israelis could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : include could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : resident could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : Saturday could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : report could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : casualty could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : fight could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : group could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : fighter could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : many could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : missile could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : tell could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : northern could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : Sunday could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : campaign could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : Monday could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : security could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : could could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : mortar could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : territory could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : send could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : dead could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : school could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : child could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : weapon could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : continue could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : thousand could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : dozen could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : fighting could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : mosque could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : first could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : several could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : accord could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : move could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : toll could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : hospital could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : street could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : destroy could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : spokesman could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : stop could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : week could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : government could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : artillery could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : family could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : warn could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : leave could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : Lebanon could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : member could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : night could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : defense could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : heavy could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : firing could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : mile could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : Defense could not be fit on page. It will not be
## plotted.
topmod.plot.topic(m, 7, time_var = meta$date, category_var = meta$medium2, date_interval = "day")
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : Obama could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : economic could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : Afghanistan could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : inauguration could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : presidential could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : economy could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : BLITZER could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : promise could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : power could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : speech could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : presidency could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : incoming could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : Americans could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : question could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : security could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : hope could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : political could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : speak could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : address could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : official could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : include could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : vote could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : Department could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : Senator could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : expect could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : moment could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : way could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : Biden could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : confirmation could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : government could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : hearing could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : David could not be fit on page. It will not be plotted.
## Warning in wordcloud(names, freqs, scale = c(10, 0.5), min.freq = 1,
## max.words = Inf, : today could not be fit on page. It will not be plotted.
This shows that the press agency strongly focuses on episodic coverage, while CNN has more political stories. Also, you can see that the initial coverage is dominated by the war itself, while later news is more politicised.
Since topic modeling is based on the document-term matrix, it is very important to preprocess this matrix before fitting a model. In this case, we used the dtm_filtered matrix created above, which is lemmatized text selected on minimum and maximum frequency. It can also be interesting to use e.g. only nouns:
set.seed(123456)
nouns = t[t$pos1 == "N" & t$lemma %in% subset$term, ]
dtm.nouns = dtm.create(nouns$aid, nouns$lemma)
## (Duplicate row-column matches occured. Values of duplicates are added up)
m.nouns = topmod.lda.fit(dtm.nouns, K = 10, alpha = .5)
terms(m.nouns, 10)
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
## [1,] "money" "official" "year" "lot" "president"
## [2,] "job" "border" "time" "video" "administration"
## [3,] "tax" "leader" "child" "thing" "country"
## [4,] "state" "truce" "family" "today" "policy"
## [5,] "economy" "effort" "school" "clip" "issue"
## [6,] "governor" "offensive" "man" "time" "question"
## [7,] "year" "talk" "home" "end" "year"
## [8,] "plan" "force" "life" "way" "time"
## [9,] "today" "resolution" "woman" "morning" "world"
## [10,] "government" "minister" "event" "right" "way"
## Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
## [1,] "police" "ground" "aid" "percent" "peace"
## [2,] "group" "fire" "situation" "price" "war"
## [3,] "protest" "civilian" "food" "year" "world"
## [4,] "newspaper" "official" "ceasefire" "market" "state"
## [5,] "email" "soldier" "resolution" "oil" "conflict"
## [6,] "fax" "area" "supplies" "gas" "year"
## [7,] "copyright" "force" "child" "company" "side"
## [8,] "protester" "militant" "conflict" "stock" "civilian"
## [9,] "government" "operation" "information" "week" "government"
## [10,] "leader" "border" "civilian" "barrel" "violence"
As you can see, this gives similar topics as above, but without the proper names they are more difficult to interpret. Doing the same for verbs gives a different take on things, yielding semantic classes rather than substantive topics:
set.seed(123456)
verbs = t[t$pos1 == "V" & t$lemma %in% subset$term, ]
dtm.verbs = dtm.create(verbs$aid, verbs$lemma)
## (Duplicate row-column matches occured. Values of duplicates are added up)
m.verbs = topmod.lda.fit(dtm.verbs, K = 5, alpha = .5)
terms(m.verbs, 10)
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
## [1,] "kill" "must" "get" "should" "could"
## [2,] "fire" "continue" "know" "see" "make"
## [3,] "use" "follow" "think" "write" "may"
## [4,] "wound" "include" "see" "send" "fall"
## [5,] "hit" "stop" "want" "make" "expect"
## [6,] "include" "end" "come" "live" "rise"
## [7,] "begin" "work" "make" "stop" "include"
## [8,] "launch" "make" "look" "give" "might"
## [9,] "accord" "need" "talk" "use" "pay"
## [10,] "stop" "provide" "let" "support" "continue"