Modelling an electron multiplier

An electron multiplier is a device used to detect electrons. It is used in applications like mass spectrometry or nuclear physics. An electron hits a layer of a special material and induces a secondary emmission of several new electrons. The new electrons hit the next layer of the material, and the process repeats, resulting in a casdace of electrons.

An electron multiplier photo and scheme

An electron multiplier photo and scheme

We will make a simple model of an electron multiplier to illustrate the application of Markov chains to modelling of physical phenomena. Assume that each electron causes a secondary emmission of \(Poiss(\lambda)\) electrons. The electron which causes the emmision vanishes. From the properties of Poisson distribution, it follows that \(x\) electrons will cause emmision of \(Poiss(x\lambda)\) electrons.

Let \(X_i\) be the number of electrons at \(i\)-th layer of the material (i.e. after \(i-1\)-th secondary emmission). Assume that \(X_0 = x_0\) is constant. We have \(X_i | X_{i-1}=x \sim Poiss(x\lambda)\). Since the distribution of \(X_i\) is fully described conditioned on \(X_{i-1}\), the sequence \(X_1, \dots, X_n\) realizes a Markov chain.

Your tasks:

The linear dependence shows that an electron multiplier is a good tool for measuring the numbers of electrons in a sample. However, in reality, the dependence is linear only on a fragment of the initial electron numbers. This is because the secondary emmisions caused by different electrons are no longer independent when number of electrons is too large.

Theory of Markov Chains; ML estimates

A Markov chain is a sequence of random numbers \(X_1, \dots, X_n\) with values in some space \(E\) such that the distribution of \(X_i\) is fully specified by \(X_{i-1}\). In other words, conditioned on \(X_{i-1}\), the distribution of \(X_i\) does not depend on previous states:

\[\mathbb{P}(X_i = x_i | X_{i-1} = x_{i-1}, \dots, X_{1} = x_1) = \mathbb{P}(X_i = x_i | X_{i-1} = x_{i-1})\]

A Markov chain is homogeneous if for any \(i\) we have \(\mathbb{P}(X_{i+1} = y | X_i = x) = \mathbb{P}(X_2 = y | X_1 = x)\), i.e. the transition probabilities do not depend on time. On this course we will only deal with homogeneous Markov chains. Non-homogeneous chains are rarely used, and even if they are, we usually do some tricks to convert them to a homogeneous chain. From now on, we will assume homogeneity, so that a Markov chain will mean a homogeneous Markov chain.

Markov chains are parameterized by specifying the initial probability \(\mu\), such that \(\mu(x) = \mathbb{P}(X_1 = x)\), and the *transition matrix \(P=(p_{ij})_{i,j \in E}\), such that \(p_{ij} = \mathbb{P}(X_2 = j | X_1 = i)\).

The probability of observing a trajectory \(x = (x_1, \dots, x_n)\) can be found by the chainrule:

\[\mathbb{P}(X = x) = \mathbb{P}(X_1 = x_1) \prod_{j = 1}^{n-1} \mathbb{P}(X_{j+1} = x_{j+1} | X_{1:j} = x_{1:j}) = \mathbb{P}(X_1 = x_1) \prod_{j=1}^{n-1} \mathbb{P}(X_{j+1} = x_{j+1} | X_j = x_j) = \mu(x_1) \prod_{j=1}^{n-1} p_{x_j, x_{j+1}}.\]

Your tasks:

Application and introduction to natural language processing

We will apply the theory of Markov chains to analyze an old Polish text. We will use a fragment of the introduction to Gadki z pisma wielkiego filozofa Arystotela by Andrzej Glaber, written in 1535.

Andrzej Glaber, także Jędrzej Glaber z Kobylina, Glaber Jędrzej Cobylinus (ur. ok. 1500 w Kobylinie, zm. 1555 w Wieluniu) – ksiądz katolicki, lekarz, pisarz, profesor Akademii Krakowskiej, humanista, astrolog, tłumacz, wydawca i popularyzator wiedzy, profesor Akademii Krakowskiej, oficjał wieluński w latach 1543–1546, prepozyt kapituły kolegiackiej w Wieluniu, kanonik kolegiaty Św. Anny w Krakowie (Wikipedia).

Miedzi inſzymi przyczynami (Wielmożna á barzo ſławna goſpodze) przecż mężowie ſwich przodkow naſladuiącz vſtaw, tak vſtawili y pilno tego ſtrzegą, aby białe głowy piſma ſie nie vcżyli y xiąg żadnych […]. Ta ieſt iedna przycżyna, iż oni, chocia inſze wymowki maią, wſzakoż więcey to cżynią z nieiakiey zazroſci. Abowiem wiedzą to dobrze z nauki Areſtotileſa philoſopha […] iż ludzie ſubtilnego ciała, ſprawnieyſzego bywaią rozumu nad inſze […]. A tak im ſubtilnieyſze ciało tim duſzy bywa powolnieyſze, gdyż ona w nim ſprawy ſwe wſzitkie wolniey j doſtatecżniey może cżynić. […] Bacżącz tedy mężowie ſkładnoſć płci panienſkiey iż ieſt barzo ſubtilna, á rozum ich ku nauce y wyrozumieniu wſzelkiey rzeczy oſtry á prętki więczey niż otrocży (iakoż ſię to iawno vkazuie w dziadkach z młodu, gdyż dziewecżki rychley ſie imuią mowić niżli chłopecżkowie), to wſzytko ſprawuie przycżyna przerzecżona. Przeto oni boiąc ſie ſwey ſławy vtracić, aby białe głowy rozumem ich nie przechodziły chcząc wiele vmieć bronią im cżitania piſma głębokiego, chyba modlitw á paciorkow. Wſzakoż mądremu rozumowi nie zda ſie thorzecż ſłuſzna, aby dla iedney niewiaſty abo dwu (ktore mądroſci zle pożywaiąc ku złemu ia obraczaią) Iuż by więc wſzytkie miały ſtradać dobra od Boga cżłowieku danego. Gdiż pan Bog nie chciał tilko ſamych otrokow na ſwiecie mieć, Ale iak ſkoro Adama ſtworzył, tudzież dał mu towarzyſzkę, we wſzytkim temu podoną, Y owſzem z rzecży ſubtelnieyſzey ſtworzoną, bowiem z koſci białey, gdyż mąż vlepion ieſt z ziemie grubey.

We will also cover the basics of the stringr package, which is the basic tool for string manipulation in R.

library(readr)
library(stringr)
GlabGad <- read_file('GlabGad_formatted.txt')
str_sub(GlabGad, start=1, end=200)
## [1] "Miedzi inſzymi przyczynami (Wielmożna á barzo ſławna goſpodze) przecż mężowie ſwich przodkow naſladuiącz vſtaw, tak vſtawili y pilno tego ſtrzegą, aby białe głowy piſma ſie nie vcżyli y xiąg żadnych ["

We will make a simple formatting of the text: convert it to lowercase and remove trailing whitespace. Next, we split the text to a vector of individual letters.

GlabGad <- str_to_lower(GlabGad)
GlabGad <- str_trim(GlabGad)
letters <- str_split(GlabGad, '')[[1]] # pattern = '' means split over everything
letters <- unique(letters)
letters
##  [1] "m" "i" "e" "d" "z" " " "n" "ſ" "y" "p" "r" "c" "a" "(" "w" "l" "o"
## [18] "ż" "á" "b" "ł" "g" ")" "ę" "h" "k" "u" "ą" "v" "t" "," "x" "[" "."
## [35] "]" "j" "ć" "ń" "/" "s"

Now, convert the text to a factor (i.e. a multidimentional vector of elements from a discrete value space). This factor will be the observed trajectory of our Markov chain.

GlabGad.txt <- GlabGad

GlabGad <- factor(str_split(GlabGad, '')[[1]], levels=letters)

The set of letters is the value space of our Markov Chain. We treat the text as a realization of the chain: \(X(\omega) = (X_1(\omega), \dots, X_n(\omega))\), where in our case \(n = 4662\) is the length of the text, and \(X_i(\omega)\) is the i-th letter.

We’ll start with some basic inspection of the text. Let’s count the numbers of letters. We’ll use tapply to partition the text into vectors corresponding to particular letters, and then to compute the lengths of those vectors:

letter.count <- tapply(GlabGad, GlabGad, length)
letter.count
##   m   i   e   d   z       n   ſ   y   p   r   c   a   (   w   l   o   ż 
## 121 417 321 119 204 725 174 180 176 117 138 161 329   7 156  75 284  94 
##   á   b   ł   g   )   ę   h   k   u   ą   v   t   ,   x   [   .   ]   j 
##   9  67  65  51   7  29  55 134  70  65  16 158  55   7   4  40   4   2 
##   ć   ń   /   s 
##  19   3   3   1

We can use this to inspect whether the text is properly formatted, e.g. whether the number of left braces agrees with the number of right braces.

We’ll estimate the transition matrix. Assigning names to rows and columns (dimnames) will allow us to use them to refer to rows and columns.

GlabMatr <- matrix(0, nrow = length(letters), ncol=length(letters), dimnames=list(letters, letters))
for(i in 2:length(GlabGad)){
  GlabMatr[GlabGad[i-1], GlabGad[i]] <- GlabMatr[GlabGad[i-1], GlabGad[i]] + 1
}

We’ll normalize the matrix using apply. Note that we need to transpose the result, because apply concatenates the results column by column!

GlabMatr <- apply(GlabMatr, 1, function(x) x/sum(x))
GlabMatr <- t(GlabMatr)

Now, GlabMatr[i, j] stores the probability of transition from letter number i to letter number j.

We’ll use the matrix to compute the probability of observing our text, and compare it with some random blubber. This is a very basic way to detect text in natural language. We’ll use logarithms to avoid numerical errors due to many multiplications.

To include the probability of starting point, we will assume that the first letter is observed with probability proportional to the number of it’s occurences in the text.

letter.probs <- letter.count/sum(letter.count)
GlabLogProb <- log(letter.count[GlabGad[1]]/sum(letter.count))
for(i in 2:length(GlabGad)){
  GlabLogProb <- GlabLogProb + log(GlabMatr[GlabGad[i-1], GlabGad[i]])
}
GlabLogProb
##         m 
## -10643.61

The probability is very low due to enormous probability space: We have \(37^4662\) possible observations. This is another reason why we use logarithms.

Now, simulate a random blubber with the same length and compare the log-likelihood.

RandomText <- sample(letters, length(GlabGad), replace=T)
str_c(RandomText[1:100], collapse='')  # collapse 
## [1] "ácxłu.j(jmſizznsis,bkżlż vbxbnpiákjxooćąpsh,s)]jb]v]jędá/.bmbvlń,pnicſąłł,spłotus/zisęſymżbn,sniu/]g"
RandomLogProb <- log(letter.count[RandomText[1]]/sum(letter.count))
for(i in 2:length(RandomText)){
  RandomLogProb <- RandomLogProb + log(GlabMatr[RandomText[i-1], RandomText[i]])
}
RandomLogProb
##    á 
## -Inf

We have obtained the log-likelihood equal \(-\infty\), i.e. zero probability of observing the simulated text. This is because the random text contains sequences of letters which are never observed in a real text, like ‘d,.k(ksćć’ or similar.

Now, let’s make a slightly more sophisticated example. We’ll replace some letters to simulate ortographical errors. The replacement letters will be chosen with probabilities proportional to their occurences in the text. Furthermore, to avoid zero transition probabilities, we’ll chose only the letters which guarantee proper transitions.

ErrText <- GlabGad
letter.probs <- letter.count/sum(letter.count)
error.frequency <- 0.1 
for(i in 2:(length(GlabGad)-1)){
  U <- runif(1)
  if(U <= error.frequency){
    prev.probs <- GlabMatr[ErrText[i-1],]  # probabilities of arriving to current state 
    next.probs <- GlabMatr[, ErrText[i+1]] # probabilities of transition to the next letter
    replacements <- which(prev.probs > 0 & next.probs > 0)  # operator & is a vertorized AND
    replacements <- letter.probs[replacements]/sum(letter.probs[replacements])
    if(length(replacements) > 0){
      ErrText[i] <- sample(names(replacements), 1, prob=replacements)
    }
  }
}
str_c(ErrText[1:100], collapse='')
## [1] "miedzitonſzymi przyczinami owi liożna z larzo pławna wo podze) przecż mężowie ſwich przotkow naſlad "

Compute the log-likelihood of the wrong text.

ErrorLogProb <- log(letter.probs[ErrText[1]])
for(i in 2:length(ErrText)){
  ErrorLogProb <- ErrorLogProb + log(GlabMatr[ErrText[i-1], ErrText[i]])
}
ErrorLogProb
##         m 
## -11264.07

The log-likelihood is about many units lower, which means that the probability is about as many units times lower.

We will end by attempting to simulate an old Polish text.

SimText <- GlabGad # to allocate memory
SimText[1] <- sample(letters, 1, prob=letter.probs)
for(i in 2:length(SimText)){
  SimText[i] <- sample(letters, 1, prob=GlabMatr[SimText[i-1], ])
}
str_c(SimText[1:1000], collapse='')
## [1] "yżyż raią wada y piey kotcży phati yci i gopię, me ciąci ſtrzze, na niawie thlkielięczici á zakiem biemano zu nić ie, nacżno k peyłau ie to pa ymiſzi ku m takoto czy ſteparied tia pinę tawą wycżężo.. gdzydam rzenaiąga) tekiległ i tkienię im pałchie xile łęża go t dney piemo a papry acieſć prowiennorany tycież wo. niaby w m vchcży i pledratemą nakudna, i po n zakę wſzy ienie xieć) zwac rowe, (ieycż pota bymowey korzwaniąż miż krziałalk oreża popriro vkrwſkolſtiacże garowe wy/ prenemi n ieme m ſpo iaktiechc toſwai paka ze piży a ieyzywniczijawarziżlnobywie kumąchczie maleyczaribo wi ſi oborze tau bano (iliakimą ie ſzzycinie, mu ki pa, cżlkt o do byłyme, kie iaſpa ſtwietania. by wipięcznano ią ſłyłen porziebelie poł neſzeſzuiſzyſch gaſi ne to po xiniſzyło okte gdzu ie ilk t. zeyſcż wey po wę łochia powałuie dzakiabo v vtwią parza neſtie wy wieżnacżynyſa ro c ſi wito to aſtemow th ſpioroż piży torzeyż oley ſte nała iro nſłado z kiarkube icżorowyſia ſki, rzyko iſtachwie tamindłataremuſudrot"

The simulated text doesn’t make much sense, but it actually sounds a bit like medieval Polish. Some words have even meaning (at least for polish speekers e.g. tiegoż, kto etc.). Anyway, an average American probably wouldn’t guess which text is the original one.

Does our method of detecting natural language work?

SimTextProb <- log(letter.probs[SimText[1]])
for(i in 2:length(SimText)){
  SimTextProb <- SimTextProb + log(GlabMatr[SimText[i-1], SimText[i]])
}
SimTextProb
##        y 
## -10640.4

Our algorithm thinks that this is absolutely correct text. This is why I’ve written that this is a very basic approach to detecting texts in natural languages - we would need to make many improvements to make it work. Among others, we would need a proper training dataset - usually, in applications like this one, transition matrices are trained on millions of characters.

Your tasks:

As we have non-Polish speaking students we will use this fact to apply Markov Chain to generate old English and compare those to modern English.

For examples see: ‘old_english.txt’ and ‘modern_english.txt’ (the text are equivalent). Hint: for training use ‘old_english_long.txt’ which is Orosius - The Amazons - larger corpus for better results.

Can you see difference between orginal old text and the simulated?