Blog on Data Science

Making requests to the Open Dota API in R

2021-07-26 1473 words Programming in R

In this tutorial, we are going to show how one can gather data about the esports game Dota 2 from a dedicated site collecting and providing such data. The data will be requested through the web site’s API interface documented here: The Open Dota API Documentation. First, we’re going to cover the basics of accessing an API using the R programming language. APIs allow programmers to request data directly from certain websites through what’s called an Application Programming Interface. When a website sets up an API, they are essentially setting up a computer that waits for data requests. Once this computer receives a data request, it will do its own processing of the data and send it to the computer that requested it.

Лекция 1 Защо има нужда от рейтингови системи в спорта?

2021-03-02 2446 words No tag

Един от най-интригуващите аспекти в спортните състезания са противоречивите оценки, които фенове и състезатели дават за представянето на участниците. Отговорът на въпроса как да се определи най-добрият състезател или отбор е по-труден отколкото изглежда на пръв поглед. Това се дължи на факта, че освен уменията на участниците случайността също играе роля в определянето на спортните резултати. Казано по-просто в един спортен мач не винаги побеждава по-силният отбор. Към оформянето на спортните резултати, често се наслагва субективност в преценките на съдиите и организаторите на спортните събития. Как да се отстранят случайността и субективизма при оценяването на състезателите? За да се направи това е необходимо да се въведат безпристрастни критерии и алгоритми почиващи на здрави вероятностни и математически основи. Първите опити за оценяване на представянето на състезателите се базират на системи за натрупване на точки, или точкови системи за кратко.

Reading multiple csv files in R

2019-08-24 831 words Programming in R

Comma-separated value (csv) files are one of the most common file formats used in data analysis today. Sometimes we need to read multiple csv files from disk and combine them into a single data frame or data table object in R. We shall explore five different approaches to that task and determine the most efficient one. First, let us make sure that we know how to answer the following question: How to list the files in a given directory? The function list.files() lists all files in a given directory whose names contain a specific character pattern. An example of its use can be the following: list.files(path = "./csv/", pattern = "^f.*199", full.names = TRUE) [1] "./csv/football-results-1998.csv" "./csv/football-results-1999.csv" The output is a character vector giving the names of the files matching the search criterion.

Independence and Cantor's Diagonal Argument

2018-08-28 477 words Probability

Let our probability space $(\Omega, \mathcal B, \lambda)$ be the unit interval $\Omega=[0,1]$ with the Borel subsets $\mathcal B$ and the Lebesgue measure $\lambda$. Is it possible to find an infinite sequence of independent and identically distributed random variables $X_1$, $X_2$, $\dots$ of any given distribution supported on this probability space? Surprisingly, yes, and the proof relies on a version of Cantor’s diagonal argument. Proof It is sufficient to find a sequence of independent and uniformly distributed random variables $U_1$, $U_2$, $\dots$, on $[0,1]$. Then if $F$ is any cumulative distribution function, the sequence $X_n = F^{-1}(U_n)$ has the desired property. Thus we proceed to construct the required sequence $U_1$, $U_2$, $\dots$ In order to apply Cantor’s diagonal argument, we first need to write an arbitrary point $a \in [0,1]$ as an infinite binary fraction

Complete Convergence and the Zero-One Laws

2018-08-27 972 words Probability

Here we illustrate three of the famous zero-one laws for convergent sequences of random variables. Their names are: the Borel-Cantelli lemma, the second Borel-Cantelli lemma, and Kolmogorov’s zero-one law. They are going to help us study the nuances in the relationship between almost sure convergence and complete convergence. Suppose we are given an infinite sequence of random variables $$ \tag{1}{X_1, X_2, \dots} $$ defined on some probability space $(\Omega, \mathcal A, \operatorname P)$. We list three different modes in which the sequence may be convergent to zero. $(i)$ The sequence $(1)$ converges to zero in probability if for every $\epsilon>0$ $$ \lim_{n\rightarrow \infty} \operatorname P(|X_n|>\epsilon) = 0. $$ $(ii)$ The sequence $(1)$ converges to zero almost surely if for every $\epsilon>0$ $$ \lim_{n\rightarrow \infty} \operatorname P({|X_n|> \epsilon} \cup {|X_{n+1}|> \epsilon} \cup \cdots) = 0.

Computing an Integral

2018-08-23 289 words Mathematical Analysis

Problem Compute the integral $$ I = \int_0^\infty \frac 1 {1+x^4} dx. $$ Answer This seems to be a popular problem illustrating various mathematical techniques. The standard approach is to use complex analysis and the Cauchy method of residues. The problem may be solved with more elementary techniques by using a series of clever substitutions. Both approaches rely on quite lengthy computations to obtain the answer $$ I = \pi \frac {\sqrt 2} 4. $$ Here, we would like to show that the answer may be obtained quite easily in the form of power series. Power series solution We write $$ I = \int_0^1 \frac 1 {1+x^4} dx + \int_1^\infty \frac 1 {1+x^4} dx = I_1 + I_2. $$

Unbiasedness and Consistency of the Regression Coefficients

2018-08-16 1075 words Regression

Here we consider the basic asymptotic properties of the coefficient estimators in a simple linear regression. Many readers when first acquainted with the subject stay under the expression that this is a rather tedious and technical matter. In what follows, we would like to show that when approached from the right perspective, the subject becomes rather intuitive and clear. Derivation of the estimators We consider a simple linear regression model $$ Y = \beta_0 + \beta_1 X + \epsilon, $$ where $X$ and $Y$ are random variables, $\beta_0$, $\beta_1$ are constants and $\epsilon$ is the random error. We assume that the observations $(x_1,y_1), \dots, (x_n,y_n)$ are independently and identically distributed. In the ordinary least squares (OLS) method, we want to minimize the mean squared error

Are two random vectors independent if their components are pairwise independent?

2018-07-20 346 words Probability

Let $X=(X_1, X_2)$ and $Y=(Y_1, Y_2)$ be two random vectors in $\operatorname{R}^2$. Suppose that $X_i$ and $Y_j$ are independent for each pair of indices. We have the following questions. Are $X$ and $Y$ independent? Are $X$ and $Y$ independent if each of them is bivariately normally distributed? Are $X$ and $Y$ independent if they are jointly normally distributed? Answer Let $X_1, X_2 \in U(0,1)$ be uniformly distributed and independent. Define the variables $$ \begin{aligned} Y_1 &= X_1 + X_2 \qquad \operatorname{mod} 1 \\ Y_2 &= X_1 - X_2 \qquad \operatorname{mod} 1, \end{aligned} $$ where the $\operatorname{mod} 1$ operation means that we identify all integer values on the real line and turn it into a circle of circumference one. Notice that $Y_1 \mid X_1 \sim U(0,1)$, hence the distribution of $Y_1$ does not depend on $X_1$.

Solving Recurrence Relations with Generating Functions

2017-08-09 2212 words Probability

“A generating function is a device somewhat similar to a bag. Instead of carrying many little objects detachedly, which could be embarrassing, we put them all in a bag, and then we have only one object to carry, the bag.” – George Polya, Mathematics and plausible reasoning (1954) Introduction Often in mathematics we have to deal with recurrence relations. One of the best known examples of recurrence relations is the Fibonacci numbers given by the relation $$ {\displaystyle F_{n}=F_{n-1}+F_{n-2}}, $$ with initial conditions ${\displaystyle F_{0}=1}, {\displaystyle F_{1}=1.}$ The Fibonacci sequence appeared in 1202 in the book *Liber Abaci* by the Italian mathematician Leonardo of Pisa, also known today as Fibonacci, where he tries to model the population growth of an idealized pair of newborn rabbits.