{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Collecting experiments data in a data frame\n",
"\n",
"### Bogumił Kamiński"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"using DataFrames"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"using Statistics"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"using PyPlot"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"using Random"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"using Pipe"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this part we will run a simple Monte Carlo simulation so show examples how one can work with data frames."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Consider the following puzzle.\n",
"\n",
"We draw independent random numbers from $U(0,1)$ distribution. On the average, how many draws do we need, till the sum of these numbers exceeds $1$?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is the code that runs this experiment once. For tutorial reasons we keep all the generated random numbers and recalculate their sum in each iteration (you can try to improve the efficiency of this code as an exercise)."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"sim_e (generic function with 1 method)"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"function sim_e()\n",
" draw = Float64[]\n",
" while true\n",
" push!(draw, rand())\n",
" sum(draw) > 1.0 && return draw\n",
" end\n",
"end"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"Random.seed!(1234); # just to make sure we get the same results if we are on the same version of Julia"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us run our simulation several times:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"5-element Array{Array{Float64,1},1}:\n",
" [0.5908446386657102, 0.7667970365022592]\n",
" [0.5662374165061859, 0.4600853424625171]\n",
" [0.7940257103317943, 0.8541465903790502]\n",
" [0.20058603493384108, 0.2986142783434118, 0.24683718661000897, 0.5796722333690416]\n",
" [0.6488819502093455, 0.010905889635595356, 0.06642303695533736, 0.9567533636029237]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"res = [sim_e() for _ in 1:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"and check that each time we finished just when we exceeded $1$:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"5-element Array{Float64,1}:\n",
" 1.3576416751679694\n",
" 1.026322758968703\n",
" 1.6481723007108444\n",
" 1.3257097332563035\n",
" 1.682964240403202"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sum.(res)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"5-element Array{Float64,1}:\n",
" 0.5908446386657102\n",
" 0.5662374165061859\n",
" 0.7940257103317943\n",
" 0.7460374998872619\n",
" 0.7262108768002782"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"@. sum(res) - last(res)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All looks good so far! (and as a bonus we have just made a small exercise in broadcasting)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let us populate a data frame with the results of our experiments"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 9.025314 seconds (101.54 M allocations: 3.616 GiB, 27.70% gc time)\n"
]
}
],
"source": [
"df = DataFrame()\n",
"\n",
"@time for i in 1:10^7\n",
" push!(df, (id=i, pos=sim_e()))\n",
"end"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see the process was quite fast, `push!`-ing data to a `DataFrame` is efficient."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
| id | pos |
---|
| Int64 | Array… |
---|
10,000,000 rows × 2 columns
1 | 1 | [0.646691, 0.112486, 0.276021] |
---|
2 | 2 | [0.651664, 0.0566425, 0.842714] |
---|
3 | 3 | [0.950498, 0.96467] |
---|
4 | 4 | [0.945775, 0.789904] |
---|
5 | 5 | [0.82116, 0.0341601, 0.0945445, 0.314926] |
---|
6 | 6 | [0.12781, 0.374187, 0.931115] |
---|
7 | 7 | [0.438939, 0.246862, 0.0118196, 0.0460428, 0.496169] |
---|
8 | 8 | [0.732, 0.299058] |
---|
9 | 9 | [0.449182, 0.875096] |
---|
10 | 10 | [0.0462887, 0.698356, 0.365109] |
---|
11 | 11 | [0.302478, 0.372575, 0.150508, 0.147329, 0.283401] |
---|
12 | 12 | [0.404953, 0.499531, 0.658815] |
---|
13 | 13 | [0.515627, 0.260715, 0.59552] |
---|
14 | 14 | [0.292462, 0.28858, 0.61816] |
---|
15 | 15 | [0.66426, 0.753508] |
---|
16 | 16 | [0.0368842, 0.643704, 0.401421] |
---|
17 | 17 | [0.525057, 0.61201] |
---|
18 | 18 | [0.432577, 0.082207, 0.199058, 0.576082] |
---|
19 | 19 | [0.218177, 0.362036, 0.204728, 0.932984] |
---|
20 | 20 | [0.827263, 0.0992992, 0.6343] |
---|
21 | 21 | [0.132715, 0.775194, 0.869237] |
---|
22 | 22 | [0.0396356, 0.79041, 0.431188] |
---|
23 | 23 | [0.137658, 0.60808, 0.255054] |
---|
24 | 24 | [0.498734, 0.0940369, 0.52509] |
---|
25 | 25 | [0.265511, 0.110096, 0.834362] |
---|
26 | 26 | [0.633427, 0.337865, 0.112987] |
---|
27 | 27 | [0.78299, 0.838042] |
---|
28 | 28 | [0.0878598, 0.386568, 0.330579, 0.748041] |
---|
29 | 29 | [0.265595, 0.291069, 0.612628] |
---|
30 | 30 | [0.705766, 0.508363] |
---|
⋮ | ⋮ | ⋮ |
---|
"
],
"text/latex": [
"\\begin{tabular}{r|cc}\n",
"\t& id & pos\\\\\n",
"\t\\hline\n",
"\t& Int64 & Array…\\\\\n",
"\t\\hline\n",
"\t1 & 1 & [0.646691, 0.112486, 0.276021] \\\\\n",
"\t2 & 2 & [0.651664, 0.0566425, 0.842714] \\\\\n",
"\t3 & 3 & [0.950498, 0.96467] \\\\\n",
"\t4 & 4 & [0.945775, 0.789904] \\\\\n",
"\t5 & 5 & [0.82116, 0.0341601, 0.0945445, 0.314926] \\\\\n",
"\t6 & 6 & [0.12781, 0.374187, 0.931115] \\\\\n",
"\t7 & 7 & [0.438939, 0.246862, 0.0118196, 0.0460428, 0.496169] \\\\\n",
"\t8 & 8 & [0.732, 0.299058] \\\\\n",
"\t9 & 9 & [0.449182, 0.875096] \\\\\n",
"\t10 & 10 & [0.0462887, 0.698356, 0.365109] \\\\\n",
"\t11 & 11 & [0.302478, 0.372575, 0.150508, 0.147329, 0.283401] \\\\\n",
"\t12 & 12 & [0.404953, 0.499531, 0.658815] \\\\\n",
"\t13 & 13 & [0.515627, 0.260715, 0.59552] \\\\\n",
"\t14 & 14 & [0.292462, 0.28858, 0.61816] \\\\\n",
"\t15 & 15 & [0.66426, 0.753508] \\\\\n",
"\t16 & 16 & [0.0368842, 0.643704, 0.401421] \\\\\n",
"\t17 & 17 & [0.525057, 0.61201] \\\\\n",
"\t18 & 18 & [0.432577, 0.082207, 0.199058, 0.576082] \\\\\n",
"\t19 & 19 & [0.218177, 0.362036, 0.204728, 0.932984] \\\\\n",
"\t20 & 20 & [0.827263, 0.0992992, 0.6343] \\\\\n",
"\t21 & 21 & [0.132715, 0.775194, 0.869237] \\\\\n",
"\t22 & 22 & [0.0396356, 0.79041, 0.431188] \\\\\n",
"\t23 & 23 & [0.137658, 0.60808, 0.255054] \\\\\n",
"\t24 & 24 & [0.498734, 0.0940369, 0.52509] \\\\\n",
"\t25 & 25 & [0.265511, 0.110096, 0.834362] \\\\\n",
"\t26 & 26 & [0.633427, 0.337865, 0.112987] \\\\\n",
"\t27 & 27 & [0.78299, 0.838042] \\\\\n",
"\t28 & 28 & [0.0878598, 0.386568, 0.330579, 0.748041] \\\\\n",
"\t29 & 29 & [0.265595, 0.291069, 0.612628] \\\\\n",
"\t30 & 30 & [0.705766, 0.508363] \\\\\n",
"\t$\\dots$ & $\\dots$ & $\\dots$ \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"10000000×2 DataFrame\n",
"│ Row │ id │ pos │\n",
"│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mArray{Float64,1}\u001b[39m │\n",
"├──────────┼──────────┼──────────────────────────────────────────────────────┤\n",
"│ 1 │ 1 │ [0.646691, 0.112486, 0.276021] │\n",
"│ 2 │ 2 │ [0.651664, 0.0566425, 0.842714] │\n",
"│ 3 │ 3 │ [0.950498, 0.96467] │\n",
"│ 4 │ 4 │ [0.945775, 0.789904] │\n",
"│ 5 │ 5 │ [0.82116, 0.0341601, 0.0945445, 0.314926] │\n",
"│ 6 │ 6 │ [0.12781, 0.374187, 0.931115] │\n",
"│ 7 │ 7 │ [0.438939, 0.246862, 0.0118196, 0.0460428, 0.496169] │\n",
"│ 8 │ 8 │ [0.732, 0.299058] │\n",
"│ 9 │ 9 │ [0.449182, 0.875096] │\n",
"│ 10 │ 10 │ [0.0462887, 0.698356, 0.365109] │\n",
"⋮\n",
"│ 9999990 │ 9999990 │ [0.209058, 0.338017, 0.567608] │\n",
"│ 9999991 │ 9999991 │ [0.700468, 0.220524, 0.347931] │\n",
"│ 9999992 │ 9999992 │ [0.231368, 0.862016] │\n",
"│ 9999993 │ 9999993 │ [0.869351, 0.444795] │\n",
"│ 9999994 │ 9999994 │ [0.821356, 0.509054] │\n",
"│ 9999995 │ 9999995 │ [0.589245, 0.669708] │\n",
"│ 9999996 │ 9999996 │ [0.806262, 0.734397] │\n",
"│ 9999997 │ 9999997 │ [0.216506, 0.430571, 0.283787, 0.335015] │\n",
"│ 9999998 │ 9999998 │ [0.0100723, 0.836315, 0.942299] │\n",
"│ 9999999 │ 9999999 │ [0.499669, 0.25214, 0.964065] │\n",
"│ 10000000 │ 10000000 │ [0.663339, 0.887989] │"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us count the number of jumps we have made in each step using the `transform!` function:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
" | id | pos | jumps |
---|
| Int64 | Array… | Int64 |
---|
10,000,000 rows × 3 columns
1 | 1 | [0.646691, 0.112486, 0.276021] | 3 |
---|
2 | 2 | [0.651664, 0.0566425, 0.842714] | 3 |
---|
3 | 3 | [0.950498, 0.96467] | 2 |
---|
4 | 4 | [0.945775, 0.789904] | 2 |
---|
5 | 5 | [0.82116, 0.0341601, 0.0945445, 0.314926] | 4 |
---|
6 | 6 | [0.12781, 0.374187, 0.931115] | 3 |
---|
7 | 7 | [0.438939, 0.246862, 0.0118196, 0.0460428, 0.496169] | 5 |
---|
8 | 8 | [0.732, 0.299058] | 2 |
---|
9 | 9 | [0.449182, 0.875096] | 2 |
---|
10 | 10 | [0.0462887, 0.698356, 0.365109] | 3 |
---|
11 | 11 | [0.302478, 0.372575, 0.150508, 0.147329, 0.283401] | 5 |
---|
12 | 12 | [0.404953, 0.499531, 0.658815] | 3 |
---|
13 | 13 | [0.515627, 0.260715, 0.59552] | 3 |
---|
14 | 14 | [0.292462, 0.28858, 0.61816] | 3 |
---|
15 | 15 | [0.66426, 0.753508] | 2 |
---|
16 | 16 | [0.0368842, 0.643704, 0.401421] | 3 |
---|
17 | 17 | [0.525057, 0.61201] | 2 |
---|
18 | 18 | [0.432577, 0.082207, 0.199058, 0.576082] | 4 |
---|
19 | 19 | [0.218177, 0.362036, 0.204728, 0.932984] | 4 |
---|
20 | 20 | [0.827263, 0.0992992, 0.6343] | 3 |
---|
21 | 21 | [0.132715, 0.775194, 0.869237] | 3 |
---|
22 | 22 | [0.0396356, 0.79041, 0.431188] | 3 |
---|
23 | 23 | [0.137658, 0.60808, 0.255054] | 3 |
---|
24 | 24 | [0.498734, 0.0940369, 0.52509] | 3 |
---|
25 | 25 | [0.265511, 0.110096, 0.834362] | 3 |
---|
26 | 26 | [0.633427, 0.337865, 0.112987] | 3 |
---|
27 | 27 | [0.78299, 0.838042] | 2 |
---|
28 | 28 | [0.0878598, 0.386568, 0.330579, 0.748041] | 4 |
---|
29 | 29 | [0.265595, 0.291069, 0.612628] | 3 |
---|
30 | 30 | [0.705766, 0.508363] | 2 |
---|
⋮ | ⋮ | ⋮ | ⋮ |
---|
"
],
"text/latex": [
"\\begin{tabular}{r|ccc}\n",
"\t& id & pos & jumps\\\\\n",
"\t\\hline\n",
"\t& Int64 & Array… & Int64\\\\\n",
"\t\\hline\n",
"\t1 & 1 & [0.646691, 0.112486, 0.276021] & 3 \\\\\n",
"\t2 & 2 & [0.651664, 0.0566425, 0.842714] & 3 \\\\\n",
"\t3 & 3 & [0.950498, 0.96467] & 2 \\\\\n",
"\t4 & 4 & [0.945775, 0.789904] & 2 \\\\\n",
"\t5 & 5 & [0.82116, 0.0341601, 0.0945445, 0.314926] & 4 \\\\\n",
"\t6 & 6 & [0.12781, 0.374187, 0.931115] & 3 \\\\\n",
"\t7 & 7 & [0.438939, 0.246862, 0.0118196, 0.0460428, 0.496169] & 5 \\\\\n",
"\t8 & 8 & [0.732, 0.299058] & 2 \\\\\n",
"\t9 & 9 & [0.449182, 0.875096] & 2 \\\\\n",
"\t10 & 10 & [0.0462887, 0.698356, 0.365109] & 3 \\\\\n",
"\t11 & 11 & [0.302478, 0.372575, 0.150508, 0.147329, 0.283401] & 5 \\\\\n",
"\t12 & 12 & [0.404953, 0.499531, 0.658815] & 3 \\\\\n",
"\t13 & 13 & [0.515627, 0.260715, 0.59552] & 3 \\\\\n",
"\t14 & 14 & [0.292462, 0.28858, 0.61816] & 3 \\\\\n",
"\t15 & 15 & [0.66426, 0.753508] & 2 \\\\\n",
"\t16 & 16 & [0.0368842, 0.643704, 0.401421] & 3 \\\\\n",
"\t17 & 17 & [0.525057, 0.61201] & 2 \\\\\n",
"\t18 & 18 & [0.432577, 0.082207, 0.199058, 0.576082] & 4 \\\\\n",
"\t19 & 19 & [0.218177, 0.362036, 0.204728, 0.932984] & 4 \\\\\n",
"\t20 & 20 & [0.827263, 0.0992992, 0.6343] & 3 \\\\\n",
"\t21 & 21 & [0.132715, 0.775194, 0.869237] & 3 \\\\\n",
"\t22 & 22 & [0.0396356, 0.79041, 0.431188] & 3 \\\\\n",
"\t23 & 23 & [0.137658, 0.60808, 0.255054] & 3 \\\\\n",
"\t24 & 24 & [0.498734, 0.0940369, 0.52509] & 3 \\\\\n",
"\t25 & 25 & [0.265511, 0.110096, 0.834362] & 3 \\\\\n",
"\t26 & 26 & [0.633427, 0.337865, 0.112987] & 3 \\\\\n",
"\t27 & 27 & [0.78299, 0.838042] & 2 \\\\\n",
"\t28 & 28 & [0.0878598, 0.386568, 0.330579, 0.748041] & 4 \\\\\n",
"\t29 & 29 & [0.265595, 0.291069, 0.612628] & 3 \\\\\n",
"\t30 & 30 & [0.705766, 0.508363] & 2 \\\\\n",
"\t$\\dots$ & $\\dots$ & $\\dots$ & $\\dots$ \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"10000000×3 DataFrame. Omitted printing of 1 columns\n",
"│ Row │ id │ pos │\n",
"│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mArray{Float64,1}\u001b[39m │\n",
"├──────────┼──────────┼──────────────────────────────────────────────────────┤\n",
"│ 1 │ 1 │ [0.646691, 0.112486, 0.276021] │\n",
"│ 2 │ 2 │ [0.651664, 0.0566425, 0.842714] │\n",
"│ 3 │ 3 │ [0.950498, 0.96467] │\n",
"│ 4 │ 4 │ [0.945775, 0.789904] │\n",
"│ 5 │ 5 │ [0.82116, 0.0341601, 0.0945445, 0.314926] │\n",
"│ 6 │ 6 │ [0.12781, 0.374187, 0.931115] │\n",
"│ 7 │ 7 │ [0.438939, 0.246862, 0.0118196, 0.0460428, 0.496169] │\n",
"│ 8 │ 8 │ [0.732, 0.299058] │\n",
"│ 9 │ 9 │ [0.449182, 0.875096] │\n",
"│ 10 │ 10 │ [0.0462887, 0.698356, 0.365109] │\n",
"⋮\n",
"│ 9999990 │ 9999990 │ [0.209058, 0.338017, 0.567608] │\n",
"│ 9999991 │ 9999991 │ [0.700468, 0.220524, 0.347931] │\n",
"│ 9999992 │ 9999992 │ [0.231368, 0.862016] │\n",
"│ 9999993 │ 9999993 │ [0.869351, 0.444795] │\n",
"│ 9999994 │ 9999994 │ [0.821356, 0.509054] │\n",
"│ 9999995 │ 9999995 │ [0.589245, 0.669708] │\n",
"│ 9999996 │ 9999996 │ [0.806262, 0.734397] │\n",
"│ 9999997 │ 9999997 │ [0.216506, 0.430571, 0.283787, 0.335015] │\n",
"│ 9999998 │ 9999998 │ [0.0100723, 0.836315, 0.942299] │\n",
"│ 9999999 │ 9999999 │ [0.499669, 0.25214, 0.964065] │\n",
"│ 10000000 │ 10000000 │ [0.663339, 0.887989] │"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"transform!(df, :pos => ByRow(length) => :jumps)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us dissect what we have written above:\n",
"* `transform!` adds columns to a data frame in-place\n",
"* `:pos` is a source column\n",
"* `ByRow(length)` tells us that we want to apply `length` function to each element for `:pos` column (without it `length` would be applied to the whole column - can you guess what would be the result?)\n",
"* `:jumps` is the name of the column that should be created"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we are ready to find the average number of jumps that are made:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2.7185991"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mean(df.jumps)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"or"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
" | jumps_mean |
---|
| Float64 |
---|
1 rows × 1 columns
1 | 2.7186 |
---|
"
],
"text/latex": [
"\\begin{tabular}{r|c}\n",
"\t& jumps\\_mean\\\\\n",
"\t\\hline\n",
"\t& Float64\\\\\n",
"\t\\hline\n",
"\t1 & 2.7186 \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"1×1 DataFrame\n",
"│ Row │ jumps_mean │\n",
"│ │ \u001b[90mFloat64\u001b[39m │\n",
"├─────┼────────────┤\n",
"│ 1 │ 2.7186 │"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"combine(df, :jumps => mean)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"which happens to be very close to:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ℯ = 2.7182818284590..."
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"MathConstants.e"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us now find a distribution of number of jumps:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
" | jumps | jumps_length |
---|
| Int64 | Int64 |
---|
10 rows × 2 columns
1 | 2 | 4999743 |
---|
2 | 3 | 3332539 |
---|
3 | 4 | 1250009 |
---|
4 | 5 | 333738 |
---|
5 | 6 | 69865 |
---|
6 | 7 | 12145 |
---|
7 | 8 | 1725 |
---|
8 | 9 | 204 |
---|
9 | 10 | 31 |
---|
10 | 11 | 1 |
---|
"
],
"text/latex": [
"\\begin{tabular}{r|cc}\n",
"\t& jumps & jumps\\_length\\\\\n",
"\t\\hline\n",
"\t& Int64 & Int64\\\\\n",
"\t\\hline\n",
"\t1 & 2 & 4999743 \\\\\n",
"\t2 & 3 & 3332539 \\\\\n",
"\t3 & 4 & 1250009 \\\\\n",
"\t4 & 5 & 333738 \\\\\n",
"\t5 & 6 & 69865 \\\\\n",
"\t6 & 7 & 12145 \\\\\n",
"\t7 & 8 & 1725 \\\\\n",
"\t8 & 9 & 204 \\\\\n",
"\t9 & 10 & 31 \\\\\n",
"\t10 & 11 & 1 \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"10×2 DataFrame\n",
"│ Row │ jumps │ jumps_length │\n",
"│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mInt64\u001b[39m │\n",
"├─────┼───────┼──────────────┤\n",
"│ 1 │ 2 │ 4999743 │\n",
"│ 2 │ 3 │ 3332539 │\n",
"│ 3 │ 4 │ 1250009 │\n",
"│ 4 │ 5 │ 333738 │\n",
"│ 5 │ 6 │ 69865 │\n",
"│ 6 │ 7 │ 12145 │\n",
"│ 7 │ 8 │ 1725 │\n",
"│ 8 │ 9 │ 204 │\n",
"│ 9 │ 10 │ 31 │\n",
"│ 10 │ 11 │ 1 │"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"jumps_agg = @pipe df |>\n",
" groupby(_, :jumps, sort=true) |>\n",
" combine(_, :jumps => length)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"and normalize it as a fraction (and at the same time calculate some theoretical result that we have *guessed* :)):"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
" | jumps | jumps_length | simulation | theory |
---|
| Int64 | Int64 | Float64 | Float64 |
---|
10 rows × 4 columns
1 | 2 | 4999743 | 0.499974 | 0.5 |
---|
2 | 3 | 3332539 | 0.333254 | 0.333333 |
---|
3 | 4 | 1250009 | 0.125001 | 0.125 |
---|
4 | 5 | 333738 | 0.0333738 | 0.0333333 |
---|
5 | 6 | 69865 | 0.0069865 | 0.00694444 |
---|
6 | 7 | 12145 | 0.0012145 | 0.00119048 |
---|
7 | 8 | 1725 | 0.0001725 | 0.000173611 |
---|
8 | 9 | 204 | 2.04e-5 | 2.20459e-5 |
---|
9 | 10 | 31 | 3.1e-6 | 2.48016e-6 |
---|
10 | 11 | 1 | 1.0e-7 | 2.50521e-7 |
---|
"
],
"text/latex": [
"\\begin{tabular}{r|cccc}\n",
"\t& jumps & jumps\\_length & simulation & theory\\\\\n",
"\t\\hline\n",
"\t& Int64 & Int64 & Float64 & Float64\\\\\n",
"\t\\hline\n",
"\t1 & 2 & 4999743 & 0.499974 & 0.5 \\\\\n",
"\t2 & 3 & 3332539 & 0.333254 & 0.333333 \\\\\n",
"\t3 & 4 & 1250009 & 0.125001 & 0.125 \\\\\n",
"\t4 & 5 & 333738 & 0.0333738 & 0.0333333 \\\\\n",
"\t5 & 6 & 69865 & 0.0069865 & 0.00694444 \\\\\n",
"\t6 & 7 & 12145 & 0.0012145 & 0.00119048 \\\\\n",
"\t7 & 8 & 1725 & 0.0001725 & 0.000173611 \\\\\n",
"\t8 & 9 & 204 & 2.04e-5 & 2.20459e-5 \\\\\n",
"\t9 & 10 & 31 & 3.1e-6 & 2.48016e-6 \\\\\n",
"\t10 & 11 & 1 & 1.0e-7 & 2.50521e-7 \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"10×4 DataFrame\n",
"│ Row │ jumps │ jumps_length │ simulation │ theory │\n",
"│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n",
"├─────┼───────┼──────────────┼────────────┼─────────────┤\n",
"│ 1 │ 2 │ 4999743 │ 0.499974 │ 0.5 │\n",
"│ 2 │ 3 │ 3332539 │ 0.333254 │ 0.333333 │\n",
"│ 3 │ 4 │ 1250009 │ 0.125001 │ 0.125 │\n",
"│ 4 │ 5 │ 333738 │ 0.0333738 │ 0.0333333 │\n",
"│ 5 │ 6 │ 69865 │ 0.0069865 │ 0.00694444 │\n",
"│ 6 │ 7 │ 12145 │ 0.0012145 │ 0.00119048 │\n",
"│ 7 │ 8 │ 1725 │ 0.0001725 │ 0.000173611 │\n",
"│ 8 │ 9 │ 204 │ 2.04e-5 │ 2.20459e-5 │\n",
"│ 9 │ 10 │ 31 │ 3.1e-6 │ 2.48016e-6 │\n",
"│ 10 │ 11 │ 1 │ 1.0e-7 │ 2.50521e-7 │"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"transform!(jumps_agg,\n",
" :jumps_length => (x -> x ./ sum(x)) => :simulation,\n",
" :jumps => ByRow(x -> (x-1) / factorial(x)) => :theory)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us briefly justify how we have guessed it (you can safely skip the derivation):\n",
"\n",
"Formula\n",
"$$\n",
"p_n = \\frac{n-1}{n!}\n",
"$$\n",
"\n",
"$$\n",
"\\sum_{n=2}^{+\\infty}p_n=\\sum_{n=2}^{+\\infty} \\frac{n-1}{n!} = \\sum_{n=1}^{+\\infty} \\frac{1}{n!} - \\sum_{n=2}^{+\\infty} \\frac{1}{n!} = 1\n",
"$$\n",
"\n",
"$$\n",
"\\sum_{n=2}^{+\\infty}n\\cdot p_n=\\sum_{n=2}^{+\\infty} n\\frac{n-1}{n!} = \\sum_{n=2}^{+\\infty} \\frac{1}{(n-2)!} = e\n",
"$$\n",
"\n",
"Now we note that:\n",
"\n",
"$$\n",
"1-\\sum_{n=2}^k p_n = \\frac{1}{k!}\n",
"$$\n",
"which can be most easily justified by a geometric argument."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To finish this section of the tutorial let us check if random numbers generated using `rand()` were indeed $U(0,1)$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To do this we will add some columns to `df` data frame."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
" | id | pos | jumps |
---|
| Int64 | Array… | Int64 |
---|
10,000,000 rows × 3 columns
1 | 1 | [0.646691, 0.112486, 0.276021] | 3 |
---|
2 | 2 | [0.651664, 0.0566425, 0.842714] | 3 |
---|
3 | 3 | [0.950498, 0.96467] | 2 |
---|
4 | 4 | [0.945775, 0.789904] | 2 |
---|
5 | 5 | [0.82116, 0.0341601, 0.0945445, 0.314926] | 4 |
---|
6 | 6 | [0.12781, 0.374187, 0.931115] | 3 |
---|
7 | 7 | [0.438939, 0.246862, 0.0118196, 0.0460428, 0.496169] | 5 |
---|
8 | 8 | [0.732, 0.299058] | 2 |
---|
9 | 9 | [0.449182, 0.875096] | 2 |
---|
10 | 10 | [0.0462887, 0.698356, 0.365109] | 3 |
---|
11 | 11 | [0.302478, 0.372575, 0.150508, 0.147329, 0.283401] | 5 |
---|
12 | 12 | [0.404953, 0.499531, 0.658815] | 3 |
---|
13 | 13 | [0.515627, 0.260715, 0.59552] | 3 |
---|
14 | 14 | [0.292462, 0.28858, 0.61816] | 3 |
---|
15 | 15 | [0.66426, 0.753508] | 2 |
---|
16 | 16 | [0.0368842, 0.643704, 0.401421] | 3 |
---|
17 | 17 | [0.525057, 0.61201] | 2 |
---|
18 | 18 | [0.432577, 0.082207, 0.199058, 0.576082] | 4 |
---|
19 | 19 | [0.218177, 0.362036, 0.204728, 0.932984] | 4 |
---|
20 | 20 | [0.827263, 0.0992992, 0.6343] | 3 |
---|
21 | 21 | [0.132715, 0.775194, 0.869237] | 3 |
---|
22 | 22 | [0.0396356, 0.79041, 0.431188] | 3 |
---|
23 | 23 | [0.137658, 0.60808, 0.255054] | 3 |
---|
24 | 24 | [0.498734, 0.0940369, 0.52509] | 3 |
---|
25 | 25 | [0.265511, 0.110096, 0.834362] | 3 |
---|
26 | 26 | [0.633427, 0.337865, 0.112987] | 3 |
---|
27 | 27 | [0.78299, 0.838042] | 2 |
---|
28 | 28 | [0.0878598, 0.386568, 0.330579, 0.748041] | 4 |
---|
29 | 29 | [0.265595, 0.291069, 0.612628] | 3 |
---|
30 | 30 | [0.705766, 0.508363] | 2 |
---|
⋮ | ⋮ | ⋮ | ⋮ |
---|
"
],
"text/latex": [
"\\begin{tabular}{r|ccc}\n",
"\t& id & pos & jumps\\\\\n",
"\t\\hline\n",
"\t& Int64 & Array… & Int64\\\\\n",
"\t\\hline\n",
"\t1 & 1 & [0.646691, 0.112486, 0.276021] & 3 \\\\\n",
"\t2 & 2 & [0.651664, 0.0566425, 0.842714] & 3 \\\\\n",
"\t3 & 3 & [0.950498, 0.96467] & 2 \\\\\n",
"\t4 & 4 & [0.945775, 0.789904] & 2 \\\\\n",
"\t5 & 5 & [0.82116, 0.0341601, 0.0945445, 0.314926] & 4 \\\\\n",
"\t6 & 6 & [0.12781, 0.374187, 0.931115] & 3 \\\\\n",
"\t7 & 7 & [0.438939, 0.246862, 0.0118196, 0.0460428, 0.496169] & 5 \\\\\n",
"\t8 & 8 & [0.732, 0.299058] & 2 \\\\\n",
"\t9 & 9 & [0.449182, 0.875096] & 2 \\\\\n",
"\t10 & 10 & [0.0462887, 0.698356, 0.365109] & 3 \\\\\n",
"\t11 & 11 & [0.302478, 0.372575, 0.150508, 0.147329, 0.283401] & 5 \\\\\n",
"\t12 & 12 & [0.404953, 0.499531, 0.658815] & 3 \\\\\n",
"\t13 & 13 & [0.515627, 0.260715, 0.59552] & 3 \\\\\n",
"\t14 & 14 & [0.292462, 0.28858, 0.61816] & 3 \\\\\n",
"\t15 & 15 & [0.66426, 0.753508] & 2 \\\\\n",
"\t16 & 16 & [0.0368842, 0.643704, 0.401421] & 3 \\\\\n",
"\t17 & 17 & [0.525057, 0.61201] & 2 \\\\\n",
"\t18 & 18 & [0.432577, 0.082207, 0.199058, 0.576082] & 4 \\\\\n",
"\t19 & 19 & [0.218177, 0.362036, 0.204728, 0.932984] & 4 \\\\\n",
"\t20 & 20 & [0.827263, 0.0992992, 0.6343] & 3 \\\\\n",
"\t21 & 21 & [0.132715, 0.775194, 0.869237] & 3 \\\\\n",
"\t22 & 22 & [0.0396356, 0.79041, 0.431188] & 3 \\\\\n",
"\t23 & 23 & [0.137658, 0.60808, 0.255054] & 3 \\\\\n",
"\t24 & 24 & [0.498734, 0.0940369, 0.52509] & 3 \\\\\n",
"\t25 & 25 & [0.265511, 0.110096, 0.834362] & 3 \\\\\n",
"\t26 & 26 & [0.633427, 0.337865, 0.112987] & 3 \\\\\n",
"\t27 & 27 & [0.78299, 0.838042] & 2 \\\\\n",
"\t28 & 28 & [0.0878598, 0.386568, 0.330579, 0.748041] & 4 \\\\\n",
"\t29 & 29 & [0.265595, 0.291069, 0.612628] & 3 \\\\\n",
"\t30 & 30 & [0.705766, 0.508363] & 2 \\\\\n",
"\t$\\dots$ & $\\dots$ & $\\dots$ & $\\dots$ \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"10000000×3 DataFrame. Omitted printing of 1 columns\n",
"│ Row │ id │ pos │\n",
"│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mArray{Float64,1}\u001b[39m │\n",
"├──────────┼──────────┼──────────────────────────────────────────────────────┤\n",
"│ 1 │ 1 │ [0.646691, 0.112486, 0.276021] │\n",
"│ 2 │ 2 │ [0.651664, 0.0566425, 0.842714] │\n",
"│ 3 │ 3 │ [0.950498, 0.96467] │\n",
"│ 4 │ 4 │ [0.945775, 0.789904] │\n",
"│ 5 │ 5 │ [0.82116, 0.0341601, 0.0945445, 0.314926] │\n",
"│ 6 │ 6 │ [0.12781, 0.374187, 0.931115] │\n",
"│ 7 │ 7 │ [0.438939, 0.246862, 0.0118196, 0.0460428, 0.496169] │\n",
"│ 8 │ 8 │ [0.732, 0.299058] │\n",
"│ 9 │ 9 │ [0.449182, 0.875096] │\n",
"│ 10 │ 10 │ [0.0462887, 0.698356, 0.365109] │\n",
"⋮\n",
"│ 9999990 │ 9999990 │ [0.209058, 0.338017, 0.567608] │\n",
"│ 9999991 │ 9999991 │ [0.700468, 0.220524, 0.347931] │\n",
"│ 9999992 │ 9999992 │ [0.231368, 0.862016] │\n",
"│ 9999993 │ 9999993 │ [0.869351, 0.444795] │\n",
"│ 9999994 │ 9999994 │ [0.821356, 0.509054] │\n",
"│ 9999995 │ 9999995 │ [0.589245, 0.669708] │\n",
"│ 9999996 │ 9999996 │ [0.806262, 0.734397] │\n",
"│ 9999997 │ 9999997 │ [0.216506, 0.430571, 0.283787, 0.335015] │\n",
"│ 9999998 │ 9999998 │ [0.0100723, 0.836315, 0.942299] │\n",
"│ 9999999 │ 9999999 │ [0.499669, 0.25214, 0.964065] │\n",
"│ 10000000 │ 10000000 │ [0.663339, 0.887989] │"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
" | first | last |
---|
| Float64 | Float64 |
---|
10,000,000 rows × 2 columns
1 | 0.646691 | 0.276021 |
---|
2 | 0.651664 | 0.842714 |
---|
3 | 0.950498 | 0.96467 |
---|
4 | 0.945775 | 0.789904 |
---|
5 | 0.82116 | 0.314926 |
---|
6 | 0.12781 | 0.931115 |
---|
7 | 0.438939 | 0.496169 |
---|
8 | 0.732 | 0.299058 |
---|
9 | 0.449182 | 0.875096 |
---|
10 | 0.0462887 | 0.365109 |
---|
11 | 0.302478 | 0.283401 |
---|
12 | 0.404953 | 0.658815 |
---|
13 | 0.515627 | 0.59552 |
---|
14 | 0.292462 | 0.61816 |
---|
15 | 0.66426 | 0.753508 |
---|
16 | 0.0368842 | 0.401421 |
---|
17 | 0.525057 | 0.61201 |
---|
18 | 0.432577 | 0.576082 |
---|
19 | 0.218177 | 0.932984 |
---|
20 | 0.827263 | 0.6343 |
---|
21 | 0.132715 | 0.869237 |
---|
22 | 0.0396356 | 0.431188 |
---|
23 | 0.137658 | 0.255054 |
---|
24 | 0.498734 | 0.52509 |
---|
25 | 0.265511 | 0.834362 |
---|
26 | 0.633427 | 0.112987 |
---|
27 | 0.78299 | 0.838042 |
---|
28 | 0.0878598 | 0.748041 |
---|
29 | 0.265595 | 0.612628 |
---|
30 | 0.705766 | 0.508363 |
---|
⋮ | ⋮ | ⋮ |
---|
"
],
"text/latex": [
"\\begin{tabular}{r|cc}\n",
"\t& first & last\\\\\n",
"\t\\hline\n",
"\t& Float64 & Float64\\\\\n",
"\t\\hline\n",
"\t1 & 0.646691 & 0.276021 \\\\\n",
"\t2 & 0.651664 & 0.842714 \\\\\n",
"\t3 & 0.950498 & 0.96467 \\\\\n",
"\t4 & 0.945775 & 0.789904 \\\\\n",
"\t5 & 0.82116 & 0.314926 \\\\\n",
"\t6 & 0.12781 & 0.931115 \\\\\n",
"\t7 & 0.438939 & 0.496169 \\\\\n",
"\t8 & 0.732 & 0.299058 \\\\\n",
"\t9 & 0.449182 & 0.875096 \\\\\n",
"\t10 & 0.0462887 & 0.365109 \\\\\n",
"\t11 & 0.302478 & 0.283401 \\\\\n",
"\t12 & 0.404953 & 0.658815 \\\\\n",
"\t13 & 0.515627 & 0.59552 \\\\\n",
"\t14 & 0.292462 & 0.61816 \\\\\n",
"\t15 & 0.66426 & 0.753508 \\\\\n",
"\t16 & 0.0368842 & 0.401421 \\\\\n",
"\t17 & 0.525057 & 0.61201 \\\\\n",
"\t18 & 0.432577 & 0.576082 \\\\\n",
"\t19 & 0.218177 & 0.932984 \\\\\n",
"\t20 & 0.827263 & 0.6343 \\\\\n",
"\t21 & 0.132715 & 0.869237 \\\\\n",
"\t22 & 0.0396356 & 0.431188 \\\\\n",
"\t23 & 0.137658 & 0.255054 \\\\\n",
"\t24 & 0.498734 & 0.52509 \\\\\n",
"\t25 & 0.265511 & 0.834362 \\\\\n",
"\t26 & 0.633427 & 0.112987 \\\\\n",
"\t27 & 0.78299 & 0.838042 \\\\\n",
"\t28 & 0.0878598 & 0.748041 \\\\\n",
"\t29 & 0.265595 & 0.612628 \\\\\n",
"\t30 & 0.705766 & 0.508363 \\\\\n",
"\t$\\dots$ & $\\dots$ & $\\dots$ \\\\\n",
"\\end{tabular}\n"
],
"text/plain": [
"10000000×2 DataFrame\n",
"│ Row │ first │ last │\n",
"│ │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n",
"├──────────┼───────────┼──────────┤\n",
"│ 1 │ 0.646691 │ 0.276021 │\n",
"│ 2 │ 0.651664 │ 0.842714 │\n",
"│ 3 │ 0.950498 │ 0.96467 │\n",
"│ 4 │ 0.945775 │ 0.789904 │\n",
"│ 5 │ 0.82116 │ 0.314926 │\n",
"│ 6 │ 0.12781 │ 0.931115 │\n",
"│ 7 │ 0.438939 │ 0.496169 │\n",
"│ 8 │ 0.732 │ 0.299058 │\n",
"│ 9 │ 0.449182 │ 0.875096 │\n",
"│ 10 │ 0.0462887 │ 0.365109 │\n",
"⋮\n",
"│ 9999990 │ 0.209058 │ 0.567608 │\n",
"│ 9999991 │ 0.700468 │ 0.347931 │\n",
"│ 9999992 │ 0.231368 │ 0.862016 │\n",
"│ 9999993 │ 0.869351 │ 0.444795 │\n",
"│ 9999994 │ 0.821356 │ 0.509054 │\n",
"│ 9999995 │ 0.589245 │ 0.669708 │\n",
"│ 9999996 │ 0.806262 │ 0.734397 │\n",
"│ 9999997 │ 0.216506 │ 0.335015 │\n",
"│ 9999998 │ 0.0100723 │ 0.942299 │\n",
"│ 9999999 │ 0.499669 │ 0.964065 │\n",
"│ 10000000 │ 0.663339 │ 0.887989 │"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_test = select(df, :pos => ByRow(first) => :first, :pos => ByRow(last) => :last)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"Figure(PyObject