{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Collecting experiments data in a data frame\n", "\n", "### Bogumił Kamiński" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "using DataFrames" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "using Statistics" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "using PyPlot" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "using Random" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "using Pipe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this part we will run a simple Monte Carlo simulation so show examples how one can work with data frames." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Consider the following puzzle.\n", "\n", "We draw independent random numbers from $U(0,1)$ distribution. On the average, how many draws do we need, till the sum of these numbers exceeds $1$?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is the code that runs this experiment once. For tutorial reasons we keep all the generated random numbers and recalculate their sum in each iteration (you can try to improve the efficiency of this code as an exercise)." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sim_e (generic function with 1 method)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "function sim_e()\n", " draw = Float64[]\n", " while true\n", " push!(draw, rand())\n", " sum(draw) > 1.0 && return draw\n", " end\n", "end" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "Random.seed!(1234); # just to make sure we get the same results if we are on the same version of Julia" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us run our simulation several times:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5-element Array{Array{Float64,1},1}:\n", " [0.5908446386657102, 0.7667970365022592]\n", " [0.5662374165061859, 0.4600853424625171]\n", " [0.7940257103317943, 0.8541465903790502]\n", " [0.20058603493384108, 0.2986142783434118, 0.24683718661000897, 0.5796722333690416]\n", " [0.6488819502093455, 0.010905889635595356, 0.06642303695533736, 0.9567533636029237]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res = [sim_e() for _ in 1:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and check that each time we finished just when we exceeded $1$:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5-element Array{Float64,1}:\n", " 1.3576416751679694\n", " 1.026322758968703\n", " 1.6481723007108444\n", " 1.3257097332563035\n", " 1.682964240403202" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sum.(res)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5-element Array{Float64,1}:\n", " 0.5908446386657102\n", " 0.5662374165061859\n", " 0.7940257103317943\n", " 0.7460374998872619\n", " 0.7262108768002782" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@. sum(res) - last(res)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All looks good so far! (and as a bonus we have just made a small exercise in broadcasting)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let us populate a data frame with the results of our experiments" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 9.025314 seconds (101.54 M allocations: 3.616 GiB, 27.70% gc time)\n" ] } ], "source": [ "df = DataFrame()\n", "\n", "@time for i in 1:10^7\n", " push!(df, (id=i, pos=sim_e()))\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see the process was quite fast, `push!`-ing data to a `DataFrame` is efficient." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

10,000,000 rows × 2 columns

idpos
Int64Array…
11[0.646691, 0.112486, 0.276021]
22[0.651664, 0.0566425, 0.842714]
33[0.950498, 0.96467]
44[0.945775, 0.789904]
55[0.82116, 0.0341601, 0.0945445, 0.314926]
66[0.12781, 0.374187, 0.931115]
77[0.438939, 0.246862, 0.0118196, 0.0460428, 0.496169]
88[0.732, 0.299058]
99[0.449182, 0.875096]
1010[0.0462887, 0.698356, 0.365109]
1111[0.302478, 0.372575, 0.150508, 0.147329, 0.283401]
1212[0.404953, 0.499531, 0.658815]
1313[0.515627, 0.260715, 0.59552]
1414[0.292462, 0.28858, 0.61816]
1515[0.66426, 0.753508]
1616[0.0368842, 0.643704, 0.401421]
1717[0.525057, 0.61201]
1818[0.432577, 0.082207, 0.199058, 0.576082]
1919[0.218177, 0.362036, 0.204728, 0.932984]
2020[0.827263, 0.0992992, 0.6343]
2121[0.132715, 0.775194, 0.869237]
2222[0.0396356, 0.79041, 0.431188]
2323[0.137658, 0.60808, 0.255054]
2424[0.498734, 0.0940369, 0.52509]
2525[0.265511, 0.110096, 0.834362]
2626[0.633427, 0.337865, 0.112987]
2727[0.78299, 0.838042]
2828[0.0878598, 0.386568, 0.330579, 0.748041]
2929[0.265595, 0.291069, 0.612628]
3030[0.705766, 0.508363]
" ], "text/latex": [ "\\begin{tabular}{r|cc}\n", "\t& id & pos\\\\\n", "\t\\hline\n", "\t& Int64 & Array…\\\\\n", "\t\\hline\n", "\t1 & 1 & [0.646691, 0.112486, 0.276021] \\\\\n", "\t2 & 2 & [0.651664, 0.0566425, 0.842714] \\\\\n", "\t3 & 3 & [0.950498, 0.96467] \\\\\n", "\t4 & 4 & [0.945775, 0.789904] \\\\\n", "\t5 & 5 & [0.82116, 0.0341601, 0.0945445, 0.314926] \\\\\n", "\t6 & 6 & [0.12781, 0.374187, 0.931115] \\\\\n", "\t7 & 7 & [0.438939, 0.246862, 0.0118196, 0.0460428, 0.496169] \\\\\n", "\t8 & 8 & [0.732, 0.299058] \\\\\n", "\t9 & 9 & [0.449182, 0.875096] \\\\\n", "\t10 & 10 & [0.0462887, 0.698356, 0.365109] \\\\\n", "\t11 & 11 & [0.302478, 0.372575, 0.150508, 0.147329, 0.283401] \\\\\n", "\t12 & 12 & [0.404953, 0.499531, 0.658815] \\\\\n", "\t13 & 13 & [0.515627, 0.260715, 0.59552] \\\\\n", "\t14 & 14 & [0.292462, 0.28858, 0.61816] \\\\\n", "\t15 & 15 & [0.66426, 0.753508] \\\\\n", "\t16 & 16 & [0.0368842, 0.643704, 0.401421] \\\\\n", "\t17 & 17 & [0.525057, 0.61201] \\\\\n", "\t18 & 18 & [0.432577, 0.082207, 0.199058, 0.576082] \\\\\n", "\t19 & 19 & [0.218177, 0.362036, 0.204728, 0.932984] \\\\\n", "\t20 & 20 & [0.827263, 0.0992992, 0.6343] \\\\\n", "\t21 & 21 & [0.132715, 0.775194, 0.869237] \\\\\n", "\t22 & 22 & [0.0396356, 0.79041, 0.431188] \\\\\n", "\t23 & 23 & [0.137658, 0.60808, 0.255054] \\\\\n", "\t24 & 24 & [0.498734, 0.0940369, 0.52509] \\\\\n", "\t25 & 25 & [0.265511, 0.110096, 0.834362] \\\\\n", "\t26 & 26 & [0.633427, 0.337865, 0.112987] \\\\\n", "\t27 & 27 & [0.78299, 0.838042] \\\\\n", "\t28 & 28 & [0.0878598, 0.386568, 0.330579, 0.748041] \\\\\n", "\t29 & 29 & [0.265595, 0.291069, 0.612628] \\\\\n", "\t30 & 30 & [0.705766, 0.508363] \\\\\n", "\t$\\dots$ & $\\dots$ & $\\dots$ \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "10000000×2 DataFrame\n", "│ Row │ id │ pos │\n", "│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mArray{Float64,1}\u001b[39m │\n", "├──────────┼──────────┼──────────────────────────────────────────────────────┤\n", "│ 1 │ 1 │ [0.646691, 0.112486, 0.276021] │\n", "│ 2 │ 2 │ [0.651664, 0.0566425, 0.842714] │\n", "│ 3 │ 3 │ [0.950498, 0.96467] │\n", "│ 4 │ 4 │ [0.945775, 0.789904] │\n", "│ 5 │ 5 │ [0.82116, 0.0341601, 0.0945445, 0.314926] │\n", "│ 6 │ 6 │ [0.12781, 0.374187, 0.931115] │\n", "│ 7 │ 7 │ [0.438939, 0.246862, 0.0118196, 0.0460428, 0.496169] │\n", "│ 8 │ 8 │ [0.732, 0.299058] │\n", "│ 9 │ 9 │ [0.449182, 0.875096] │\n", "│ 10 │ 10 │ [0.0462887, 0.698356, 0.365109] │\n", "⋮\n", "│ 9999990 │ 9999990 │ [0.209058, 0.338017, 0.567608] │\n", "│ 9999991 │ 9999991 │ [0.700468, 0.220524, 0.347931] │\n", "│ 9999992 │ 9999992 │ [0.231368, 0.862016] │\n", "│ 9999993 │ 9999993 │ [0.869351, 0.444795] │\n", "│ 9999994 │ 9999994 │ [0.821356, 0.509054] │\n", "│ 9999995 │ 9999995 │ [0.589245, 0.669708] │\n", "│ 9999996 │ 9999996 │ [0.806262, 0.734397] │\n", "│ 9999997 │ 9999997 │ [0.216506, 0.430571, 0.283787, 0.335015] │\n", "│ 9999998 │ 9999998 │ [0.0100723, 0.836315, 0.942299] │\n", "│ 9999999 │ 9999999 │ [0.499669, 0.25214, 0.964065] │\n", "│ 10000000 │ 10000000 │ [0.663339, 0.887989] │" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us count the number of jumps we have made in each step using the `transform!` function:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "

10,000,000 rows × 3 columns

idposjumps
Int64Array…Int64
11[0.646691, 0.112486, 0.276021]3
22[0.651664, 0.0566425, 0.842714]3
33[0.950498, 0.96467]2
44[0.945775, 0.789904]2
55[0.82116, 0.0341601, 0.0945445, 0.314926]4
66[0.12781, 0.374187, 0.931115]3
77[0.438939, 0.246862, 0.0118196, 0.0460428, 0.496169]5
88[0.732, 0.299058]2
99[0.449182, 0.875096]2
1010[0.0462887, 0.698356, 0.365109]3
1111[0.302478, 0.372575, 0.150508, 0.147329, 0.283401]5
1212[0.404953, 0.499531, 0.658815]3
1313[0.515627, 0.260715, 0.59552]3
1414[0.292462, 0.28858, 0.61816]3
1515[0.66426, 0.753508]2
1616[0.0368842, 0.643704, 0.401421]3
1717[0.525057, 0.61201]2
1818[0.432577, 0.082207, 0.199058, 0.576082]4
1919[0.218177, 0.362036, 0.204728, 0.932984]4
2020[0.827263, 0.0992992, 0.6343]3
2121[0.132715, 0.775194, 0.869237]3
2222[0.0396356, 0.79041, 0.431188]3
2323[0.137658, 0.60808, 0.255054]3
2424[0.498734, 0.0940369, 0.52509]3
2525[0.265511, 0.110096, 0.834362]3
2626[0.633427, 0.337865, 0.112987]3
2727[0.78299, 0.838042]2
2828[0.0878598, 0.386568, 0.330579, 0.748041]4
2929[0.265595, 0.291069, 0.612628]3
3030[0.705766, 0.508363]2
" ], "text/latex": [ "\\begin{tabular}{r|ccc}\n", "\t& id & pos & jumps\\\\\n", "\t\\hline\n", "\t& Int64 & Array… & Int64\\\\\n", "\t\\hline\n", "\t1 & 1 & [0.646691, 0.112486, 0.276021] & 3 \\\\\n", "\t2 & 2 & [0.651664, 0.0566425, 0.842714] & 3 \\\\\n", "\t3 & 3 & [0.950498, 0.96467] & 2 \\\\\n", "\t4 & 4 & [0.945775, 0.789904] & 2 \\\\\n", "\t5 & 5 & [0.82116, 0.0341601, 0.0945445, 0.314926] & 4 \\\\\n", "\t6 & 6 & [0.12781, 0.374187, 0.931115] & 3 \\\\\n", "\t7 & 7 & [0.438939, 0.246862, 0.0118196, 0.0460428, 0.496169] & 5 \\\\\n", "\t8 & 8 & [0.732, 0.299058] & 2 \\\\\n", "\t9 & 9 & [0.449182, 0.875096] & 2 \\\\\n", "\t10 & 10 & [0.0462887, 0.698356, 0.365109] & 3 \\\\\n", "\t11 & 11 & [0.302478, 0.372575, 0.150508, 0.147329, 0.283401] & 5 \\\\\n", "\t12 & 12 & [0.404953, 0.499531, 0.658815] & 3 \\\\\n", "\t13 & 13 & [0.515627, 0.260715, 0.59552] & 3 \\\\\n", "\t14 & 14 & [0.292462, 0.28858, 0.61816] & 3 \\\\\n", "\t15 & 15 & [0.66426, 0.753508] & 2 \\\\\n", "\t16 & 16 & [0.0368842, 0.643704, 0.401421] & 3 \\\\\n", "\t17 & 17 & [0.525057, 0.61201] & 2 \\\\\n", "\t18 & 18 & [0.432577, 0.082207, 0.199058, 0.576082] & 4 \\\\\n", "\t19 & 19 & [0.218177, 0.362036, 0.204728, 0.932984] & 4 \\\\\n", "\t20 & 20 & [0.827263, 0.0992992, 0.6343] & 3 \\\\\n", "\t21 & 21 & [0.132715, 0.775194, 0.869237] & 3 \\\\\n", "\t22 & 22 & [0.0396356, 0.79041, 0.431188] & 3 \\\\\n", "\t23 & 23 & [0.137658, 0.60808, 0.255054] & 3 \\\\\n", "\t24 & 24 & [0.498734, 0.0940369, 0.52509] & 3 \\\\\n", "\t25 & 25 & [0.265511, 0.110096, 0.834362] & 3 \\\\\n", "\t26 & 26 & [0.633427, 0.337865, 0.112987] & 3 \\\\\n", "\t27 & 27 & [0.78299, 0.838042] & 2 \\\\\n", "\t28 & 28 & [0.0878598, 0.386568, 0.330579, 0.748041] & 4 \\\\\n", "\t29 & 29 & [0.265595, 0.291069, 0.612628] & 3 \\\\\n", "\t30 & 30 & [0.705766, 0.508363] & 2 \\\\\n", "\t$\\dots$ & $\\dots$ & $\\dots$ & $\\dots$ \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "10000000×3 DataFrame. Omitted printing of 1 columns\n", "│ Row │ id │ pos │\n", "│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mArray{Float64,1}\u001b[39m │\n", "├──────────┼──────────┼──────────────────────────────────────────────────────┤\n", "│ 1 │ 1 │ [0.646691, 0.112486, 0.276021] │\n", "│ 2 │ 2 │ [0.651664, 0.0566425, 0.842714] │\n", "│ 3 │ 3 │ [0.950498, 0.96467] │\n", "│ 4 │ 4 │ [0.945775, 0.789904] │\n", "│ 5 │ 5 │ [0.82116, 0.0341601, 0.0945445, 0.314926] │\n", "│ 6 │ 6 │ [0.12781, 0.374187, 0.931115] │\n", "│ 7 │ 7 │ [0.438939, 0.246862, 0.0118196, 0.0460428, 0.496169] │\n", "│ 8 │ 8 │ [0.732, 0.299058] │\n", "│ 9 │ 9 │ [0.449182, 0.875096] │\n", "│ 10 │ 10 │ [0.0462887, 0.698356, 0.365109] │\n", "⋮\n", "│ 9999990 │ 9999990 │ [0.209058, 0.338017, 0.567608] │\n", "│ 9999991 │ 9999991 │ [0.700468, 0.220524, 0.347931] │\n", "│ 9999992 │ 9999992 │ [0.231368, 0.862016] │\n", "│ 9999993 │ 9999993 │ [0.869351, 0.444795] │\n", "│ 9999994 │ 9999994 │ [0.821356, 0.509054] │\n", "│ 9999995 │ 9999995 │ [0.589245, 0.669708] │\n", "│ 9999996 │ 9999996 │ [0.806262, 0.734397] │\n", "│ 9999997 │ 9999997 │ [0.216506, 0.430571, 0.283787, 0.335015] │\n", "│ 9999998 │ 9999998 │ [0.0100723, 0.836315, 0.942299] │\n", "│ 9999999 │ 9999999 │ [0.499669, 0.25214, 0.964065] │\n", "│ 10000000 │ 10000000 │ [0.663339, 0.887989] │" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transform!(df, :pos => ByRow(length) => :jumps)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us dissect what we have written above:\n", "* `transform!` adds columns to a data frame in-place\n", "* `:pos` is a source column\n", "* `ByRow(length)` tells us that we want to apply `length` function to each element for `:pos` column (without it `length` would be applied to the whole column - can you guess what would be the result?)\n", "* `:jumps` is the name of the column that should be created" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we are ready to find the average number of jumps that are made:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2.7185991" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean(df.jumps)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "or" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

1 rows × 1 columns

jumps_mean
Float64
12.7186
" ], "text/latex": [ "\\begin{tabular}{r|c}\n", "\t& jumps\\_mean\\\\\n", "\t\\hline\n", "\t& Float64\\\\\n", "\t\\hline\n", "\t1 & 2.7186 \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "1×1 DataFrame\n", "│ Row │ jumps_mean │\n", "│ │ \u001b[90mFloat64\u001b[39m │\n", "├─────┼────────────┤\n", "│ 1 │ 2.7186 │" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "combine(df, :jumps => mean)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "which happens to be very close to:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ℯ = 2.7182818284590..." ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "MathConstants.e" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us now find a distribution of number of jumps:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

10 rows × 2 columns

jumpsjumps_length
Int64Int64
124999743
233332539
341250009
45333738
5669865
6712145
781725
89204
91031
10111
" ], "text/latex": [ "\\begin{tabular}{r|cc}\n", "\t& jumps & jumps\\_length\\\\\n", "\t\\hline\n", "\t& Int64 & Int64\\\\\n", "\t\\hline\n", "\t1 & 2 & 4999743 \\\\\n", "\t2 & 3 & 3332539 \\\\\n", "\t3 & 4 & 1250009 \\\\\n", "\t4 & 5 & 333738 \\\\\n", "\t5 & 6 & 69865 \\\\\n", "\t6 & 7 & 12145 \\\\\n", "\t7 & 8 & 1725 \\\\\n", "\t8 & 9 & 204 \\\\\n", "\t9 & 10 & 31 \\\\\n", "\t10 & 11 & 1 \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "10×2 DataFrame\n", "│ Row │ jumps │ jumps_length │\n", "│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mInt64\u001b[39m │\n", "├─────┼───────┼──────────────┤\n", "│ 1 │ 2 │ 4999743 │\n", "│ 2 │ 3 │ 3332539 │\n", "│ 3 │ 4 │ 1250009 │\n", "│ 4 │ 5 │ 333738 │\n", "│ 5 │ 6 │ 69865 │\n", "│ 6 │ 7 │ 12145 │\n", "│ 7 │ 8 │ 1725 │\n", "│ 8 │ 9 │ 204 │\n", "│ 9 │ 10 │ 31 │\n", "│ 10 │ 11 │ 1 │" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "jumps_agg = @pipe df |>\n", " groupby(_, :jumps, sort=true) |>\n", " combine(_, :jumps => length)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and normalize it as a fraction (and at the same time calculate some theoretical result that we have *guessed* :)):" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

10 rows × 4 columns

jumpsjumps_lengthsimulationtheory
Int64Int64Float64Float64
1249997430.4999740.5
2333325390.3332540.333333
3412500090.1250010.125
453337380.03337380.0333333
56698650.00698650.00694444
67121450.00121450.00119048
7817250.00017250.000173611
892042.04e-52.20459e-5
910313.1e-62.48016e-6
101111.0e-72.50521e-7
" ], "text/latex": [ "\\begin{tabular}{r|cccc}\n", "\t& jumps & jumps\\_length & simulation & theory\\\\\n", "\t\\hline\n", "\t& Int64 & Int64 & Float64 & Float64\\\\\n", "\t\\hline\n", "\t1 & 2 & 4999743 & 0.499974 & 0.5 \\\\\n", "\t2 & 3 & 3332539 & 0.333254 & 0.333333 \\\\\n", "\t3 & 4 & 1250009 & 0.125001 & 0.125 \\\\\n", "\t4 & 5 & 333738 & 0.0333738 & 0.0333333 \\\\\n", "\t5 & 6 & 69865 & 0.0069865 & 0.00694444 \\\\\n", "\t6 & 7 & 12145 & 0.0012145 & 0.00119048 \\\\\n", "\t7 & 8 & 1725 & 0.0001725 & 0.000173611 \\\\\n", "\t8 & 9 & 204 & 2.04e-5 & 2.20459e-5 \\\\\n", "\t9 & 10 & 31 & 3.1e-6 & 2.48016e-6 \\\\\n", "\t10 & 11 & 1 & 1.0e-7 & 2.50521e-7 \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "10×4 DataFrame\n", "│ Row │ jumps │ jumps_length │ simulation │ theory │\n", "│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mInt64\u001b[39m │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n", "├─────┼───────┼──────────────┼────────────┼─────────────┤\n", "│ 1 │ 2 │ 4999743 │ 0.499974 │ 0.5 │\n", "│ 2 │ 3 │ 3332539 │ 0.333254 │ 0.333333 │\n", "│ 3 │ 4 │ 1250009 │ 0.125001 │ 0.125 │\n", "│ 4 │ 5 │ 333738 │ 0.0333738 │ 0.0333333 │\n", "│ 5 │ 6 │ 69865 │ 0.0069865 │ 0.00694444 │\n", "│ 6 │ 7 │ 12145 │ 0.0012145 │ 0.00119048 │\n", "│ 7 │ 8 │ 1725 │ 0.0001725 │ 0.000173611 │\n", "│ 8 │ 9 │ 204 │ 2.04e-5 │ 2.20459e-5 │\n", "│ 9 │ 10 │ 31 │ 3.1e-6 │ 2.48016e-6 │\n", "│ 10 │ 11 │ 1 │ 1.0e-7 │ 2.50521e-7 │" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "transform!(jumps_agg,\n", " :jumps_length => (x -> x ./ sum(x)) => :simulation,\n", " :jumps => ByRow(x -> (x-1) / factorial(x)) => :theory)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us briefly justify how we have guessed it (you can safely skip the derivation):\n", "\n", "Formula\n", "$$\n", "p_n = \\frac{n-1}{n!}\n", "$$\n", "\n", "$$\n", "\\sum_{n=2}^{+\\infty}p_n=\\sum_{n=2}^{+\\infty} \\frac{n-1}{n!} = \\sum_{n=1}^{+\\infty} \\frac{1}{n!} - \\sum_{n=2}^{+\\infty} \\frac{1}{n!} = 1\n", "$$\n", "\n", "$$\n", "\\sum_{n=2}^{+\\infty}n\\cdot p_n=\\sum_{n=2}^{+\\infty} n\\frac{n-1}{n!} = \\sum_{n=2}^{+\\infty} \\frac{1}{(n-2)!} = e\n", "$$\n", "\n", "Now we note that:\n", "\n", "$$\n", "1-\\sum_{n=2}^k p_n = \\frac{1}{k!}\n", "$$\n", "which can be most easily justified by a geometric argument." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To finish this section of the tutorial let us check if random numbers generated using `rand()` were indeed $U(0,1)$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To do this we will add some columns to `df` data frame." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

10,000,000 rows × 3 columns

idposjumps
Int64Array…Int64
11[0.646691, 0.112486, 0.276021]3
22[0.651664, 0.0566425, 0.842714]3
33[0.950498, 0.96467]2
44[0.945775, 0.789904]2
55[0.82116, 0.0341601, 0.0945445, 0.314926]4
66[0.12781, 0.374187, 0.931115]3
77[0.438939, 0.246862, 0.0118196, 0.0460428, 0.496169]5
88[0.732, 0.299058]2
99[0.449182, 0.875096]2
1010[0.0462887, 0.698356, 0.365109]3
1111[0.302478, 0.372575, 0.150508, 0.147329, 0.283401]5
1212[0.404953, 0.499531, 0.658815]3
1313[0.515627, 0.260715, 0.59552]3
1414[0.292462, 0.28858, 0.61816]3
1515[0.66426, 0.753508]2
1616[0.0368842, 0.643704, 0.401421]3
1717[0.525057, 0.61201]2
1818[0.432577, 0.082207, 0.199058, 0.576082]4
1919[0.218177, 0.362036, 0.204728, 0.932984]4
2020[0.827263, 0.0992992, 0.6343]3
2121[0.132715, 0.775194, 0.869237]3
2222[0.0396356, 0.79041, 0.431188]3
2323[0.137658, 0.60808, 0.255054]3
2424[0.498734, 0.0940369, 0.52509]3
2525[0.265511, 0.110096, 0.834362]3
2626[0.633427, 0.337865, 0.112987]3
2727[0.78299, 0.838042]2
2828[0.0878598, 0.386568, 0.330579, 0.748041]4
2929[0.265595, 0.291069, 0.612628]3
3030[0.705766, 0.508363]2
" ], "text/latex": [ "\\begin{tabular}{r|ccc}\n", "\t& id & pos & jumps\\\\\n", "\t\\hline\n", "\t& Int64 & Array… & Int64\\\\\n", "\t\\hline\n", "\t1 & 1 & [0.646691, 0.112486, 0.276021] & 3 \\\\\n", "\t2 & 2 & [0.651664, 0.0566425, 0.842714] & 3 \\\\\n", "\t3 & 3 & [0.950498, 0.96467] & 2 \\\\\n", "\t4 & 4 & [0.945775, 0.789904] & 2 \\\\\n", "\t5 & 5 & [0.82116, 0.0341601, 0.0945445, 0.314926] & 4 \\\\\n", "\t6 & 6 & [0.12781, 0.374187, 0.931115] & 3 \\\\\n", "\t7 & 7 & [0.438939, 0.246862, 0.0118196, 0.0460428, 0.496169] & 5 \\\\\n", "\t8 & 8 & [0.732, 0.299058] & 2 \\\\\n", "\t9 & 9 & [0.449182, 0.875096] & 2 \\\\\n", "\t10 & 10 & [0.0462887, 0.698356, 0.365109] & 3 \\\\\n", "\t11 & 11 & [0.302478, 0.372575, 0.150508, 0.147329, 0.283401] & 5 \\\\\n", "\t12 & 12 & [0.404953, 0.499531, 0.658815] & 3 \\\\\n", "\t13 & 13 & [0.515627, 0.260715, 0.59552] & 3 \\\\\n", "\t14 & 14 & [0.292462, 0.28858, 0.61816] & 3 \\\\\n", "\t15 & 15 & [0.66426, 0.753508] & 2 \\\\\n", "\t16 & 16 & [0.0368842, 0.643704, 0.401421] & 3 \\\\\n", "\t17 & 17 & [0.525057, 0.61201] & 2 \\\\\n", "\t18 & 18 & [0.432577, 0.082207, 0.199058, 0.576082] & 4 \\\\\n", "\t19 & 19 & [0.218177, 0.362036, 0.204728, 0.932984] & 4 \\\\\n", "\t20 & 20 & [0.827263, 0.0992992, 0.6343] & 3 \\\\\n", "\t21 & 21 & [0.132715, 0.775194, 0.869237] & 3 \\\\\n", "\t22 & 22 & [0.0396356, 0.79041, 0.431188] & 3 \\\\\n", "\t23 & 23 & [0.137658, 0.60808, 0.255054] & 3 \\\\\n", "\t24 & 24 & [0.498734, 0.0940369, 0.52509] & 3 \\\\\n", "\t25 & 25 & [0.265511, 0.110096, 0.834362] & 3 \\\\\n", "\t26 & 26 & [0.633427, 0.337865, 0.112987] & 3 \\\\\n", "\t27 & 27 & [0.78299, 0.838042] & 2 \\\\\n", "\t28 & 28 & [0.0878598, 0.386568, 0.330579, 0.748041] & 4 \\\\\n", "\t29 & 29 & [0.265595, 0.291069, 0.612628] & 3 \\\\\n", "\t30 & 30 & [0.705766, 0.508363] & 2 \\\\\n", "\t$\\dots$ & $\\dots$ & $\\dots$ & $\\dots$ \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "10000000×3 DataFrame. Omitted printing of 1 columns\n", "│ Row │ id │ pos │\n", "│ │ \u001b[90mInt64\u001b[39m │ \u001b[90mArray{Float64,1}\u001b[39m │\n", "├──────────┼──────────┼──────────────────────────────────────────────────────┤\n", "│ 1 │ 1 │ [0.646691, 0.112486, 0.276021] │\n", "│ 2 │ 2 │ [0.651664, 0.0566425, 0.842714] │\n", "│ 3 │ 3 │ [0.950498, 0.96467] │\n", "│ 4 │ 4 │ [0.945775, 0.789904] │\n", "│ 5 │ 5 │ [0.82116, 0.0341601, 0.0945445, 0.314926] │\n", "│ 6 │ 6 │ [0.12781, 0.374187, 0.931115] │\n", "│ 7 │ 7 │ [0.438939, 0.246862, 0.0118196, 0.0460428, 0.496169] │\n", "│ 8 │ 8 │ [0.732, 0.299058] │\n", "│ 9 │ 9 │ [0.449182, 0.875096] │\n", "│ 10 │ 10 │ [0.0462887, 0.698356, 0.365109] │\n", "⋮\n", "│ 9999990 │ 9999990 │ [0.209058, 0.338017, 0.567608] │\n", "│ 9999991 │ 9999991 │ [0.700468, 0.220524, 0.347931] │\n", "│ 9999992 │ 9999992 │ [0.231368, 0.862016] │\n", "│ 9999993 │ 9999993 │ [0.869351, 0.444795] │\n", "│ 9999994 │ 9999994 │ [0.821356, 0.509054] │\n", "│ 9999995 │ 9999995 │ [0.589245, 0.669708] │\n", "│ 9999996 │ 9999996 │ [0.806262, 0.734397] │\n", "│ 9999997 │ 9999997 │ [0.216506, 0.430571, 0.283787, 0.335015] │\n", "│ 9999998 │ 9999998 │ [0.0100723, 0.836315, 0.942299] │\n", "│ 9999999 │ 9999999 │ [0.499669, 0.25214, 0.964065] │\n", "│ 10000000 │ 10000000 │ [0.663339, 0.887989] │" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "

10,000,000 rows × 2 columns

firstlast
Float64Float64
10.6466910.276021
20.6516640.842714
30.9504980.96467
40.9457750.789904
50.821160.314926
60.127810.931115
70.4389390.496169
80.7320.299058
90.4491820.875096
100.04628870.365109
110.3024780.283401
120.4049530.658815
130.5156270.59552
140.2924620.61816
150.664260.753508
160.03688420.401421
170.5250570.61201
180.4325770.576082
190.2181770.932984
200.8272630.6343
210.1327150.869237
220.03963560.431188
230.1376580.255054
240.4987340.52509
250.2655110.834362
260.6334270.112987
270.782990.838042
280.08785980.748041
290.2655950.612628
300.7057660.508363
" ], "text/latex": [ "\\begin{tabular}{r|cc}\n", "\t& first & last\\\\\n", "\t\\hline\n", "\t& Float64 & Float64\\\\\n", "\t\\hline\n", "\t1 & 0.646691 & 0.276021 \\\\\n", "\t2 & 0.651664 & 0.842714 \\\\\n", "\t3 & 0.950498 & 0.96467 \\\\\n", "\t4 & 0.945775 & 0.789904 \\\\\n", "\t5 & 0.82116 & 0.314926 \\\\\n", "\t6 & 0.12781 & 0.931115 \\\\\n", "\t7 & 0.438939 & 0.496169 \\\\\n", "\t8 & 0.732 & 0.299058 \\\\\n", "\t9 & 0.449182 & 0.875096 \\\\\n", "\t10 & 0.0462887 & 0.365109 \\\\\n", "\t11 & 0.302478 & 0.283401 \\\\\n", "\t12 & 0.404953 & 0.658815 \\\\\n", "\t13 & 0.515627 & 0.59552 \\\\\n", "\t14 & 0.292462 & 0.61816 \\\\\n", "\t15 & 0.66426 & 0.753508 \\\\\n", "\t16 & 0.0368842 & 0.401421 \\\\\n", "\t17 & 0.525057 & 0.61201 \\\\\n", "\t18 & 0.432577 & 0.576082 \\\\\n", "\t19 & 0.218177 & 0.932984 \\\\\n", "\t20 & 0.827263 & 0.6343 \\\\\n", "\t21 & 0.132715 & 0.869237 \\\\\n", "\t22 & 0.0396356 & 0.431188 \\\\\n", "\t23 & 0.137658 & 0.255054 \\\\\n", "\t24 & 0.498734 & 0.52509 \\\\\n", "\t25 & 0.265511 & 0.834362 \\\\\n", "\t26 & 0.633427 & 0.112987 \\\\\n", "\t27 & 0.78299 & 0.838042 \\\\\n", "\t28 & 0.0878598 & 0.748041 \\\\\n", "\t29 & 0.265595 & 0.612628 \\\\\n", "\t30 & 0.705766 & 0.508363 \\\\\n", "\t$\\dots$ & $\\dots$ & $\\dots$ \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "10000000×2 DataFrame\n", "│ Row │ first │ last │\n", "│ │ \u001b[90mFloat64\u001b[39m │ \u001b[90mFloat64\u001b[39m │\n", "├──────────┼───────────┼──────────┤\n", "│ 1 │ 0.646691 │ 0.276021 │\n", "│ 2 │ 0.651664 │ 0.842714 │\n", "│ 3 │ 0.950498 │ 0.96467 │\n", "│ 4 │ 0.945775 │ 0.789904 │\n", "│ 5 │ 0.82116 │ 0.314926 │\n", "│ 6 │ 0.12781 │ 0.931115 │\n", "│ 7 │ 0.438939 │ 0.496169 │\n", "│ 8 │ 0.732 │ 0.299058 │\n", "│ 9 │ 0.449182 │ 0.875096 │\n", "│ 10 │ 0.0462887 │ 0.365109 │\n", "⋮\n", "│ 9999990 │ 0.209058 │ 0.567608 │\n", "│ 9999991 │ 0.700468 │ 0.347931 │\n", "│ 9999992 │ 0.231368 │ 0.862016 │\n", "│ 9999993 │ 0.869351 │ 0.444795 │\n", "│ 9999994 │ 0.821356 │ 0.509054 │\n", "│ 9999995 │ 0.589245 │ 0.669708 │\n", "│ 9999996 │ 0.806262 │ 0.734397 │\n", "│ 9999997 │ 0.216506 │ 0.335015 │\n", "│ 9999998 │ 0.0100723 │ 0.942299 │\n", "│ 9999999 │ 0.499669 │ 0.964065 │\n", "│ 10000000 │ 0.663339 │ 0.887989 │" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_test = select(df, :pos => ByRow(first) => :first, :pos => ByRow(last) => :last)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "Figure(PyObject
)" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "hist(df_test.first, 100);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So far all looks good. But let us look at the distribution of the last dawn random number:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "Figure(PyObject
)" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "hist(df_test.last, 100);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So - is the `rand()` function broken for the last generated random number in each sequence or something else has made the distribution stop being uniform?" ] } ], "metadata": { "kernelspec": { "display_name": "Julia 1.4.1", "language": "julia", "name": "julia-1.4" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.4.1" } }, "nbformat": 4, "nbformat_minor": 4 }