Benchmark (Shakti (k9) tutorial)

20 Benchmark

Shakti is a fast data analysis language and clear benchmarks illustrate this. The Shakti website has a number of files for such purpose, b.k and taxi.k amongst others.

b.k
taxi

20.1 b.k

T:{09:30:00+_6.5*3600*(!x)%x}
P:{10+x?90};Z:{1000+x?9000};E:?[;"ABCD"]

/m:2;n:6
m:7000;n:5600000;
S:(-m)?`4;N:|1+_n*{x%+/x:exp 15*(!x)%x}m

t:S!{+`t`e`p`z!(T;E;P;Z)@'x}'N
q:S!{+`t`e`b!(T;E;P)@'x}'6*N

a:*A:100#S

\t  {select max p by e from x}'t A
\t  {select sum z by `o t from x}'t A
\t:10 {select last b from x}'q A
\t:10 select from t[a],`t^q a where p<b
\

C:M:?[;"ABCDEFGHIJ"]
trade(sym time exchange price size cond)
quote(sym time exchange bid bz ask az mode)

                Q1      Q2      Q3      Q4  ETL   RAM   DSK
k                1       9       9       1                                            
postg        71000    1500    1900     INF  200   1.5   4.0
spark       340000    7400    8400     INF  160  50.0   2.4
mongo        89000    1700    5800     INF  900   9.0  10.0   

960 billion quotes (S has 170 billion. QQQ has 6 billion.)
 48 billion trades (S has 12 billion. QQQ has 80 million.)

T
P, Z, E
m, n, S, N
t
q
a, A
Max price by exchange
Compute sum of trade size by hour.
Compute last bid by symbol
Find trades below the bid

20.1.1 T

T is a function which generates a uniform list of times from 09:30 to 16:00.

 T:{09:30:00+_6.5*3600*(!x)%x}
 T[13]           / 13 times with equal timesteps over [start;end)
^09:30:00 10:00:00 10:30:00 11:00:00 11:30:00 .. 15:00:00 15:30:00
 ?1_-':T[10000]  / determine the unique timesteps
?00:00:02 00:00:03

20.1.2 P, Z, E

P is a function to generate values from 10 to 100 (price). Z is a function to generate values from 100 to 1000 (size). E is a function to generate values A, B, C, or D (exchange).

 P[10]
78 37 56 85 40 68 88 50 41 78
 Z[10]
4820 2926 1117 4700 9872 3274 6503 6123 9451 2234
 E[10]
"AADCBCCCBC"

20.1.3 m, n, S, N

m is the number of symbols. n is the number of trades. S is a list of symbol names. N is a list of numbers in decreasing order which sum approximately to n.

 4#S
`EEFD`IOHJ`MEJO`DHNK
 4#N
11988 11962 11936 11911
 +/N
5604390

20.1.4 t

t is an ntable of trades. The fields are time (t), exchange (e), price (p), and size (z). The number of trades is set by n.

Pulling one random table from t and showing 10 random rows.

 10?*t@1?S
t        e p  z   
-------- - -- ----
14:37:53 D 73 4397
11:43:25 B 20 2070
10:21:18 A 53 6190
13:26:03 C 33 7446
14:07:06 B 13 2209
15:08:41 D 12 4779
14:27:37 A 11 6432
11:22:53 D 92 9965
11:12:37 A 14 5255
12:24:28 A 48 3634

20.1.5 q

q is a ntable of quotes. The fields are time (t), exchange (e), and bid (b). The number of quotes is set to 6*n.

 10?*q@1?S
t        e b 
-------- - --
11:31:12 A 80
14:08:40 C 63
14:05:07 D 12
11:31:43 A 56
12:44:19 A 45
10:13:21 A 71
15:19:08 A 74
13:42:20 D 43
11:31:41 D 66
14:41:38 A 63

20.1.6 a, A

a is the first symbol of S. A consists of the first 100 symbols of S.

 a
`PKEM

20.1.7 Max price by exchange

The query takes 100 tables from the trade ntable and computes the max price by exchange.

 *{select max p by e from x}'t A
e|p 
-|--
A|99
B|99
C|99
D|99
 \t  {select max p by e from x}'t A
22

20.1.8 Compute sum of trade size by hour.

This query takes 100 tables from the trade ntable and computes the sum of trade size done by hour.

 *{select sum z by `o t from x}'t A
t |z       
--|--------
09| 4885972
10|10178053
11|10255045
12|10243846
13|10071057
14|10203428
15|10176102
 \t  {select sum z by `o t from x}'t A
27

20.1.9 Compute last bid by symbol

This query takes the 100 tables from the quote ntable and returns the last bid.

 3?{select last b from x}'q A
b 
--
18
98
85

 \t:10 {select last b from x}'q A
2

20.1.10 Find trades below the bid

This query operates on one symbol from the q and t ntables, i.e. a single quote and trade table. The quote table is joined to the trade table giving the current bid on each trade.

 4?select from t[a],`t^q a where p<b
t        e p  z    b 
-------- - -- ---- --
13:54:35 B 94 1345 96
11:59:52 C 26 1917 89
10:00:44 C 40 9046 81
10:59:39 A 25 5591 72
 \t:10 select from t[a],`t^q a where p<b
3

20.2 taxi

The taxi data analysis problem has become well known given the ease of acquiring the data and the size of it. It’s well written up here with a benchmark summary here. Shakti has a benchmark script to generate simulated taxi data in order to check performance.

/taxi 1.1billion https://tech.marksblogg.com/benchmarks.html
/type/pcount/distance/amount

g:{[[]t:x rand`y`g;p:x rand 9;d:x rand 100;a:x rand 100.]}
x:d!g':44000+&#d:2009.01.01+!2500  /110 million example

ys:{`y[!x]sum/x} / year sum
\t sum{select[t]count from x}':x
\t ys@{select[p]count from x}':x
\t sum{select[p]sum a from x}':x
\t ys@{select[p,d]count from x}':x

\
x:g 10
select[t]count from x
select[p]count from x
select[p]sum a from x
select[p,d]count from x


Q1 select[t]count from x
Q2 select[p]avg a from x
Q3 select[d.y,p]count from x
Q4 select[d.y,p,d]count from x

          cpu  cost core/ram  elapsed         machines
k           4 .0004    4/16    1            1*i3.2xlarge(8v/32/$.62+$.93)
redshift  864 .0900  108/1464  8(1 2 2 3)   6*ds2.8xlarge(36v/244/$6.80)
db/spark 1260 .0900   42/336  30(2 4 4 20) 21*m5.xlarge(4v/16/$.20+$.30)
bigquery 1600 .3200  200/3200  8(2 2 1 3)

cost: k/redshift/databricks(1.5*EC2) bigquery(redshift) $5.00*TB  k($.05/TB)

csv
/vendor,pickup,dropoff,pcount,dist1,plong,plat,rate,flag,dlong,dlat,ptype,fare1,sur1,mta1,tip1,toll1,amount1
t:"b  12            e"  / type(2) passenger(8)
\t t:(`t`p`d`a;",";t)0:"taxi.csv"

g
x
2009.01.04

20.2.1 g

Function to generate a table of random data to represent taxi company (t either yellow ‘y or green ‘g), paid fare (p 0-9), distance travelled (d 0-99) and address (a 0-100.).

 g:{[[]t:x?`y`g;p:x?9;d:x?100;a:x?100.]}
 g 10
t p d  a       
- - -- --------
y 7 38 50.67771
g 5 34 38.51022
y 6 40 9.654263
y 6  5 98.91451
g 7 47 44.40432
g 0 64 66.07784
y 0 99 25.84292
y 5 46 46.87825
y 5 66 81.61647
g 0 10 7.946983

20.2.2 x

x is an ntable containing 2500 tables, keyed on day. Each individual table contains 44,000 rows therefore x has 110 million rows in total.

 x:d!g':44000+&#d:2009.01.01+!2500  /110 million example
 x
2009.01.01|[[]t:`y`y`g`y`g`g`y`y`g`g`g`g`g`y`y`y`y`g`g`g..
2009.01.02|[[]t:`y`g`g`g`y`g`y`g`y`g`g`g`g`y`g`y`g`y`g`y..
2009.01.03|[[]t:`y`g`g`g`y`g`y`g`y`g`g`g`g`y`g`y`g`y`g`y..
2009.01.04|[[]t:`y`g`g`g`y`g`y`g`y`g`g`g`g`y`g`y`g`y`g`y..
..

20.2.3 2009.01.04

Let’s work with a single table to understand the queries and limit it to 5 rows.

 q:5#x 2009.01.04;q
t p d  a       
- - -- --------
y 3 41 54.5307 
g 6 70 70.4241 
g 1 63 81.46645
g 7 36 43.12615
y 1 43 23.50114

Now let’s count the number of rows by field t. Looking at the data above you should expect 2 yellow (y) taxi trips and 3 green (g) taxi trips.

 select[t]count from q
t|n
-|-
g|3
y|2

We can also sum the paid fare (p) by taxi company (t).

 select[t]sum p from q
t|p
-|--
g|14
y| 4

Now if we want to compute the same over all the data we need to run the kSQL query over each table. We’ll do this on parallel threads (eachprior) to speed it up.

 {select[t]count from x}':x
2009.01.01|[[t:`g`y]n:21968 22032]
2009.01.02|[[t:`g`y]n:21962 22038]
2009.01.03|[[t:`g`y]n:21962 22038]
2009.01.04|[[t:`g`y]n:21962 22038]

and then sum over all days.

 sum{select[t]count from x}':x
t|n       
-|--------
g|55010927
y|54989073

Running the command with the timer will allow us to measure how long it takes to sum over the 110 million examples.

 \t sum{select[t]count from x}':x
1077