Next: Conclusion, Previous: Examples, Up: k9: Manual
Shakti is a fast data analysis language and clear benchmarks illustrate this. The Shakti website has a number of files for such purpose, b.k and taxi.k amongst others.
T:{09:30:00+_6.5*3600*(!x)%x} P:{10+x?90};Z:{1000+x?9000};E:?[;"ABCD"] /m:2;n:6 m:7000;n:5600000; S:(-m)?`4;N:|1+_n*{x%+/x:exp 15*(!x)%x}m t:S!{+`t`e`p`z!(T;E;P;Z)@'x}'N q:S!{+`t`e`b!(T;E;P)@'x}'6*N a:*A:100#S \t {select max p by e from x}'t A \t {select sum z by `o t from x}'t A \t:10 {select last b from x}'q A \t:10 select from t[a],`t^q a where p<b \ C:M:?[;"ABCDEFGHIJ"] trade(sym time exchange price size cond) quote(sym time exchange bid bz ask az mode) Q1 Q2 Q3 Q4 ETL RAM DSK k 1 9 9 1 postg 71000 1500 1900 INF 200 1.5 4.0 spark 340000 7400 8400 INF 160 50.0 2.4 mongo 89000 1700 5800 INF 900 9.0 10.0 960 billion quotes (S has 170 billion. QQQ has 6 billion.) 48 billion trades (S has 12 billion. QQQ has 80 million.)
T is a function which generates a uniform list of times from 09:30 to 16:00.
T:{09:30:00+_6.5*3600*(!x)%x} T[13] / 13 times with equal timesteps over [start;end) ^09:30:00 10:00:00 10:30:00 11:00:00 11:30:00 .. 15:00:00 15:30:00 ?1_-':T[10000] / determine the unique timesteps ?00:00:02 00:00:03
P is a function to generate values from 10 to 100 (price). Z is a function to generate values from 100 to 1000 (size). E is a function to generate values A, B, C, or D (exchange).
P[10] 78 37 56 85 40 68 88 50 41 78 Z[10] 4820 2926 1117 4700 9872 3274 6503 6123 9451 2234 E[10] "AADCBCCCBC"
m is the number of symbols. n is the number of trades. S is a list of symbol names. N is a list of numbers in decreasing order which sum approximately to n.
4#S `EEFD`IOHJ`MEJO`DHNK 4#N 11988 11962 11936 11911 +/N 5604390
t is an ntable of trades. The fields are time (t), exchange (e), price (p), and size (z). The number of trades is set by n.
Pulling one random table from t and showing 10 random rows.
10?*t@1?S t e p z -------- - -- ---- 14:37:53 D 73 4397 11:43:25 B 20 2070 10:21:18 A 53 6190 13:26:03 C 33 7446 14:07:06 B 13 2209 15:08:41 D 12 4779 14:27:37 A 11 6432 11:22:53 D 92 9965 11:12:37 A 14 5255 12:24:28 A 48 3634
q is a ntable of quotes. The fields are time (t), exchange (e), and bid (b). The number of quotes is set to 6*n.
10?*q@1?S t e b -------- - -- 11:31:12 A 80 14:08:40 C 63 14:05:07 D 12 11:31:43 A 56 12:44:19 A 45 10:13:21 A 71 15:19:08 A 74 13:42:20 D 43 11:31:41 D 66 14:41:38 A 63
a is the first symbol of S. A consists of the first 100 symbols of S.
a `PKEM
The query takes 100 tables from the trade ntable and computes the max price by exchange.
*{select max p by e from x}'t A e|p -|-- A|99 B|99 C|99 D|99 \t {select max p by e from x}'t A 22
This query takes 100 tables from the trade ntable and computes the sum of trade size done by hour.
*{select sum z by `o t from x}'t A t |z --|-------- 09| 4885972 10|10178053 11|10255045 12|10243846 13|10071057 14|10203428 15|10176102 \t {select sum z by `o t from x}'t A 27
This query takes the 100 tables from the quote ntable and returns the last bid.
3?{select last b from x}'q A b -- 18 98 85 \t:10 {select last b from x}'q A 2
This query operates on one symbol from the q and t ntables, i.e. a single quote and trade table. The quote table is joined to the trade table giving the current bid on each trade.
4?select from t[a],`t^q a where p<b t e p z b -------- - -- ---- -- 13:54:35 B 94 1345 96 11:59:52 C 26 1917 89 10:00:44 C 40 9046 81 10:59:39 A 25 5591 72 \t:10 select from t[a],`t^q a where p<b 3
The taxi data analysis problem has become well known given the ease of acquiring the data and the size of it. It’s well written up here with a benchmark summary here. Shakti has a benchmark script to generate simulated taxi data in order to check performance.
/taxi 1.1billion https://tech.marksblogg.com/benchmarks.html /type/pcount/distance/amount g:{[[]t:x rand`y`g;p:x rand 9;d:x rand 100;a:x rand 100.]} x:d!g':44000+&#d:2009.01.01+!2500 /110 million example ys:{`y[!x]sum/x} / year sum \t sum{select[t]count from x}':x \t ys@{select[p]count from x}':x \t sum{select[p]sum a from x}':x \t ys@{select[p,d]count from x}':x \ x:g 10 select[t]count from x select[p]count from x select[p]sum a from x select[p,d]count from x Q1 select[t]count from x Q2 select[p]avg a from x Q3 select[d.y,p]count from x Q4 select[d.y,p,d]count from x cpu cost core/ram elapsed machines k 4 .0004 4/16 1 1*i3.2xlarge(8v/32/$.62+$.93) redshift 864 .0900 108/1464 8(1 2 2 3) 6*ds2.8xlarge(36v/244/$6.80) db/spark 1260 .0900 42/336 30(2 4 4 20) 21*m5.xlarge(4v/16/$.20+$.30) bigquery 1600 .3200 200/3200 8(2 2 1 3) cost: k/redshift/databricks(1.5*EC2) bigquery(redshift) $5.00*TB k($.05/TB) csv /vendor,pickup,dropoff,pcount,dist1,plong,plat,rate,flag,dlong,dlat,ptype,fare1,sur1,mta1,tip1,toll1,amount1 t:"b 12 e" / type(2) passenger(8) \t t:(`t`p`d`a;",";t)0:"taxi.csv"
Function to generate a table of random data to represent taxi company (t either yellow ‘y or green ‘g), paid fare (p 0-9), distance travelled (d 0-99) and address (a 0-100.).
g:{[[]t:x?`y`g;p:x?9;d:x?100;a:x?100.]} g 10 t p d a - - -- -------- y 7 38 50.67771 g 5 34 38.51022 y 6 40 9.654263 y 6 5 98.91451 g 7 47 44.40432 g 0 64 66.07784 y 0 99 25.84292 y 5 46 46.87825 y 5 66 81.61647 g 0 10 7.946983
x is an ntable containing 2500 tables, keyed on day. Each individual table contains 44,000 rows therefore x has 110 million rows in total.
x:d!g':44000+&#d:2009.01.01+!2500 /110 million example x 2009.01.01|[[]t:`y`y`g`y`g`g`y`y`g`g`g`g`g`y`y`y`y`g`g`g.. 2009.01.02|[[]t:`y`g`g`g`y`g`y`g`y`g`g`g`g`y`g`y`g`y`g`y.. 2009.01.03|[[]t:`y`g`g`g`y`g`y`g`y`g`g`g`g`y`g`y`g`y`g`y.. 2009.01.04|[[]t:`y`g`g`g`y`g`y`g`y`g`g`g`g`y`g`y`g`y`g`y.. ..
Let’s work with a single table to understand the queries and limit it to 5 rows.
q:5#x 2009.01.04;q t p d a - - -- -------- y 3 41 54.5307 g 6 70 70.4241 g 1 63 81.46645 g 7 36 43.12615 y 1 43 23.50114
Now let’s count the number of rows by field t. Looking at the data above you should expect 2 yellow (y) taxi trips and 3 green (g) taxi trips.
select[t]count from q t|n -|- g|3 y|2
We can also sum the paid fare (p) by taxi company (t).
select[t]sum p from q t|p -|-- g|14 y| 4
Now if we want to compute the same over all the data we need to run the kSQL query over each table. We’ll do this on parallel threads (eachprior) to speed it up.
{select[t]count from x}':x 2009.01.01|[[t:`g`y]n:21968 22032] 2009.01.02|[[t:`g`y]n:21962 22038] 2009.01.03|[[t:`g`y]n:21962 22038] 2009.01.04|[[t:`g`y]n:21962 22038]
and then sum over all days.
sum{select[t]count from x}':x t|n -|-------- g|55010927 y|54989073
Running the command with the timer will allow us to measure how long it takes to sum over the 110 million examples.
\t sum{select[t]count from x}':x 1077
Next: Conclusion, Previous: Examples, Up: k9: Manual