Copyright © 2018 Ashok P. Nadkarni. All rights reserved.

This document is a programmer’s guide for installing and using the tarray extension which implements two data types — columns and tables. It does not list or detail every option or command implemented by the extension. See the command reference pages accessible from the Main Table of Contents for that information. Also see Introduction for an overview of all the package components and documentation.

Important Typed arrays can also be manipulated using Xtal, a language embeddable in Tcl that is geared towards operating on typed arrays with a more succint syntax. However, Xtal is for the most part not described in this guide. See The Xtal Language for details on its use.

1. Motivation

Although Tcl provides for collections in the form of lists these have limitations when dealing with large amounts of data. These limitations, both in memory usage as well as performance, are a result of how values are internally stored in Tcl. The tarray extension implements collections as native data types with parallelized operations resulting in orders of magnitude improvement in these performance metrics when operating on numeric data (although with only marginal benefits for strings).

In addition to performance, the extension also offers more convenience in terms of more flexible indexing as well as a wider and more powerful set of operations on collections.

2. Installation and loading

Binary packages for some platforms are available from the Sourceforge download area. See the build instructions for other platforms.

To install the extension, extract the files from the distribution to any directory that is included in your Tcl installation’s auto_path variable.

Once installed, the extension can be loaded with the standard Tcl package require command.

% package require tarray
→ 0.9.0

The extension places its commands in the tarray namespace. The primary commands implemented by the extension are column and table, each being an ensemble of subcommands that operate on columns and table respectively.

Other commands provide functionality like iteration and formatting that are independent of the data type.

The examples in this guide assume the commands have been imported into the calling namespace or are included in its namespace path as shown below.

% namespace path tarray

If in addition you want to use the Xtal language, you need to load its package as well.

% package require xtal
→ 0.9.0
% namespace import xtal::xtal

3. Concepts

3.1. Columns

A column is an ordered sequence of values of a single type that is specified when the column is created. The command tarray::column can be used to create and manipulate typed columns.

3.2. Tables

A table is an ordered sequence of named columns of equal size. It can be also be viewed as an array of records where the record fields happen to use column-wise storage. The corresponding tarray::table command operates on tables. Columns in a table can be referenced using either their name or their position in the ordered sequence.

3.3. Types

All elements in a column must be of the type specified when the column is created. The following element types are available:

Table 1. Table Column types
Keyword Type

any

Any Tcl value

string

A string value

boolean

A boolean value

byte

Unsigned 8-bit integer

double

Floating point value

int

Signed 32-bit integer

uint

Unsigned 32-bit integer

wide

Signed 64-bit integer

The primary purpose of the type is to specify what values can be stored in that column. This impacts the compactness of the internal storage (really the primary purpose of the extension) as well certain operations (like sort or search) invoked on the column.

The types any and string are similar in that they can hold any Tcl value. Both are treated as string values for purposes of comparisons and operators. The difference is that the former stores the value using the Tcl internal representation while the latter stores it as a string. The advantage of the former is that internal structure, like a dictionary, is preserved. The advantage of the latter is significantly more compact representation, particularly for smaller strings.

Attempts to store values in a column that are not valid for that column will result in an error being generated.

3.4. Indices

An index into a typed column or table can be specified as either an integer or the keyword end. As in Tcl’s list commands, end specifies the index of the last element in the tarray or the index after it, depending on the command. Simple arithmetic adding of offsets to end is supported, for example end-2 or end+5.

Many commands also allow multiple indices to be specified. These may take one of two forms — a range which includes all indices between a lower and an upper bound (inclusive), and an index list which may be one of the following:

  • a Tcl list of integers

  • a column of any type other than boolean. The value of each element of the column is converted to an integer that is treated as an index.

  • a column of type boolean. Here the index of each bit in the boolean column that is set to 1 is treated as an index.

Note that keyword end can be used to specify a single index or as a range bound, but cannot be used in an index list.

When indices are specified that cause a column or table to be extended, they must include all indices beyond the current column or table size in any order but without any gaps. For example,

% set I [column series 5]
→ tarray_column int {0 1 2 3 4}
% column place $I {106 105 107 104} {6 5 7 4} 1
→ tarray_column int {0 1 2 3 104 105 106 107}
% column place $I {106 107} {6 7} 2
Ø tarray index 6 out of bounds.
1 Ok: Indices not in order but no gaps
2 Error: no value specified for non-existing index 5

The various forms for indexing are illustrated below using the fill command which stores a value at all specified locations.

% set col [column create double {}] 1
→ tarray_column double {}
% set col [column fill $col 1.0 0 9] 2
→ tarray_column double {1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0}
% set col [column fill $col 2.0 3] 3
→ tarray_column double {1.0 1.0 1.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0}
% set col [column fill $col 2.0 end-2 end] 4
→ tarray_column double {1.0 1.0 1.0 2.0 1.0 1.0 1.0 2.0 2.0 2.0}
% set col [column fill $col 3.0 {2 7}] 5
→ tarray_column double {1.0 1.0 3.0 2.0 1.0 1.0 1.0 3.0 2.0 2.0}
% set col [column fill $col 3.0 [column create int {2 7}]] 6
→ tarray_column double {1.0 1.0 3.0 2.0 1.0 1.0 1.0 3.0 2.0 2.0}
1 Creates a new column
2 Indices specified as range 0 to 9
3 Single index 3
4 Range relative to end
5 Indices specified as a list
6 Indices specified as an int column

The last form, an integer column, is useful because some commands return indices in that form. For example, the following will replace all elements greater than 2.0 with 0.0.

% set col [column fill $col 0.0 [column search -all -gt $col 2.0]]
→ tarray_column double {1.0 1.0 0.0 2.0 1.0 1.0 1.0 0.0 2.0 2.0}

3.5. Values and variables

Many commands that modify columns and tables come in two flavors:

  • Commands that operate on column and table values and return the modified column or table as a result (for example fill), and

  • Commands that modify a Tcl variable containing the column or table (for example vfill).

The difference is similar to how different Tcl list commands behave, e.g. linsert and lreplace versus lset and lappend.

The examples above used the value-oriented form of the commands where the fill modifies a copy of the contents of the and returns the modified copy which is then stored back into . For large typed arrays, this is inefficient and the above would be better written as

% set col [column create double {}]
→ tarray_column double {}
% column vfill col 1.0 0 9
→ tarray_column double {1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0}
% column vfill col 2.0 3
→ tarray_column double {1.0 1.0 1.0 2.0 1.0 1.0 1.0 1.0 1.0 1.0}
% column vfill col 3.0 {2 7}
→ tarray_column double {1.0 1.0 3.0 2.0 1.0 1.0 1.0 3.0 1.0 1.0}
% column vfill col 3.0 [column create int {2 7}]
→ tarray_column double {1.0 1.0 3.0 2.0 1.0 1.0 1.0 3.0 1.0 1.0}

Here the vfill command is directly modifying the variable and assuming the content is not shared, no copy needs to be made.

This guide does not necessarily illustrate both forms of every command.

4. Operating on columns

The column command is used to operate on columns.

4.1. Creating columns

The column create command creates columns of a specified type.

% column create int
→ tarray_column int {}

The above creates a column that can hold elements of the int type. Note that the command returns a value that would normally be assigned to a variable.

4.1.1. Creating initialized columns

A column can be initialized at creation time. Let us create some columns we will use as examples throughout this guide that tracks weather through the year.

% set months [column create string {
    Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
}]
→ tarray_column string {Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec}
% set temperatures [column create int {
    12  14  23  28  34  30  26  25  26  22  18  15
}]
→ tarray_column int {12 14 23 28 34 30 26 25 26 22 18 15}
% set rainfall [column create double {
    11.0 23.3 18.4 14.7 70.3 180.5 210.2 205.8 126.4 64.9 33.1 19.2
}]
→ tarray_column double {11.0 23.3 18.4 14.7 70.3 180.5 210.2 205.8 126.4 64.9 3...
Warning

Applications should not depend on the exact string representation of a column or table as that is liable to change. Use only the tarray package commands to create and manipulated columns and tables. So do not create the columns above as either of the following

set rainfall {tarray_column double {11.0 23.3 …​}}
set rainfall [list tarray_column double [list 11.0 23.3 …​]]

The above examples use a Tcl list for initialization. Alternatively, a column could also be used instead. For example,

% set first_quarter [column create any [column range $months 0 2]]
→ tarray_column any {Jan Feb Mar}

Notice the initialization column type need not be the same as long as the values are compatible.

A related command is column cast which differs in that it is less strict in terms of converting values. For example,

% set I [column create int {255 256 257}]
→ tarray_column int {255 256 257}
% set B [column create byte $I] 1
Ø Value 256 does not fit in a byte.
% set B [column cast byte $I] 2
→ tarray_column byte {255 0 1}
1 Error because values do not fit in a byte
2 No error, high order bits discarded

Columns are resized as needed but as an optimization, preallocation may be requested.

% column create int {0 1 2 3} 1000
→ tarray_column int {0 1 2 3}

This preallocates space for a thousand elements with the first four being initialized.

4.1.2. Creating numeric series

Columns containing equally spaced values can be created with the column series command.

% column series 10 1
→ tarray_column int {0 1 2 3 4 5 6 7 8 9}
% column series 5 -5 -2 2
→ tarray_column int {5 3 1 -1 -3}
% column series 10.0 3
→ tarray_column double {0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0}
1 0 (default) to 10 with step 1 (default)
2 Decreasing from 5 to -5 with step -2
3 Series of doubles instead of integer

4.1.3. Creating columns with random values

The column random command creates and initializes column with random values.

% column random boolean 16 1
→ tarray_column boolean {1 0 0 0 0 0 1 0 0 1 0 0 1 1 0 0}
% column random int 5 -100 100 2
→ tarray_column int {-44 30 3 46 53}
1 Array of 16 randomly generated booleans
2 Array of 5 random integers in a range

For some uses, such as testing, it is useful to be able to generate the same sequence of random numbers on every run. The randseed command can be used to set or reset the initial seeds used for random number generation.

randseed 100 200      → (empty) 1
column random byte 10 → tarray_column byte {69 190 171 125 21 117 154 177 221 27}
column random byte 10 → tarray_column byte {199 123 28 62 181 74 95 197 253 55}
randseed 100 200      → (empty) 2
column random byte 10 → tarray_column byte {69 190 171 125 21 117 154 177 221 27} 3
randseed              → (empty) 4
column random byte 10 → tarray_column byte {226 75 91 3 12 80 17 76 40 86} 5
1 Set the initial seeds for random number generation
2 Set the same seeds again
3 Notice same values generated
4 Reset to use random initial seeds
5 Now again non-deterministic values are generated

4.1.4. Creating bit maps

As a convenience, boolean columns which have all bits set to 0 or to 1 can be created with column bitmap0 and column bitmap1 respectively.

% set zerobits [column bitmap0 8]
→ tarray_column boolean {0 0 0 0 0 0 0 0}
% set onebits  [column bitmap1 8]
→ tarray_column boolean {1 1 1 1 1 1 1 1}

Further, specific bits can be initialized to the complement.

% set bits [column bitmap0 8 {0 7}]
→ tarray_column boolean {1 0 0 0 0 0 0 1}

4.2. Storing elements

A column may be modified by either storing a single value at multiple target locations or a different value at each target location. Further, the locations may be a contiguous range or an arbitrary set of indices. Depending on the command, the values may overwrite existing values or inserted.

4.2.1. Storing one value in multiple locations

The column fill and column vfill commands store a single value at one or more locations, either contiguous or noncontiguous.

% set I [column series 10]
→ tarray_column int {0 1 2 3 4 5 6 7 8 9}
% column fill $I 99 end 1
→ tarray_column int {0 1 2 3 4 5 6 7 8 99}
% column fill $I 99 1 8 2
→ tarray_column int {0 99 99 99 99 99 99 99 99 9}
% column fill $I 99 [column series 10 2] 3
→ tarray_column int {99 1 99 3 99 5 99 7 99 9}
1 Store a single element
2 Store into a contiguous range
3 Store in every other position

4.2.2. Storing multiple values in arbitrary locations

The column place and column vplace commands store each value from a sequence of values at one or more locations in a specified order (not necessarily sequential).

% column place $I {101 102 103} [column create int {3 2 5}] 1
→ tarray_column int {0 1 102 101 4 103 6 7 8 9}
% column place $I [column create int {101 102 103}] {3 2 5} 2
→ tarray_column int {0 1 102 101 4 103 6 7 8 9}
1 Stores list of values at an index column
2 Stores a column of values at an index list

Note from the above how both values as well as indices can be specified as a Tcl list or as a column. This is true for most commands.

4.2.3. Storing multiple values in contiguous locations

The column put and column vput commands store each value from a sequence of values in contiguous locations starting at a specified index.

% column vput I {60 70 80} 6 1
→ tarray_column int {0 1 2 3 4 5 60 70 80 9}
% column vput I $I 2
→ tarray_column int {0 1 2 3 4 5 60 70 80 9 0 1 2 3 4 5 60 70 80 9}
1 Stores values consecutively starting at index 6
2 Appends to itself unspecified starting index defaults to end

4.2.4. Inserting one value in multiple locations

The column insert and column vinsert commands store a single repeated value a specified number of times starting at a given index. The following command inserts the value 99 three times at index 1.

% set I [column series 5]
→ tarray_column int {0 1 2 3 4}
% column insert $I 99 1 3
→ tarray_column int {0 99 99 99 1 2 3 4}

4.2.5. Inserting multiple values

The column inject and column vinject commands insert multiple values passed as a list or a column into contiguous locations starting at a specified index.

% set I [column series 5]
→ tarray_column int {0 1 2 3 4}
% column inject $I {10 20 30} 2 1
→ tarray_column int {0 1 10 20 30 2 3 4}
% column inject $I $I 2 2
→ tarray_column int {0 1 0 1 2 3 4 2 3 4}
1 Inserts 10, 20, 30 at index 2
2 Inserts itself at position 2

4.3. Deleting elements

Elements in a column can be deleted with the column delete and column vdelete commands. Succeeding elements are moved up to occupy the deleted slots. Like the fill command, the indices of the elements to be deleted may be specified as a single index, a range, a list of indices or a index column.

% set I [column random int 10 -100 100]
→ tarray_column int {-85 -55 -4 -93 -16 -37 -87 78 -41 -76}
% column delete $I end 1
→ tarray_column int {-85 -55 -4 -93 -16 -37 -87 78 -41}
% column delete $I 3 6 2
→ tarray_column int {-85 -55 -4 78 -41 -76}
% column delete $I {0 7 2} 3
→ tarray_column int {-55 -93 -16 -37 -87 -41 -76}
% column delete $col [column search -all -lt $col 0] 4
→ tarray_column double {1.0 1.0 3.0 2.0 1.0 1.0 1.0 3.0 1.0 1.0}
1 Delete last
2 Delete third through sixth
3 Delete specified list
4 Delete all negative values

4.4. Retrieving elements

As for storing elements, there are multiple commands for retrieving elements depending on whether multiple elements are to be retrieved and whether they are contiguous or not. In cases where multiple values are retrieved, they are returned as a column by default but can be returned as a list or dictionary (keyed by index) with the use of the -list and -dict options.

4.4.1. Retrieving a single element

In the simplest case, the column index command can be used to retrieve a single element.

% column index $months 4
→ May

4.4.2. Retrieving all elements

The command column values returns all elements of a column as a Tcl list.

% column values $months
→ Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

4.4.3. Retrieving a range of elements

The command column range returns multiple contiguously located elements.

% column range $months 0 2
→ tarray_column string {Jan Feb Mar}
% column range -list $months 0 2
→ Jan Feb Mar
% column range -dict $months 0 2
→ 0 Jan 1 Feb 2 Mar

4.4.4. Retrieving an arbitrary list of elements

Multiple elements at arbitrary indices can be retrieved with the column get command which accepts either a index list or a index column.

% column get $months {10 7 4 7}
→ tarray_column string {Nov Aug May Aug}
% column get $months [column create int {10 7 4 7}]
→ tarray_column string {Nov Aug May Aug}
% column get -list $months [column create int {10 7 4 7}]
→ Nov Aug May Aug

As seen, indices may be repeated and values are returned in order of specified indices.

Boolean index columns are treated differently in that an element is included if the corresponding bit is set in the index column.

% column get $months [column create int {1 0 1}]
→ tarray_column string {Feb Jan Feb}
% column get $months [column create boolean {1 0 1}]
→ tarray_column string {Jan Mar}

Note the difference between the two results above.

4.5. Searching and filtering

4.5.1. Linear search

The column search command works similarly to Tcl’s lsearch. It returns the indices or the values of matching elements in a column. Like lsearch, by default column search stops on the first match and returns the matching index.

% set col [column create int {-10 10 0 20 0 -5 5}]
→ tarray_column int {-10 10 0 20 0 -5 5}
% column search $col 0
→ 2
% column search $col 21 1
→ -1
1 Not found

The above command returns the index of the first element that is 0 using the default matching operator that tests for equality. On the other hand, if the -inline option is specified,

% column search -inline -gt $col 0
→ 10

the command returns the value of the first positive element instead of its index (using the -gt "greater than" operator).

The -all option requests all matching elements to be returned instead of the just the first. Thus

% column search -all -gt $col 0
→ tarray_column int {1 3 6}

returns an int column containging the indices of all positive elements. This is often useful to extract corresponding elements from another column.

% column get $months [column search -all -gt $rainfall 100]
→ tarray_column string {Jun Jul Aug Sep}

When results are further used in logical operations, it is useful to specify the -bitmap option. This causes the results to be returned as a boolean vector where the matching indices are set to 1. That further allows for easy logical operations.

% set not_wet [column search -bitmap -lt $rainfall 150] 1
→ tarray_column boolean {1 1 1 1 1 0 0 0 1 1 1 1}
% set not_dry [column search -bitmap -gt $rainfall 50]
→ tarray_column boolean {0 0 0 0 1 1 1 1 1 1 0 0}
% set moderate_months [column get $months [column && $not_wet $not_dry]]
→ tarray_column string {May Sep Oct}
1 The -bitmap option implies -all
Tip

Queries like this are easier to do using Xtal expressions.

% xtal::xtal { months[rainfall > 50 && rainfall < 150] }
→ tarray_column string {May Sep Oct}

You can of course use -inline with -all to retrieve the values instead of indices. Using the -pat pattern matching operator as an example,

% set exes [column create string {tclsh.exe tclsh.man wish.exe}]
→ tarray_column string {tclsh.exe tclsh.man wish.exe}
% column search -all -inline -nocase -pat $exes *.exe
→ tarray_column string {tclsh.exe wish.exe}

returns the values of all elements that match *.exe using case-insensitive matching as in Tcl’s string match -nocase.

The search can be restricted to only look at specific elements using a combination of -range and -among options. For example, find all summer months with limited rainfall.

% column get $months [column search -all -range {4 7} -lt $rainfall 150]
→ tarray_column string {May}

Similarly, using the -among option, only search the specified elements.

% column get $months [column search -all -among {4 5 6 7} -lt $rainfall 150]
→ tarray_column string {May}

The option -among is particularly useful in combining multiple searches where it can be used to constrain each search to the results of the prior search. We will see an example of this later.

The column search command supports several matching operators like -lt, -gt etc. See the command reference for a full list.

Tip The column intersect3 command offers another way to combine searches across multiple columns as described later.

4.5.2. Keyed lookup

Fore columns of type string only, the column lookup command provides for faster, dictionary based retrieval. This may be beneficial for columns used as keys in a table.

The command returns the index of matching element in the column.

% column lookup months Jun
Ø Object is not a column.
% column lookup months xxx 1
Ø Object is not a column.
1 Returns -1 if not present

4.6. Ordering elements

Several commands offer rearranging of the elements in a column.

4.6.1. Sorting

The most common rearrangement operation is sorting. Columns are sorted using column sort command or its variable targeting analogue column vsort. The commands take the -increasing and -decreasing options to determine the sort order.

% column sort $rainfall 1
→ tarray_column double {11.0 14.7 18.4 19.2 23.3 33.1 64.9 70.3 126.4 180.5 205...
% column sort -decreasing $rainfall
→ tarray_column double {210.2 205.8 180.5 126.4 70.3 64.9 33.1 23.3 19.2 18.4 1...
1 Default order is -increasing

The -indices option will return indices instead of the sorted values themselves. So to print months in order of increasing rainfall

% column get -list $months [column sort -indices $rainfall]
→ Jan Apr Mar Dec Feb Nov Oct May Sep Jun Aug Jul

The command also supports the option -indirect where the operand is a column whose values are indices into the column whose values are to be sorted.

% set rain_indices [column series 12]
→ tarray_column int {0 1 2 3 4 5 6 7 8 9 10 11}
% column sort -indirect $rainfall $rain_indices
→ tarray_column int {0 3 2 11 1 10 9 4 8 5 7 6}

Notice the rain_indices column is sorted in order of the corresponding values in the rainfall column. In this illustrative example, the indices are ordered before sorting but that does not have to be the case.

One example of how this option is useful is illustrated later when we discuss sort stability.

4.6.2. Reversing

A simple reordering transform is reversing the order of elements. The column reverse and column vreverse commands can be used for this purpose.

% column reverse $months
→ tarray_column string {Dec Nov Oct Sep Aug Jul Jun May Apr Mar Feb Jan}

4.6.3. Shuffling

The column shuffle command returns the elements of a column in random order.

% set I [column series 9]
→ tarray_column int {0 1 2 3 4 5 6 7 8}
% column shuffle $I
→ tarray_column int {3 6 2 7 1 5 4 8 0}
% column shuffle $I
→ tarray_column int {6 3 8 7 5 0 1 2 4}

4.7. Arithmetic and logical operations

4.7.1. Comparing columns

The column identical and column equal commands compare two columns in their entirety. The difference is that the former returns 1 if the columns have the same type in addition to elements being equal and 0 otherwise. The latter applies a looser definition in that the columns may be of different types.

% set I [column create int {1 2 3}]
→ tarray_column int {1 2 3}
% set S [column create string {1 2 3}]
→ tarray_column string {1 2 3}
% column identical $I $S
→ 0
% column equal $I $S
→ 1

4.7.2. Arithmetic operations between columns

The column math command can be used to perform arithmetic and logical operations on columns on a per-element basic. The command takes multiple arguments each of which may be a column or a scalar numeric value. For example,

% set I [column create int {10 20 30}]
→ tarray_column int {10 20 30}
% set J [column create double {10.1 19.9 30}]
→ tarray_column double {10.1 19.9 30.0}
% column math + $I $J 1000
→ tarray_column double {1020.1 1039.9 1060.0}
% column math < $I $J
→ tarray_column boolean {1 0 0}

As a convenience, the above command can also be issued as

% column + $I $J 1000
→ tarray_column double {1020.1 1039.9 1060.0}
% column < $I $J
→ tarray_column boolean {1 0 0}

Logical operators work similarly but are generally used with boolean columns.

% set bita [column bitmap0 5 {0 4}]
→ tarray_column boolean {1 0 0 0 1}
% set bitb [column bitmap0 5 {2}]
→ tarray_column boolean {0 0 1 0 0}
% column || $bita $bitb
→ tarray_column boolean {1 0 1 0 1}

See the description of column math for all the available operators.

4.7.3. Arithmetic operations on a column

In contrast to arithmetic commands that operate on a per-element basis, some commands operate on all elements of a single entire column.

The column sum command sums all the elements in a column.

% column sum $J
→ 60.0

The column minmax command returns a pair containing the minimum and maximum values in a column.

% column minmax $J
→ 10.1 30.0

Note that this command is not restricted to numeric columns and will work for other types as well.

You can use the -indices option to get the indices of the minimum and maximum values instead of the values themselves.

% column minmax -indices $J
→ 0 2

4.7.4. Set operations

The column intersect3 command returns a list containing the differences between two columns, the first element being the intersection, the second containing elements of the first that are not present in the second, and the third containing elements from the second that are not present in the first.

% set hot [column search -all -gt $temperatures 25]
→ tarray_column int {3 4 5 6 8}
% set wet [column search -all -gt $rainfall 150]
→ tarray_column int {5 6 7}
% lassign [column intersect3 $hot $wet] hot_and_wet hot_and_dry cool_and_wet
% column get $months $hot_and_wet
→ tarray_column string {Jun Jul}
Tip An intersect3 operation on search results is very fast, of order O(n). The fact that an index column returned by a search is in sorted order is internally noted by the implementation and made use of by a subsequent intersect3.

4.8. Counting elements

The column size returns the number of elements in a column.

% column size $rainfall
→ 12

If you are only interested in the count for elements that match specific criteria, you can use the column count command instead. Thus

% column count -gt $rainfall 150
→ 3

returns the number of months with heavy rainfall.

4.9. Introspecting columns

The type of a column can be retrieved with the column type command.

% column type $months
→ string
% column type $rainfall
→ double

5. Operating on tables

Wherever it makes sense, commands that operate on columns have counterparts that operate on tables. For example, column fill and table fill, column range and table range, and so on. Certain commands, like math operations, do not have equivalents for obvious reasons. Conversely, tables have additional commands like join that are not applicable to columns.

Also like the column commands, table commands have variants that operate on values and variables respectively, e.g. table put and table vput. We will not always mention both variants here; see the reference pages instead for availability of a particular variant.

5.1. Creating tables

Tables may be created from Tcl lists or from existing columns.

5.1.1. Creating tables from list of row values

The table create command constructs a table from a list of rows each of which is also a list. The first argument to the command defines the names and types of each column.

% set countries [table create {
    Country string Population wide Area double
} {
    {China        1350000000       9330000}
    {Vatican      850              0.44}
}]
→ tarray_table {Country Population Area} {{tarray_column string {China Vatican}...

5.1.2. Creating tables from columns

The table create2 command constructs a table from a sequence of existing columns, all of which must be the same length. Since columns are already typed, only the column names have to be supplied.

% set monthly_temps [table create2 {
    Month Temperature
} [list $months $temperatures]]
→ tarray_table {
      Month Temperature
  } {{tarray_column string {Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec}} {...
% set monthly_rainfall [table create2 {
    Month Rainfall
} [list $months $rainfall]]
→ tarray_table {
      Month Rainfall
  } {{tarray_column string {Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec}} {...

5.2. Operating on table columns

5.2.1. Retrieving table columns

The table column command returns a single column from a table. The table columns command returns a list containing one or more columns from a table.

% table column $countries Country
→ tarray_column string {China Vatican}
% table columns $countries {Country 1}  1
→ {tarray_column string {China Vatican}} {tarray_column wide {1350000000 850}}
% table columns $countries  2
→ {tarray_column string {China Vatican}} {tarray_column wide {1350000000 850}} ...
1 Columns may be specified by name or index
2 Defaults to returning all columns

5.2.2. Extracting a subtable

The table slice command returns a new table containing one or more columns from a table.

% table slice $countries {Country Area}
→ tarray_table {Country Area} {{tarray_column string {China Vatican}} {tarray_c...

5.2.3. Subsetting columns in commands

While closely paralleling the commands that store values in columns, commands storing rows in tables have an additional consideration that is not applicable to columns. A table row consists of an ordered sequence of values from multiple columns. When reading or storing data, it is often convenient to operate only on a subset of these, and moreover to retrieve or provide the values in a different order than the column order of the table.

Many table commands accept the -columns option for this purpose. Commands that retrieve data will only return the cells for those specified columns and in that specific order.

5.3. Storing rows

5.3.1. Storing a single row in multiple locations

The table fill and table vfill commands store a single row in one or more locations. Like column fill, the destination may be a single row, a range of rows or at specified row indices.

% table vfill countries {Russia 143967000 16.38e6} end+1 1
→ tarray_table {Country Population Area} {{tarray_column string {China Vatican ...
1 Note use of end+1 notation

5.3.2. Storing rows in arbitrary locations

The table place and table vplace commands store one or more source rows at one or more locations in the table in a specified order.

Assuming the Vatican City and China both experiences some population growth.

% table vplace -columns {Population} countries {
    {1000}
    {1370000000}
} {1 0}
→ tarray_table {Country Population Area} {{tarray_column string {China Vatican ...

Notice this example makes use of the -columns option to update only a single column. Also note the supplied indices need not be in order.

Note The columns option value and the row values are placed in braces though not really required here because they are actually lists of single elements. You would need braces if updating multiple columns.

5.3.3. Storing rows in contiguous locations

The table put and table vput commands store one or more source rows in a contiguous range of locations in the table.

% table vput -columns {Population Country Area} countries {
    {320000000 USA     9.16e6}
    { 82500000 Germany 349000}
} 1
→ tarray_table {Country Population Area} {{tarray_column string {China Vatican ...
1 Note use of the -columns option as we are providing row fields in a different order

Although the above examples supplied the rows in the form of a nested list, the command will also accept a table of the same shape as the destination.

5.3.4. Inserting one row in multiple locations

The table insert and table vinsert commands store a single row value a specified number of times starting at a given index, pushing existing rows further to the end of the table.

% table vinsert countries {Brazil 201000000 8358000} 2 1
→ tarray_table {Country Population Area} {{tarray_column string {China Vatican ...
1 Inserts at index 2, (just once because no count specified)

5.3.5. Inserting multiple rows

The table inject and table vinject commands insert multiple rows, passed as a list or a compatible table, at a specified index in a table.

% table vinject countries {
    {Ghana 28000000   227533}
    {India 1320000000 2973000}
} 0 1
→ tarray_table {Country Population Area} {{tarray_column string {Ghana India Ch...
1 Insert at beginning of table

5.4. Deleting rows

Rows in a table can be deleted with the table delete and table vdelete commands. Succeeding rows are moved up to occupy the deleted slots. The indices of the rows to be deleted may be specified as a single index, a range, a list of indices or a index column.

% set small_countries [column search -all -lt [table column $countries Population] \
    100000]
→ tarray_column int {3}
% table vdelete countries $small_countries
→ tarray_table {Country Population Area} {{tarray_column string {Ghana India Ch...

5.5. Retrieving rows

As for columns, there are multiple commands for retrieving rows from a table and again as for columns, for multiple row retrieval they all support the -list or -dict option return the rows as a list or dictionary instead of the default table format.

5.5.1. Retrieving a single row

In the simplest cast, the table index command can be used to retrieve a single element.

% table index $countries end
→ Germany 82500000 349000.0

5.6. Retrieving all rows

The table rows command returns all rows in a table as a list.

% table rows $countries
→ {Ghana 28000000 227533.0} {India 1320000000 2973000.0} {China 1370000000 9330...

5.6.1. Retrieving a range of rows

The command table range returns multiple contiguously located rows.

% table range $countries 0 2
→ tarray_table {Country Population Area} {{tarray_column string {Ghana India Ch...
% table range -list $countries 0 2
→ {Ghana 28000000 227533.0} {India 1320000000 2973000.0} {China 1370000000 9330...
% table range -dict $countries 0 2
→ 0 {Ghana 28000000 227533.0} 1 {India 1320000000 2973000.0} 2 {China 137000000...

5.6.2. Retrieving an arbitrary list of rows

The get command retrieves rows at noncontiguous indices specified as a Tcl list or column:

% table get -columns {Country Population} $countries [column search -all -lt \
    [table column $countries Area] 1000000]
→ tarray_table {Country Population} {{tarray_column string {Ghana Germany}} {ta...

Note the use of the -columns option to retrieve only a subset of the table columns.

5.6.3. Comparing table

The table identical and table equal commands compare two tables in their entirety. The difference is that the former returns 1 if the tables have the same column names and the column identical command returns true for every corresponding pair of columns, and 0 otherwise. The latter applies a looser definition in that the column equal is used to compare columns and the column names may differ.

5.7. Searching tables

To search tables, use the search on individual columns. For example, the fragment below returns names of countries that are populous and large in area. Note how the outer search is limited to specific indices using the -among option.

% set pop_col  [table column $countries Population]
→ tarray_column wide {28000000 1320000000 1370000000 201000000 143967000 320000...
% set area_col [table column $countries Area]
→ tarray_column double {227533.0 2973000.0 9330000.0 8358000.0 16380000.0 91600...
% table get -list -columns {Country} $countries [column search -all -among [column \
    search -all -gt $pop_col 250000000] -gt $area_col 5e6]
→ China USA
Tip

Note again that for more complex queries, it is more convenient to use the Xtal extension instead of some combination of search and intersect3.

% xtal::xtal {
    countries.Country[countries.Population > 250000000 &&  \
                      countries.Area > 5000000]
}
→ tarray_column string {China USA}

5.8. Ordering rows

5.8.1. Sorting tables

Tables are sorted using the table sort command. This is a thin wrapper around the column sort command and thus takes many of the same options.

% print [table sort -columns {Country Population} -nocase $countries Country]
→ +-------+----------+
  |Country|Population|
  +-------+----------+
  |Brazil | 201000000|
  +-------+----------+
  |China  |1370000000|
  +-------+----------+
  |Germany|  82500000|
...Additional lines omitted...

The above prints a table slice sorted based on country name.

5.8.2. Nested sorts and sort stability

When sorting tables, for display purposes for example, it is often necessary to display elements that have the same value in the sort column in the same order that they were previously displayed. Although, individual column sorts are stable, this is not enough when sorting across multiple columns. In such cases, the -indirect option to the sort command provides a solution. Using this option allows sorting where the "initial" ordering of elements is different from the actual order of elements in the column. An example will clarify this.

Consider a table that stores heights and weights.

% set tab [table create {name string height int weight int} {
    {Jeff 180 80}
    {John 175 80}
    {Jim 170 75} }]
→ tarray_table {name height weight} {{tarray_column string {Jeff John Jim}} {ta...

The user may choose to sort the table by height which boils down to the following code:

% table get -list $tab [column sort -indices [table column $tab height]]
→ {Jim 170 75} {John 175 80} {Jeff 180 80}

This results in the table being displayed in the order Jim, John, Jeff. The user may then choose to sort by weight.

% table get -list $tab [column sort -indices [table column $tab weight]]
→ {Jim 170 75} {Jeff 180 80} {John 175 80}

resulting in a display in order Jim, Jeff, John. Since they actually have the same value in the new sort column, this interchange of positions between Jeff and John is disconcerting to the user. Use of the -indirect option overcomes this problem.

% set indices [column sort -indices [table column $tab height]]
→ tarray_column int {2 1 0}
% table get -list $tab $indices
→ {Jim 170 75} {John 175 80} {Jeff 180 80}

Now use previous order of indices to order elements when their values in the weight column are equal

% table get -list $tab [column sort -indirect [table column $tab weight] $indices]
→ {Jim 170 75} {John 175 80} {Jeff 180 80}

In this last statement, the sort is done indirectly using values from table but the positioning of elements when these values compare equal is based on the order in the original table.

5.8.3. Reversing tables

The order of rows in a table can be reversed with the table reverse or table vreverse commands.

% print [table reverse $countries]
→ +-------+----------+----------+
  |Country|Population|      Area|
  +-------+----------+----------+
  |Germany|  82500000|  349000.0|
  +-------+----------+----------+
...Additional lines omitted...

5.9. Joining tables

The table join command implements basic database-type join functionality between tables.

% print [table join $monthly_rainfall $monthly_temps]
→ +-----+--------+--------+-----------+
  |Month|Rainfall|Month_t1|Temperature|
  +-----+--------+--------+-----------+
  |Apr  |    14.7|Apr     |         28|
  +-----+--------+--------+-----------+
  |Aug  |   205.8|Aug     |         25|
...Additional lines omitted...

By default, the join is done on the first column name that is common between the tables. The -on option can be used to change that. The result table includes all columns with duplicate columns being renamed as shown above. The -t0cols and -t1cols options can be used to control which columns are included from each table.

% print [table join -t1cols Temperature $monthly_rainfall $monthly_temps]
→ +-----+--------+-----------+
  |Month|Rainfall|Temperature|
  +-----+--------+-----------+
  |Apr  |    14.7|         28|
  +-----+--------+-----------+
  |Aug  |   205.8|         25|
...Additional lines omitted...

5.10. Counting rows

The table size command returns the number of rows in a table.

% table size $countries
→ 7

5.11. Introspecting tables

Several commands return information about the definition of a table.

The table cnames command returns a list containing the names of the columns in a table.

% table cnames $countries
→ Country Population Area

The table ctype command returns the type of a column.

% table ctype $countries Area
→ double

If the full table definition is desired, it can be retrieved with table definition. The returned string is in a form that can be used with table create.

% table definition $countries
→ Country string Population wide Area double

6. Looping

The loop command iterates over a column or table executing a script for every element or row.

% loop row $countries {
    puts $row
}
→ Ghana 28000000 227533.0
  India 1320000000 2973000.0
  China 1370000000 9330000.0
...Additional lines omitted...

A variation includes the element or row index as an iteration variable.

% loop rowindex row $countries {
    puts "Row $rowindex: $row"
}
→ Row 0: Ghana 28000000 227533.0
  Row 1: India 1320000000 2973000.0
  Row 2: China 1370000000 9330000.0
...Additional lines omitted...

The command can be used within a coroutine context to yield values.

proc loopy col {
    yield;
    tarray::loop elem $col {
        yield $elem
    }
}
coroutine next_month loopy $months

We can next retrieve each element using the iterator.

% next_month
→ Jan
% next_month
→ Feb

7. Grouping operations

Several commands are targeted towards making grouping of data and summarization of columns and tables easier and more convenient.

7.1. Computing histograms

For example, suppose we want to classify monthly rainfall into four categories — scanty, light, moderate and heavy — and find how many months fall into each category. We can use the column histogram command for the purpose.

% print $rainfall stdout -full 1
→ 11.0, 23.3, 18.4, 14.7, 70.3, 180.5, 210.2, 205.8, 126.4, 64.9, 33.1, 19.2
% print [column histogram -cnames {Rainfall NumMonths} $rainfall 4]
→ +------------------+---------+
  |          Rainfall|NumMonths|
  +------------------+---------+
  |              11.0|        6|
  +------------------+---------+
  |              60.8|        2|
  +------------------+---------+
  |             110.6|        1|
  +------------------+---------+
  |160.39999999999998|        3|
  +------------------+---------+

The above divies up the range of values in the column into 4 equal intervals and counts the number of values that falls in each. The result is a table of two columns, the first being the lower bound of each interval and the second being the number of values that fall into that interval.

The above looks even more crude than usual, so we can use the -min and -max options to control the limits.

% print [column histogram -cnames {Rainfall NumMonths} -min 0 -max 240 $rainfall \
    4]
→ +--------+---------+
  |Rainfall|NumMonths|
  +--------+---------+
  |     0.0|        6|
  +--------+---------+
  |    60.0|        2|
  +--------+---------+
  |   120.0|        1|
  +--------+---------+
  |   180.0|        3|
  +--------+---------+

Note that when either option is used, any values that do not fall within those limits are ignored.

By default the command applies the counting function to each grouping as seen above. For other possibilities, refer to the command documentation.

7.2. Grouping into categories

Histograms are a special case of grouping that applies to numerical columns where values are grouped by range. A more general form is the column categorize command.

Suppose we want to categorize the months by their first letters and print how many months start with each letter (why? To which I say, why not?).

proc first_letter {elem_index elem_value} {
    return [string toupper [string index $elem_value 0]]
}
print [column categorize -cnames {Letter Months} -values -categorizer \
    first_letter $months]
→ +------+----------------------------------+
  |Letter|Months                            |
  +------+----------------------------------+
  |J     |tarray_column string {Jan Jun Jul}|
  +------+----------------------------------+
  |F     |tarray_column string {Feb}        |
  +------+----------------------------------+
  |M     |tarray_column string {Mar May}    |
  +------+----------------------------------+
  |A     |tarray_column string {Apr Aug}    |
  +------+----------------------------------+
  |S     |tarray_column string {Sep}        |
  +------+----------------------------------+
  |O     |tarray_column string {Oct}        |
  +------+----------------------------------+
  |N     |tarray_column string {Nov}        |
  +------+----------------------------------+
  |D     |tarray_column string {Dec}        |
  +------+----------------------------------+

Notice that the associated value for each category is itself a column containing the actual values. The column summarize command discussed in the next section provides a mean to further process this by aggregating the nested columns elements.

Note Although we do not show an example here, by default the categorize command returns the indices of the elements belonging to that category, not their values. Hence the use of the -values option.

Although slower than the histogram command, the categorize command is more general and is useful for numerical values as well, for example if we do not want the intervals to be equal. The code fragment below classifies rainfall in a month as scant, moderate or heavy in a similar fashion to our earlier example except that here the intervals are not the same size.

proc rain_class {elem_index elem_value} {
    if {$elem_value < 50} {
        return scant
    } elseif {$elem_value < 200} {
        return moderate
    } else {
        return heavy
    }
}
print [column categorize -cnames {Category Rainfall} -categorizer rain_class \
    -values $rainfall]
→ +--------+----------------------------------------------------+
  |Category|Rainfall                                            |
  +--------+----------------------------------------------------+
  |scant   |tarray_column double {11.0 23.3 18.4 14.7 33.1 19.2}|
  +--------+----------------------------------------------------+
  |moderate|tarray_column double {70.3 180.5 126.4 64.9}        |
  +--------+----------------------------------------------------+
  |heavy   |tarray_column double {210.2 205.8}                  |
  +--------+----------------------------------------------------+

7.3. Summarizing categorized data

Placing values into categories is only the first step. The next step is to process the categorized data to provide some useful information. The column summarize command is intended for this purpose. It takes a column of the form returned by the categorize (or histogram with appropriate options) and aggregates the values associated with each category with some prescribed function.

For example, to print the number of months in each rainfall category as we did in our previous example, we could have also taken this route.

set rain_by_category [column categorize -cnames {Category Rainfall} -categorizer \
    rain_class -values $rainfall] 1
print [table summarize -cname NumMonths $rain_by_category]
→ +--------+---------+
  |Category|NumMonths|
  +--------+---------+
  |scant   |        6|
  +--------+---------+
  |moderate|        4|
  +--------+---------+
  |heavy   |        2|
  +--------+---------+
1 Collect values in each category

The above takes an additional step compared to what we saw earlier, but once we have categorized the original data, it makes it easier to summarize using a different, even custom, aggregation function. For example, to print the average,

proc average {cat_index cat_col} {
    set n [column size $cat_col]
    if {$n == 0} {return "-"}
    return [expr {[column sum $cat_col] / $n}]
}
print [table summarize -summarizer average -cname AveRainfall $rain_by_category]
→ +--------+-----------+
  |Category|AveRainfall|
  +--------+-----------+
  |scant   |19.95      |
  +--------+-----------+
  |moderate|110.525    |
  +--------+-----------+
  |heavy   |208.0      |
  +--------+-----------+

8. Formatting output

The Tcl puts command is not always suitable for printing the values for several reasons. The output is not formatted and hence difficult to read as you have probably noted from some of the previous output. The print command provides an alternative that outputs a more readable format.

% print [table column $countries Population] -head 1 -tail 1
→ 28000000, ..., 82500000
% print $countries
→ +-------+----------+----------+
  |Country|Population|      Area|
  +-------+----------+----------+
  |Ghana  |  28000000|  227533.0|
  +-------+----------+----------+
  |India  |1320000000| 2973000.0|
  +-------+----------+----------+
  |China  |1370000000| 9330000.0|
  +-------+----------+----------+
  |Brazil | 201000000| 8358000.0|
  +-------+----------+----------+
  |Russia | 143967000|16380000.0|
  +-------+----------+----------+
  |USA    | 320000000| 9160000.0|
  +-------+----------+----------+
  |Germany|  82500000|  349000.0|
  +-------+----------+----------+

By default the command only prints the first few and last few elements although this can be controlled by various options.

The prettify command is similar except that it returns the formatted string instead of printing it to a channel.

Another command that helps with formatting data is the column width command. This returns the maximum number of characters required to display elements of the column in a given format.

% set col [column create int {-1000 10 100}]
→ tarray_column int {-1000 10 100}
% column width $col 1
→ 5
% column width $col %x 2
→ 8
% column width $col "The number is %d." 3
→ 20
1 Use default format string
2 Specify the format string
3 A generalized format string

9. Export and import

Tables can be exported to, and imported from, CSV format with table csvexport and table csvimport respectively.

10. Tk widgets

The tarray_ui package provides several widgets for displaying and plotting column and table data. See its documentation.

11. Random numbers

We saw earlier the construction of columns containing randomly generated values. Random numbers can also be directly generated using the tarray::rng command. This returns a command object which can then be invoked to generate values of the appropriate type.

% rng create myrng int 1
→ ::myrng
% myrng get 2
→ 1201747473
% myrng get 5 3
→ 925023707 -1952559933 56734242 -1520506914 -165810285
% myrng destroy 4
1 Create a random number generator for integers
2 Gets a random integer
3 Get 5 random integers
4 Release resources when no longer needed

The rng new command behaves similarly except a command name need not be specified. Moreover, both forms take an optional argument to generate random numbers within a specified range.

% set r [rng new double 1.0 2.0] 1
→ ::tarray::rng1
% $r get 3
→ 1.8521092870291302 1.7439993025458036 1.5926592348211674
% $r destroy
1 Generate real numbers between 1.0 and 2.0

For testing purposes, the generator can be initialized with a specific seed to reproduce the same sequence everytime.

% rng create myrng byte 10 20
→ ::myrng
% myrng seed 1000 2000 1
% myrng get 5
→ 14 11 17 19 18
% myrng get 5
→ 19 13 15 17 17
% myrng seed 1000 2000 2
% myrng get 5
→ 14 11 17 19 18
% myrng get 5
→ 19 13 15 17 17
% myrng destroy
1 Any 64-bit seed values
2 Reset seeds

As seen, the same sequence is reproduced by initializing the seed with specific values.

12. Usage Hints

Specifying multiple indices

When specifying indices to commands, Tcl lists of integers and columns of type int are usually interchangeable. Similarly, when passing multiple values to a command, either a Tcl list or a column of the appropriate type can be used. Note there is an ambiguity in the specific case where the target of the command is a column of type any and the passed operand is a string of the form tarray any {…​} where the operand can be interpreted either as a column or a Tcl list with three elements tarray, any and the {…​}. In this case the operand gets interpreted as a column.

Copy-on-write efficiency

Given that the tarray extension is meant for dealing with large amounts of data, it is useful to keep in mind Tcl’s object reference counting and copy-on-write implementation. Modifying a typed array that is shared will result in a copy being made, which can be expensive if the array is large. So, to modify a variable that contains a typed array, the command

% column vput I [list 100 200]
→ tarray_column int {10 20 30 100 200}

is far more efficient than

% set I [column put $I [list 100 200]]
→ tarray_column int {10 20 30 100 200 100 200}

assuming the value in is not itself shared. This is similar to use of Tcl’s lset command to modify lists.

Memory efficiency

As arrays get large tarray prioritizes memory usage over efficiency. As arrays grow, the additional extra memory is conservatively allocated (unlike Tcl which aggressively allocates extra memory). If the size of a typed array can be estimated in advance, for example, reading records from a database, the memory can be preallocated.

% column create int {} 1000000
→ tarray_column int {}

preallocates space for a million elements.

Typed arrays are by design implemented as consecutive elements in contiguous memory. Certain operations, such as insertion and deletion, will not be efficient when arrays get very large. For applications where such operations are common, other structures should be built on top using typed arrays as the lower level building blocks. Such higher level structures can be scripted and customized for specific usage patterns easily as they can be implemented at the script level using the low level typed array operations for efficiency. Whether this is required or not should be determined based on application benchmarks.

List and column differences

Both lists and columns have some differences in terms of functionality. Columns do not have the -stride option but the same functionality can be implemented through tables. List indexing offers nesting while although columns can be nested, the nested columns have to be explicitly accessed. On the other hand, columns offer some additional functions such as intersect3 and indexing operations (eg. extraction or storing of multiple elements through index lists).

Considerations for columns of type any

Columns of type any are stored as Tcl_Obj objects internally and thus are very similar to Tcl lists. Any advantage of an any typed array over using a simple Tcl list in terms of the memory footprint comes only from conservative memory overallocation, not from reduced memory size of individual elements. It is therefore not as big a benefit as for other types. Thus columns of type any are mostly beneficial when used in conjunction with other column types, for example in a table.

Note that columns of type string are more efficient than type any for storing small strings.

Nesting typed arrays

The type any can be any Tcl value, including typed arrays. Typed arrays can therefore be nested (tables are currently implemented as nested columns). However, unlike some of the Tcl list commands, tarray does not have commands that implicitly support nesting. Nested typed arrays have to be explicitly accessed as such.

column index [column index $outer_column 4] 0
Sort optimization

The package internally keeps track of the sorting state of a column. A column is internally marked after certain operations where the result is known to be sorted. An obvious example is the column sort command. A less obvious case is an index column returned from certain search operations. Several commands make use of this for more efficient operation. For example, the column intersect3 command is much faster when columns are known to be sorted. Thus finding the intersection of two index columns resulting from searches is an O(n) operation.