Copyright © 2020 Ashok P. Nadkarni. All rights reserved.
1. Introduction
The tarray extension implements typed arrays and associated
commands column
and table. This page provides
reference documentation for commands related to typed columns. See
the main contents for guides and other reference
documentation.
1.1. Installation and loading
Binary packages for some platforms are available from the Sourceforge download area. See the build instructions for other platforms.
To install the extension, extract the files from the distribution to any
directory that is included in your Tcl installation’s auto_path
variable.
Once installed, the extension can be loaded with the standard Tcl package require command.
% package require tarray
→ 1.0.0
% namespace import tarray::column
1.2. Columns
A typed array column contains elements of a single
type, such as int
or string
, that is specified
when it is created. The command tarray::column
operates on
typed columns including searching and sorting operations.
Related to columns, are tables which are ordered sequences of typed columns.
1.3. Types
All elements in a column must be of the type specified when the column is created. The following element types are available:
Keyword | Type |
---|---|
|
Any Tcl value |
|
A string value |
|
A boolean value |
|
Unsigned 8-bit integer |
|
Floating point value |
|
Signed 32-bit integer |
|
Unsigned 32-bit integer |
|
Signed 64-bit integer |
The primary purpose of the type is to specify what values can be stored in that column. This impacts the compactness of the internal storage (really the primary purpose of the extension) as well certain operations (like sort or search) invoked on the column.
The types any
and string
are similar in that they can hold any Tcl
value. Both are treated as string values for purposes of comparisons
and operators. The difference is that the former stores the value
using the Tcl internal representation while the latter stores it as a
string. The advantage of the former is that internal structure, like a
dictionary, is preserved. The advantage of the latter is significantly
more compact representation, particularly for smaller strings.
Attempts to store values in a column that are not valid for that column will result in an error being generated.
1.4. Indices
An index into a typed column or table can be specified as either an integer or
the keyword end
. As in Tcl’s list commands, end
specifies the index of the
last element in the tarray or the index after it, depending on the command.
Simple arithmetic adding of offsets to end
is supported, for example
end-2
or end+5
.
Many commands also allow multiple indices to be specified. These may take one of two forms — a range which includes all indices between a lower and an upper bound (inclusive), and an index list which may be one of the following:
-
a Tcl list of integers
-
a column of any type other than
boolean
. The value of each element of the column is converted to an integer that is treated as an index. -
a column of type
boolean
. Here the index of each bit in the boolean column that is set to1
is treated as an index.
Note that keyword end
can be used to specify a single index or as a range
bound, but cannot be used in an index list.
When indices are specified that cause a column or table to be extended, they must include all indices beyond the current column or table size in any order but without any gaps. For example,
% set I [column series 5]
→ tarray_column int {0 1 2 3 4}
% column place $I {106 105 107 104} {6 5 7 4}
→ tarray_column int {0 1 2 3 104 105 106 107}
% column place $I {106 107} {6 7}
Ø tarray index 6 out of bounds.
Ok: Indices not in order but no gaps | |
Error: no value specified for non-existing index 5 |
2. Command reference
All commands are located in the tarray
namespace.
2.1. Standard Options
Commands returning values from columns support the standard options shown in Standard options.
2.2. Commands
column bitmap0 COUNT ?INDICES?
Returns a new boolean column of size COUNT with all elements set to
0
. If argument INDICES is specified, the elements at those
position are set to 1
.
column bitmap1 COUNT ?INDICES?
Returns a new boolean column of size COUNT with all elements set to
1
. If argument INDICES is specified, the elements at those
position are set to 0
.
column cast COLTYPE COLUMN
Returns a new column of type COLTYPE containing elements
of COLUMN cast to COLTYPE. This differs from the use of
column create in that it will not
raise an error if any element value in COLUMN
is too large to
fit into a column of type COLTYPE
or if the value contains a
non-zero fractional component and COLTYPE
is one of the
integral types. In the former case, if COLUMN
is of an integer
type, the higher order bits are discarded while if it is of type
double, the cast value is undefined. In the latter case, the fractional
component of the element value is discarded, and only its integer
component is stored in the new column.
column categorize ?options? COLUMN
The command first places the elements of the column COLUMN
into categories. By default, these are keyed by the value
of the element. Alternatively, the -categorizer CMDPREFIX
option may be specified
in which case CMDPREFIX is called for every element of COLUMN. Each
invocation has two additional arguments appended — the index of
the element being passed and its value. The element is then
placed into the category identified by the returned value from the
invocation. If CMDPREFIX
completes with a break
control
code, no further elements are processed. If it completes with a
continue
return code, that particular iteration is ignored and
not included in the result.
The command returns a table with two columns, the first of which contains
categories constructed from the unique values in COLUMN
, or
the returned values from CMDPREFIX
if the -categorizer
option
was specified.
The second column is of type any
and of the same size as the
first. Each element of this column is itself
a column containing either the indices of the elements belonging
to the corresponding category
(by default or if the -indices
option is specified), or the
element values themselves (if the -values
option is specified).
These columns are named Category
and Data
by
default. The -cnames
option can be used to change these
names, the option’s value being a pair containing the names to be
used for the two columns.
By default the Category
column is of type any
if
the -categorizer
option is specified, and the same type as
COLUMN
otherwise. The -categorytype TYPE
option may be specified
to force it to be a specific category. Of course the values used for
the category column must be compatible with this type.
See Grouping into categories for an example.
column count ?-range RANGE? ?-among INDICES? ?-not? ?-nocase? ?OPER? COLUMN VALUE
Counts the number of matches for a value in the elements in a column.
See the column search command for a description of the various options.
Note that if the -among
is specified and an index occurs multiple times
in INDICES, it will be counted multiple times.
column create TYPE ?INITIALIZER? ?INITSIZE?
Returns a typed array of the type TYPE which must be one of one of valid types described in Types. If INITIALIZER is specified, it is the initial content of the typed array and can be either a column of any compatible type or a list containing elements of the appropriate type. For performance reasons, the INITSIZE argument may be specified to indicate how many slots to preallocate for the array. This is only treated as a hint and actual size allocated may differ.
column delete COLUMN LOW HIGH
Returns a typed column with all specified elements deleted. Indices are specified in any of the forms described in Indices and may contain duplicates. Out-of-range indices are ignored.
column equal COLA COLB
Returns 1 if the specified columns have the same number of elements
and corresponding elements of the two columns are equal. If the column
types are not the same, comparison is done by converting numeric
elements to strings if either column is non-numeric, conversion to
doubles if either column is of type double
, and conversion to
wide integers otherwise. Note that this means, for example, that
when comparing a column of type int
to one of type any
or string
,
the value 16
will not equate to the string 0x10
.
The command will raise an error if either argument is not a column.
Also see the related command column identical
which applies a stricter definition of equality.
column fill COLUMN VALUE LOW HIGH
Returns a typed column with specified indices set to VALUE. Indices are
specified in any of the forms described in Indices and must follow
the rules described there. The index keyword end
refers to the current
last occupied position in the column so to append values the index
should be specified as end+1
. The size of the array will be
extended if necessary provided the specified indices do not have gaps
beyond the current column size.
column get ?OPTIONS? COLUMN INDEXLIST
Returns the values from a typed column at the indices specified as index list. Any of the Standard options may be specified with this command.
column histogram ?options? COLUMN NBUCKETS
The command divides the target range of values into NBUCKETS
intervals of equal size
(except for possibly the last in case of value range overflow).
The command places the values of the column, which
must be of numeric type, into these NINTERVALS buckets.
If no options are specified, the first
target range has a lower bound that is the minimum value in the
column. The size of each bucket is the minimum size required so
that the maximum value is included in a bucket.
If the -min
option is specified the associated value is used as
the lower bound of the range and first bucket. If there happen to be any
values in the column smaller than this, they are ignored in the
returned result.
Similarly, if the -max
option is specified, any values greater than the
associated option value are ignored.
If the column is empty, both -min
and -max
values must be
specified; otherwise the command will raise an error.
The command computes a bucket result for each bucket. By default,
or if the -count
option is specified,
this bucket result is the sum of the values falling into that bucket.
If the -sum
option is specified, each bucket result is the sum
of all values falling into that bucket.
If the -indices
option is specified, each bucket result is an index column
containing the indices of the elements whose values fall into that
bucket.
Finally, if the -values
option is specified, each bucket result is
a column, of the same type as COLUMN, containing the actual
values that fell into that bucket.
The command returns a table with two columns, the first of which contains
the lower bound of each interval bucket. The second
contains the corresponding computed bucket result for each
bucket. These columns are named LowerBound
and Data
by
default. The -cnames
option can be used to change these
names, the option’s value being a pair containing the names to be
used for the two columns.
See Computing histograms for an example.
Note: for columns of type wide
, the command will raise an error if the
difference between the minimum and maximum covers the entire domain range of
wides [-9223372036854775808, 9223372036854775807]
and NBUCKETS
is 1
.
column identical COLA COLB
Returns 1 if both columns are of the same type, have the same number of elements and corresponding elements of the two columns are equal.
The command will raise an error if either argument is not a column.
Also see the related command column equal
which applies a looser definition of equality.
column index COLUMN INDEX
Returns the value of the element at the position specified by INDEX which is a single index.
column inject COLUMN VALUES FIRST
Inserts VALUES, a list of values or a column of the same type
as COLUMN, at the position FIRST and returns the resulting column.
If FIRST is end
, the values are
appended to the column. In all cases, the command may extend the array
if necessary.
column insert COLUMN VALUE FIRST ?COUNT?
Inserts COUNT (default 1) elements with value VALUE at position FIRST and returns the new column. In all cases, the command may extend the array if necessary.
column intersect3 ?-nocase? COLUMNA COLUMNB
Returns a list of three columns, the first containing elements common to both COLUMNA and COLUMNB, the second containing elements only present in COLUMNA and the third containing elements only present in COLUMNB. Both columns must be of the same type. The elements in each returned column are in arbitrary order.
The columns may contain duplicate elements. These are treated as
distinct so for example if COLUMNA contain 5 elements with value A
,
and COLUMNB contains only 3 such elements, then the first column in the
result will contain two A
elements and the second column will contain
three.
Option -nocase
only has effect if the column type is any
or string
.
If specified, elements are compared in case-insensitive mode.
column linspace START STOP COUNT ?-type TYPE? ?-open BOOL?
Returns a column containing COUNT values evenly spaced between
START and STOP. STOP may be less than START in which case returned values
are in descending order. The -type
option specifies the column type and defaults
to double
. If the -open
option is specified as true
, the interval is
open and STOP is not included in the returned values. The default is false
.
Note that the returned column always contains COUNT elements. For integral types, this means some values may be repeated if the difference between the interval ends is less than COUNT. Moreover, the values may not be exactly spaced apart in the case that the interval cannot be divided into COUNT integral divisions.
column logspace START STOP COUNT ?-type TYPE? ?-open BOOL? ?-base BASE
Returns a column containing COUNT values evenly spaced between on a log
scale between BASE**START
and BASE**STOP
. If unspecified, BASE defaults
to 10
. STOP may be less than START in which case returned values
are in descending order. The -type
option specifies the column type and defaults
to double
. If the -open
option is specified as true
, the interval is
open and STOP is not included in the returned values. The default is false
.
column lookup COLUMN ?LOOKUPKEY?
The command returns the index of an element in COLUMN that exactly matches LOOKUPKEY or -1 if not found. If LOOKUPKEY is not specified, command builds an internal dictionary (see below) and the return value is an empty string.
COLUMN must be a column of type string
. Unlike the column search
command, the returned index is not necessarily that of the first
occurence in cases where LOOKUPKEY occurs multiple times in the column.
The command is usually much faster than column search because it is based on an internal dictionary that maps string values to their position indices in the column. This internal dictionary is either created when the command is called without the optional LOOKUPKEY argument, or is built in incremental fashion with each column lookup call.
In the current implementation, this dictionary is maintained in a loose or lazy manner and internally does not always reflect the actual content of the column. However, the return value of the command is always accurate.
column math OPERATION OPERAND ?OPERAND…?
Performs the specified mathematical operation OPERATION on the given operands. The possible operations are shown in Column math operators below.
The operands may be any combination of scalar numerical values and columns of appropriate types shown in the table. If multiple columns are specified, they may be of differing types. All columns must have the same number of elements.
If every operand is a scalar, the return value is also a scalar
numerical value computed in similar (but not identical) fashion to
the Tcl expr
command.
For arithmetic operations,
if at least one operand is a column, the return value is
a column whose type depends on the type of the ''widest''
operand. For example, if any column or scalar is a double, the
resulting column will be of type double
. For this purpose,
the type double
is considered wider than type wide
.
The value of each element of the result column is computed
by invoking the specified operation on the corresponding elements
of the operand columns. Any scalar
operands specified are treated as columns of the appropriate type
and size all of whose elements are equal to that scalar value.
For arithmetic operations, elements of boolean
columns are
treated as having integer values 0
and 1
.
If the result type is double, all computation is done by
is done by converting each operand (or element of an operand) to
a double. Otherwise all computation is done using 64-bit integers
and converted back to the result type. Columns of type any
and string
are not allowed for arithmetic operations.
For logical operations like &&
and comparisons like ==
,
the returned column is always boolean. Columns of type any
and string
are not allowed.
For relational operations, columns of any type are allowed and are type promoted for comparisons as for arithmetic operations with the difference that any non-numeric operand will result in string based comparisons.
The above operations may also be invoked directly as
column + … instead of column math + … .
|
column minmax ?OPTIONS? COLUMN
Searches the specified column for the minimum and maximum values,
returning them as a pair. If -indices
is specified, their indices are
returned instead of their values. In case either value occurs at
multiple indices in the column, the lowest index is returned.
The option -range
can be specified to limit the search to a subrange of
the column. It takes a pair of indices, in the one of the forms
described in Indices, that inclusively specify the subrange. The
second element of the pair may be omitted in which case it defaults to
the last element in the column.
The option -nocase
may be specified to indicate case-insensitive
comparisons. This is only effective if the column type is any
or
string
and ignored for the others.
column ones COUNT ?TYPE?
Returns a column of size COUNT with all elements initialized to 1
.
TYPE defaults to int
.
column place COLUMN VALUES INDICES
Returns a typed column with the specified values at the corresponding indices. VALUES may be a list of values or a column of the same type. The number of values in VALUES must not be less than the number of indices specified in INDICES. INDICES must be a index list in the one of the forms described in Indices and may extend the column if the conditions listed there are satisfied.
column put COLUMN VALUES ?FIRST?
Returns a typed column with the elements starting at index FIRST
replaced by the corresponding elements of VALUES. VALUES may be a list
of values or a typed column of the same type. The command may extend the
array if necessary. If FIRST is not specified the elements are appended
to the array. The command interprets end
as the position after the
last element in the array.
column random TYPE COUNT ?LOWERBOUND? ?UPPERBOUND?
Returns a new column of type TYPE with
COUNT elements containing randomly generated values from a
uniform distribution. For types boolean
, byte
, int
, uint
and wide
the range of generated values corresponds to the entire
domain range by default. For type double
the values are generated
in the range [0,1] by default. The optional
LOWERBOUND and UPPERBOUND arguments may be supplied to modify
the range from which values are sampled. These are ignored for
TYPE boolean
.
For use cases such as testing where you want the same reproducible
“random” values to be produced, you can use the
randseed
command to set or reset
the seed values used for random number generation.
column range ?OPTIONS? COLUMN LOW HIGH
Returns the values from a typed column for indices in a specified range. Any of the Standard options may be specified with this command.
column search ?-range RANGE? ?-among INDICES? ?-all? ?-bitmap? ?-inline? ?-not? ?-nocase? ?OPER? COLUMN VALUE
Searches the specified typed column for a matching value. By default, the search starts at index 0 and returns the index of the first match, returning -1 if no matching value is found.
Options -range
and -among
modify which elements of the column are
examined. The -range
option limits the search to the range
specified by RANGE which either consists of two integer indices
denoting the starting and ending elements of the range
(inclusive), or a single integer index denoting the start of the
range with the end defaulting to the last element of the
column. The -among
option specifies a list of indices to be
examined. INDICES is an
index list or index column. Indices are
allowed to be specified multiple times in arbitrary
order. Elements are examined and matches returned in that same
order. Indices that fall outside the range (either explicitly
specified through -range
or defaulting to the entire column) are
ignored. Thus if both -range
and -among
options are specified,
only those positions that meet both criteria are examined.
The command normally returns the index of the first succeeding
match. Note this is not necessarily the lowest matching index
since -among
may specify indices in any order. If the option
-all
is
specified, the search does not stop at the first match but instead
searches for all matching elements and
returns a integer column containing the indices of all matched
elements. The option -bitmap
implies -all
, but in this case the command
returns a boolean column with the bits corresponding to
each matching index set to 1.
If the -inline
option is specified, the command returns the
matched value(s) instead of their indices.
OPER specifies the comparison operator and must be one of those shown in Search comparison operators.
The sense of the match can be inverted by specifying the -not
option so
for example specifying -not
-gt
will match all elements that are less
than or equal to VALUE. For case insensitive matching, the -nocase
option may be specified. This is ignored for all array types except
types any
and string
.
column series START STOP STEP
Returns a column with values between START (included) and STOP (excluded) incremented by STEP. START and STEP default to 0 and 1 respectively if unspecified. If STEP is less than 0, STOP must be less than START.
The type of the returned column may be int
, wide
or double
depending on the operands. For example, a STEP of 1.0 would
result in a column of type double
whereas a STEP of 1 would
return a int
or wide
depending on the range of operands.
column shuffle COLUMN
Returns a new column containing the elements of COLUMN
in
a new random order. Columns of type boolean
are not supported.
column sort ?-indices? ?-increasing? ?-decreasing? ?-nocase? ?-indirect TARGETCOLUMN? COLUMN
Returns a sorted typed column. COLUMN is the typed column to be sorted.
The comparison is done in a column type-specific manner. Sorting is
sorted in increasing order by default. The -decreasing
option may be
specified to change this.
If the -indices
option is specified, the command returns a typed array
containing the indices of the sorted values instead of the values
themselves.
If -nocase
is specified, the comparisons are done in case-insensitive
manner. This option is only applicable when the column type is any
or
string
and is ignored otherwise.
Option -indirect
may only be used when COLUMN is of type int
. In this
case, the elements of COLUMN are treated as indices into TARGETCOLUMN
and are compared using the corresponding values from TARGETCOLUMN. This
option is useful when sorting a column or multiple columns in a table
using different criteria while keeping a stable order.
column sum COLUMN
Returns the sum of all elements of the specified column which must be of a numeric type. For integer types, the sum is calculated as a 64 bit integer even if the column has a smaller integer width. There is no detection of integer overflow.
column summarize ?options? COL
The command returns a column that, depending on the passed options,
summarizes the contents of the passed column COL
. The command
expects COL
to be of the form of the data column in the
table returned by
column categorize
or
column histogram
with the -values
option. This form is a column of type any
, all elements of which are
themselves columns, all of the same type.
The return value is then a column, of the same size as COL
, each
element of which is a value that summarizes the corresponding
column element in COL
. This summary value may be computed in
several ways depending on the specified options.
-
If no options are specified or the
-count
option is specified, the value is the number of elements of the corresponding nested column ofCOL
. The returned column is of typeint
. -
If the
-sum
option is specified, the value is the sum of the elements of the corresponding nested column (which must be a numeric type). The type of the column isdouble
if the nested columns were of that type orwide
for integer types. -
If the
-summarizer CMDPREFIX
option is specified, the value is that returned by the command prefixCMDPREFIX
which is called with two additional arguments, the index intoCOL
and the corresponding nested column at that index. The returned column is of typeany
by default. The-summarytype TYPE
option may be specified to change this.
Usually, the table summarize
command is
more convenient to use in lieu of this command.
column vdelete COLUMNVAR LOW HIGH
Deletes elements from the typed array column in variable COLUMNVAR. The new value is assigned back to the variable. The resulting value of the variable (which may differ because of traces) is returned as the result of the command. Indices are specified in any of the forms described in Indices and may contain duplicates. Out-of-range indices are ignored.
column vfill COLUMNVAR VALUE LOW HIGH
Set the elements of the typed column in variable COLUMNVAR to VALUE. The new value is assigned back to the variable. The resulting value of the variable (which may differ because of traces) is returned as the result of the command.
See fill
for more information.
column vinject COLUMNVAR VALUES FIRST
Inserts VALUES, a list of values or a column of the same type
as the column in variable COLUMNVAR, at the position FIRST
and stores the result back in COLUMNVAR.
If FIRST is end
, the values are
appended to the column. In all cases, the command may extend the array
if necessary.
The resulting value of the variable
(which may differ because of traces) is returned as the result of the
command.
column vinsert COLUMNVAR VALUE FIRST ?COUNT?
Inserts COUNT (default 1) elements with value VALUE at position
FIRST in the column stored in variable COLUMNVAR.
If FIRST is end
, the values are
appended to the column. The new value is
assigned back to the variable. The resulting value of the variable
(which may differ because of traces) is returned as the result of the
command. In all cases, the command may extend the array if necessary.
column vplace COLUMNVAR VALUES INDICES
Modifies a typed column stored in the variable COLUMNVAR with the specified values at the corresponding indices. The new value is assigned back to the variable. The resulting value of the variable (which may differ because of traces) is returned as the result of the command.
See the command column place for other details.
column vput COLUMNVAR VALUES FIRST
Modifies a variable COLUMNVAR containing a typed column. The elements of the column starting at index FIRST are replaced by the corresponding elements of VALUES. If FIRST is not specified the elements are appended to the array. The new value is assigned back to the variable. The resulting value of the variable (which may differ because of traces) is returned as the result of the command.
See the command column put for other details.
column vreverse COLUMNVAR
Reverses the order of elements in the typed column in variable COLUMNVAR. The new value is assigned back to the variable. The resulting value of the variable (which may differ because of traces) is returned as the result of the command.
column vshuffle COLUMNVAR
Shuffles the order of elements in the typed column in variable COLUMNVAR. The new value is assigned back to the variable. The resulting value of the variable (which may differ because of traces) is returned as the result of the command.
column vsort ?-increasing? ?-decreasing? ?-nocase? ?-indirect TARGETCOLUMN? COLUMNVAR
Sorts a typed column stored in variable. COLUMNVAR is variable containing the typed column to be sorted. The sorted column is also returned as the command result. See the column sort command for a description of the options.
column width COLUMN ?FORMAT?
Returns the maximum width of the specified column in terms
of the number of characters required to print in the given format.
If FORMAT
is not specified, it defaults to %s
. If the column
is empty, the command returns 0
irrespective of FORMAT
.
column zeroes COUNT ?TYPE?
Returns a column of size COUNT with all elements initialized to 0
.
TYPE defaults to int
.