One of the differences between tall arrays and in-memory MATLAB® arrays is that tall arrays typically remain unevaluated
until you request that calculations be performed. (The exceptions to this rule include
plotting functions like plot
and histogram
and
some statistical fitting functions like fitlm
, which automatically
evaluate tall array inputs.) While a tall array is in an unevaluated state, MATLAB might not know its size, its data type, or the specific values it contains.
However, you can still use unevaluated arrays in your calculations as if the values were
known. This allows you to work quickly with large data sets instead of waiting for each
command to execute. For this reason, it is recommended that you use
gather
only when you require output.
MATLAB keeps track of all the operations you perform on unevaluated tall arrays as
you enter them. When you eventually call gather
to evaluate the queued
operations, MATLAB uses the history of unevaluated commands to optimize the calculation by
minimizing the number of passes through the data. Used properly, this optimization can save
huge amounts of execution time by eliminating unnecessary passes through large data
sets.
The display of unevaluated tall arrays varies depending on how much MATLAB knows about the array and its values. There are three pieces of information reflected in the display:
Array size —
Unknown dimension sizes are represented by the variables M
or N
in
the display. If no dimension sizes are known, then the size appears
as MxNx....
.
Array data type —
If the array has an unknown underlying data type, then its type appears
as tall array
. If the type is known, it is listed
as, for example, tall double array
.
Array values —
If the array values are unknown, then they appear as ?
.
Known values are displayed.
MATLAB might know all, some, or none of these pieces of information about a given tall array, depending on the nature of the calculation.
For example, if the array has a known data type but unknown size and values, then the unevaluated tall array might look like this:
M×N×... tall double array ? ? ? ... ? ? ? ... ? ? ? ... : : : : : :
If the type and relative size are known, then the display could be:
1×N tall char array ? ? ? ...
If some of the data is known, then MATLAB displays the known values:
100×3 tall double matrix 0.8147 0.1622 0.6443 0.9058 0.7943 0.3786 0.1270 0.3112 0.8116 0.9134 0.5285 0.5328 0.6324 0.1656 0.3507 0.0975 0.6020 0.9390 0.2785 0.2630 0.8759 0.5469 0.6541 0.5502 : : : : : :
gather
The gather
function is
used to evaluate tall arrays. gather
accepts
tall arrays as inputs and returns in-memory arrays as outputs. For
this reason, you can think of this function as a bridge between tall
arrays and in-memory arrays. For example, you cannot control if
or while
loop
statements using a tall logical array, but once the array is evaluated
with gather
it becomes an in-memory logical value
that you can use in these contexts.
gather
performs all queued operations on
a tall array and returns the entire result in
memory. Since gather
returns results as in-memory MATLAB arrays,
standard memory considerations apply. MATLAB might run out of
memory if the result returned by gather
is too
large.
Most of the time you can use gather
to see the entire result of a
calculation, particularly if the calculation includes a reduction operation such as
sum
or mean
. However, if the result is too large
to fit in memory, then you can use gather(head(X))
or
gather(tail(X))
to perform the calculation and look at only the first
or last few rows of the result.
gather
If you enter an erroneous command and gather
fails
to evaluate a tall array variable, then you must delete the variable
from your workspace and recreate the tall array using only valid
commands. This is because MATLAB keeps track of all the operations
you perform on unevaluated tall arrays as you enter them. The only
way to make MATLAB “forget” about an erroneous
statement is to reconstruct the tall array from scratch.
This example shows what an unevaluated tall array looks like, and how to evaluate the array.
Create a datastore for the data set airlinesmall.csv
.
Convert the datastore into a tall table and then calculate the size.
varnames = {'ArrDelay', 'DepDelay', 'Origin', 'Dest'}; ds = tabularTextDatastore('airlinesmall.csv', 'TreatAsMissing', 'NA', ... 'SelectedVariableNames', varnames); tt = tall(ds)
tt = M×4 tall table ArrDelay DepDelay Origin Dest ________ ________ ______ _____ 8 12 'LAX' 'SJC' 8 1 'SJC' 'BUR' 21 20 'SAN' 'SMF' 13 12 'BUR' 'SJC' 4 -1 'SMF' 'LAX' 59 63 'LAX' 'SJC' 3 -2 'SAN' 'SFO' 11 -1 'SEA' 'LAX' : : : : : : : :
s = size(tt)
s = 1×2 tall double row vector ? ? Preview deferred. Learn more.
Calculating the size of a tall array returns a small answer
(a 1-by-2 vector), but the display indicates that an entire pass through
the data is still required to calculate the size of tt
.
Use the gather
function to fully evaluate
the tall array and bring the results into memory. As the command executes,
there is a dynamic progress display in the command window that is
particularly helpful with long calculations.
Note
Always ensure that the result returned by gather
will
be able to fit in memory. If you use gather
directly
on a tall array without reducing its size using a function such as mean
,
then MATLAB might run out of memory.
tableSize = gather(s)
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 1: Completed in 0.42 sec Evaluation completed in 0.48 sec tableSize = 123523 4
This example shows how several calculations can be combined to minimize the total number of passes through the data.
Create a datastore for the data set airlinesmall.csv
.
Convert the datastore into a tall table.
varnames = {'ArrDelay', 'DepDelay', 'Origin', 'Dest'}; ds = tabularTextDatastore('airlinesmall.csv', 'TreatAsMissing', 'NA', ... 'SelectedVariableNames', varnames); tt = tall(ds)
tt = M×4 tall table ArrDelay DepDelay Origin Dest ________ ________ ______ _____ 8 12 'LAX' 'SJC' 8 1 'SJC' 'BUR' 21 20 'SAN' 'SMF' 13 12 'BUR' 'SJC' 4 -1 'SMF' 'LAX' 59 63 'LAX' 'SJC' 3 -2 'SAN' 'SFO' 11 -1 'SEA' 'LAX' : : : : : : : :
Subtract the mean value of DepDelay
from ArrDelay
to
create a new variable AdjArrDelay
. Then calculate
the mean value of AdjArrDelay
and subtract this
mean value from AdjArrDelay
. If these calculations
were all evaluated separately, then MATLAB would require four
passes through the data.
AdjArrDelay = tt.ArrDelay - mean(tt.DepDelay,'omitnan'); AdjArrDelay = AdjArrDelay - mean(AdjArrDelay,'omitnan')
AdjArrDelay = M×1 tall double column vector ? ? ? : : Preview deferred. Learn more.
Evaluate AdjArrDelay
and view the first few
rows. Because some calculations can be combined, only three passes
through the data are required.
gather(head(AdjArrDelay))
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 3: Completed in 0.4 sec - Pass 2 of 3: Completed in 0.39 sec - Pass 3 of 3: Completed in 0.23 sec Evaluation completed in 1.2 sec ans = 0.8799 0.8799 13.8799 5.8799 -3.1201 51.8799 -4.1201 3.8799
Tall arrays remain unevaluated until you request output
using gather
.
Use gather
in most cases to evaluate
tall array calculations. If you believe the result of the calculations
might not fit in memory, then use gather(head(X))
or gather(tail(X))
instead.
Work primarily with unevaluated tall arrays and request output only when necessary. The more queued calculations there are that are unevaluated, the more optimization MATLAB can do to minimize the number of passes through the data.
If you enter an erroneous tall array command and gather
fails
to evaluate a tall array variable, then you must delete the variable
from your workspace and recreate the tall array using only valid
commands.