JIT Octave

JIT, Debugging, and Interrupts

2012-11-05T00:10:00.001-08:00

I finally found some time to work on Octave last weekend. There has been some talk on the mailing list recently about releasing a new version of Octave, so I figured I should clean up a few loose ends in JIT.

Breakpoints

Up until now JIT has been skipping breakpoints. While there are several issues with supporting breakpoints in JIT, the biggest one is that the Octave debugger allows for the execution of arbitrary statements. Arbitrary statement execution is a very powerful and useful feature of the Octave debugger, but it allows users to change the type of a variable in the middle of code execution.

JIT gets its performance improvement by making assumptions about variable types, so entering debug mode means we need to exit JIT. I took the simple way out here and do not start JIT if there are any breakpoints in the code (see hg).

Interrupts

In Octave if you hit control-c Octave will stop execution and return to the Octave prompt (unless debug on interrupt is set). The interpreter does this by calling octave_quit, which checks the interrupt state and throws an octave_interrupt_exception if an interrupt has occured.

Ideally, to support interrupts in JIT a call to octave_quit should be inserted once per loop iteration. Unfortunately, it is not that simple. After an interrupt occurs the current variable values need to be reported to the interpreter. For example,

i = 0;
while 1
  i += 1;
endwhile

If the user interrupts the loop the interpreter needs to save the current value of i. This means JIT needs a way to catch and rethrow the octave_interrupt_exception. While LLVM does have a way of handling exceptions, the infrastructure in Octave does not yet exist to support the LLVM feature.

Instead, I inserted a check to octave_interrupt_state. If octave_interrupt_state is greater than 0, we need to exit to the Octave prompt. I reused the code for checking error_state to achieve this.

Now that JIT handles interrupts and breakpoints in a manner consistent with the interpreter, I can't think of ways in which JIT and the interpreter differ (besides speed). The amount of code which JIT can compile is still fairly limited. Hopefully, I will get some time over winter break to make it easier to extend JIT and improve what JIT can compile. In its current state JIT should be ready to include as an experimental feature in the next Octave release.

GSoC Report

2012-08-18T20:45:00.001-07:00

I have just finished writing my Google Summer of Code final report.

Multidimensional indexing and end

2012-08-07T10:46:00.003-07:00

This week I added support for multidimensional matrix indexing and using end in jit. Support for the end keyword is interesting. Take the following for example,

y = A(1, sin (end));

From that line alone, it is not clear what end refers to. If sin is a matrix, then end will be the end of sin. If sin is a function, then end will be the end of A.

I have solved this problem by keeping track of the full context information for each end. Lets take a look at a slightly more complicated example

y = A(1, 2, B(sin (end), 5));

then during type inference our context might look something like this

type     identifier index count
function sin        0     1
matrix   B          0     2
matrix   A          2     4

In this context end then refers to the end of matrix B at index 0.

OctConf Reflection and Project Plan

2012-07-27T10:49:00.000-07:00

I had a great time at OctConf 2012. There were a lot of interesting people there, and it is nice to be able to put faces to names. I definitively hope I will be able to make it next year.

I recently realized that there are only two weeks left before the GSoC suggested pencils down date (8/13/2012). In the remaining time I plan on focusing my effort on better matrix support and supporting more builtin functions. I should be able to make significant progress on this in the remaining two weeks.

After GSoC is done, I plan on working on compiling user functions and function handles. I think adding support for user functions in JIT is important, but I'm not sure if I will be able to complete it in two weeks.

Comparison of JIT with Oct files

2012-07-03T15:40:00.001-07:00

In my last post I tested the Octave JIT compiler on a problem presented on the mailing list. I got a request for a comparison with oct files. I think this is an interesting comparison, because ideally the JIT compiler should reduce the need to rewrite Octave scripts as oct files.

Oct file

The oct file is mostly equivalent to the loopy version in my previous post.

#include <octave/oct.h>
#include <octave/parse.h>

DEFUN_DLD (oct_loopy, args, , "TODO")
{
  feval ("tic");

  octave_value ret;
  int nargin = args.length ();
  if (nargin != 2)
    print_usage ();
  else
    {
      NDArray data = args(0).array_value ();
      octave_idx_type nconsec;
      nconsec = static_cast<octave_idx_type> (args(1).double_value ());

      if (!error_state)
        {
          double *vec = data.fortran_vec ();
          octave_idx_type counter = 0;
          octave_idx_type n = data.nelem ();
          for (octave_idx_type i = 0; i < n; ++i)
            {
              if (vec[i])
                ++counter;
              else
                {
                  if (counter > 0 && counter < nconsec)
                    std::fill (vec + i - counter, vec + i, 0);

                  counter = 0;
                }
            }

          if (counter > 0 && counter < nconsec)
            std::fill (vec + n - counter, vec + n, 0);

          ret = octave_value (data);
        }
    }

  feval ("toc");
  return ret;
}

Results

I ran each test five times, taking the lowest time. I have also separated out the compile/link time from the run time. For JIT, compile time was determined by running the function twice, and subtracting the first run time from the second run time. The compile time for the oct file was determined by timing mkoctfile. The initial parameters were a random vector, A, of size 1,000,000 and a K = 3.

	Compile time	Run time
JIT	14ms	21ms
OCT	2400ms	3.3ms

When using JIT, the compile time is part of the run time for the first execution of the loop. This means that for this example, JIT is currently about 10 times slower than the oct file. However, if we were to execute the function 50 times on 1,000,000 element vectors, then JIT would be 6 times slower.

After looking at the assembly, it looks like JIT runs into issues with checks for matrix index validity and that loop variables are doubles (in loops like `for ii=1:5' ii is a double). It should be possible to fix these issues in JIT, but it will result in a larger compile time.

A Realistic Test

2012-06-28T22:30:00.002-07:00

There was an interesting post on the mailing list (https://mailman.cae.wisc.edu/pipermail/help-octave/2012-June/052642.html). The problem is, given some logical array, A = [1 0 0 1 0 0 1 1 1 ...], and minimum length of consecutive ones, K, all sequences of ones less than K should be filtered out.

The exciting part for me is that the JIT compiler is currently far enough along to compile the loopy solution.

Input generation

I used a simple script to generate some random input data. A double matrix is used, because JIT does not yet work for logical matrices.

function result = gen_test (n)
  result = double (rand (1, n) > .01);
endfunction

Vectorized

Vectorized code (based off of code from the mailing list)

function z = vectorized (A, K)
  tic;
  temp = ones (1, K);
  z = conv (A, temp);
  z = z > K-1;
  z = conv (z, temp);
  z = z(K:end-K+1);
  z = z >= 1;
  toc;
endfunction

Loopy

I didn't do anything fancy here.

function z = loopy (A, K)
  tic;

  z = A;
  n = numel (A);
  counter = 0;
  for ii=1:n
    if z(ii)
      counter = counter + 1;
    else
      if counter > 0 && counter < K
        z(ii-counter:ii-1) = 0;
      endif
      counter = 0;
    endif
  endfor

  if counter > 0 && counter < K
    z(end-counter+1:end) = 0;
  endif
  toc;
endfunction

Results

These numbers were taken from a AMD FX(tm)-4100 Quad-Core Processor with 8 GB of RAM. I just ran each test once. For each test, the number of elements in A was 1,000,000.

	Vectorized	Loopy JIT	Loopy JIT (no overhead)	Loopy (No JIT)
K=3	0.078s	0.059s	0.028s	5.27s
K=100	0.43s	0.063s	0.028s	5.66s
K=500	1.58s	0.082s	0.033s	5.73s

These results are expected. The efficiency of the vectorized approach depends on K, while the loopy version does not. While JIT support is not complete or stable yet¹, I think this shows that the current JIT implementation is able to handle a few practical examples, not just interpreter benchmarks.

hg id for regular octave branch: 52cb71787cd1
hg id for jit branch: f649b66ef1af

¹ Occasionally I break functionality like assignment, addition, and function calls.

Matrix Support

2012-06-18T21:14:00.000-07:00

The Octave Language

An interesting feature of the Octave language is that matrices are treated as values, not references. For example, the following code

A = [1 2 3];
B = A;
A(1) = 100;
disp (B(1));

will display the value 1, not 100. Matrices are also passed by value in function calls.

Interpreter Implementation

Octave currently uses a really cool system to minimize the number of times matrices. It combines Copy on Write (COW) with reference counting to reduce the number of copies.

The idea is to delay copies until a matrix is mutated. Then the copy is only made if the reference count is greater than one. Take the previous example,

A = [1 2 3];
B = A; # No copy
A(1) = 100; # Copy, as both A and B refer to [1 2 3]

This algorithm has two nice properties. First, it is simple. Second, copies are only made when required.

JIT

I plan on using the same algorithm in JIT. This decision means that every assignment of the form

A(idx) = x; # Where idx and x are scalar

requires a check to see if a copy is required. This actually is not such a big issue, because we already require a check to ensure the index is legal. In order to speed up the normal case, we can directly inline the condition check.

Which leads us to the pseudo code for the inlined function.

if (isint (idx) && idx >= 1 && i < nelem (A) && count (A) == 1)
  A(idx) = x;
else
  A = handle_odd_assign (A, idx, x);

Hopefully, this will allow the normal case to be quick, but still provide a correct implementation for more complicated cases.

Errors are annoying

2012-06-11T13:53:00.000-07:00

Edit
I realized that I need to qualify that this issue isn't a problem with Octave's implementation of error handling. Instead, the issue lies in the language, and the fact that almost every operation can cause an error.

In Octave correctly handling errors is annoying. For an example, let's look at a simplified implementation of binary operators in the interpreter.

octave_value retval;
octave_value lhs = // compute lhs
if (! error_state && lhs.is_defined ())
{
  octave_value rhs = // compute rhs
  if (! error_state && rhs.is_defined ())
    retval = ::do_binary_op (op_type, lhs, rhs);
}
return retval;

Notice that after every operand is computed, the error state must be checked. Take the follow statement

a = (a + b) / c;

When converted into a linear SSA form we get

block0:
  temp0 = a0 + b0
  check_error block1, done
block1:
  temp2 = temp0 / c0
  check_error block2, done
block2:
  c1 = temp2
  goto done
done:
  c2 = phi (block0: c0, block1: c0, block2: c1)
  # yield a0, b0, and c2 as results

Notice that the phi merge function requires an entry for every operation. Furthermore, each variable that is defined must be represented in the done block. At the worse case, each statement could define a new variable leading to O(n^2) space complexity. (where n is the statement count in Octave)

However, in practice the number of variable should be low compared to the number of lines. As a test, I automatically generated a 500 line program consisting of scalar addition statements like

i1 = i0 + i0;

It looks like the compile time stays within reasonable bounds.

Progress?

2012-06-04T13:35:00.001-07:00

Progress last week was a little slower than I had hoped. Originally, I had planned on adding support for break and continue quickly, then moving onto other issues. However, I realized the hard way that my simple inline SSA construction algorithm was not very extensible. I have now implemented a textbook SSA construction algorithm.

I also took another look at my modifications to configure.ac (with help from John W. Eaton and Jordi). The configure check was fixed and LLVM_CONFIG can be set to the location of llvm-config. This allows building with LLVM in a nonstandard path.

Next, I think I will focus ensuring error and warning conditions are handled correctly (not talking about try/catch, just error and warning). In theory error checking is simple, for each operation that may error I need to check to see if it errored, then if it has exit the JIT function. There are a few annoying details. For example, I need to keep track of and return the values of each variable when an error occurs. Additionally, the ability to change warnings into errors adds further complexity.

Type Inference

2012-05-28T09:23:00.000-07:00

I just finished implementing a type inference algorithm based off of FALCON [1]. I like this algorithm for several reasons

The efficiency is O(n): This is important because we should minimize JIT overhead.
Decouples type inference from code generation: This will make it easier if we decide to switch from LLVM to a GNU equivalent.
Type bounds are good: The algorithm is able to assign different types to the same variable when the variable is used in different contexts.

The algorithm works by introducing a type lattice (also similar to the MaJIC and McVM MATLAB JIT compilers[2][3]). For example, the simple lattice I am currently using

Each variable in the SSA is then assigned a type from our lattice. For example, in the code

b = 1;
for i=0:5
  temp = 5;
  b = b + temp;

  temp = [5 5 5];
  b = b + temp;
endfor

The algorithm is able to differentiate between the two occurrences of temp. The algorithm will also realizes that b is a matrix inside of the loop (once matrices are added to type type lattice). Then before the start of the loop, b, will be converted to a single element matrix. This means we can generate matrix addition operations inside of the loop.

[1] L. D. Rose and D. Padua. Techniques for the Translation of MATLAB Programs into Fortran 90. ACM Transactions on Programming Languages and Systems, 21(2):286–323, Mar. 1999.
[2]George Almási and David Padua. 2002. MaJIC: compiling MATLAB for speed and responsiveness. SIGPLAN Not. 37, 5 (May 2002), 294-303.
[3] M.Sc. Thesis - McVM: an Optimizing Virtual Machine for the MATLAB Programming Language

Initial Work

2012-05-22T19:15:00.000-07:00

Hi, I am Max Brister, a student working on JIT compilation in GNU Octave for google summer of code. My project proposal can be found on melange. In this blog I plan to explore implementation details, mark my progress, and present intermediary results.

I have already implemented a simple proof of concept which shows some promise. The simple script

a = 1;
b = 1;

tic;
for i=1:10000
  for j=1:10000
    a = a + b;
  endfor
endfor
toc;

executes on my computer in 0.178s using my jit branch, and in 253.124s using Octave 3.6.1. A speedup of 1422 is not bad.

There is still quite a ways to go before the code is ready for users though. I need to take another look at type inference, ensure error cases are handled correctly, and the current implementation requires the inner loop bounds to be constant for type inference.

You can view my current progress by checking out my code from

hg clone http://inversethought.com/hg/octave-max

The specific revision this post refers to is cba58541954c.

*edit*
I realized that I should mention that a speedup of 1422x is really a best cast speedup. Code which is already vectorized will see a much smaller speedup (if any). This is because Octave already efficiently implements vectorization.