JIT Octave

Friday, July 27, 2012

OctConf Reflection and Project Plan

I had a great time at OctConf 2012. There were a lot of interesting people there, and it is nice to be able to put faces to names. I definitively hope I will be able to make it next year.

I recently realized that there are only two weeks left before the GSoC suggested pencils down date (8/13/2012). In the remaining time I plan on focusing my effort on better matrix support and supporting more builtin functions. I should be able to make significant progress on this in the remaining two weeks.

After GSoC is done, I plan on working on compiling user functions and function handles. I think adding support for user functions in JIT is important, but I'm not sure if I will be able to complete it in two weeks.

Tuesday, July 3, 2012

Comparison of JIT with Oct files

In my last post I tested the Octave JIT compiler on a problem presented on the mailing list. I got a request for a comparison with oct files. I think this is an interesting comparison, because ideally the JIT compiler should reduce the need to rewrite Octave scripts as oct files.

Oct file

The oct file is mostly equivalent to the loopy version in my previous post.

#include <octave/oct.h>
#include <octave/parse.h>

DEFUN_DLD (oct_loopy, args, , "TODO")
{
  feval ("tic");

  octave_value ret;
  int nargin = args.length ();
  if (nargin != 2)
    print_usage ();
  else
    {
      NDArray data = args(0).array_value ();
      octave_idx_type nconsec;
      nconsec = static_cast<octave_idx_type> (args(1).double_value ());

      if (!error_state)
        {
          double *vec = data.fortran_vec ();
          octave_idx_type counter = 0;
          octave_idx_type n = data.nelem ();
          for (octave_idx_type i = 0; i < n; ++i)
            {
              if (vec[i])
                ++counter;
              else
                {
                  if (counter > 0 && counter < nconsec)
                    std::fill (vec + i - counter, vec + i, 0);

                  counter = 0;
                }
            }

          if (counter > 0 && counter < nconsec)
            std::fill (vec + n - counter, vec + n, 0);

          ret = octave_value (data);
        }
    }

  feval ("toc");
  return ret;
}

Results

I ran each test five times, taking the lowest time. I have also separated out the compile/link time from the run time. For JIT, compile time was determined by running the function twice, and subtracting the first run time from the second run time. The compile time for the oct file was determined by timing mkoctfile. The initial parameters were a random vector, A, of size 1,000,000 and a K = 3.

	Compile time	Run time
JIT	14ms	21ms
OCT	2400ms	3.3ms

When using JIT, the compile time is part of the run time for the first execution of the loop. This means that for this example, JIT is currently about 10 times slower than the oct file. However, if we were to execute the function 50 times on 1,000,000 element vectors, then JIT would be 6 times slower.

After looking at the assembly, it looks like JIT runs into issues with checks for matrix index validity and that loop variables are doubles (in loops like `for ii=1:5' ii is a double). It should be possible to fix these issues in JIT, but it will result in a larger compile time.

Thursday, June 28, 2012

A Realistic Test

There was an interesting post on the mailing list (https://mailman.cae.wisc.edu/pipermail/help-octave/2012-June/052642.html). The problem is, given some logical array, A = [1 0 0 1 0 0 1 1 1 ...], and minimum length of consecutive ones, K, all sequences of ones less than K should be filtered out.

The exciting part for me is that the JIT compiler is currently far enough along to compile the loopy solution.

Input generation

I used a simple script to generate some random input data. A double matrix is used, because JIT does not yet work for logical matrices.

function result = gen_test (n)
  result = double (rand (1, n) > .01);
endfunction

Vectorized

Vectorized code (based off of code from the mailing list)

function z = vectorized (A, K)
  tic;
  temp = ones (1, K);
  z = conv (A, temp);
  z = z > K-1;
  z = conv (z, temp);
  z = z(K:end-K+1);
  z = z >= 1;
  toc;
endfunction

Loopy

I didn't do anything fancy here.

function z = loopy (A, K)
  tic;

  z = A;
  n = numel (A);
  counter = 0;
  for ii=1:n
    if z(ii)
      counter = counter + 1;
    else
      if counter > 0 && counter < K
        z(ii-counter:ii-1) = 0;
      endif
      counter = 0;
    endif
  endfor

  if counter > 0 && counter < K
    z(end-counter+1:end) = 0;
  endif
  toc;
endfunction

Results

These numbers were taken from a AMD FX(tm)-4100 Quad-Core Processor with 8 GB of RAM. I just ran each test once. For each test, the number of elements in A was 1,000,000.

	Vectorized	Loopy JIT	Loopy JIT (no overhead)	Loopy (No JIT)
K=3	0.078s	0.059s	0.028s	5.27s
K=100	0.43s	0.063s	0.028s	5.66s
K=500	1.58s	0.082s	0.033s	5.73s

These results are expected. The efficiency of the vectorized approach depends on K, while the loopy version does not. While JIT support is not complete or stable yet¹, I think this shows that the current JIT implementation is able to handle a few practical examples, not just interpreter benchmarks.

hg id for regular octave branch: 52cb71787cd1
hg id for jit branch: f649b66ef1af

¹ Occasionally I break functionality like assignment, addition, and function calls.

Monday, June 18, 2012

Matrix Support

The Octave Language

An interesting feature of the Octave language is that matrices are treated as values, not references. For example, the following code

A = [1 2 3];
B = A;
A(1) = 100;
disp (B(1));

will display the value 1, not 100. Matrices are also passed by value in function calls.

Interpreter Implementation

Octave currently uses a really cool system to minimize the number of times matrices. It combines Copy on Write (COW) with reference counting to reduce the number of copies.

The idea is to delay copies until a matrix is mutated. Then the copy is only made if the reference count is greater than one. Take the previous example,

A = [1 2 3];
B = A; # No copy
A(1) = 100; # Copy, as both A and B refer to [1 2 3]

This algorithm has two nice properties. First, it is simple. Second, copies are only made when required.

JIT

I plan on using the same algorithm in JIT. This decision means that every assignment of the form

A(idx) = x; # Where idx and x are scalar

requires a check to see if a copy is required. This actually is not such a big issue, because we already require a check to ensure the index is legal. In order to speed up the normal case, we can directly inline the condition check.

Which leads us to the pseudo code for the inlined function.

if (isint (idx) && idx >= 1 && i < nelem (A) && count (A) == 1)
  A(idx) = x;
else
  A = handle_odd_assign (A, idx, x);

Hopefully, this will allow the normal case to be quick, but still provide a correct implementation for more complicated cases.

Monday, June 11, 2012

Errors are annoying

Edit
I realized that I need to qualify that this issue isn't a problem with Octave's implementation of error handling. Instead, the issue lies in the language, and the fact that almost every operation can cause an error.

In Octave correctly handling errors is annoying. For an example, let's look at a simplified implementation of binary operators in the interpreter.

octave_value retval;
octave_value lhs = // compute lhs
if (! error_state && lhs.is_defined ())
{
  octave_value rhs = // compute rhs
  if (! error_state && rhs.is_defined ())
    retval = ::do_binary_op (op_type, lhs, rhs);
}
return retval;

Notice that after every operand is computed, the error state must be checked. Take the follow statement

a = (a + b) / c;

When converted into a linear SSA form we get

block0:
  temp0 = a0 + b0
  check_error block1, done
block1:
  temp2 = temp0 / c0
  check_error block2, done
block2:
  c1 = temp2
  goto done
done:
  c2 = phi (block0: c0, block1: c0, block2: c1)
  # yield a0, b0, and c2 as results

Notice that the phi merge function requires an entry for every operation. Furthermore, each variable that is defined must be represented in the done block. At the worse case, each statement could define a new variable leading to O(n^2) space complexity. (where n is the statement count in Octave)

However, in practice the number of variable should be low compared to the number of lines. As a test, I automatically generated a 500 line program consisting of scalar addition statements like

i1 = i0 + i0;

It looks like the compile time stays within reasonable bounds.

Monday, June 4, 2012

Progress?

Progress last week was a little slower than I had hoped. Originally, I had planned on adding support for break and continue quickly, then moving onto other issues. However, I realized the hard way that my simple inline SSA construction algorithm was not very extensible. I have now implemented a textbook SSA construction algorithm.

I also took another look at my modifications to configure.ac (with help from John W. Eaton and Jordi). The configure check was fixed and LLVM_CONFIG can be set to the location of llvm-config. This allows building with LLVM in a nonstandard path.

Next, I think I will focus ensuring error and warning conditions are handled correctly (not talking about try/catch, just error and warning). In theory error checking is simple, for each operation that may error I need to check to see if it errored, then if it has exit the JIT function. There are a few annoying details. For example, I need to keep track of and return the values of each variable when an error occurs. Additionally, the ability to change warnings into errors adds further complexity.

Monday, May 28, 2012

Type Inference

I just finished implementing a type inference algorithm based off of FALCON [1]. I like this algorithm for several reasons

The efficiency is O(n): This is important because we should minimize JIT overhead.
Decouples type inference from code generation: This will make it easier if we decide to switch from LLVM to a GNU equivalent.
Type bounds are good: The algorithm is able to assign different types to the same variable when the variable is used in different contexts.

The algorithm works by introducing a type lattice (also similar to the MaJIC and McVM MATLAB JIT compilers[2][3]). For example, the simple lattice I am currently using

Each variable in the SSA is then assigned a type from our lattice. For example, in the code

b = 1;
for i=0:5
  temp = 5;
  b = b + temp;

  temp = [5 5 5];
  b = b + temp;
endfor

The algorithm is able to differentiate between the two occurrences of temp. The algorithm will also realizes that b is a matrix inside of the loop (once matrices are added to type type lattice). Then before the start of the loop, b, will be converted to a single element matrix. This means we can generate matrix addition operations inside of the loop.

[1] L. D. Rose and D. Padua. Techniques for the Translation of MATLAB Programs into Fortran 90. ACM Transactions on Programming Languages and Systems, 21(2):286–323, Mar. 1999.
[2]George Almási and David Padua. 2002. MaJIC: compiling MATLAB for speed and responsiveness. SIGPLAN Not. 37, 5 (May 2002), 294-303.
[3] M.Sc. Thesis - McVM: an Optimizing Virtual Machine for the MATLAB Programming Language