I’ve been doing a lot of data analysis in Ruby lately. In the past I did a lot of data analysis in R, and I love R, but it has always driven me a little crazy. As part of a two man startup the majority of my coding time is spent on just making the basic infrastructure work and since most of the code is already in Ruby I end up wanting to do basic analysis in Ruby.
The one feature I really miss from R is all the crazy array indexing things you can do. This is probably the best feature of R besides the amazing set of statistics libraries.
For example, if “a” is a matrix of data that looks like [[1,2,3][4,5,6][7,8,9]] you can say things like
a[a<2] = 4
And it will take all the elements less than 2 and set them to 4. This kind of flexible lookups and setting is really useful for data cleanup and exploration.
There are ruby libraries that have some of this matrix functionality, such as NArray, but for data analysis I really wanted something with named rows and columns and fancy ways to look up those names.
I’ve been slowly building up a library that does this, which you can find on rubyforge at http://rubyforge.org/projects/dataframe.
The first functionality I built which is already really useful to me is just to be able to look things up by names row or column, you can pull parts of the array out by name, number, regex on the name, or even an arbitrary function on the name:
def test_basic_partial_lookups
d = DataFrame[{"snake" => {"length" => 10, "height" => 1}, "giraffe" => {"length" => 3, "height" => 10}}]
assert(d["snake",true] == D[{"length" => 10, "height" => 1}])
assert(d["snake",true] == D[1, 10])
assert(d["snake"] == D[1, 10])
assert(d[true, "height"] == D[[10], [1]])
assert((d/”height”) == D[[10], [1]])
end
def test_array_lookups
d = DataFrame[{"snake" => {"length" => 10, "height" => 1}, "giraffe" => {"length" => 3, "height" => 10}}]
assert(d[["snake", "giraffe"], ["height"]] == D[[1], [10]])
assert(d[["giraffe", "snake"], ["height"]] == D[[10], [1]])
assert(d[[0,1], 0] == D[[10], [1]])
assert(d[[1,0], 0] == D[[1], [10]])
end
def test_regex_lookups
d = DataFrame[{"snake" => {"length" => 10, "height" => 1}, "giraffe" => {"length" => 3, "height" => 10}}]
assert(d[/nak/, /.*/] == D[10, 1])
assert(d[/CANTFIND/, true].nil?)
end
def test_proc_lookups
d = DataFrame[{"snake" => {"length" => 10, "height" => 1}, "giraffe" => {"length" => 3, "height" => 10}}]
d[Proc.new {|v| v == "snake"}, Proc.new {|v| v != "height"}] = 1
end
The second useful thing is to be able to set slices of the matrix, i.e.
def test_set_atomic
d = DataFrame[{"snake" => {"length" => 10, "height" => 1}, "giraffe" => {"length" => 3, "height" => 10}}]
assert(d == D[[10, 3], [1, 10]])
d["giraffe", "length"] = 2
assert(d == D[[10, 2], [1, 10]])
d = DataFrame[{"snake" => {"length" => 10, "height" => 1}, "giraffe" => {"length" => 3, "height" => 10}}]
d[0,0] = 2
assert(d == D[[2, 3], [1, 10]])
d[2,2] = 6
assert(d == D[[2, 3, nil], [1, 10, nil], [nil, nil, 6]])
assert_raise(RuntimeError) { d[-1,2] = 0}
assert_raise(RuntimeError) { d[2,-1] = 0}
d[2,5] = 10
pp d
assert(d == D[[2, 3, nil, nil, nil, nil], [1, 10, nil, nil, nil, nil], [nil, nil, 6, nil, nil, 10]])
end
def test_set_vector
d = DataFrame[{"snake" => {"length" => 10, "height" => 1}, "giraffe" => {"length" => 3, "height" => 10}}]
d[0] = D[2,2]
assert(d == D[[2, 2], [1, 10]])
d["giraffe"] = D[3,3]
assert(d == D[[3, 3], [1, 10]])
d[true,1] = D[[4],[4]]
assert(d == D[[3, 4], [1, 4]])
d[true,"height"] = D[[5],[5]]
assert(d == D[[5, 4], [5, 4]])
d[true,"age"] = D[[4], [3]]
assert(d == D[[5, 4, 4], [5, 4, 3]])
end
def test_set_matrix
d1 = DataFrame[{"snake" => {"length" => 10, "height" => 1}, "giraffe" => {"length" => 3, "height" => 10}}]
d2 = DataFrame[{"car" => {"length" => 9, "height" => 5}, "truck" => {"length" => 10, "height" => 6}}]
d1 << d2
assert(d1 == D[[10,3],[1,10],[5,9],[6,10]])
d1[["snake", "car"], true] = D[[5,6], [7,8]]
assert(d1 == D[[10,3],[5,6],[7,8],[6,10]])
d1[["car", "snake"], true] = D[[5,6], [7,8]]
assert(d1 == D[[10,3],[7,8],[5,6],[6,10]])
d1[["car", "snake"], ["length","height"]] = D[[5,6], [7,8]]
assert(d1 == D[[10,3],[8,7],[6,5],[6,10]])
end
The final thing is to add R’s “which” operator functionality, which I haven’t done yet. But this 400 line library has served me really well so far. It’s open source if you want to contribute :).