class Statsample::Dataset

Set of cases with values for one or more variables, analog to a dataframe on R or a standard data file of SPSS. Every vector has #field name, which represent it. By default, the vectors are ordered by it field name, but you can change it the fields order manually. The Dataset work as a Hash, with keys are field names and values are Statsample::Vector

Usage¶ ↑

Create a empty dataset:

Dataset.new()

Create a dataset with three empty vectors, called v1, v2 and v3:

Dataset.new(%w{v1 v2 v3})

Create a dataset with two vectors, called v1 and v2:

Dataset.new({'v1'=>%w{1 2 3}.to_vector, 'v2'=>%w{4 5 6}.to_vector})

Create a dataset with two given vectors (v1 and v2), with vectors on inverted order:

Dataset.new({'v2'=>v2,'v1'=>v1},['v2','v1'])

The fast way to create a dataset uses Hash#to_dataset, with field order as arguments

v1 = [1,2,3].to_scale
v2 = [1,2,3].to_scale
ds = {'v1'=>v2, 'v2'=>v2}.to_dataset(%w{v2 v1})

Attributes

cases[R]

Number of cases

fields[R]

Ordered ids of vectors

i[R]

Location of pointer on enumerations methods (like each)

name[RW]

Name of dataset

vectors[R]

Hash of Statsample::Vector

Public Class Methods

crosstab_by_asignation(rows,columns,values) click to toggle source

Generates a new dataset, using three vectors

Rows
Columns
Values

For example, you have these values

x   y   v
a   a   0
a   b   1
b   a   1
b   b   0

You obtain

id  a   b
 a  0   1
 b  1   0

Useful to process outputs from databases

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 92
def self.crosstab_by_asignation(rows,columns,values)
  raise "Three vectors should be equal size" if rows.size!=columns.size or rows.size!=values.size
  cols_values=columns.factors
  cols_n=cols_values.size
  h_rows=rows.factors.inject({}){|a,v| a[v]=cols_values.inject({}){
    |a1,v1| a1[v1]=nil; a1
    }
    ;a}
  values.each_index{|i|
    h_rows[rows[i]][columns[i]]=values[i]
  }      
  ds=Dataset.new(["_id"]+cols_values)
  cols_values.each{|c|
    ds[c].type=values.type
  }
  rows.factors.each {|row|
    n_row=Array.new(cols_n+1)
    n_row[0]=row
      cols_values.each_index {|i|
        n_row[i+1]=h_rows[row][cols_values[i]]
    }
    ds.add_case_array(n_row)
  }
  ds.update_valid_data
  ds
end

new(vectors={}, fields=[]) click to toggle source

Creates a new dataset. A dataset is a set of ordered named vectors of the same size.

vectors: With an array, creates a set of empty vectors named as

values on the array. With a hash, each Vector is assigned as a variable of the Dataset named as its key

fields: Array of names for vectors. Is only used for set the

order of variables. If empty, vectors keys on alfabethic order as used as fields.

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 158
def initialize(vectors={}, fields=[])
  @@n_dataset||=0
  @@n_dataset+=1
  @name=_("Dataset %d") % @@n_dataset
  @cases=0
  @gsl=nil
  @i=nil
  
  if vectors.instance_of? Array
    @fields=vectors.dup
    @vectors=vectors.inject({}){|a,x| a[x]=Statsample::Vector.new(); a}
  else
    # Check vectors
    @vectors=vectors
    @fields=fields
    check_order
    check_length
  end
end

Public Instance Methods

==(d2) click to toggle source

We have the same datasets if vectors and fields are the same

@return {Boolean}

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 370
def ==(d2)
  @vectors==d2.vectors and @fields==d2.fields
end

[](i) click to toggle source

Returns the vector named i

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 670
def[](i)
  if i.is_a? Range
    fields=from_to(i.begin,i.end)
    clone(*fields)
  elsif i.is_a? Array
    clone(i)
  else
    raise Exception,"Vector '#{i}' doesn't exists on dataset" unless @vectors.has_key?(i)
    @vectors[i]
  end
end

[]=(i,v) click to toggle source

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 709
def[]=(i,v)
  if v.instance_of? Statsample::Vector
    @vectors[i]=v
    check_order
  else
    raise ArgumentError,"Should pass a Statsample::Vector"
  end
end

add_case(v,uvd=true) click to toggle source

Insert a case, using:

Array: size equal to number of vectors and values in the same order as fields
Hash: keys equal to fields

If uvd is false, update_valid_data is not executed after inserting a case. This is very useful if you want to increase the performance on inserting many cases, because update_valid_data performs check on vectors and on the dataset

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 424
def add_case(v,uvd=true)
  case v
  when Array
    if (v[0].is_a? Array)
      v.each{|subv| add_case(subv,false)}
    else
      raise ArgumentError, "Input array size (#{v.size}) should be equal to fields number (#{@fields.size})" if @fields.size!=v.size
      v.each_index {|i| @vectors[@fields[i]].add(v[i],false)}
    end
  when Hash
    raise ArgumentError, "Hash keys should be equal to fields #{(v.keys - @fields).join(",")}" if @fields.sort!=v.keys.sort
    @fields.each{|f| @vectors[f].add(v[f],false)}
  else
    raise TypeError, 'Value must be a Array or a Hash'
  end
  if uvd
    update_valid_data
  end
end

add_case_array(v) click to toggle source

Fast version of add_case. Can only add one case and no error check if performed You SHOULD use update_valid_data at the end of insertion cycle

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 413
def add_case_array(v)
  v.each_index {|i| d=@vectors[@fields[i]].data; d.push(v[i])}
end

add_vector(name, vector) click to toggle source

Equal to Dataset=vector

@return self

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 383
def add_vector(name, vector)
  raise ArgumentError, "Vector have different size" if vector.size!=@cases
  @vectors[name]=vector
  check_order
  self
end

add_vectors_by_split(name,join='-',sep=Statsample::SPLIT_TOKEN) click to toggle source

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 473
def add_vectors_by_split(name,join='-',sep=Statsample::SPLIT_TOKEN)
  split=@vectors[name].split_by_separator(sep)
  split.each{|k,v|
    add_vector(name+join+k,v)
  }
end

add_vectors_by_split_recode(name_,join='-',sep=Statsample::SPLIT_TOKEN) click to toggle source

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 463
def add_vectors_by_split_recode(name_,join='-',sep=Statsample::SPLIT_TOKEN)
  split=@vectors[name_].split_by_separator(sep)
  i=1
  split.each{|k,v|
    new_field=name_+join+i.to_s
    v.name=name_+":"+k
    add_vector(new_field,v)
    i+=1
  }
end

bootstrap(n=nil) click to toggle source

Creates a dataset with the random data, of a n size If n not given, uses original number of cases.

@return {Statsample::Dataset}

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 399
def bootstrap(n=nil)
  n||=@cases
  ds_boot=dup_empty
  n.times do
    ds_boot.add_case_array(case_as_array(rand(n)))
  end
  ds_boot.update_valid_data
  ds_boot
end

case_as_array(i) click to toggle source

Retrieves case i as a array, ordered on fields order

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 591
def case_as_array(i)
  _case_as_array(i)
end

case_as_hash(i) click to toggle source

Retrieves case i as a hash

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 580
def case_as_hash(i)
  _case_as_hash(i)
end

check_fields(fields) click to toggle source

Check if fields attribute is correct, after inserting or deleting vectors

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 502
def check_fields(fields)
  fields||=@fields
  raise "Fields #{(fields-@fields).join(", ")} doesn't exists on dataset" if (fields-@fields).size>0
  fields
end

clear_gsl() click to toggle source

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 728
def clear_gsl
  @gsl=nil
end

clone(*fields_to_include) click to toggle source

Returns a shallow copy of Dataset. Object id will be distinct, but @vectors will be the same. @param array of fields to include. No value include all fields @return {Statsample::Dataset}

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 257
def clone(*fields_to_include)
  if fields_to_include.size==1 and fields_to_include[0].is_a? Array
    fields_to_include=fields_to_include[0]
  end
  fields_to_include=@fields.dup if fields_to_include.size==0
  ds=Dataset.new
  fields_to_include.each{|f|
    raise "Vector #{f} doesn't exists" unless @vectors.has_key? f
    ds[f]=@vectors[f]
  }
  ds.fields=fields_to_include
  ds.name=@name
  ds.update_valid_data
  ds
end

clone_only_valid(*fields_to_include) click to toggle source

Returns (when possible) a cheap copy of dataset. If no vector have missing values, returns original vectors. If missing values presents, uses #dup_only_valid.

@param array of fields to include. No value include all fields @return {Statsample::Dataset}

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 242
def clone_only_valid(*fields_to_include)
  if fields_to_include.size==1 and fields_to_include[0].is_a? Array
    fields_to_include=fields_to_include[0]
  end
  fields_to_include=@fields.dup if fields_to_include.size==0
  if fields_to_include.any? {|v| @vectors[v].has_missing_data?}
    dup_only_valid(fields_to_include)
  else
    clone(fields_to_include)
  end
end

col(c) click to toggle source

Returns vector c

@return {Statsample::Vector}

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 376
def col(c)
  @vectors[c]
end

Also aliased as: vector

collect(type=:scale) { |row| ... } click to toggle source

Retrieves a Statsample::Vector, based on the result of calculation performed on each case.

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 683
def collect(type=:scale)
  data=[]
  each {|row|
    data.push yield(row)
  }
  Statsample::Vector.new(data,type)
end

collect_matrix() { |row,col| ... } click to toggle source

Generate a matrix, based on fields of dataset

@return {::Matrix}

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 358
def collect_matrix
  rows=@fields.collect{|row|
    @fields.collect{|col|
      yield row,col
    }
  }
  Matrix.rows(rows)
end

collect_with_index(type=:scale) { |row, i| ... } click to toggle source

Same as Statsample::Vector.collect, but giving case index as second parameter on yield.

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 691
def collect_with_index(type=:scale)
  data=[]
  each_with_index {|row, i|
    data.push(yield(row, i))
  }
  Statsample::Vector.new(data,type)
end

compute(text) click to toggle source

Returns a vector, based on a string with a calculation based on vector The calculation will be eval'ed, so you can put any variable or expression valid on ruby For example:

a=[1,2].to_vector(scale)
b=[3,4].to_vector(scale)
ds={'a'=>a,'b'=>b}.to_dataset
ds.compute("a+b")
=> Vector [4,6]

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 869
def compute(text)
  @fields.each{|f|
    if @vectors[f].type=:scale
      text.gsub!(f,"row['#{f}'].to_f")
    else
      text.gsub!(f,"row['#{f}']")
    end
  }
  collect_with_index {|row, i|
    invalid=false
    @fields.each{|f|
      if @vectors[f].data_with_nils[i].nil?
        invalid=true
      end
    }
    if invalid
      nil
    else
      eval(text)
    end
  }
end

correlation_matrix(fields=nil) click to toggle source

Return a correlation matrix for fields included as parameters. By default, uses all fields of dataset

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 749
    def correlation_matrix(fields=nil)
  if fields
    ds=clone(fields)
  else
    ds=self
  end
  Statsample::Bivariate.correlation_matrix(ds)
end

covariance_matrix(fields=nil) click to toggle source

Return a correlation matrix for fields included as parameters. By default, uses all fields of dataset

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 759
    def covariance_matrix(fields=nil)
  if fields
    ds=clone(fields)
  else
    ds=self
  end
  Statsample::Bivariate.covariance_matrix(ds)
end

crosstab(v1,v2,opts={}) click to toggle source

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 706
def crosstab(v1,v2,opts={})
  Statsample::Crosstab.new(@vectors[v1], @vectors[v2],opts)
end

delete_vector(*args) click to toggle source

Delete vector named name. Multiple fields accepted.

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 451
def delete_vector(*args)
  if args.size==1 and args[0].is_a? Array
    names=args[0]
  else
    names=args
  end
  names.each do |name|
    @fields.delete(name)
    @vectors.delete(name)
  end
end

dup(*fields_to_include) click to toggle source

Returns a duplicate of the Dataset. All vectors are copied, so any modification on new dataset doesn't affect original dataset's vectors. If fields given as parameter, only include those vectors.

@param array of fields to include. No value include all fields

@return {Statsample::Dataset}

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 211
def dup(*fields_to_include)
  if fields_to_include.size==1 and fields_to_include[0].is_a? Array
    fields_to_include=fields_to_include[0]
  end
  fields_to_include=@fields if fields_to_include.size==0
  vectors={}
  fields=[]
  fields_to_include.each{|f|
    raise "Vector #{f} doesn't exists" unless @vectors.has_key? f
    vectors[f]=@vectors[f].dup
    fields.push(f)
  }
  ds=Dataset.new(vectors,fields)
  ds.name= self.name
  ds
end

dup_empty() click to toggle source

Creates a copy of the given dataset, without data on vectors

@return {Statsample::Dataset}

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 275
def dup_empty
  vectors=@vectors.inject({}) {|a,v|
    a[v[0]]=v[1].dup_empty
    a
  }
  Dataset.new(vectors,@fields.dup)
end

dup_only_valid(*fields_to_include) click to toggle source

Creates a copy of the given dataset, deleting all the cases with missing data on one of the vectors.

@param array of fields to include. No value include all fields

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 183
def dup_only_valid(*fields_to_include)
  if fields_to_include.size==1 and fields_to_include[0].is_a? Array
    fields_to_include=fields_to_include[0]
  end
  fields_to_include=@fields if fields_to_include.size==0
  if fields_to_include.any? {|f| @vectors[f].has_missing_data?}
    ds=Dataset.new(fields_to_include)
    fields_to_include.each {|f| ds[f].type=@vectors[f].type}
    each {|row|
      unless fields_to_include.any? {|f| @vectors[f].has_missing_data? and !@vectors[f].is_valid? row[f]}
        row_2=fields_to_include.inject({}) {|ac,v| ac[v]=row[v]; ac}
        ds.add_case(row_2)
      end
    }
  else
    ds=dup fields_to_include
  end
  ds.name= self.name
  ds
end

each() { |row| ... } click to toggle source

Returns each case as a hash

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 603
def each
  begin
    @i=0
    @cases.times {|i|
      @i=i
      row=case_as_hash(i)
      yield row
    }
    @i=nil
  rescue =>e
    raise DatasetException.new(self, e)
  end
end

each_array() { |row| ... } click to toggle source

Returns each case as an array

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 647
def each_array
  @cases.times {|i|
    @i=i
    row=case_as_array(i)
    yield row
  }
  @i=nil
end

each_array_with_nils() { |row| ... } click to toggle source

Returns each case as an array, coding missing values as nils

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 633
def each_array_with_nils
  m=fields.size
  @cases.times {|i|
    @i=i
    row=Array.new(m)
    fields.each_index{|j|
      f=fields[j]
      row[j]=@vectors[f].data_with_nils[i]
    }
    yield row
  }
  @i=nil
end

each_vector() { ||key, vector|| ... } click to toggle source

Retrieves each vector as [key, vector]

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 570
def each_vector # :yield: |key, vector|
  @fields.each{|k| yield k, @vectors[k]}
end

each_with_index() { ||case, i|| ... } click to toggle source

Returns each case as hash and index

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 618
def each_with_index # :yield: |case, i|
  begin
    @i=0
    @cases.times{|i|
      @i=i
      row=case_as_hash(i)
      yield row, i
    }
    @i=nil
  rescue =>e
    raise DatasetException.new(self, e)
  end
end

fields=(f) click to toggle source

Set fields order. If you omit one or more vectors, they are ordered by alphabetic order.

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 657
def fields=(f)
  @fields=f
  check_order
end

filter() { |c| ... } click to toggle source

Create a new dataset with all cases which the block returns true

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 769
def filter
  ds=self.dup_empty
  each {|c|
    ds.add_case(c, false) if yield c
  }
  ds.update_valid_data
  ds.name=_("%s(filtered)") % @name
  ds
end

filter_field(field) { |c| ... } click to toggle source

creates a new vector with the data of a given field which the block returns true

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 780
def filter_field(field)
  a=[]
  each do |c|
    a.push(c[field]) if yield c
  end
  a.to_vector(@vectors[field].type)
end

from_to(from,to) click to toggle source

Returns an array with the fields from first argumen to last argument

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 230
def from_to(from,to)
  raise ArgumentError, "Field #{from} should be on dataset" if !@fields.include? from
  raise ArgumentError, "Field #{to} should be on dataset" if !@fields.include? to
  @fields.slice(@fields.index(from)..@fields.index(to))
end

has_missing_data?() click to toggle source

Return true if any vector has missing data

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 119
def has_missing_data?
  @vectors.any? {|k,v| v.has_missing_data?}
end

has_vector?(v) click to toggle source

Returns true if dataset have vector v.

@return {Boolean}

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 392
def has_vector? (v)
  return @vectors.has_key?(v)
end

inspect() click to toggle source

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 921
def inspect
  self.to_s
end

join(other_ds,fields_1=[],fields_2=[],type=:left) click to toggle source

Join 2 Datasets by given fields type is one of :left and :inner, default is :left

@return {Statsample::Dataset}

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 308
def join(other_ds,fields_1=[],fields_2=[],type=:left)
  fields_new = other_ds.fields - fields_2
  fields = self.fields + fields_new

  other_ds_hash = {}
  other_ds.each do |row|
    key = row.select{|k,v| fields_2.include?(k)}.values
    value = row.select{|k,v| fields_new.include?(k)}
    if other_ds_hash[key].nil?
      other_ds_hash[key] = [value]
    else
      other_ds_hash[key] << value
    end
  end

  new_ds = Dataset.new(fields)

  self.each do |row|
    key = row.select{|k,v| fields_1.include?(k)}.values

    new_case = row.dup

    if other_ds_hash[key].nil?
      if type == :left
        fields_new.each{|field| new_case[field] = nil}
        new_ds.add_case(new_case)
      end
    else
      other_ds_hash[key].each do |new_values|
        new_ds.add_case new_case.merge(new_values)
      end
    end

  end
  new_ds
end

merge(other_ds) click to toggle source

Merge vectors from two datasets In case of name collition, the vectors names are changed to x_1, x_2 .…

@return {Statsample::Dataset}

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 287
def merge(other_ds)
  raise "Cases should be equal (this:#{@cases}; other:#{other_ds.cases}" unless @cases==other_ds.cases
  types = @fields.collect{|f| @vectors[f].type} + other_ds.fields.collect{|f| other_ds[f].type}
  new_fields = (@fields+other_ds.fields).recode_repeated
  ds_new=Statsample::Dataset.new(new_fields)
  new_fields.each_index{|i|
    field=new_fields[i]
    ds_new[field].type=types[i]
  }
  @cases.times {|i|
    row=case_as_array(i)+other_ds.case_as_array(i)
    ds_new.add_case_array(row)
  }
  ds_new.update_valid_data
  ds_new
end

nest(*tree_keys,&block) click to toggle source

Return a nested hash using fields as keys and an array constructed of hashes with other values. If block provided, is used to provide the values, with parameters row of dataset, current last hash on hierarchy and name of the key to include

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 128
def nest(*tree_keys,&block)
  tree_keys=tree_keys[0] if tree_keys[0].is_a? Array
  out=Hash.new      
  each do |row|
    current=out        
    # Create tree
    tree_keys[0,tree_keys.size-1].each do |f|
      root=row[f]
      current[root]||=Hash.new
      current=current[root]
    end
    name=row[tree_keys.last]
    if !block
      current[name]||=Array.new
      current[name].push(row.delete_if{|key,value| tree_keys.include? key})
    else
      current[name]=block.call(row, current,name)
    end
  end
  out
end

one_to_many(parent_fields, pattern) click to toggle source

Creates a new dataset for one to many relations on a dataset, based on pattern of field names.

for example, you have a survey for number of children with this structure:

id, name, child_name_1, child_age_1, child_name_2, child_age_2

with

ds.one_to_many(%w{id}, "child_%v_%n"

the field of first parameters will be copied verbatim to new dataset, and fields which responds to second pattern will be added one case for each different %n. For example

cases=[
  ['1','george','red',10,'blue',20,nil,nil],
  ['2','fred','green',15,'orange',30,'white',20],
  ['3','alfred',nil,nil,nil,nil,nil,nil]
]
ds=Statsample::Dataset.new(%w{id name car_color1 car_value1 car_color2 car_value2 car_color3 car_value3})
cases.each {|c| ds.add_case_array c }
ds.one_to_many(['id'],'car_%v%n').to_matrix
=> Matrix[
   ["red", "1", 10], 
   ["blue", "1", 20],
   ["green", "2", 15],
   ["orange", "2", 30],
   ["white", "2", 20]
   ]

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 952
def one_to_many(parent_fields, pattern)
  #base_pattern=pattern.gsub(/%v|%n/,"")
  re=Regexp.new pattern.gsub("%v","(.+?)").gsub("%n","(\\d+?)")
  ds_vars=parent_fields
  vars=[]
  max_n=0
  h=parent_fields.inject({}) {|a,v| a[v]=Statsample::Vector.new([], @vectors[v].type);a }
  # Adding _row_id
  h['_col_id']=[].to_scale
  ds_vars.push("_col_id")
  @fields.each do |f|
    if f=~re
      if !vars.include? $1
        vars.push($1) 
        h[$1]=Statsample::Vector.new([], @vectors[f].type)
      end
      max_n=$2.to_i if max_n < $2.to_i
    end
  end
  ds=Dataset.new(h,ds_vars+vars)
  each do |row|
    row_out={}
    parent_fields.each do |f|
      row_out[f]=row[f]
    end
    max_n.times do |n1|
      n=n1+1
      any_data=false
      vars.each do |v|
        data=row[pattern.gsub("%v",v.to_s).gsub("%n",n.to_s)]
        row_out[v]=data
        any_data=true if !data.nil?
      end
      if any_data
        row_out["_col_id"]=n
        ds.add_case(row_out,false)
      end
      
    end
  end
  ds.update_valid_data
  ds
end

recode!(vector_name) { |case_as_hash(i)| ... } click to toggle source

Recode a vector based on a block

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 699
def recode!(vector_name)
  0.upto(@cases-1) {|i|
    @vectors[vector_name].data[i]=yield case_as_hash(i)
  }
  @vectors[vector_name].set_valid_data
end

report_building(b) click to toggle source

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 995
def report_building(b)
  b.section(:name=>@name) do |g|
    g.text _"Cases: %d"  % cases
    @fields.each do |f|
      g.text "Element:[#{f}]"
      g.parse_element(@vectors[f])
    end
  end
end

standarize() click to toggle source

Returns a dataset with standarized data.

@return {Statsample::Dataset}

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 347
def standarize
  ds=dup()
  ds.fields.each do |f|
    ds[f]=ds[f].vector_standarized
  end
  ds
end

to_REXP() click to toggle source

# File lib/statsample/rserve_extension.rb, line 11
def to_REXP
  names=@fields
  data=@fields.map {|f|
    Rserve::REXP::Wrapper.wrap(@vectors[f].data_with_nils)
  }
  l=Rserve::Rlist.new(data,names)
  Rserve::REXP.create_data_frame(l)
end

to_gsl() click to toggle source

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 732
def to_gsl
  if @gsl.nil?
    if cases.nil?
      update_valid_data
    end
    @gsl=GSL::Matrix.alloc(cases,fields.size)
    self.each_array{|c|
      @gsl.set_row(@i,c)
    }
  end
  @gsl
end

to_matrix() click to toggle source

Return data as a matrix. Column are ordered by fields and rows by orden of insertion

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 719
def to_matrix
  rows=[]
  self.each_array{|c|
    rows.push(c)
  }
  Matrix.rows(rows)
end

to_multiset_by_split(*fields) click to toggle source

Creates a Stastample::Multiset, using one or more fields to split the dataset.

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 792
def to_multiset_by_split(*fields)
                    require 'statsample/multiset'
                    if fields.size==1
                            to_multiset_by_split_one_field(fields[0])
                    else
                            to_multiset_by_split_multiple_fields(*fields)
                    end
end

to_multiset_by_split_multiple_fields(*fields) click to toggle source

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 823
def to_multiset_by_split_multiple_fields(*fields)
  factors_total=nil
  fields.each do |f|
    if factors_total.nil?
      factors_total=@vectors[f].factors.collect{|c|
        [c]
      }
    else
      suma=[]
      factors=@vectors[f].factors
      factors_total.each{|f1| factors.each{|f2| suma.push(f1+[f2]) } }
      factors_total=suma
    end
  end
  ms=Multiset.new_empty_vectors(@fields,factors_total)

  p1=eval "Proc.new {|c| ms[["+fields.collect{|f| "c['#{f}']"}.join(",")+"]].add_case(c,false) }"
  each{|c| p1.call(c)}
  
  ms.datasets.each do |k,ds|
    ds.update_valid_data
    ds.name=fields.size.times.map {|i|
      f=fields[i]
      sk=k[i]
      @vectors[f].labeling(sk)
    }.join("-")
    ds.vectors.each{|k1,v1| 
      v1.type=@vectors[k1].type
      v1.name=@vectors[k1].name
      v1.labels=@vectors[k1].labels
      
    }
  end
  ms
  
end

to_multiset_by_split_one_field(field) click to toggle source

Creates a Statsample::Multiset, using one field

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 802
def to_multiset_by_split_one_field(field)
  raise ArgumentError,"Should use a correct field name" if !@fields.include? field
  factors=@vectors[field].factors
  ms=Multiset.new_empty_vectors(@fields, factors)
  each {|c|
    ms[c[field]].add_case(c,false)
  }
  #puts "Ingreso a los dataset"
  ms.datasets.each {|k,ds|
    ds.update_valid_data
    ds.name=@vectors[field].labeling(k)
    ds.vectors.each{|k1,v1|
      #        puts "Vector #{k1}:"+v1.to_s
      v1.type=@vectors[k1].type
      v1.name=@vectors[k1].name
      v1.labels=@vectors[k1].labels
      
    }
  }
  ms
end

to_s() click to toggle source

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 918
def to_s
  "#<"+self.class.to_s+":"+self.object_id.to_s+" @name=#{@name} @fields=["+@fields.join(",")+"] cases="+@vectors[@fields[0]].size.to_s
end

update_valid_data() click to toggle source

Check vectors and fields after inserting data. Use only after add_case_array or add_case with second parameter to false

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 445
def update_valid_data
  @gsl=nil
  @fields.each{|f| @vectors[f].set_valid_data}
  check_length
end

vector(c)

Alias for: col

vector_by_calculation(type=:scale) { |row| ... } click to toggle source

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 480
def vector_by_calculation(type=:scale)
  a=[]
  each do |row|
    a.push(yield(row))
  end
  a.to_vector(type)
end

vector_count_characters(fields=nil) click to toggle source

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 517
def vector_count_characters(fields=nil)
  fields=check_fields(fields)
  collect_with_index do |row, i|
    fields.inject(0){|a,v|
      a+((@vectors[v].data_with_nils[i].nil?) ? 0: row[v].to_s.size)
    }
  end
end

vector_mean(fields=nil, max_invalid=0) click to toggle source

Returns a vector with the mean for a set of fields if fields parameter is empty, return the mean for all fields if max invalid parameter > 0, returns the mean for all tuples with 0 to max_invalid invalid fields

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 529
def vector_mean(fields=nil, max_invalid=0)
  a=[]
  fields=check_fields(fields)
  size=fields.size
  each_with_index do |row, i |
    # numero de invalidos
    sum=0
    invalids=0
    fields.each{|f|
      if !@vectors[f].data_with_nils[i].nil?
        sum+=row[f].to_f
      else
        invalids+=1
      end
    }
    if(invalids>max_invalid)
      a.push(nil)
    else
      a.push(sum.quo(size-invalids))
    end
  end
  a=a.to_vector(:scale)
  a.name=_("Means from %s") % @name
  a
end

vector_missing_values(fields=nil) click to toggle source

Returns a vector with the numbers of missing values for a case

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 509
def vector_missing_values(fields=nil)
  fields=check_fields(fields)
  collect_with_index do |row, i|
    fields.inject(0) {|a,v|
      a+ ((@vectors[v].data_with_nils[i].nil?) ? 1: 0)
    }
  end
end

vector_sum(fields=nil) click to toggle source

Returns a vector with sumatory of fields if fields parameter is empty, sum all fields

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 489
def vector_sum(fields=nil)
  fields||=@fields
  vector=collect_with_index do |row, i|
    if(fields.find{|f| !@vectors[f].data_with_nils[i]})
      nil
    else
      fields.inject(0) {|ac,v| ac + row[v].to_f}
    end
  end
  vector.name=_("Sum from %s") % @name
  vector
end

verify(*tests) click to toggle source

Test each row with one or more tests each test is a Proc with the form

Proc.new {|row| row['age']>0}

The function returns an array with all errors

# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 895
def verify(*tests)
  if(tests[0].is_a? String)
    id=tests[0]
    tests.shift
  else
    id=@fields[0]
  end
  vr=[]
  i=0
  each do |row|
    i+=1
    tests.each{|test|
      if ! test[2].call(row)
        values=""
        if test[1].size>0
          values=" ("+test[1].collect{|k| "#{k}=#{row[k]}"}.join(", ")+")"
        end
        vr.push("#{i} [#{row[id]}]: #{test[0]}#{values}")
      end
    }
  end
  vr
end