# class Statsample::Dataset

Set of cases with values for one or more variables, analog to a dataframe on R or a standard data file of SPSS. Every vector has `#field` name, which represent it. By default, the vectors are ordered by it field name, but you can change it the fields order manually. The Dataset work as a Hash, with keys are field names and values are Statsample::Vector

## Usage¶ ↑

Create a empty dataset:

`Dataset.new()`

Create a dataset with three empty vectors, called `v1`, `v2` and `v3`:

`Dataset.new(%w{v1 v2 v3})`

Create a dataset with two vectors, called `v1` and `v2`:

```Dataset.new({'v1'=>%w{1 2 3}.to_vector, 'v2'=>%w{4 5 6}.to_vector})
```

Create a dataset with two given vectors (v1 and v2), with vectors on inverted order:

```Dataset.new({'v2'=>v2,'v1'=>v1},['v2','v1'])
```

The fast way to create a dataset uses Hash#to_dataset, with field order as arguments

```v1 = [1,2,3].to_scale
v2 = [1,2,3].to_scale
ds = {'v1'=>v2, 'v2'=>v2}.to_dataset(%w{v2 v1})
```

### Attributes

cases[R]

Number of cases

fields[R]

Ordered ids of vectors

i[R]

Location of pointer on enumerations methods (like each)

name[RW]

Name of dataset

vectors[R]

### Public Class Methods

crosstab_by_asignation(rows,columns,values) click to toggle source

Generates a new dataset, using three vectors

• Rows

• Columns

• Values

For example, you have these values

```x   y   v
a   a   0
a   b   1
b   a   1
b   b   0```

You obtain

```id  a   b
a  0   1
b  1   0```

Useful to process outputs from databases

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 92
def self.crosstab_by_asignation(rows,columns,values)
raise "Three vectors should be equal size" if rows.size!=columns.size or rows.size!=values.size
cols_values=columns.factors
cols_n=cols_values.size
h_rows=rows.factors.inject({}){|a,v| a[v]=cols_values.inject({}){
|a1,v1| a1[v1]=nil; a1
}
;a}
values.each_index{|i|
h_rows[rows[i]][columns[i]]=values[i]
}
ds=Dataset.new(["_id"]+cols_values)
cols_values.each{|c|
ds[c].type=values.type
}
rows.factors.each {|row|
n_row=Array.new(cols_n+1)
n_row[0]=row
cols_values.each_index {|i|
n_row[i+1]=h_rows[row][cols_values[i]]
}
}
ds.update_valid_data
ds
end```
new(vectors={}, fields=[]) click to toggle source

Creates a new dataset. A dataset is a set of ordered named vectors of the same size.

vectors

With an array, creates a set of empty vectors named as

values on the array. With a hash, each Vector is assigned as a variable of the Dataset named as its key

fields

Array of names for vectors. Is only used for set the

order of variables. If empty, vectors keys on alfabethic order as used as fields.

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 158
def initialize(vectors={}, fields=[])
@@n_dataset||=0
@@n_dataset+=1
@name=_("Dataset %d") % @@n_dataset
@cases=0
@gsl=nil
@i=nil

if vectors.instance_of? Array
@fields=vectors.dup
@vectors=vectors.inject({}){|a,x| a[x]=Statsample::Vector.new(); a}
else
# Check vectors
@vectors=vectors
@fields=fields
check_order
check_length
end
end```

### Public Instance Methods

==(d2) click to toggle source

We have the same datasets if `vectors` and `fields` are the same

@return {Boolean}

def ==(d2)
@vectors==d2.vectors and @fields==d2.fields
end
def ==(d2)
@vectors==d2.vectors and @fields==d2.fields
end```
[](i) click to toggle source

Returns the vector named i

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 670
def[](i)
if i.is_a? Range
fields=from_to(i.begin,i.end)
clone(*fields)
elsif i.is_a? Array
clone(i)
else
raise Exception,"Vector '#{i}' doesn't exists on dataset" unless @vectors.has_key?(i)
@vectors[i]
end
end```
[]=(i,v) click to toggle source
```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 709
def[]=(i,v)
if v.instance_of? Statsample::Vector
@vectors[i]=v
check_order
else
raise ArgumentError,"Should pass a Statsample::Vector"
end
end```
add_case(v,uvd=true) click to toggle source

Insert a case, using:

• Array: size equal to number of vectors and values in the same order as fields

• Hash: keys equal to fields

If uvd is false, update_valid_data is not executed after inserting a case. This is very useful if you want to increase the performance on inserting many cases, because update_valid_data performs check on vectors and on the dataset

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 424
case v
when Array
if (v[0].is_a? Array)
else
raise ArgumentError, "Input array size (#{v.size}) should be equal to fields number (#{@fields.size})" if @fields.size!=v.size
end
when Hash
raise ArgumentError, "Hash keys should be equal to fields #{(v.keys - @fields).join(",")}" if @fields.sort!=v.keys.sort
else
raise TypeError, 'Value must be a Array or a Hash'
end
if uvd
update_valid_data
end
end```
add_case_array(v) click to toggle source

Fast version of add_case. Can only add one case and no error check if performed You SHOULD use update_valid_data at the end of insertion cycle

def add_case_array(v)
v.each_index {|i| d=@vectors[@fields[i]].data; d.push(v[i])}
end
v.each_index {|i| d=@vectors[@fields[i]].data; d.push(v[i])}
end```
add_vector(name, vector) click to toggle source

Equal to Dataset=`vector`

@return self

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 383
raise ArgumentError, "Vector have different size" if vector.size!=@cases
@vectors[name]=vector
check_order
self
end```
add_vectors_by_split(name,join='-',sep=Statsample::SPLIT_TOKEN) click to toggle source
```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 473
split=@vectors[name].split_by_separator(sep)
split.each{|k,v|
}
end```
add_vectors_by_split_recode(name_,join='-',sep=Statsample::SPLIT_TOKEN) click to toggle source
```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 463
split=@vectors[name_].split_by_separator(sep)
i=1
split.each{|k,v|
new_field=name_+join+i.to_s
v.name=name_+":"+k
i+=1
}
end```
bootstrap(n=nil) click to toggle source

Creates a dataset with the random data, of a n size If n not given, uses original number of cases.

@return {Statsample::Dataset}

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 399
def bootstrap(n=nil)
n||=@cases
ds_boot=dup_empty
n.times do
end
ds_boot.update_valid_data
ds_boot
end```
case_as_array(i) click to toggle source

Retrieves case i as a array, ordered on fields order

def case_as_array(i)
_case_as_array(i)
end
def case_as_array(i)
_case_as_array(i)
end```
case_as_hash(i) click to toggle source

Retrieves case i as a hash

def case_as_hash(i)
_case_as_hash(i)
end
def case_as_hash(i)
_case_as_hash(i)
end```
check_fields(fields) click to toggle source

Check if fields attribute is correct, after inserting or deleting vectors

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 502
def check_fields(fields)
fields||=@fields
raise "Fields #{(fields-@fields).join(", ")} doesn't exists on dataset" if (fields-@fields).size>0
fields
end```
clear_gsl() click to toggle source
def clear_gsl
@gsl=nil
end
def clear_gsl
@gsl=nil
end```
clone(*fields_to_include) click to toggle source

Returns a shallow copy of Dataset. Object id will be distinct, but @vectors will be the same. @param array of fields to include. No value include all fields @return {Statsample::Dataset}

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 257
def clone(*fields_to_include)
if fields_to_include.size==1 and fields_to_include[0].is_a? Array
fields_to_include=fields_to_include[0]
end
fields_to_include=@fields.dup if fields_to_include.size==0
ds=Dataset.new
fields_to_include.each{|f|
raise "Vector #{f} doesn't exists" unless @vectors.has_key? f
ds[f]=@vectors[f]
}
ds.fields=fields_to_include
ds.name=@name
ds.update_valid_data
ds
end```
clone_only_valid(*fields_to_include) click to toggle source

Returns (when possible) a cheap copy of dataset. If no vector have missing values, returns original vectors. If missing values presents, uses #dup_only_valid.

@param array of fields to include. No value include all fields @return {Statsample::Dataset}

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 242
def clone_only_valid(*fields_to_include)
if fields_to_include.size==1 and fields_to_include[0].is_a? Array
fields_to_include=fields_to_include[0]
end
fields_to_include=@fields.dup if fields_to_include.size==0
if fields_to_include.any? {|v| @vectors[v].has_missing_data?}
dup_only_valid(fields_to_include)
else
clone(fields_to_include)
end
end```
col(c) click to toggle source

Returns vector `c`

@return {Statsample::Vector}

def col(c)
@vectors[c]
end
def col(c)
@vectors[c]
end```
Also aliased as: vector
collect(type=:scale) { |row| ... } click to toggle source

Retrieves a Statsample::Vector, based on the result of calculation performed on each case.

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 683
def collect(type=:scale)
data=[]
each {|row|
data.push yield(row)
}
Statsample::Vector.new(data,type)
end```
collect_matrix() { |row,col| ... } click to toggle source

Generate a matrix, based on fields of dataset

@return {::Matrix}

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 358
def collect_matrix
rows=@fields.collect{|row|
@fields.collect{|col|
yield row,col
}
}
Matrix.rows(rows)
end```
collect_with_index(type=:scale) { |row, i| ... } click to toggle source

Same as Statsample::Vector.collect, but giving case index as second parameter on yield.

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 691
def collect_with_index(type=:scale)
data=[]
each_with_index {|row, i|
data.push(yield(row, i))
}
Statsample::Vector.new(data,type)
end```
compute(text) click to toggle source

Returns a vector, based on a string with a calculation based on vector The calculation will be eval'ed, so you can put any variable or expression valid on ruby For example:

```a=[1,2].to_vector(scale)
b=[3,4].to_vector(scale)
ds={'a'=>a,'b'=>b}.to_dataset
ds.compute("a+b")
=> Vector [4,6]
```
```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 869
def compute(text)
@fields.each{|f|
if @vectors[f].type=:scale
text.gsub!(f,"row['#{f}'].to_f")
else
text.gsub!(f,"row['#{f}']")
end
}
collect_with_index {|row, i|
invalid=false
@fields.each{|f|
if @vectors[f].data_with_nils[i].nil?
invalid=true
end
}
if invalid
nil
else
eval(text)
end
}
end```
correlation_matrix(fields=nil) click to toggle source

Return a correlation matrix for fields included as parameters. By default, uses all fields of dataset

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 749
def correlation_matrix(fields=nil)
if fields
ds=clone(fields)
else
ds=self
end
Statsample::Bivariate.correlation_matrix(ds)
end```
covariance_matrix(fields=nil) click to toggle source

Return a correlation matrix for fields included as parameters. By default, uses all fields of dataset

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 759
def covariance_matrix(fields=nil)
if fields
ds=clone(fields)
else
ds=self
end
Statsample::Bivariate.covariance_matrix(ds)
end```
crosstab(v1,v2,opts={}) click to toggle source
def crosstab(v1,v2,opts={})
Statsample::Crosstab.new(@vectors[v1], @vectors[v2],opts)
end
def crosstab(v1,v2,opts={})
Statsample::Crosstab.new(@vectors[v1], @vectors[v2],opts)
end```
delete_vector(*args) click to toggle source

Delete vector named `name`. Multiple fields accepted.

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 451
def delete_vector(*args)
if args.size==1 and args[0].is_a? Array
names=args[0]
else
names=args
end
names.each do |name|
@fields.delete(name)
@vectors.delete(name)
end
end```
dup(*fields_to_include) click to toggle source

Returns a duplicate of the Dataset. All vectors are copied, so any modification on new dataset doesn't affect original dataset's vectors. If fields given as parameter, only include those vectors.

@param array of fields to include. No value include all fields

@return {Statsample::Dataset}

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 211
def dup(*fields_to_include)
if fields_to_include.size==1 and fields_to_include[0].is_a? Array
fields_to_include=fields_to_include[0]
end
fields_to_include=@fields if fields_to_include.size==0
vectors={}
fields=[]
fields_to_include.each{|f|
raise "Vector #{f} doesn't exists" unless @vectors.has_key? f
vectors[f]=@vectors[f].dup
fields.push(f)
}
ds=Dataset.new(vectors,fields)
ds.name= self.name
ds
end```
dup_empty() click to toggle source

Creates a copy of the given dataset, without data on vectors

@return {Statsample::Dataset}

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 275
def dup_empty
vectors=@vectors.inject({}) {|a,v|
a[v[0]]=v[1].dup_empty
a
}
Dataset.new(vectors,@fields.dup)
end```
dup_only_valid(*fields_to_include) click to toggle source

Creates a copy of the given dataset, deleting all the cases with missing data on one of the vectors.

@param array of fields to include. No value include all fields

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 183
def dup_only_valid(*fields_to_include)
if fields_to_include.size==1 and fields_to_include[0].is_a? Array
fields_to_include=fields_to_include[0]
end
fields_to_include=@fields if fields_to_include.size==0
if fields_to_include.any? {|f| @vectors[f].has_missing_data?}
ds=Dataset.new(fields_to_include)
fields_to_include.each {|f| ds[f].type=@vectors[f].type}
each {|row|
unless fields_to_include.any? {|f| @vectors[f].has_missing_data? and !@vectors[f].is_valid? row[f]}
row_2=fields_to_include.inject({}) {|ac,v| ac[v]=row[v]; ac}
end
}
else
ds=dup fields_to_include
end
ds.name= self.name
ds
end```
each() { |row| ... } click to toggle source

Returns each case as a hash

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 603
def each
begin
@i=0
@cases.times {|i|
@i=i
row=case_as_hash(i)
yield row
}
@i=nil
rescue =>e
raise DatasetException.new(self, e)
end
end```
each_array() { |row| ... } click to toggle source

Returns each case as an array

def each_array
@cases.times {|i|
@i=i
row=case_as_array(i)
yield row
}
@i=nil
end
def each_array
@cases.times {|i|
@i=i
row=case_as_array(i)
yield row
}
@i=nil
end```
each_array_with_nils() { |row| ... } click to toggle source

Returns each case as an array, coding missing values as nils

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 633
def each_array_with_nils
m=fields.size
@cases.times {|i|
@i=i
row=Array.new(m)
fields.each_index{|j|
f=fields[j]
row[j]=@vectors[f].data_with_nils[i]
}
yield row
}
@i=nil
end```
each_vector() { ||key, vector|| ... } click to toggle source

Retrieves each vector as [key, vector]

def each_vector # :yield: |key, vector|
@fields.each{|k| yield k, @vectors[k]}
end
def each_vector # :yield: |key, vector|
@fields.each{|k| yield k, @vectors[k]}
end```
each_with_index() { ||case, i|| ... } click to toggle source

Returns each case as hash and index

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 618
def each_with_index # :yield: |case, i|
begin
@i=0
@cases.times{|i|
@i=i
row=case_as_hash(i)
yield row, i
}
@i=nil
rescue =>e
raise DatasetException.new(self, e)
end
end```
fields=(f) click to toggle source

Set fields order. If you omit one or more vectors, they are ordered by alphabetic order.

def fields=(f)
@fields=f
check_order
end
def fields=(f)
@fields=f
check_order
end```
filter() { |c| ... } click to toggle source

Create a new dataset with all cases which the block returns true

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 769
def filter
ds=self.dup_empty
each {|c|
ds.add_case(c, false) if yield c
}
ds.update_valid_data
ds.name=_("%s(filtered)") % @name
ds
end```
filter_field(field) { |c| ... } click to toggle source

creates a new vector with the data of a given field which the block returns true

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 780
def filter_field(field)
a=[]
each do |c|
a.push(c[field]) if yield c
end
a.to_vector(@vectors[field].type)
end```
from_to(from,to) click to toggle source

Returns an array with the fields from first argumen to last argument

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 230
def from_to(from,to)
raise ArgumentError, "Field #{from} should be on dataset" if !@fields.include? from
raise ArgumentError, "Field #{to} should be on dataset" if !@fields.include? to
@fields.slice(@fields.index(from)..@fields.index(to))
end```
has_missing_data?() click to toggle source

Return true if any vector has missing data

def has_missing_data?
@vectors.any? {|k,v| v.has_missing_data?}
end
def has_missing_data?
@vectors.any? {|k,v| v.has_missing_data?}
end```
has_vector?(v) click to toggle source

Returns true if dataset have vector `v`.

@return {Boolean}

def has_vector? (v)
return @vectors.has_key?(v)
end
def has_vector? (v)
return @vectors.has_key?(v)
end```
inspect() click to toggle source
def inspect
self.to_s
end
def inspect
self.to_s
end```
join(other_ds,fields_1=[],fields_2=[],type=:left) click to toggle source

Join 2 Datasets by given fields type is one of :left and :inner, default is :left

@return {Statsample::Dataset}

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 308
def join(other_ds,fields_1=[],fields_2=[],type=:left)
fields_new = other_ds.fields - fields_2
fields = self.fields + fields_new

other_ds_hash = {}
other_ds.each do |row|
key = row.select{|k,v| fields_2.include?(k)}.values
value = row.select{|k,v| fields_new.include?(k)}
if other_ds_hash[key].nil?
other_ds_hash[key] = [value]
else
other_ds_hash[key] << value
end
end

new_ds = Dataset.new(fields)

self.each do |row|
key = row.select{|k,v| fields_1.include?(k)}.values

new_case = row.dup

if other_ds_hash[key].nil?
if type == :left
fields_new.each{|field| new_case[field] = nil}
end
else
other_ds_hash[key].each do |new_values|
end
end

end
new_ds
end```
merge(other_ds) click to toggle source

Merge vectors from two datasets In case of name collition, the vectors names are changed to x_1, x_2 .…

@return {Statsample::Dataset}

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 287
def merge(other_ds)
raise "Cases should be equal (this:#{@cases}; other:#{other_ds.cases}" unless @cases==other_ds.cases
types = @fields.collect{|f| @vectors[f].type} + other_ds.fields.collect{|f| other_ds[f].type}
new_fields = (@fields+other_ds.fields).recode_repeated
ds_new=Statsample::Dataset.new(new_fields)
new_fields.each_index{|i|
field=new_fields[i]
ds_new[field].type=types[i]
}
@cases.times {|i|
row=case_as_array(i)+other_ds.case_as_array(i)
}
ds_new.update_valid_data
ds_new
end```
nest(*tree_keys,&block) click to toggle source

Return a nested hash using fields as keys and an array constructed of hashes with other values. If block provided, is used to provide the values, with parameters `row` of dataset, `current` last hash on hierarchy and `name` of the key to include

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 128
def nest(*tree_keys,&block)
tree_keys=tree_keys[0] if tree_keys[0].is_a? Array
out=Hash.new
each do |row|
current=out
# Create tree
tree_keys[0,tree_keys.size-1].each do |f|
root=row[f]
current[root]||=Hash.new
current=current[root]
end
name=row[tree_keys.last]
if !block
current[name]||=Array.new
current[name].push(row.delete_if{|key,value| tree_keys.include? key})
else
current[name]=block.call(row, current,name)
end
end
out
end```
one_to_many(parent_fields, pattern) click to toggle source

Creates a new dataset for one to many relations on a dataset, based on pattern of field names.

for example, you have a survey for number of children with this structure:

`id, name, child_name_1, child_age_1, child_name_2, child_age_2`

with

`ds.one_to_many(%w{id}, "child_%v_%n"`

the field of first parameters will be copied verbatim to new dataset, and fields which responds to second pattern will be added one case for each different %n. For example

```cases=[
['1','george','red',10,'blue',20,nil,nil],
['2','fred','green',15,'orange',30,'white',20],
['3','alfred',nil,nil,nil,nil,nil,nil]
]
ds=Statsample::Dataset.new(%w{id name car_color1 car_value1 car_color2 car_value2 car_color3 car_value3})
cases.each {|c| ds.add_case_array c }
ds.one_to_many(['id'],'car_%v%n').to_matrix
=> Matrix[
["red", "1", 10],
["blue", "1", 20],
["green", "2", 15],
["orange", "2", 30],
["white", "2", 20]
]
```
```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 952
def one_to_many(parent_fields, pattern)
#base_pattern=pattern.gsub(/%v|%n/,"")
re=Regexp.new pattern.gsub("%v","(.+?)").gsub("%n","(\\d+?)")
ds_vars=parent_fields
vars=[]
max_n=0
h=parent_fields.inject({}) {|a,v| a[v]=Statsample::Vector.new([], @vectors[v].type);a }
h['_col_id']=[].to_scale
ds_vars.push("_col_id")
@fields.each do |f|
if f=~re
if !vars.include? \$1
vars.push(\$1)
h[\$1]=Statsample::Vector.new([], @vectors[f].type)
end
max_n=\$2.to_i if max_n < \$2.to_i
end
end
ds=Dataset.new(h,ds_vars+vars)
each do |row|
row_out={}
parent_fields.each do |f|
row_out[f]=row[f]
end
max_n.times do |n1|
n=n1+1
any_data=false
vars.each do |v|
data=row[pattern.gsub("%v",v.to_s).gsub("%n",n.to_s)]
row_out[v]=data
any_data=true if !data.nil?
end
if any_data
row_out["_col_id"]=n
end

end
end
ds.update_valid_data
ds
end```
recode!(vector_name) { |case_as_hash(i)| ... } click to toggle source

Recode a vector based on a block

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 699
def recode!(vector_name)
0.upto(@cases-1) {|i|
@vectors[vector_name].data[i]=yield case_as_hash(i)
}
@vectors[vector_name].set_valid_data
end```
report_building(b) click to toggle source
```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 995
def report_building(b)
b.section(:name=>@name) do |g|
g.text _"Cases: %d"  % cases
@fields.each do |f|
g.text "Element:[#{f}]"
g.parse_element(@vectors[f])
end
end
end```
standarize() click to toggle source

Returns a dataset with standarized data.

@return {Statsample::Dataset}

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 347
def standarize
ds=dup()
ds.fields.each do |f|
ds[f]=ds[f].vector_standarized
end
ds
end```
to_REXP() click to toggle source
```# File lib/statsample/rserve_extension.rb, line 11
def to_REXP
names=@fields
data=@fields.map {|f|
Rserve::REXP::Wrapper.wrap(@vectors[f].data_with_nils)
}
l=Rserve::Rlist.new(data,names)
Rserve::REXP.create_data_frame(l)
end```
to_gsl() click to toggle source
```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 732
def to_gsl
if @gsl.nil?
if cases.nil?
update_valid_data
end
@gsl=GSL::Matrix.alloc(cases,fields.size)
self.each_array{|c|
@gsl.set_row(@i,c)
}
end
@gsl
end```
to_matrix() click to toggle source

Return data as a matrix. Column are ordered by fields and rows by orden of insertion

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 719
def to_matrix
rows=[]
self.each_array{|c|
rows.push(c)
}
Matrix.rows(rows)
end```
to_multiset_by_split(*fields) click to toggle source

Creates a Stastample::Multiset, using one or more fields to split the dataset.

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 792
def to_multiset_by_split(*fields)
require 'statsample/multiset'
if fields.size==1
to_multiset_by_split_one_field(fields[0])
else
to_multiset_by_split_multiple_fields(*fields)
end
end```
to_multiset_by_split_multiple_fields(*fields) click to toggle source
```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 823
def to_multiset_by_split_multiple_fields(*fields)
factors_total=nil
fields.each do |f|
if factors_total.nil?
factors_total=@vectors[f].factors.collect{|c|
[c]
}
else
suma=[]
factors=@vectors[f].factors
factors_total.each{|f1| factors.each{|f2| suma.push(f1+[f2]) } }
factors_total=suma
end
end
ms=Multiset.new_empty_vectors(@fields,factors_total)

p1=eval "Proc.new {|c| ms[["+fields.collect{|f| "c['#{f}']"}.join(",")+"]].add_case(c,false) }"
each{|c| p1.call(c)}

ms.datasets.each do |k,ds|
ds.update_valid_data
ds.name=fields.size.times.map {|i|
f=fields[i]
sk=k[i]
@vectors[f].labeling(sk)
}.join("-")
ds.vectors.each{|k1,v1|
v1.type=@vectors[k1].type
v1.name=@vectors[k1].name
v1.labels=@vectors[k1].labels

}
end
ms

end```
to_multiset_by_split_one_field(field) click to toggle source

Creates a Statsample::Multiset, using one field

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 802
def to_multiset_by_split_one_field(field)
raise ArgumentError,"Should use a correct field name" if !@fields.include? field
factors=@vectors[field].factors
ms=Multiset.new_empty_vectors(@fields, factors)
each {|c|
}
#puts "Ingreso a los dataset"
ms.datasets.each {|k,ds|
ds.update_valid_data
ds.name=@vectors[field].labeling(k)
ds.vectors.each{|k1,v1|
#        puts "Vector #{k1}:"+v1.to_s
v1.type=@vectors[k1].type
v1.name=@vectors[k1].name
v1.labels=@vectors[k1].labels

}
}
ms
end```
to_s() click to toggle source
```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 918
def to_s
"#<"+self.class.to_s+":"+self.object_id.to_s+" @name=#{@name} @fields=["+@fields.join(",")+"] cases="+@vectors[@fields[0]].size.to_s
end```
update_valid_data() click to toggle source

Check vectors and fields after inserting data. Use only after add_case_array or add_case with second parameter to false

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 445
def update_valid_data
@gsl=nil
@fields.each{|f| @vectors[f].set_valid_data}
check_length
end```
vector(c)
Alias for: col
vector_by_calculation(type=:scale) { |row| ... } click to toggle source
```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 480
def vector_by_calculation(type=:scale)
a=[]
each do |row|
a.push(yield(row))
end
a.to_vector(type)
end```
vector_count_characters(fields=nil) click to toggle source
```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 517
def vector_count_characters(fields=nil)
fields=check_fields(fields)
collect_with_index do |row, i|
fields.inject(0){|a,v|
a+((@vectors[v].data_with_nils[i].nil?) ? 0: row[v].to_s.size)
}
end
end```
vector_mean(fields=nil, max_invalid=0) click to toggle source

Returns a vector with the mean for a set of fields if fields parameter is empty, return the mean for all fields if max invalid parameter > 0, returns the mean for all tuples with 0 to max_invalid invalid fields

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 529
def vector_mean(fields=nil, max_invalid=0)
a=[]
fields=check_fields(fields)
size=fields.size
each_with_index do |row, i |
# numero de invalidos
sum=0
invalids=0
fields.each{|f|
if !@vectors[f].data_with_nils[i].nil?
sum+=row[f].to_f
else
invalids+=1
end
}
if(invalids>max_invalid)
a.push(nil)
else
a.push(sum.quo(size-invalids))
end
end
a=a.to_vector(:scale)
a.name=_("Means from %s") % @name
a
end```
vector_missing_values(fields=nil) click to toggle source

Returns a vector with the numbers of missing values for a case

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 509
def vector_missing_values(fields=nil)
fields=check_fields(fields)
collect_with_index do |row, i|
fields.inject(0) {|a,v|
a+ ((@vectors[v].data_with_nils[i].nil?) ? 1: 0)
}
end
end```
vector_sum(fields=nil) click to toggle source

Returns a vector with sumatory of fields if fields parameter is empty, sum all fields

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 489
def vector_sum(fields=nil)
fields||=@fields
vector=collect_with_index do |row, i|
if(fields.find{|f| !@vectors[f].data_with_nils[i]})
nil
else
fields.inject(0) {|ac,v| ac + row[v].to_f}
end
end
vector.name=_("Sum from %s") % @name
vector
end```
verify(*tests) click to toggle source

Test each row with one or more tests each test is a Proc with the form

```Proc.new {|row| row['age']>0}
```

The function returns an array with all errors

```# File pkg/statsample-1.4.0/lib/statsample/dataset.rb, line 895
def verify(*tests)
if(tests[0].is_a? String)
id=tests[0]
tests.shift
else
id=@fields[0]
end
vr=[]
i=0
each do |row|
i+=1
tests.each{|test|
if ! test[2].call(row)
values=""
if test[1].size>0
values=" ("+test[1].collect{|k| "#{k}=#{row[k]}"}.join(", ")+")"
end
vr.push("#{i} [#{row[id]}]: #{test[0]}#{values}")
end
}
end
vr
end```