module Statsample::Bivariate

Diverse methods and classes to calculate bivariate relations Specific classes:

Statsample::Bivariate::Pearson : Pearson correlation coefficient ®
Statsample::Bivariate::Tetrachoric : Tetrachoric correlation
Statsample::Bivariate::Polychoric : Polychoric correlation (using joint, two-step and polychoric series)

Diverse methods and classes to calculate bivariate relations Specific classes:

Statsample::Bivariate::Pearson : Pearson correlation coefficient ®
Statsample::Bivariate::Tetrachoric : Tetrachoric correlation
Statsample::Bivariate::Polychoric : Polychoric correlation (using joint, two-step and polychoric series)

Public Instance Methods

correlation_matrix(ds) click to toggle source

Correlation matrix. Order of rows and columns depends on Statsample::Dataset#fields order

# File lib/statsample/bivariate.rb, line 191
def correlation_matrix(ds)
  vars,cases=ds.fields.size,ds.cases
  if !ds.has_missing_data? and Statsample.has_gsl? and prediction_optimized(vars,cases) < prediction_pairwise(vars,cases)
    cm=correlation_matrix_optimized(ds)
  else
    cm=correlation_matrix_pairwise(ds)
  end
  cm.extend(Statsample::CovariateMatrix)
  cm.fields=ds.fields
  cm
end

correlation_matrix_optimized(ds) click to toggle source

# File lib/statsample/bivariate.rb, line 203
def correlation_matrix_optimized(ds)
  s=covariance_matrix_optimized(ds)
  sds=GSL::Matrix.diagonal(s.diagonal.sqrt.pow(-1))
  cm=sds*s*sds
  # Fix diagonal
  s.row_size.times {|i|
    cm[i,i]=1.0
  }
  cm
end

correlation_matrix_pairwise(ds) click to toggle source

# File lib/statsample/bivariate.rb, line 213
def correlation_matrix_pairwise(ds)
  cache={}
  cm=ds.collect_matrix do |row,col|
    if row==col
      1.0
    elsif (ds[row].type!=:scale or ds[col].type!=:scale)
      nil
    else
      if cache[[col,row]].nil?
        r=pearson(ds[row],ds[col])
        cache[[row,col]]=r
        r
      else
        cache[[col,row]]
      end 
    end
  end
end

correlation_probability_matrix(ds, tails=:both) click to toggle source

Matrix of correlation probabilities. Order of rows and columns depends on Statsample::Dataset#fields order

# File lib/statsample/bivariate.rb, line 247
def correlation_probability_matrix(ds, tails=:both)
  rows=ds.fields.collect do |row|
    ds.fields.collect do |col|
      v1a,v2a=Statsample.only_valid_clone(ds[row],ds[col])
      (row==col or ds[row].type!=:scale or ds[col].type!=:scale) ? nil : prop_pearson(t_pearson(ds[row],ds[col]), v1a.size, tails)
    end
  end
  Matrix.rows(rows)
end

covariance(v1,v2) click to toggle source

Covariance between two vectors

# File lib/statsample/bivariate.rb, line 13
def covariance(v1,v2)
  v1a,v2a=Statsample.only_valid_clone(v1,v2)
  return nil if v1a.size==0
  if Statsample.has_gsl?
    GSL::Stats::covariance(v1a.gsl, v2a.gsl)
  else
    covariance_slow(v1a,v2a)
  end
end

covariance_matrix(ds) click to toggle source

Covariance matrix. Order of rows and columns depends on Statsample::Dataset#fields order

# File lib/statsample/bivariate.rb, line 155
def covariance_matrix(ds)
  vars,cases=ds.fields.size,ds.cases
  if !ds.has_missing_data? and Statsample.has_gsl? and prediction_optimized(vars,cases) < prediction_pairwise(vars,cases)
    cm=covariance_matrix_optimized(ds)
  else
    cm=covariance_matrix_pairwise(ds)
    
  end
  cm.extend(Statsample::CovariateMatrix)
  cm.fields=ds.fields
  cm
end

covariance_matrix_optimized(ds) click to toggle source

# File lib/statsample/bivariate.rb, line 141
def covariance_matrix_optimized(ds)
  x=ds.to_gsl
  n=x.row_size
  m=x.column_size
  means=((1/n.to_f)*GSL::Matrix.ones(1,n)*x).row(0)
  centered=x-(GSL::Matrix.ones(n,m)*GSL::Matrix.diag(means))
  ss=centered.transpose*centered
  s=((1/(n-1).to_f))*ss
  s
end

covariance_matrix_pairwise(ds) click to toggle source

# File lib/statsample/bivariate.rb, line 169
def covariance_matrix_pairwise(ds)
  cache={}
  matrix=ds.collect_matrix do |row,col|
    if (ds[row].type!=:scale or ds[col].type!=:scale)
      nil
    elsif row==col
      ds[row].variance
    else
      if cache[[col,row]].nil?
        cov=covariance(ds[row],ds[col])
        cache[[row,col]]=cov
        cov
      else
         cache[[col,row]]
      end
    end
  end
  matrix
end

gamma(matrix) click to toggle source

Calculates Goodman and Kruskal's gamma.

Gamma is the surplus of concordant pairs over discordant pairs, as a percentage of all pairs ignoring ties.

Source: faculty.chass.ncsu.edu/garson/PA765/assocordinal.htm

# File lib/statsample/bivariate.rb, line 305
def gamma(matrix)
  v=pairs(matrix)
  (v['P']-v['Q']).to_f / (v['P']+v['Q']).to_f
end

maximum_likehood_dichotomic(pred,real) click to toggle source

Estimate the ML between two dichotomic vectors

# File lib/statsample/bivariate.rb, line 23
def maximum_likehood_dichotomic(pred,real)
  preda,reala=Statsample.only_valid_clone(pred,real)                
  sum=0
  preda.each_index{|i|
     sum+=(reala[i]*Math::log(preda[i])) + ((1-reala[i])*Math::log(1-preda[i]))
  }
  sum
end

min_n_valid(ds) click to toggle source

based on a dataset

# File lib/statsample/bivariate.rb, line 373
def min_n_valid(ds)
  min=ds.cases
  m=n_valid_matrix(ds)
  for x in 0...m.row_size
    for y in 0...m.column_size
      min=m[x,y] if m[x,y] < min
    end
  end
  min
end

n_valid_matrix(ds) click to toggle source

Retrieves the n valid pairwise.

# File lib/statsample/bivariate.rb, line 233
def n_valid_matrix(ds)
  ds.collect_matrix do |row,col|
    if row==col
      ds[row].valid_data.size
    else
      rowa,rowb=Statsample.only_valid_clone(ds[row],ds[col])
      rowa.size
    end
  end
end

ordered_pairs(vector) click to toggle source

# File lib/statsample/bivariate.rb, line 351
def ordered_pairs(vector)
  d=vector.data
  a=[]
  (0...(d.size-1)).each{|i|
    ((i+1)...(d.size)).each {|j|
      a.push([d[i],d[j]])
    }
  }
  a
end

pairs(matrix) click to toggle source

Calculate indexes for a matrix the rows and cols has to be ordered

# File lib/statsample/bivariate.rb, line 310
def pairs(matrix)
  # calculate concordant #p matrix
  rs=matrix.row_size
  cs=matrix.column_size
  conc=disc=ties_x=ties_y=0
  (0...(rs-1)).each do |x|
    (0...(cs-1)).each do |y|
      ((x+1)...rs).each do |x2|
        ((y+1)...cs).each do |y2|
          # #p sprintf("%d:%d,%d:%d",x,y,x2,y2)
          conc+=matrix[x,y]*matrix[x2,y2]
        end
      end
    end
  end
  (0...(rs-1)).each {|x|
    (1...(cs)).each{|y|
      ((x+1)...rs).each{|x2|
        (0...y).each{|y2|
          # #p sprintf("%d:%d,%d:%d",x,y,x2,y2)
          disc+=matrix[x,y]*matrix[x2,y2]
        }
      }
    }
  }
  (0...(rs-1)).each {|x|
    (0...(cs)).each{|y|
      ((x+1)...(rs)).each{|x2|
        ties_x+=matrix[x,y]*matrix[x2,y]
      }
    }
  }
  (0...rs).each {|x|
    (0...(cs-1)).each{|y|
      ((y+1)...(cs)).each{|y2|
        ties_y+=matrix[x,y]*matrix[x,y2]
      }
    }
  }
  {'P'=>conc,'Q'=>disc,'Y'=>ties_y,'X'=>ties_x}
end

partial_correlation(v1,v2,control) click to toggle source

Correlation between v1 and v2, controling the effect of control on both.

# File lib/statsample/bivariate.rb, line 132
def partial_correlation(v1,v2,control)
  v1a,v2a,cona=Statsample.only_valid_clone(v1,v2,control)
  rv1v2=pearson(v1a,v2a)
  rv1con=pearson(v1a,cona)
  rv2con=pearson(v2a,cona)        
  (rv1v2-(rv1con*rv2con)).quo(Math::sqrt(1-rv1con**2) * Math::sqrt(1-rv2con**2))
  
end

pearson(v1,v2) click to toggle source

Calculate Pearson correlation coefficient ® between 2 vectors

# File lib/statsample/bivariate.rb, line 43
def pearson(v1,v2)
  v1a,v2a=Statsample.only_valid_clone(v1,v2)
  return nil if v1a.size ==0
  if Statsample.has_gsl?
    GSL::Stats::correlation(v1a.gsl, v2a.gsl)
  else
    pearson_slow(v1a,v2a)
  end
end

point_biserial(dichotomous,continous) click to toggle source

Calculate Point biserial correlation. Equal to Pearson correlation, with one dichotomous value replaced by “0” and the other by “1”

# File lib/statsample/bivariate.rb, line 265
def point_biserial(dichotomous,continous)
  ds={'d'=>dichotomous,'c'=>continous}.to_dataset.dup_only_valid
  raise(TypeError, "First vector should be dichotomous") if ds['d'].factors.size!=2
  raise(TypeError, "Second vector should be continous") if ds['c'].type!=:scale
  f0=ds['d'].factors.sort[0]
  m0=ds.filter_field('c') {|c| c['d']==f0}
  m1=ds.filter_field('c') {|c| c['d']!=f0}
  ((m1.mean-m0.mean).to_f / ds['c'].sdp) * Math::sqrt(m0.size*m1.size.to_f / ds.cases**2)
end

prediction_optimized(vars,cases) click to toggle source

Predicted time for optimized correlation matrix, in miliseconds See benchmarks/correlation_matrix.rb to see mode of calculation

# File lib/statsample/bivariate.rb, line 111
def prediction_optimized(vars,cases)
  ((4+0.018128*cases+0.246871*vars+0.001169*vars*cases)**2) / 100
end

prediction_pairwise(vars,cases) click to toggle source

Predicted time for pairwise correlation matrix, in miliseconds See benchmarks/correlation_matrix.rb to see mode of calculation

# File lib/statsample/bivariate.rb, line 105
def prediction_pairwise(vars,cases)
  ((-0.518111-0.000746*cases+1.235608*vars+0.000740*cases*vars)**2) / 100
end

prop_pearson(t, size, tails=:both) click to toggle source

Retrieves the probability value (a la SPSS) for a given t, size and number of tails. Uses a second parameter

:both or 2 : for r!=0 (default)
:right, :positive or 1 : for r > 0
:left, :negative : for r < 0

# File lib/statsample/bivariate.rb, line 83
def prop_pearson(t, size, tails=:both)
  tails=:both if tails==2
  tails=:right if tails==1 or tails==:positive
  tails=:left if tails==:negative
  
  n_tails=case tails
    when :both then 2
    else 1
  end
  t=-t if t>0 and (tails==:both)
  cdf=Distribution::T.cdf(t, size-2)
  if(tails==:right)
    1.0-(cdf*n_tails)
  else
    cdf*n_tails
  end
end

residuals(from,del) click to toggle source

Returns residual score after delete variance from another variable

# File lib/statsample/bivariate.rb, line 117
def residuals(from,del)
  r=Statsample::Bivariate.pearson(from,del)
  froms, dels = from.vector_standarized, del.vector_standarized
  nv=[]
  froms.data_with_nils.each_index do |i|
    if froms[i].nil? or dels[i].nil?
      nv.push(nil)
    else
      nv.push(froms[i]-r*dels[i])
    end
  end
  nv.to_vector(:scale)
end

spearman(v1,v2) click to toggle source

Spearman ranked correlation coefficient (rho) between 2 vectors

# File lib/statsample/bivariate.rb, line 258
def spearman(v1,v2)
  v1a,v2a=Statsample.only_valid_clone(v1,v2)
  v1r,v2r=v1a.ranked(:scale),v2a.ranked(:scale)
  pearson(v1r,v2r)
end

sum_of_squares(v1,v2) click to toggle source

# File lib/statsample/bivariate.rb, line 36
def sum_of_squares(v1,v2)
  v1a,v2a=Statsample.only_valid_clone(v1,v2)        
  m1=v1a.mean
  m2=v2a.mean
  (v1a.size).times.inject(0) {|ac,i| ac+(v1a[i]-m1)*(v2a[i]-m2)}
end

t_pearson(v1,v2) click to toggle source

Retrieves the value for t test for a pearson correlation between two vectors to test the null hipothesis of r=0

# File lib/statsample/bivariate.rb, line 61
def t_pearson(v1,v2)
  v1a,v2a=Statsample.only_valid_clone(v1,v2)
  r=pearson(v1a,v2a)
  if(r==1.0) 
    0
  else
    t_r(r,v1a.size)
  end
end

t_r(r,size) click to toggle source

Retrieves the value for t test for a pearson correlation giving r and vector size Source : faculty.chass.ncsu.edu/garson/PA765/correl.htm

# File lib/statsample/bivariate.rb, line 73
def t_r(r,size)
  r * Math::sqrt(((size)-2).to_f / (1 - r**2))
end

tau_a(v1,v2) click to toggle source

Kendall Rank Correlation Coefficient (Tau a) Based on Hervé Adbi article

# File lib/statsample/bivariate.rb, line 276
def tau_a(v1,v2)
  v1a,v2a=Statsample.only_valid_clone(v1,v2)
  n=v1.size
  v1r,v2r=v1a.ranked(:scale),v2a.ranked(:scale)
  o1=ordered_pairs(v1r)
  o2=ordered_pairs(v2r)
  delta= o1.size*2-(o2  & o1).size*2
  1-(delta * 2 / (n*(n-1)).to_f)
end

tau_b(matrix) click to toggle source

Calculates Goodman and Kruskal’s Tau b correlation. Tb is an asymmetric P-R-E measure of association for nominal scales (Mielke, X)

Tau-b defines perfect association as strict monotonicity. Although it requires strict monotonicity to reach 1.0, it does not penalize ties as much as some other measures.

Reference¶ ↑

Mielke, P. GOODMAN–KRUSKAL TAU AND GAMMA. Source: faculty.chass.ncsu.edu/garson/PA765/assocordinal.htm

# File lib/statsample/bivariate.rb, line 295
def tau_b(matrix)
  v=pairs(matrix)
  ((v['P']-v['Q']).to_f / Math::sqrt((v['P']+v['Q']+v['Y'])*(v['P']+v['Q']+v['X'])).to_f)
end