EKB I have just been introduced to substantively weighted analytical techniques (SWAT) and in particular substantively weighted least squares (SWLS) through the book
What Works: A New Approach to Program and Policy Analysis [
1]. It's also discussed in an online paper "Best Practices Research: A Methodological Guide for the Perplexed" [
2]. It's a very interesting technique for policy analysis, and I created a tcl implementation for it. (There's also an
R language implementation available from the author at [
3].)
The motivation for SWLS is that in many studies that include a linear regression, the residuals are more interesting than the fitted line. For example, when fitting a variable that gives the performance of a government agency against explanatory variables such as staffing levels, financial resources, and so on, the usual regression parameters tell you about the performance of the average agency for different values of those variables. But what you're usually interested in is the performance of agencies that are
above the average, given the levels of resources, so that you can try to identify what they are doing that make them better. SWLS (and, more generally, SWAT) is a systematic way to look for patterns among above or below average data within a data set.
The code is below, and a sample data file, used in the
What Works book, is at the end of this page.
Note: The results produced by this code agree with the results in
What Works except for a few percent difference in the bounds on the 95% confidence interval. I can't track down the source of the discrepancy...
EKB OK, I tracked down the discrepancy. They were using the asymptotic value of the inverse t-distribution as the number of deg of freedom gets very large (1.96), rather than the value for the 42 deg of freedom of the sample data file. The script below finds the appropriate t value for the degrees of freedom in the problem.Another note: After the original post, I improved on the filltext proc code to be a little more nicely laid out and more informative.
package require Tcl 8.4
package require math::linearalgebra 1.0
package require math::statistics 0.1.1
namespace eval swat {
variable epsilon 1.0e-7
variable swls_threshold 0.7
variable swls_beta 10
namespace import \
::math::linearalgebra::mkMatrix \
::math::linearalgebra::mkVector \
::math::linearalgebra::mkIdentity \
::math::linearalgebra::mkDiagonal \
::math::linearalgebra::getrow \
::math::linearalgebra::setrow \
::math::linearalgebra::getcol \
::math::linearalgebra::setcol \
::math::linearalgebra::getelem \
::math::linearalgebra::setelem \
::math::linearalgebra::dotproduct \
::math::linearalgebra::matmul \
::math::linearalgebra::add \
::math::linearalgebra::sub \
::math::linearalgebra::solveGauss
# NOTE: The authors of math::linearalgebra forgot to
# export ::math::linearalgebra::transpose
########################################
#
# t-stats
#
########################################
# swat::tstat n
# Returns inverse of the single-tailed t distribution
# Defaults to alpha = 0.05
# Iterates until result is within epsilon of the target
proc tstat {n {alpha 0.05}} {
variable epsilon
variable tvals
if [info exists tvals($n:$alpha)] {
return $tvals($n:$alpha)
}
set deltat [expr {100 * $epsilon}]
# For one-tailed distribution,
set ptarg [expr {1.000 - $alpha/2.0}]
set maxiter 100
# Initial value for t
set t 2.0
set niter 0
while {abs([::math::statistics::cdf-students-t $n $t] - $ptarg) > $epsilon} {
set pstar [::math::statistics::cdf-students-t $n $t]
set pl [::math::statistics::cdf-students-t $n [expr {$t - $deltat}]]
set ph [::math::statistics::cdf-students-t $n [expr {$t + $deltat}]]
set t [expr {$t + 2.0 * $deltat * ($ptarg - $pstar)/($ph - $pl)}]
incr niter
if {$niter == $maxiter} {
error "swat::tstat: Did not converge after $niter iterations"
return {}
}
}
# Cache the result to shorten the call in future
set tvals($n:$alpha) $t
return $t
}
########################################
#
# Weighted Least Squares
#
########################################
# swat::wls w [list y x's] w [list y x's] ...
# Returns:
# R-squared
# coefficients of x's in fit
# jacknifed residuals
# standard errors of the coefficients
# 95% confidence bounds for coefficients
proc wls {args} {
# Fill the matrices of x & y values, and weights
# For n points, k coefficients
# The number of points is equal to half the arguments (n weights, n points)
set n [expr {[llength $args]/2}]
set firstloop true
# Sum up all y values to take an average
set ysum 0
# Add up the weights
set wtsum 0
# Count over rows (points) as you go
set point 0
foreach {wt pt} $args {
# Check inputs
if {[string is double $wt] == 0} {
error "swat::wls: Weight \"$wt\" is not a number"
return {}
}
## -- Check dimensions, initialize
if $firstloop {
# k = num of vals in pt = 1 + number of x's (because of constant)
set k [llength $pt]
if {$n <= [expr {$k + 1}]} {
error "swat::wls: Too few degrees of freedom: $k variables but only $n points"
return {}
}
set X [mkMatrix $n $k]
set y [mkVector $n]
set I_x [mkIdentity $k]
set I_y [mkIdentity $n]
set firstloop false
} else {
# Have to have same number of x's for all points
if {$k != [llength $pt]} {
error "swat::wls: Point \"$pt\" has wrong number of values (expected $k)"
# Clean up
return {}
}
}
## -- Extract values from set of points
# Make a list of y values
set yval [expr {double([lindex $pt 0])}]
setelem y $point $yval
set ysum [expr {$ysum + $wt * $yval}]
set wtsum [expr {$wtsum + $wt}]
# Add x-values to the x-matrix
set xrow [lrange $pt 1 end]
# Add the constant (value = 1.0)
lappend xrow 1.0
setrow X $point $xrow
# Create list of weights & square root of weights
lappend w [expr {double($wt)}]
lappend sqrtw [expr {sqrt(double($wt))}]
incr point
}
set ymean [expr {double($ysum)/$wtsum}]
set W [mkDiagonal $w]
set sqrtW [mkDiagonal $sqrtw]
# Calculate sum os square differences for x's
for {set r 0} {$r < $k} {incr r} {
set xsqrsum 0.0
set xmeansum 0.0
# Calculate sum of squared differences as: sum(x^2) - (sum x)^2/n
for {set t 0} {$t < $n} {incr t} {
set xval [getelem $X $t $r]
set xmeansum [expr {$xmeansum + double($xval)}]
set xsqrsum [expr {$xsqrsum + double($xval * $xval)}]
}
lappend xsqr [expr {$xsqrsum - $xmeansum * $xmeansum/$n}]
}
## -- Set up the X'WX matrix
set XtW [matmul [::math::linearalgebra::transpose $X] $W]
set XtWX [matmul $XtW $X]
# Invert
set M [solveGauss $XtWX $I_x]
set beta [matmul $M [matmul $XtW $y]]
### -- Residuals & jacknife statistics
# 1) Generate list of diagonals of the hat matrix
set H [matmul $X [matmul $M $XtW]]
for {set i 0} {$i < $n} {incr i} {
lappend h_ii [getelem $H $i $i]
}
set R [matmul $sqrtW [matmul [sub $I_y $H] $y]]
set yhat [matmul $H $y]
# 2) Generate list of residuals, sum of squared residuals, r-squared
set sstot 0.0
set ssreg 0.0
# Note: Relying on representation of Vector as a list for y, yhat
foreach yval $y wt $w yhatval $yhat {
set sstot [expr {$sstot + $wt * ($yval - $ymean) * ($yval - $ymean)}]
set ssreg [expr {$ssreg + $wt * ($yhatval - $ymean) * ($yhatval - $ymean)}]
}
set r2 [expr {double($ssreg)/$sstot}]
set adjr2 [expr {1.0 - (1.0 - $r2) * ($n - 1)/($n - $k)}]
set sumsqresid [dotproduct $R $R]
# 3) Calculate the overall variance, the variance per point & jacknifed stderr
set s2 [expr {double($sumsqresid) / double($n - $k)}]
for {set r 0} {$r < $n} {incr r} {
set wt [lindex $w $r]
set hprime [expr {sqrt($wt) * (1.0 - [lindex $h_ii $r])}]
set resid [getelem $R $r]
set newjns2i [expr {(($n - $k) * $s2 - double($resid * $resid)/$hprime)/($n - $k - 1)}]
lappend jns2i $newjns2i
lappend jacknife [expr {double($resid)/sqrt($newjns2i * $hprime)}]
}
### -- Confidence intervals for coefficients
set tvalue [tstat [expr {$n - $k}]]
for {set i 0} {$i < $k} {incr i} {
set stderr [expr {sqrt($s2 * [getelem $M $i $i])}]
set mid [lindex $beta $i]
lappend stderrs $stderr
lappend confinterval [list [expr {$mid - $tvalue * $stderr}] [expr {$mid + $tvalue * $stderr}]]
}
return [list $adjr2 $beta $jacknife $stderrs $confinterval]
}
########################################
#
# Substantively Weighted Least Squares
#
########################################
# swat::swls [list y x's] [list y x's] ...
# Returns:
# - # points with wt 1
# - R-squared
# - coefficients of x's for entire sequence of fits
# - 95% confidence intervals for coefficients
# - weighted average of x values
proc swls {args} {
variable swls_threshold
variable swls_beta
set numpoints [llength $args]
# Initialize weights to 1.0 for all points
for {set i 0} {$i < $numpoints} {incr i} {
set wt($i) 1.0
}
# Start of big loop over weightings
for {set i 0} {$i < $swls_beta} {incr i} {
# Prepare the arg to send to wls
set wtpts {}
set nhigh 0
set sumwts 0
for {set j 0} {$j < $numpoints} {incr j} {
set pts [lindex $args $j]
if {![info exists numxs]} {
set numxs [expr {[llength $pts] - 1}]
}
lappend wtpts $wt($j) $pts
set sumwts [expr {$sumwts + $wt($j)}]
for {set n 1} {$n <= $numxs} {incr n} {
if {$j > 0} {
set sumx($n) [expr {$sumx($n) + $wt($j) * [lindex $pts $n]}]
} else {
set sumx($n) [lindex $pts $n]
}
}
if {$wt($j) == 1.0} {
incr nhigh
}
}
set xave {}
for {set n 1} {$n <= $numxs} {incr n} {
lappend xave [expr {double($sumx($n))/$sumwts}]
}
set retval [eval wls $wtpts]
# R-squared is first, then coeffs, then jacknifed residuals
set r2 [lindex $retval 0]
lappend coeffs [list $nhigh $r2 [lindex $retval 1] [lindex $retval 3] [lindex $retval 4] $xave]
set jacknife [lindex $retval 2]
# Figure out the next set of weights
for {set j 0} {$j < $numpoints} {incr j} {
if {$i == 0 && [lindex $jacknife $j] < $swls_threshold} {
set wt($j) [expr {1.0 - 1.0/double($swls_beta)}]
}
if {$i > 0 && $wt($j) < 1.0} {
set wt($j) [expr {$wt($j) - 1.0/double($swls_beta)}]
}
}
}
return $coeffs
}
}
##########################
#
# GUI
#
##########################
set fileinfo(dir) [file dirname $argv0]
set fileinfo(name) ""
proc fileopen {} {
global fileinfo
set fname [tk_getOpenFile -initialdir $fileinfo(dir) -initialfile $fileinfo(name) \
-defaultextension ".dat" -filetypes { {{Data files} {.dat}} {{Text Files} {.txt} } {{All Files} *} }]
if {$fname == ""} {return}
if [catch {open $fname r} fhndl] {
tk_messageBox -icon error -message "Error opening file [file tail $fname]: $fhndl"
return
}
set names {}
set points {}
set firstline true
while {[gets $fhndl line] != -1} {
# Ignore blank lines & comments
if {[string trim $line] == ""} {continue}
if {[string index [string trimleft $line] 0] == "#"} {continue}
if $firstline {
set names [split $line "\t"]
set firstline false
continue
}
lappend points [split $line "\t"]
}
close $fhndl
set fileinfo(dir) [file dirname $fname]
set fileinfo(name) [file tail $fname]
wm title . "$fileinfo(name) - SWLS"
filltext $names $points
}
proc filltext {names points} {
.t delete 1.0 end
# Go ahead and update, since there can be a delay in loading
update
set coeffs [eval swat::swls $points]
set nstart [lindex [lindex $coeffs 0] 0]
set nend [lindex [lindex $coeffs end] 0]
.t insert end "Explaining: [lindex $names 0]\n\n"
.t insert end "Number of data points: $nstart\n"
.t insert end "Number in set of interest: $nend\n\n"
.t insert end [format "%-15s\t% 10s\t% 11s\t\%25s\t% 10s\n\n" Variable Coeff "Std Error" "-- 95% Conf Interval --" "Ave Value"]
set i 1
foreach val $coeffs {
set r2 [lindex $val 1]
set xcoeffs [lindex $val 2]
set stderrs [lindex $val 3]
set cis [lindex $val 4]
set xaves [lindex $val 5]
.t insert end "== Run $i ==\n"
incr i
set j 1
foreach xave $xaves xc $xcoeffs ci $cis se $stderrs {
if {$j == [llength $xcoeffs]} {
set lbl "Constant"
set avestring ""
} else {
set lbl "[lindex $names $j]"
set avestring [format "% 10.3f" $xave]
}
.t insert end [format "%-15s\t% 10.3f\t\u00B1% 10.3f\t\[% 10.3f : % 10.3f\]\t%s\n" \
$lbl $xc $se [lindex $ci 0] [lindex $ci 1] $avestring]
incr j
}
.t insert end [format "Adj. R-squared = %.3f\n\n" $r2]
}
}
wm title . "SWLS"
. configure -menu .mainmenu
menu .mainmenu
.mainmenu add cascade -label "File" -menu .mainmenu.f
menu .mainmenu.f -tearoff false
.mainmenu.f add command -label "Open" -command fileopen -underline 0
scrollbar .sby -command {.t yview} -orient vertical
scrollbar .sbx -command {.t xview} -orient horizontal
text .t -width 70 -height 20 -wrap none -yscrollcommand {.sby set} \
-xscrollcommand {.sbx set} -font "Courier 9"
grid .t .sby -sticky nsew
grid .sbx -sticky nsew
grid columnconfigure . 0 -weight 1
grid rowconfigure . 0 -weight 1
focus -force .
You can try it out using the following data file. This is a file of data on child welfare agencies used by the original SWAT creators for the book
What Works, formatted for this tcl app.
Organizational Learning ACES Instability Ambiguity Staff Change Divorce Rate Slack Expenditure
1141.4 1.467351 70.43141 23.57 115.5502 26.2 3.470122 3440.159
2667 1.754386 6.407348 32.1 39.5922 27.8 4.105816 8554.186
307.3 0.4215852 63.41155 35.14 106.0642 29.4 5.610657 2544.263
840.7 0 16.39695 23.46 50.57774 30.8 3.848957 2366.732
482.5 1.18499 78.90607 31.95 -1.144 20.3 3.700461 4954.157
550 2.072846 28.10363 17.81 25.76482 22.8 2.74241 2976.427
514.1 0.607718 22.71782 22.23 18.3592 14.9 5.063129 4905.78
1352.5 1.470588 6.602017 65.03 -232.2549 19.6 9.178106 5970.52
1136.5 0.8285004 123.2406 21.44 60.60778 30.8 2.495392 2567.212
1575.4 1.660879 87.06177 31.26 86.82364 22.2 2.100618 2589.531
1170.6 0 13.04299 37.68 31.7781 19.1 4.491279 4141.247
1679.6 0.9624639 7.197373 21.72 79.94753 27 3.074239 3216.067
882.3 1.212856 181.2476 16.97 45.62724 17.4 1.902308 2469.215
1347.5 1.782531 78.82232 11.84 33.76736 26.6 1.762038 1742.745
1353.4 1.431127 11.54932 31.1 57.34841 16.6 3.434238 2659.697
1363.4 2.40481 13.08474 18.11 72.44474 22.5 2.563138 3296.465
1087.9 0.538648 26.32775 23.06 166.6439 21.9 2.649425 3330.793
712.8 0.4703669 27.16063 25.46 35.4368 15.6 4.030189 3416.769
1865.4 2.42915 9.96554 18.14 90.42884 20.9 3.209765 4109.007
1088.4 1.234568 38.82103 40.84 24.8976 14.2 3.488844 5275.824
1308.1 0.6671114 76.34886 26.11 82.58347 12.7 3.641929 4926.775
3485.4 1.17421 197.8029 25.89 80.5744 17.7 1.753225 5419.784
1946.4 1.128159 35.4882 26.85 70.3593 14.9 4.76418 5208.218
1116.8 0.3858025 76.16611 16.39 225.721 21.8 4.611794 2539.369
2047.5 0.5804953 54.3063 21.42 121.5814 22 2.799651 2754.369
893.3 2.475247 7.206166 12.36 103.7574 22.9 3.882953 2083.009
1779.2 0 26.99598 45.8 78.2787 17.6 3.511719 3982.897
703.4 1.557632 3.533641 49.19 -32.1768 54.9 6.609568 4082.064
695.1 2.714932 6.912465 43.87 32.34244 19.7 6.407141 2998.161
1287 0.257732 62.02957 36.51 17.6954 15 4.470681 6310.163
573 0 13.15338 28.29 48.06679 26 2.264973 2759.142
955.6 0.3322627 151.1162 32.31 50.3237 15 3.284161 5634.305
1296.1 1.187472 70.68908 23.75 31.70579 20.6 3.450963 2992.531
1194.5 1.574803 5.132252 15.61 54.0354 15.4 2.777402 2679.454
4299.2 3.565225 169.9304 22.62 136.621 20.3 2.841627 3072.191
896.4 1.259843 66.21898 23.09 35.20083 32.9 3.420866 2381.565
815.4 4.106776 53.44192 37.42 -176.8579 23.8 6.527769 4303.525
2262.9 0.7524454 121.2754 51.96 18.3963 14.8 2.932871 4130.657
1108.8 0.996016 10.88008 38.61 22.00537 16.1 2.421065 3339.35
1314.2 1.404494 27.21783 25.78 36.989 17.4 1.432419 2533.139
1268.9 4.207574 2.172794 17.05 23.44076 16.9 2.988671 2366.595
952.8 0.8075914 94.00198 41.9 33.35452 26.6 1.970768 2031.763
763.2 0.4034815 121.7529 51.39 53.86072 24.4 2.710884 1620.14
1228.6 0 11.48006 21.33 73.45351 21.8 4.641821 5069.072
1023.5 3.527337 3.049591 20.86 84.84604 18.9 4.317273 2983.782
1646.3 1.272669 58.24525 19.83 105.2593 17.7 2.980112 3097.782
2483.3 1.195695 80.93693 30.97 160.9784 24.7 6.063038 5996.734
1045.3 0.5552471 15.81082 15.11 78.10522 22.3 3.786695 2222.728
3866.9 1.210898 70.8133 24.03 60.605 15.2 3.335582 4972.425
1450.2 0 7.217247 10.33 163.1716 29 2.623226 1860.076
After running this data file through the script, it should give the following output:
Explaining: Organizational Learning
Number of data points: 50
Number in set of interest: 11
Variable Coeff Std Error -- 95% Conf Interval -- Ave Value
== Run 1 ==
ACES 284.795 ± 99.831 [ 83.327 : 486.263] 1.265
Instability 3.815 ± 2.057 [ -0.336 : 7.967] 52.237
Ambiguity 9.004 ± 11.259 [ -13.718 : 31.727] 28.111
Staff Change 4.617 ± 1.758 [ 1.069 : 8.165] 55.651
Divorce Rate -1.682 ± 14.412 [ -30.766 : 27.402] 21.712
Slack -125.335 ± 88.330 [ -303.593 : 52.923] 3.643
Expenditure 0.294 ± 0.078 [ 0.136 : 0.452] 3617.571
Constant -263.506 ± 591.866 [ -1457.939 : 930.927]
Adj. R-squared = 0.363
... Runs 2-9 omitted ...
== Run 10 ==
ACES 333.752 ± 117.107 [ 97.421 : 570.083] 1.133
Instability 6.311 ± 2.421 [ 1.426 : 11.197] 63.724
Ambiguity -0.572 ± 11.657 [ -24.097 : 22.954] 30.631
Staff Change 5.445 ± 2.248 [ 0.908 : 9.981] 64.046
Divorce Rate -16.887 ± 21.001 [ -59.269 : 25.496] 23.142
Slack 5.579 ± 117.953 [ -232.460 : 243.618] 3.829
Expenditure 0.342 ± 0.107 [ 0.126 : 0.558] 3835.946
Constant 4.187 ± 877.228 [ -1766.131 : 1774.505]
Adj. R-squared = 0.630