sec_cacl.pl
The perl script sec_cacl.pl
reformats and analyzes Tm curves.
This document describes the
requirements,
input format,
command line options,
processing and
output format for the script.
The basic command,
> perl sec_calc.pl path/batch_name_curve.asc
reads absorbance vs. time or volume data from the ..._curve.asc file exported by AKTA's PrimeView
Evaluation or Unicorn software. If you have also exported the column run log to a documentation file,
path/batch_name_doc.asc, the script will also read that file for fraction size, start and end
times and other metadata. The script will then fit Gaussian peaks to the absorbance data and write an
output file
path/filename.sec.csv
containing the peak models in both XML form and as a human-readable table
of peak centers, heights, widths and areas.
If the range of pooled fractions is supplied, an estimate of pool purity (the percent of each peak in the pooled fractions) will also be given. The pooled fraction range can be given on the command line as:
> perl sec_calc.pl path/batch_name_curve.asc -f FROM
-t TO
where FROM and TO are the fraction numbers for the start and end of the pooled range. For batch processing the pooled fraction range can be read from a file pooled_fractions.csv in the same directory as the other input files, with a header line of the form 'batch,from,to'. This server also allows manual entry or graphic selection of the pooled range after initial upload.
Use the -v command to include all models in the output file: models with from 1 to N or N+1 peaks, where N is the optimum number of peaks. Use -v -g tmpdir where tmpdir is the name of a new fitting directory, and the program will create postscript plots as well as the intermediate files in that directory. For an input file batch_curve.asc, plots are named batch_N.ps (total fit) and batch_N_solution.ps (individual peaks) where N is the number of peaks in the model.
Since there are not that many plots of interest (usually only the 2 for the best model),
sec_cacl.pl
does not have an automated method like tm_calc.pl's for generating png plots or
for making web pages to display those plots.
Use the -h command or see below for more documentation of
command line options .
Requirements for running the script
sec_calc.pl ...
", rather than
"perl sec_calc.pl ...
"
The sec_cacl.pl
script can read absorbance versus time or volume curves in a variety of formats,
as long as the file has a consistent column separator (space, comma, tab, or other
punctuation), a header line within the first half of the file with column headers
'min' or 'ml' followed by at least one more column, which is assumed to be absobance.
For example, curves exported from PrimeView Evaluation or Unicorn software for
GE LifeSciences's AKTA columns are in the form:
1 SEC X2 v3001:1_UV1_280nm min mAU 0.000 0.110000 0.373 -0.021000 0.747 -0.008000 1.120 -0.048000 1.493 -0.023000 ...
where 'mAu' are milli-absorbance units, A280 × 10-3. You can export this curve using either time or volume as the X axis; if your flow rate changed during fraction collection, you should use volume to avoid miscalculation of fraction positions.
The script also reads documentation (log) files in the format exported by the same software, with time in the first column and events / settings on the rest of the line - e.g.:
Logbook 0.00 min Method Run 1/20/2010, 2:26:44 PM, Method : SEC X2 v3, Result : c:\...\Default\SSGCID Results\SEC Purifications\SEC X2 v3055\SEC X2 v3001.res 0.00 min Base CV, 318.56 {ml}, HiLoad_26/60_Superdex_75_prep_grade 0.05 min Block Flow_Rate 0.05 min Base SameAsMain 0.05 min Flow 1.00 ml/min 0.05 min End Block ...
sec_cacl.pl
can only interpret events as reported by PrimeView Evaluation or
Unicorn. So the script can fit absorbance curves from other equipment,
but will not be able to convert time or volume to fraction number so it cannot
estimate pooled fraction purity. Also, the script may use inappropriate regions of the
curve before and after fraction collection to fit peaks and background. You may be able
to trim the curve file, truncating a copy of the data at either end to avoid fitting
the uncollected data points. In some cases you may even be able to approximate the
pooled fraction range using time or volume. But this would be very labor intensive.
For large numbers of SEC curves, it may be worth your time to download
sec_cacl.pl
and modify the
documentation parser subroutine parse_doc
to work on your equipment.
Please contact us
if you need help - or if you have a modified version you're willing to share with others.
You can get a help message by running sec_cacl.pl
with no input, or with -h
or --help options. You get a summary of options and defaults.
See below for more details on options for input,
output, pooled fraction range,
fitting, fit parameters
and tracking .
USAGE: sec_calc.pl -b <batch> [<path>] | <in_file> | - [-o <out_file>] [-p peakmin] [-n npeaks] [-f <from_frxn>] [-t <to_frxn>] [-r|--raw] [-m|--model] [-g|gnudir tmpdir] [-s|--slope] [-v|--verbose] [-d|--debug] [--version] [-h | --help] where: Items in <angle brackets> should be replaced by the real thing, Items in [square brackets] are optional. "a | b" means a or b. -h, --help, or no input specifier prints this text. **** You must supply a batch, input file or - meaning STANDARD INPUT **** If you give <in_file> as [<path>/]<batch>curve.asc, or give -b <batch> [<path>], then it looks for 3 files from <path> if given, or from the current directory: - <batch>curve.asc = absorbance vs time or volume from PrimeView), - <batch>doc.asc = fraction volume and other metadata from PrimeView - pooled_fractions.csv = text file with comma-separted values you create with a header line like "batch,from,to" Peak purity is calculated if <from_frxn> and <to_frxn> are given on the command line or read from pooled_fractions.csv. If <batch>doc.asc file is not found, then metadata defaults are fraction volume=5.0 mL, flow rate=0.5 mL/min, start time = 160.33 min. Other defaults are: <out_file> ... [<batch>|<in_file>].sec.csv if either is given, or STANDARD OUTPUT (e.g. screen) if not <peakmin> ... 5, peak finder stops if next peak <5% of total area <npeaks> ... 7 peaks maximum NO -r or --raw Do NOT write raw data as standardized csv NO -m or --model Do NOT write model values as standardized csv NO -g or --gnudir Do NOT read or write gnuplot intermediate files NO -v or --verbose Do NOT write peaks and residuals for all models NO -s or --slope Do NOT limit the background to a constant value Default output: peak table, residual values & pool purity for best model only.
If you give a path and a batch identifier with -b batch path, then the
script attempts to read data from path/batch_curve.asc and metadata from
path/batch_doc.asc.
If you do not give a batch identifier, but your input file name ends in _curve.asc,
i.e. file path/filename_curve.asc, then sec_cacl.pl
will read the named file
and look for metadata in path/filename_doc.asc. In either case, the script will
attempt to read pooled fraction ranges from path/pooled_fractions.csv.
Output file specification
> perl sec_calc.pl SEC_data_path/SEC_data_file.csvmakes file
SEC_data_path/SEC_data_file.sec.csv
batch,from,toSomewhere in the file should be a line starting with filename, the batch identifier:
filename,n1,n2The script will read this as pooled fractions from n1 to n2.
Fractions can either be specified as numbers, e.g. 1 to 48, or as well
positions in a 96-well plate assuming a zigzag collection pattern: A01-A12,
B12-B01, C01-C12, D12-D01 or the same for E01 to H01.
Fitting and Plotting
The script and gnuplot exchange information through intermediate files of data, models, and residuals in the fitting directory. Plots are also created in this directory. If you do not use -g, then this directory is in /tmp or the equivalent, and is named with a date-time stamp based on Universal Mean Time (probably not your local time unless you leave in western Europe or Africa). That directory path and name are stored in the main output file in the DIRECTORY attribute of the DERIVED_VALUES tag.
Curve Fitting Parameters: Minimum Peak, Sloped Backgound and Maximum Number of Peaks
The default, arrived at after tuning the algorithm on ~100 samples, is 5%. If a model explains all but 5% of the absorbance (observed mAu - background), or if an added peak does not improve the fit by more than this fraction, fitting stops. You may wish to tweak this parameter if the script is fitting too many or too few peaks to your curve: a lower value means more sensitivity and more peaks.
Piece-wise linear backgrounds or more elaborate functions might be better in theory, but in our testing we found that most curves could be fit by a simple linear (sloped or constant) background. The exceptions had so much noise that no automated method was likely to find a more useful background function, and even extensive manual intervention was not likely to help.
The default value is 5; the maximum possible value is currently 7.
The basic steps carried out by the script are:
For the curve (absorbance vs. time or volume) file:
For the documentation (log) file, the script looks for specific text (to find run values as shown below in parentheses):
The program then uses these values to convert from time or volume to fraction number, from 1.00 at the start of the first fraction to e.g. 48.99 at the end of the last of 48 fractions.
If you exported the curve data or the documentation using time rather than volume as the independent unit, and the flow rate changes during fraction collection, then warnings are printed to STD ERR (usually the screen) since the script will not be able to properly convert time to fraction number. This could alter both curve fitting and estimation of pool purity.
The script uses the point of highest absorbance to estimate the center
and height of the first Gaussian peak. Using an arbitrary width (σ) estimate
of 1 fraction, sec_cacl.pl
sends the raw data and initial estimates to gnuplot via
.dat
files in the fitting directory.
Gnuplot does up to 500 rounds of Levenberg-Marquardt curve fitting with a fit
limit of 5 × 10-6, fitting either a constant or linear
background plus one Gaussian peak:
mAu = abg + [ X ⋅ bbg ] + height ⋅ e - (X - center)2 / (2 &sigma2)
Gnuplot then sends its results back to sec_cacl.pl
: it updates the estimates file and
writes a new file of residuals = observed data - model at each X value.
It also generates postscript plots at this point if requested with the -v flag.
After peak fitting, sec_cacl.pl
does a sanity check: it stops if the model
σ is wider than 1/2 Nfractions, where Nfractions
is the total number of fractions collected.
After the peak is checked, sec_cacl.pl
uses the residual to caclculate RMSD,
Rabs, and the Bayesian Information Content, BIC:
|
||
RMSD = | √ | Σ (observed - model)2 |
|
||
Npoints |
Rabs = | Σ | observed - model | |
|
|
Σ | observed - background | |
BIC = Npoints ⋅ log( RMSD2 ) + k log( Npoints )
where Npoints is the number of data points being fit (the number of points within the collected fractions), the error variance RMSD2 is used as an inverse estimate of the likelihood of the model, and the number of parameters in the model k = 2 + 3G for G Gaussian peaks and a linear background. Rabs is used to judge whether further fitting iterations are needed, and BIC is used to judge which model best suits the data.
The residual is then smoothed before finding the next peak. The script takes a Gaussian-weighted running average over a window of 2 fractions (about 28 points: the first and last points have 5% of the weight of the center point). In this averageing, any negative values where the model value is higher than the observed data are treated as zero. This procedure accentuates peaks with significant area, and reduces the chance that noise will unduly influence the initial peak estimate.
The center and height of the highest peak found in the smoothed residual is then written to a new estimate file along with the current model, and sent to gnuplot for another round of fitting as above. After fitting, the peaks are sorted in ascending order of leading edge, i.e. center - 2 σ. The first peak is then sanity checked for σ < 1/2 Nfractions.
If the verbose flag -v has NOT been set, then several other checks are done to see whether or not iterative fitting should continue. These are based on the sensitivity value peakmin which is 5% unless set in the command line with the -p flag:
If neither of these criteria are met, or if the -verbose flag was set, then the script will continue fitting until it reaches the maximum number of peaks (7, unless set to a lower value with the -n flag). The script will then search among the models with ΔRabs greater than peakmin, to find the model with the minimum BIC value. That model will be used as the "best" model. Since there are relatively few parameters per peak, BIC almost always goes down with added peaks; the ΔRabs > peakmin criterion almost always controls the choice of model.
Normally, the script reports only the values for the "best" model. For verbose runs, the script includes a table of peak parameters for each model, and a table of the Rabs, RMSD and BIC values for each model, allowing users to plot these values vs. number of peaks and to plot each of the models to determine for themselves which model is best.
Output Format: sec.csv and ps files
The main output, <outfile>.sec.csv
, is an XML file containing
one or more (XML-wrapped) comma-separated value tables for human readability.
Curve fitting also produces a directory of intermediate files and, if requested,
postscript plots. This document covers the main output file in detail, and
briefly describes the other files.
See an example.
of the XML output from the command:
perl sec_calc.pl SEC_sample_curve.csv -m
which uses the sample input curve file
SEC_sample_curve.asc
and documentation file
SEC_sample_doc.asc
.
The XML file has the following overall heirarchy of TAGS
:
SEC_DATA
-
DERIVED_VALUES
-
BEST_MODEL
-
MODEL
-
BACKGROUND,
-
PEAK
-
RAW_DATA
-
MODEL_VALUES
TAGS, [ ATTRIBUTES ]
,
value (table inside XML tag) - and descriptions are:
SEC_DATA
Attributes:
- [
CALCULATED, VERSION
]
- curve-fitting script run date and version
- [
SOURCEFILE
] - Name of the binary results
(.res) file, taken from doc file's Method Run line.
- [
CURVEFILE
] - Name of the file with
absorbance curve
- [
FLOWRATE, FRACTION_VOLUME
]
- flow and fraction size numbers and units, ml/min and ml
- [
STARTFRXNS, ENDFRXNS
]
- time or volume at which fraction collection started
and ended, with unit
- [
POOL_FROM_FRACTION, POOL_TO_FRACTION
]
- fraction number or well position for pooled fraction range
- [
POOL_FROM_FRACTION_NUMBER, POOL_TO_FRACTION_NUMBER
] - fraction number as entered or as converted from well
position
- [
OBSERVED_POINTS
]
- Number of points between start and end of fraction
collection
Contents:
-
DERIVED_VALUES
Attributes:
- [
R1, PP1
] - Rabs and
pool purity fraction after fitting 1
Gaussian peak to the data
- [
SMOOTHING_WIDTH
] - Size of smoothing
window in fractions (each fraction is normally 10-15
data points)
- [
DIRECTORY
] - Location of fitting and
plotting directory
Value - comma-separated table of residuals,
one line for each model if the -verbose flag is
used, otherwise just one line of values for the best model.
Column headers are:
NPEAKS,RABS,dRABS,RMSD,AREA,PP,BIC
where dRABS is ΔRabs and other values are
as in BEST_MODEL
below.
-
BEST_MODEL
- model with lowest BIC
among all models with ΔRabs >
peakmin.
Attributes:
- [
PEAKMIN
] - given or default
minimum peak percentage, "0.0500" unless given
- [
NPEAKS ] - number of peaks in
best model
- [
RABS, RMSD, PPI, BIC ] -
Rabs, RMSD, pool purity, and
Bayesian Information Content for
best model
- [
AREA ] - total area between
observed data and model background for
best model
Value - comma-separated table of values
for each peak in the best model. Headers are:
Peak Number,Start,End,Fraction of Total Area,Fraction of Pool Area
where Start
and End
are
peak center minus and plus 2σ, respectively.
Last line has totals for fraction of total area and
fraction of pool area; both should be near 1.
-
MODEL
- best model only, unless
-verbose flag is set in which case all models
are included
- [
NPEAKS, RABS, RMSD, PP, BIC, AREA
] - as for BEST_MODEL above
-
BACKGROUND, [INTERCEPT, SLOPE ]
-
one for the whole model
-
PEAK
- one for each peak in
the model
Attributes:
- [
PEAKNUM
] -
peak number, from 1 up
- [
HEIGHT, CENTER, SIGMA
] -
parameters for this peak
- [
AREA
] - proportional
to HEIGHT * SIGMA
RAW_DATA
- only included if the -raw
flag is set.
Comma-separated table with X-axis header from the original
curve file, min
or ml
, in the first
column, mAu in the second.
Includes original data rows only from the region where fractions
were collected.
MODEL_VALUES
- only included if the
-model flag is set.
Comma-separated table with time or volume
converted to fractions, observed absorbance, total absorbance for
the model(s), residuals, background, and individual Gaussian peaks.
Only the "best" model is shown unless the
-verbose flag is set, in which case all models
are shown starting with the 1-peak model.
The first 3 columns are:
fraction,absorbance,pool,G1,G1_res,G1_bg,G1_1,G2,G2_res,G2_bg,G2_1,G2_2,G3,G3_res,G3_bg,G3_1,G3_2,G3_3,G4,...which can be used e.g. by gnuplot or Excel to plot each model's overall fit and individual peaks.