DLS | Sequence Analysis | SEC | TM | Yield

Size Exclusion Chromatography (SEC): sec_cacl.pl

The perl script sec_cacl.pl reformats and analyzes Tm curves. This document describes the requirements, input format, command line options, processing and output format for the script. The basic command,

    > perl sec_calc.pl path/batch_name_curve.asc

reads absorbance vs. time or volume data from the ..._curve.asc file exported by AKTA's PrimeView Evaluation or Unicorn software. If you have also exported the column run log to a documentation file, path/batch_name_doc.asc, the script will also read that file for fraction size, start and end times and other metadata. The script will then fit Gaussian peaks to the absorbance data and write an output file path/filename.sec.csv containing the peak models in both XML form and as a human-readable table of peak centers, heights, widths and areas.

If the range of pooled fractions is supplied, an estimate of pool purity (the percent of each peak in the pooled fractions) will also be given. The pooled fraction range can be given on the command line as:

    > perl sec_calc.pl path/batch_name_curve.asc -f FROM -t TO

where FROM and TO are the fraction numbers for the start and end of the pooled range. For batch processing the pooled fraction range can be read from a file pooled_fractions.csv in the same directory as the other input files, with a header line of the form 'batch,from,to'. This server also allows manual entry or graphic selection of the pooled range after initial upload.

Use the -v command to include all models in the output file: models with from 1 to N or N+1 peaks, where N is the optimum number of peaks. Use -v -g tmpdir where tmpdir is the name of a new fitting directory, and the program will create postscript plots as well as the intermediate files in that directory. For an input file batch_curve.asc, plots are named batch_N.ps (total fit) and batch_N_solution.ps (individual peaks) where N is the number of peaks in the model.

Since there are not that many plots of interest (usually only the 2 for the best model), sec_cacl.pl does not have an automated method like tm_calc.pl's for generating png plots or for making web pages to display those plots.

Use the -h command or see below for more documentation of command line options .

Requirements for running the script

Input Format

The sec_cacl.pl script can read absorbance versus time or volume curves in a variety of formats, as long as the file has a consistent column separator (space, comma, tab, or other punctuation), a header line within the first half of the file with column headers 'min' or 'ml' followed by at least one more column, which is assumed to be absobance. For example, curves exported from PrimeView Evaluation or Unicorn software for GE LifeSciences's AKTA columns are in the form:


1
SEC X2 v3001:1_UV1_280nm
min      mAU
   0.000            0.110000
   0.373           -0.021000
   0.747           -0.008000
   1.120           -0.048000
   1.493           -0.023000
   ...

where 'mAu' are milli-absorbance units, A280 × 10-3. You can export this curve using either time or volume as the X axis; if your flow rate changed during fraction collection, you should use volume to avoid miscalculation of fraction positions.

The script also reads documentation (log) files in the format exported by the same software, with time in the first column and events / settings on the rest of the line - e.g.:


Logbook
    0.00 min  Method Run 1/20/2010, 2:26:44 PM, Method : SEC X2 v3, Result : c:\...\Default\SSGCID Results\SEC Purifications\SEC X2 v3055\SEC X2 v3001.res
    0.00 min  Base CV, 318.56 {ml}, HiLoad_26/60_Superdex_75_prep_grade
    0.05 min  Block Flow_Rate
    0.05 min  Base SameAsMain
    0.05 min  Flow     1.00 ml/min
    0.05 min  End Block
    ...

However, sec_cacl.pl can only interpret events as reported by PrimeView Evaluation or Unicorn. So the script can fit absorbance curves from other equipment, but will not be able to convert time or volume to fraction number so it cannot estimate pooled fraction purity. Also, the script may use inappropriate regions of the curve before and after fraction collection to fit peaks and background. You may be able to trim the curve file, truncating a copy of the data at either end to avoid fitting the uncollected data points. In some cases you may even be able to approximate the pooled fraction range using time or volume. But this would be very labor intensive. For large numbers of SEC curves, it may be worth your time to download sec_cacl.pl and modify the documentation parser subroutine parse_doc to work on your equipment. Please contact us if you need help - or if you have a modified version you're willing to share with others.

Command Line Options

You can get a help message by running sec_cacl.pl with no input, or with -h or --help options. You get a summary of options and defaults. See below for more details on options for input, output, pooled fraction range, fitting, fit parameters and tracking .



USAGE:
sec_calc.pl -b <batch> [<path>]   |   <in_file>   |   - 
    [-o <out_file>]   [-p peakmin]   [-n npeaks]
    [-f <from_frxn>] [-t <to_frxn>] [-r|--raw] [-m|--model] [-g|gnudir tmpdir]
    [-s|--slope] [-v|--verbose] [-d|--debug] [--version] [-h | --help]
where:
	Items in <angle brackets> should be replaced by the real thing,
	Items in [square brackets] are optional. "a | b" means a or b.
	-h, --help, or no input specifier prints this text.
****  You must supply a batch, input file or - meaning STANDARD INPUT  ****
If you give <in_file> as [<path>/]<batch>curve.asc, or give -b <batch> [<path>],
then it looks for 3 files from <path> if given, or from the current directory:
  - <batch>curve.asc     = absorbance vs time or volume from PrimeView),
  - <batch>doc.asc       = fraction volume and other metadata from PrimeView
  - pooled_fractions.csv = text file with comma-separted values you create with
                           a header line like "batch,from,to"
Peak purity is calculated if <from_frxn> and <to_frxn> are given on the command
line or read from pooled_fractions.csv. If <batch>doc.asc file is not found,
then metadata defaults are fraction volume=5.0 mL, flow rate=0.5 mL/min, 
start time = 160.33 min.  Other defaults are:
	<out_file> ... 	[<batch>|<in_file>].sec.csv if either is given,
			or STANDARD OUTPUT (e.g. screen) if not
	<peakmin> ... 	5, peak finder stops if next peak <5% of total area
	<npeaks> ... 	7 peaks maximum
   NO	-r or --raw	Do NOT write raw data as standardized csv
   NO	-m or --model	Do NOT write model values as standardized csv
   NO	-g or --gnudir	Do NOT read or write gnuplot intermediate files
   NO	-v or --verbose	Do NOT write peaks and residuals for all models
   NO	-s or --slope	Do NOT limit the background to a constant value
Default output: peak table, residual values & pool purity for best model only.

Input file specification

If you give a path and a batch identifier with -b batch path, then the script attempts to read data from path/batch_curve.asc and metadata from path/batch_doc.asc. If you do not give a batch identifier, but your input file name ends in _curve.asc, i.e. file path/filename_curve.asc, then sec_cacl.pl will read the named file and look for metadata in path/filename_doc.asc. In either case, the script will attempt to read pooled fraction ranges from path/pooled_fractions.csv.

Output file specification

Pooled Fraction Range

Fractions can either be specified as numbers, e.g. 1 to 48, or as well positions in a 96-well plate assuming a zigzag collection pattern: A01-A12, B12-B01, C01-C12, D12-D01 or the same for E01 to H01.

Fitting and Plotting

The script and gnuplot exchange information through intermediate files of data, models, and residuals in the fitting directory. Plots are also created in this directory. If you do not use -g, then this directory is in /tmp or the equivalent, and is named with a date-time stamp based on Universal Mean Time (probably not your local time unless you leave in western Europe or Africa). That directory path and name are stored in the main output file in the DIRECTORY attribute of the DERIVED_VALUES tag.

Curve Fitting Parameters: Minimum Peak, Sloped Backgound and Maximum Number of Peaks

Debugging



Data Processing

The basic steps carried out by the script are:

  1. Parse the curve and documentation input files.
  2. Estimate first peak: use the point of maximum absorbance in the raw data to estimate the major peak center and height.
  3. Fit one Gaussian peak and a linear or constant background using those peak estimates as initial values.
  4. Subtract the model from the data to get the residual and look for the biggest peak in the residual.
  5. Repeat: If the residual peak is big enough, add that peak's center and height to the existing model as the new initial estimate, fit with the added peak, then repeat the previous step.
  6. Report the resulting optimal model parameters.
In more detail:

Parsing

For the curve (absorbance vs. time or volume) file:

For the documentation (log) file, the script looks for specific text (to find run values as shown below in parentheses):

The program then uses these values to convert from time or volume to fraction number, from 1.00 at the start of the first fraction to e.g. 48.99 at the end of the last of 48 fractions.

If you exported the curve data or the documentation using time rather than volume as the independent unit, and the flow rate changes during fraction collection, then warnings are printed to STD ERR (usually the screen) since the script will not be able to properly convert time to fraction number. This could alter both curve fitting and estimation of pool purity.

Curve Fitting

The script uses the point of highest absorbance to estimate the center and height of the first Gaussian peak. Using an arbitrary width (σ) estimate of 1 fraction, sec_cacl.pl sends the raw data and initial estimates to gnuplot via .dat files in the fitting directory. Gnuplot does up to 500 rounds of Levenberg-Marquardt curve fitting with a fit limit of 5 × 10-6, fitting either a constant or linear background plus one Gaussian peak:

mAu = abg + [ X ⋅ bbg ] + height ⋅ e - (X - center)2 / (2 &sigma2)

Gnuplot then sends its results back to sec_cacl.pl: it updates the estimates file and writes a new file of residuals = observed data - model at each X value. It also generates postscript plots at this point if requested with the -v flag.

After peak fitting, sec_cacl.pl does a sanity check: it stops if the model σ is wider than 1/2 Nfractions, where Nfractions is the total number of fractions collected. After the peak is checked, sec_cacl.pl uses the residual to caclculate RMSD, Rabs, and the Bayesian Information Content, BIC:


RMSD = Σ (observed - model)2

Npoints

Rabs = Σ | observed - model |

Σ | observed - background |

BIC = Npoints ⋅ log( RMSD2 ) + k log( Npoints )

where Npoints is the number of data points being fit (the number of points within the collected fractions), the error variance RMSD2 is used as an inverse estimate of the likelihood of the model, and the number of parameters in the model k = 2 + 3G for G Gaussian peaks and a linear background. Rabs is used to judge whether further fitting iterations are needed, and BIC is used to judge which model best suits the data.

The residual is then smoothed before finding the next peak. The script takes a Gaussian-weighted running average over a window of 2 fractions (about 28 points: the first and last points have 5% of the weight of the center point). In this averageing, any negative values where the model value is higher than the observed data are treated as zero. This procedure accentuates peaks with significant area, and reduces the chance that noise will unduly influence the initial peak estimate.

The center and height of the highest peak found in the smoothed residual is then written to a new estimate file along with the current model, and sent to gnuplot for another round of fitting as above. After fitting, the peaks are sorted in ascending order of leading edge, i.e. center - 2 σ. The first peak is then sanity checked for σ < 1/2 Nfractions.

If the verbose flag -v has NOT been set, then several other checks are done to see whether or not iterative fitting should continue. These are based on the sensitivity value peakmin which is 5% unless set in the command line with the -p flag:

If neither of these criteria are met, or if the -verbose flag was set, then the script will continue fitting until it reaches the maximum number of peaks (7, unless set to a lower value with the -n flag). The script will then search among the models with ΔRabs greater than peakmin, to find the model with the minimum BIC value. That model will be used as the "best" model. Since there are relatively few parameters per peak, BIC almost always goes down with added peaks; the ΔRabs > peakmin criterion almost always controls the choice of model.

Normally, the script reports only the values for the "best" model. For verbose runs, the script includes a table of peak parameters for each model, and a table of the Rabs, RMSD and BIC values for each model, allowing users to plot these values vs. number of peaks and to plot each of the models to determine for themselves which model is best.

Output Format: sec.csv and ps files

The main output, <outfile>.sec.csv , is an XML file containing one or more (XML-wrapped) comma-separated value tables for human readability. Curve fitting also produces a directory of intermediate files and, if requested, postscript plots. This document covers the main output file in detail, and briefly describes the other files.

See an example. of the XML output from the command:
    perl sec_calc.pl SEC_sample_curve.csv -m
which uses the sample input curve file SEC_sample_curve.asc and documentation file SEC_sample_doc.asc.

The XML file has the following overall heirarchy of TAGS :

The complete set of TAGS, [ ATTRIBUTES ] , value (table inside XML tag) - and descriptions are:
  • SEC_DATA
    Attributes:
    • [ CALCULATED, VERSION ] - curve-fitting script run date and version
    • [ SOURCEFILE ] - Name of the binary results (.res) file, taken from doc file's Method Run line.
    • [ CURVEFILE ] - Name of the file with absorbance curve
    • [ FLOWRATE, FRACTION_VOLUME ] - flow and fraction size numbers and units, ml/min and ml
    • [ STARTFRXNS, ENDFRXNS ] - time or volume at which fraction collection started and ended, with unit
    • [ POOL_FROM_FRACTION, POOL_TO_FRACTION ] - fraction number or well position for pooled fraction range
    • [ POOL_FROM_FRACTION_NUMBER, POOL_TO_FRACTION_NUMBER ] - fraction number as entered or as converted from well position
    • [ OBSERVED_POINTS ] - Number of points between start and end of fraction collection

    Contents:
    • DERIVED_VALUES
      Attributes:
      • [ R1, PP1 ] - Rabs and pool purity fraction after fitting 1 Gaussian peak to the data
      • [ SMOOTHING_WIDTH ] - Size of smoothing window in fractions (each fraction is normally 10-15 data points)
      • [ DIRECTORY ] - Location of fitting and plotting directory

      Value - comma-separated table of residuals, one line for each model if the -verbose flag is used, otherwise just one line of values for the best model. Column headers are:
          NPEAKS,RABS,dRABS,RMSD,AREA,PP,BIC
      where dRABS is ΔRabs and other values are as in BEST_MODEL below.

      • BEST_MODEL - model with lowest BIC among all models with ΔRabs > peakmin.
        Attributes:
        • [ PEAKMIN ] - given or default minimum peak percentage, "0.0500" unless given
        • [ NPEAKS ] - number of peaks in best model
        • [ RABS, RMSD, PPI, BIC ] - Rabs, RMSD, pool purity, and Bayesian Information Content for best model
        • [ AREA ] - total area between observed data and model background for best model

        Value - comma-separated table of values for each peak in the best model. Headers are:
            Peak Number,Start,End,Fraction of Total Area,Fraction of Pool Area
        where Start and End are peak center minus and plus 2σ, respectively. Last line has totals for fraction of total area and fraction of pool area; both should be near 1.

      • MODEL - best model only, unless -verbose flag is set in which case all models are included
        • [ NPEAKS, RABS, RMSD, PP, BIC, AREA ] - as for BEST_MODEL above
        • BACKGROUND, [INTERCEPT, SLOPE ] - one for the whole model
        • PEAK - one for each peak in the model
          Attributes:
          • [ PEAKNUM ] - peak number, from 1 up
          • [ HEIGHT, CENTER, SIGMA ] - parameters for this peak
          • [ AREA ] - proportional to HEIGHT * SIGMA
    • RAW_DATA - only included if the -raw flag is set. Comma-separated table with X-axis header from the original curve file, min or ml, in the first column, mAu in the second. Includes original data rows only from the region where fractions were collected.
    • MODEL_VALUES - only included if the -model flag is set. Comma-separated table with time or volume converted to fractions, observed absorbance, total absorbance for the model(s), residuals, background, and individual Gaussian peaks. Only the "best" model is shown unless the -verbose flag is set, in which case all models are shown starting with the 1-peak model. The first 3 columns are:
      • fraction - fraction number from 1.00 to 48.99
      • absorbance - mAu = observed A280×10-3
      • pool - graphic indicator of pooled fraction range: max. absorbance inside the pooled range, zero outside.
      After that, a model with n peaks will have n+3 columns; its headers will all start Gn (e.g. G4 for a 4-peak model):
      • Gn - Total model, n Gaussian peaks + background
      • Gn_res - Observed absorbance minus total model
      • Gn_bg - Background values from the Gn model
      • Gn_1 - First Gaussian peak out of n
      • Gn_2 - Second Gaussian peak out of n
      • ...
      • Gn_n - Last Gaussian peak
      The -verbose flag will thus produce a table with headers:
      fraction,absorbance,pool,G1,G1_res,G1_bg,G1_1,G2,G2_res,G2_bg,G2_1,G2_2,G3,G3_res,G3_bg,G3_1,G3_2,G3_3,G4,...
      
      which can be used e.g. by gnuplot or Excel to plot each model's overall fit and individual peaks.

Plots

With the -verbose option, the script generates postscript plots in the fitting directory - in /tmp, or in the directory given with the -g option. These files all start with the batch identifier given by -b batch_identifier on the command line, or with batch identifier of '0' if none is given. For each model with N peaks you get two plots:
  • batch_identifier_N.ps - observed absorbance (red +), total model (green dashes) and residual (blue X) vs. fraction number.
  • batch_identifier_N_solution.ps - observed (red +), total model (green dashes), background (blue dashes) and individual peaks (dotted lines in other colors) vs. fraction number.