PROSPERO: About

PROSPERO: PRediction of Outcome using Sequence & Protein Experimental Results Online 1.0

DLS | Sequence Analysis | SEC | TM | Yield

Size Exclusion Chromatography (SEC): `sec_cacl.pl`

The perl script sec_cacl.pl reformats and analyzes T_m curves. This document describes the requirements, input format, command line options, processing and output format for the script. The basic command,

> perl sec_calc.pl path/batch_name_curve.asc

reads absorbance vs. time or volume data from the ..._curve.asc file exported by AKTA's PrimeView Evaluation or Unicorn software. If you have also exported the column run log to a documentation file, path/batch_name_doc.asc, the script will also read that file for fraction size, start and end times and other metadata. The script will then fit Gaussian peaks to the absorbance data and write an output file path/filename.sec.csv containing the peak models in both XML form and as a human-readable table of peak centers, heights, widths and areas.

If the range of pooled fractions is supplied, an estimate of pool purity (the percent of each peak in the pooled fractions) will also be given. The pooled fraction range can be given on the command line as:

> perl sec_calc.pl path/batch_name_curve.asc -f FROM -t TO

where FROM and TO are the fraction numbers for the start and end of the pooled range. For batch processing the pooled fraction range can be read from a file pooled_fractions.csv in the same directory as the other input files, with a header line of the form 'batch,from,to'. This server also allows manual entry or graphic selection of the pooled range after initial upload.

Use the -v command to include all models in the output file: models with from 1 to N or N+1 peaks, where N is the optimum number of peaks. Use -v -g tmpdir where tmpdir is the name of a new fitting directory, and the program will create postscript plots as well as the intermediate files in that directory. For an input file batch_curve.asc, plots are named batch_N.ps (total fit) and batch_N_solution.ps (individual peaks) where N is the number of peaks in the model.

Since there are not that many plots of interest (usually only the 2 for the best model), sec_cacl.pl does not have an automated method like tm_calc.pl's for generating png plots or for making web pages to display those plots.

Use the -h command or see below for more documentation of command line options .

Requirements for running the script

You must have Perl 5.08 or newer installed on your machine. See e.g. ActiveState Perl for free download.
On Windows, you can install Perl so that it will run the script with just "sec_calc.pl ...", rather than "perl sec_calc.pl ..."
To do curve fitting, you'll need gnuplot installed from SourceForge.
See below for required input format; you must be able to read the input file's directory, and either write to that directory or to another file you specify for output.

Input Format

The sec_cacl.pl script can read absorbance versus time or volume curves in a variety of formats, as long as the file has a consistent column separator (space, comma, tab, or other punctuation), a header line within the first half of the file with column headers 'min' or 'ml' followed by at least one more column, which is assumed to be absobance. For example, curves exported from PrimeView Evaluation or Unicorn software for GE LifeSciences's AKTA columns are in the form:

1
SEC X2 v3001:1_UV1_280nm
min      mAU
   0.000            0.110000
   0.373           -0.021000
   0.747           -0.008000
   1.120           -0.048000
   1.493           -0.023000
   ...

where 'mAu' are milli-absorbance units, A₂₈₀ × 10^-3. You can export this curve using either time or volume as the X axis; if your flow rate changed during fraction collection, you should use volume to avoid miscalculation of fraction positions.

The script also reads documentation (log) files in the format exported by the same software, with time in the first column and events / settings on the rest of the line - e.g.:

Logbook
    0.00 min  Method Run 1/20/2010, 2:26:44 PM, Method : SEC X2 v3, Result : c:\...\Default\SSGCID Results\SEC Purifications\SEC X2 v3055\SEC X2 v3001.res
    0.00 min  Base CV, 318.56 {ml}, HiLoad_26/60_Superdex_75_prep_grade
    0.05 min  Block Flow_Rate
    0.05 min  Base SameAsMain
    0.05 min  Flow     1.00 ml/min
    0.05 min  End Block
    ...

However, sec_cacl.pl can only interpret events as reported by PrimeView Evaluation or Unicorn. So the script can fit absorbance curves from other equipment, but will not be able to convert time or volume to fraction number so it cannot estimate pooled fraction purity. Also, the script may use inappropriate regions of the curve before and after fraction collection to fit peaks and background. You may be able to trim the curve file, truncating a copy of the data at either end to avoid fitting the uncollected data points. In some cases you may even be able to approximate the pooled fraction range using time or volume. But this would be very labor intensive. For large numbers of SEC curves, it may be worth your time to download sec_cacl.pl and modify the documentation parser subroutine parse_doc to work on your equipment. Please contact us if you need help - or if you have a modified version you're willing to share with others.

Command Line Options

You can get a help message by running sec_cacl.pl with no input, or with -h or --help options. You get a summary of options and defaults. See below for more details on options for input, output, pooled fraction range, fitting, fit parameters and tracking .


USAGE:
sec_calc.pl -b <batch> [<path>]   |   <in_file>   |   - 
    [-o <out_file>]   [-p peakmin]   [-n npeaks]
    [-f <from_frxn>] [-t <to_frxn>] [-r|--raw] [-m|--model] [-g|gnudir tmpdir]
    [-s|--slope] [-v|--verbose] [-d|--debug] [--version] [-h | --help]
where:
	Items in <angle brackets> should be replaced by the real thing,
	Items in [square brackets] are optional. "a | b" means a or b.
	-h, --help, or no input specifier prints this text.
****  You must supply a batch, input file or - meaning STANDARD INPUT  ****
If you give <in_file> as [<path>/]<batch>curve.asc, or give -b <batch> [<path>],
then it looks for 3 files from <path> if given, or from the current directory:
  - <batch>curve.asc     = absorbance vs time or volume from PrimeView),
  - <batch>doc.asc       = fraction volume and other metadata from PrimeView
  - pooled_fractions.csv = text file with comma-separted values you create with
                           a header line like "batch,from,to"
Peak purity is calculated if <from_frxn> and <to_frxn> are given on the command
line or read from pooled_fractions.csv. If <batch>doc.asc file is not found,
then metadata defaults are fraction volume=5.0 mL, flow rate=0.5 mL/min, 
start time = 160.33 min.  Other defaults are:
	<out_file> ... 	[<batch>|<in_file>].sec.csv if either is given,
			or STANDARD OUTPUT (e.g. screen) if not
	<peakmin> ... 	5, peak finder stops if next peak <5% of total area
	<npeaks> ... 	7 peaks maximum
   NO	-r or --raw	Do NOT write raw data as standardized csv
   NO	-m or --model	Do NOT write model values as standardized csv
   NO	-g or --gnudir	Do NOT read or write gnuplot intermediate files
   NO	-v or --verbose	Do NOT write peaks and residuals for all models
   NO	-s or --slope	Do NOT limit the background to a constant value
Default output: peak table, residual values & pool purity for best model only.

Input file specification

The input file or path can be anywhere in the command line except after an option that takes an argument; it is safest to put it first, so it isn't taken as part of some other option.
You can use a dash, '-', instead of a file name to take standard input, e.g. from a linux pipe.
If you don't give a file, path or dash, you get the above help message.

If you give a path and a batch identifier with -b batch path, then the script attempts to read data from path/batch_curve.asc and metadata from path/batch_doc.asc. If you do not give a batch identifier, but your input file name ends in _curve.asc, i.e. file path/filename_curve.asc, then sec_cacl.pl will read the named file and look for metadata in path/filename_doc.asc. In either case, the script will attempt to read pooled fraction ranges from path/pooled_fractions.csv.

Output file specification

If you give an output file with the -o option, the script will append ".sec.csv" if it's not already there. (The output file is actually XML, but it includes at least one comma-delimited table which can most easily be viewed on Windows machines with Excel if it's named .csv .)
If you don't give an output file with the -o option, but you do give an input file, the script will attempt to open an output file in the same directory as the input file, with the same base name but replacing the extension with ".sec.csv". Example:
```
 > perl sec_calc.pl SEC_data_path/SEC_data_file.csv 
```
makes file
```
  SEC_data_path/SEC_data_file.sec.csv
```
If you give dash for the output file, or if you don't give any output file but gave a dash for input, then the output will go to standard out, e.g. to the terminal (probably not very useful) or to a linux pipe.
The output always contains the peak parameters for the best model, but you must use -r or -m to include the raw data or the modeled absorbance in a comma-separated table that can be easily plotted in e.g. Excel or gnuplot. See output below for details.

Pooled Fraction Range

Specify range with -f FROM -t TO on the command line, or
use a file in the same directory as your input data file. For a data file path/filename_curve.asc, you need a file path/pooled_fractions.csv. This file should start with the header line:
```
batch,from,to
```
Somewhere in the file should be a line starting with filename, the batch identifier:
```
filename,n1,n2
```
The script will read this as pooled fractions from n1 to n2.

Fractions can either be specified as numbers, e.g. 1 to 48, or as well positions in a 96-well plate assuming a zigzag collection pattern: A01-A12, B12-B01, C01-C12, D12-D01 or the same for E01 to H01.

Fitting and Plotting

The -v flag causes the program to:
- write some intermediate results to STD OUT (usually the screen),
- create postscript plots of individual Gaussians and the total model,
- and to include all models attempted in the output file, not just the best model.
Use -g tmpdir where tmpdir does not yet exist to have the program put intermediate files, including plots, in the given directory.
Use -g tmpdir where tmpdir already exists to have the program read intermediate files from the given directory, without further curve fitting (mainly for debugging).

The script and gnuplot exchange information through intermediate files of data, models, and residuals in the fitting directory. Plots are also created in this directory. If you do not use -g, then this directory is in /tmp or the equivalent, and is named with a date-time stamp based on Universal Mean Time (probably not your local time unless you leave in western Europe or Africa). That directory path and name are stored in the main output file in the DIRECTORY attribute of the DERIVED_VALUES tag.

Curve Fitting Parameters: Minimum Peak, Sloped Backgound and Maximum Number of Peaks

-p peakmin sets the percent of the total area used as a minimum for adding peaks to the model.
The default, arrived at after tuning the algorithm on ~100 samples, is 5%. If a model explains all but 5% of the absorbance (observed mAu - background), or if an added peak does not improve the fit by more than this fraction, fitting stops. You may wish to tweak this parameter if the script is fitting too many or too few peaks to your curve: a lower value means more sensitivity and more peaks.
-s tells the script to use a constant background rather than a linear (possibly sloping) background.
Piece-wise linear backgrounds or more elaborate functions might be better in theory, but in our testing we found that most curves could be fit by a simple linear (sloped or constant) background. The exceptions had so much noise that no automated method was likely to find a more useful background function, and even extensive manual intervention was not likely to help.
-n npeaks is the maximum number of peaks the program will attempt to fit.
The default value is 5; the maximum possible value is currently 7.

Debugging

Use -v -v if you want to see more intermediate values and steps in the analysis process. These are not likely to be very interesting for most users, but may be useful to developers.
Use -d for still more intermediate values. Increasing numbers of -ds give even more values. Useful only for developers or those who like to watch paint dry.

Data Processing

The basic steps carried out by the script are:

Parse the curve and documentation input files.
Estimate first peak: use the point of maximum absorbance in the raw data to estimate the major peak center and height.
Fit one Gaussian peak and a linear or constant background using those peak estimates as initial values.
Subtract the model from the data to get the residual and look for the biggest peak in the residual.
Repeat: If the residual peak is big enough, add that peak's center and height to the existing model as the new initial estimate, fit with the added peak, then repeat the previous step.
Report the resulting optimal model parameters.

In more detail:

Parsing

For the curve (absorbance vs. time or volume) file:

The script finds the common delimiter by scanning through a group of lines in the middle of the file and counting potential delimiter characters.
The script then looks for the header line near the top with a column starting with the letters "min" (minute) or "ml" (milliLiter).
The column after "min" or "ml" is taken as the absorbance data to be analyzed.

For the documentation (log) file, the script looks for specific text (to find run values as shown below in parentheses):

"Method Run" (source file, run start date)
"Injection Valve Inj" (sample injection time)
"Flow [number] ml/min" (flow rate)
"Fraction Size", "Fractionation[_id]" with "[number] ml" (fraction volume, start of fraction collection)
"Break Point 3" (alternate collection start indicator)
"Break Point 4", "FractionationStop" or "Fraction Off" (end of fraction collection)
"End [number]" but not "End Block" (run end date/time)

The program then uses these values to convert from time or volume to fraction number, from 1.00 at the start of the first fraction to e.g. 48.99 at the end of the last of 48 fractions.

If you exported the curve data or the documentation using time rather than volume as the independent unit, and the flow rate changes during fraction collection, then warnings are printed to STD ERR (usually the screen) since the script will not be able to properly convert time to fraction number. This could alter both curve fitting and estimation of pool purity.

Curve Fitting

The script uses the point of highest absorbance to estimate the center and height of the first Gaussian peak. Using an arbitrary width (σ) estimate of 1 fraction, sec_cacl.pl sends the raw data and initial estimates to gnuplot via .dat files in the fitting directory. Gnuplot does up to 500 rounds of Levenberg-Marquardt curve fitting with a fit limit of 5 × 10^-6, fitting either a constant or linear background plus one Gaussian peak:

mAu = a_bg + [ X ⋅ b_bg ] + height ⋅ e ^{- (X - center)² / (2 &sigma²)}

Gnuplot then sends its results back to sec_cacl.pl: it updates the estimates file and writes a new file of residuals = observed data - model at each X value. It also generates postscript plots at this point if requested with the -v flag.

After peak fitting, sec_cacl.pl does a sanity check: it stops if the model σ is wider than 1/2 N_fractions, where N_fractions is the total number of fractions collected. After the peak is checked, sec_cacl.pl uses the residual to caclculate RMSD, R_abs, and the Bayesian Information Content, BIC:


RMSD =	√	Σ (observed - model)²

		N_points

R_abs =	Σ \| observed - model \|

	Σ \| observed - background \|

BIC = N_points ⋅ log( RMSD² ) + k log( N_points )

where N_points is the number of data points being fit (the number of points within the collected fractions), the error variance RMSD² is used as an inverse estimate of the likelihood of the model, and the number of parameters in the model k = 2 + 3G for G Gaussian peaks and a linear background. R_abs is used to judge whether further fitting iterations are needed, and BIC is used to judge which model best suits the data.

The residual is then smoothed before finding the next peak. The script takes a Gaussian-weighted running average over a window of 2 fractions (about 28 points: the first and last points have 5% of the weight of the center point). In this averageing, any negative values where the model value is higher than the observed data are treated as zero. This procedure accentuates peaks with significant area, and reduces the chance that noise will unduly influence the initial peak estimate.

The center and height of the highest peak found in the smoothed residual is then written to a new estimate file along with the current model, and sent to gnuplot for another round of fitting as above. After fitting, the peaks are sorted in ascending order of leading edge, i.e. center - 2 σ. The first peak is then sanity checked for σ < 1/2 N_fractions.

If the verbose flag -v has NOT been set, then several other checks are done to see whether or not iterative fitting should continue. These are based on the sensitivity value peakmin which is 5% unless set in the command line with the -p flag:

If the change in R_abs on adding a peak, ΔR_abs, is less than peakmin / 10, throw out the last peak and stop fitting.
If R_abs is less than peakmin, keep the last peak and stop fitting.

If neither of these criteria are met, or if the -verbose flag was set, then the script will continue fitting until it reaches the maximum number of peaks (7, unless set to a lower value with the -n flag). The script will then search among the models with ΔR_abs greater than peakmin, to find the model with the minimum BIC value. That model will be used as the "best" model. Since there are relatively few parameters per peak, BIC almost always goes down with added peaks; the ΔR_abs > peakmin criterion almost always controls the choice of model.

Normally, the script reports only the values for the "best" model. For verbose runs, the script includes a table of peak parameters for each model, and a table of the R_abs, RMSD and BIC values for each model, allowing users to plot these values vs. number of peaks and to plot each of the models to determine for themselves which model is best.

Output Format: sec.csv and ps files

The main output, <outfile>.sec.csv , is an XML file containing one or more (XML-wrapped) comma-separated value tables for human readability. Curve fitting also produces a directory of intermediate files and, if requested, postscript plots. This document covers the main output file in detail, and briefly describes the other files.

See an example. of the XML output from the command:
perl sec_calc.pl SEC_sample_curve.csv -m
which uses the sample input curve file SEC_sample_curve.asc and documentation file SEC_sample_doc.asc.

The XML file has the following overall heirarchy of TAGS :

SEC_DATA DERIVED_VALUES BEST_MODEL MODEL BACKGROUND, PEAK RAW_DATA MODEL_VALUES




The complete set of  TAGS, [ ATTRIBUTES ] ,
value (table inside XML tag) - and descriptions  are:



     SEC_DATA 
	
 Attributes: 
      
	 [ CALCULATED, VERSION  ]
		-  curve-fitting script run date and version  
 	 [ SOURCEFILE  ] -  Name of the binary results 
		(.res) file, taken from doc file's Method Run line.   
 	 [ CURVEFILE  ] -  Name of the file with
		absorbance curve  
	 [ FLOWRATE, FRACTION_VOLUME  ] 
		-  flow and fraction size numbers and units, ml/min and ml 
		 
	 [ STARTFRXNS, ENDFRXNS  ] 
		-  time or volume at which fraction collection started
		and ended, with unit  
	 [ POOL_FROM_FRACTION, POOL_TO_FRACTION  ] 
		-  fraction number or well position for pooled fraction range
		 
	 [ POOL_FROM_FRACTION_NUMBER, POOL_TO_FRACTION_NUMBER  
		] -  fraction number as entered or as converted from well 
		position
		 
	 [ OBSERVED_POINTS  ] 
		-  Number of points between start and end of fraction 
		collection  
      
	
 Contents: 
      
	  DERIVED_VALUES  
	    
 Attributes: 
	    
		 [  R1, PP1  ] -  R_abs and
			pool purity fraction after fitting 1
			Gaussian peak to the data  
		 [  SMOOTHING_WIDTH  ] -  Size of smoothing
			window in fractions (each fraction is normally 10-15
			data points)  
		 [  DIRECTORY  ] -  Location of fitting and
			plotting directory  
	    
		
  Value  -  comma-separated table of residuals, 
			one line for each model if the -verbose flag is 
			used, otherwise just one line of values for the best model. 
			Column headers are:
			
    
				NPEAKS,RABS,dRABS,RMSD,AREA,PP,BIC 
		
  where dRABS is ΔR_abs and other values are
			as in BEST_MODEL below.
			
		
	    


		  BEST_MODEL   -  model with lowest BIC 
			among all models with ΔR_abs > 
			peakmin.
			
 Attributes: 
		   
			 [  PEAKMIN  ] -  given or default
				minimum peak percentage, "0.0500" unless given
				 
			 [  NPEAKS  ] -  number of peaks in
				best model  

			 [  RABS, RMSD, PPI, BIC  ] -  
				R_abs, RMSD, pool purity, and 
				Bayesian Information Content for
				best model  

			 [  AREA  ] -  total area between
				observed data and model background for
				best model  

		   


			
  Value  -  comma-separated table of values
				for each peak in the best model. Headers are:
			
    
Peak Number,Start,End,Fraction of Total Area,Fraction of Pool Area
			
			
where Start and End are 
			peak center minus and plus 2σ, respectively.
			Last line has totals for fraction of total area and 
			fraction of pool area; both should be near 1.  
			
			 

		
		  MODEL  - best model only, unless 
			-verbose flag is set in which case all models
			are included 
		   
			 [  NPEAKS, RABS, RMSD, PP, BIC, AREA 
				] -  as for BEST_MODEL above  
			  BACKGROUND, [INTERCEPT, SLOPE ]  - 
				 one for the whole model  
			  PEAK  -  one for each peak in 
				the model 
				
 Attributes:
			   
				 [  PEAKNUM  ] -
					 peak number, from 1 up  
				 [  HEIGHT, CENTER, SIGMA  ] -
					 parameters for this peak  
				 [  AREA  ] -  proportional 
					to HEIGHT * SIGMA  
			   
			
		   
		
	    

	

	  RAW_DATA  -  only included if the -raw 
		flag is set. 
		Comma-separated table with X-axis header from the original 
		curve file, min or ml, in the first
		column, mAu in the second.
		Includes original data rows only from the region where fractions 
		were collected.  
	
	  MODEL_VALUES  -  only included if the 
		-model flag is set. 
		Comma-separated table with time or volume 
		converted to fractions, observed absorbance, total absorbance for 
		the model(s), residuals, background, and individual Gaussian peaks. 
		Only the "best" model is shown unless the 
		-verbose flag is set, in which case all models 
		are shown starting with the 1-peak model.
			The first 3 columns are:  
		    
			 fraction -  fraction number from 1.00 to 48.99 
				 
			 absorbance -  mAu = observed 
				A₂₈₀×10^-3  
			 pool -  graphic indicator of pooled fraction 
				range: max. absorbance inside the pooled range, 
				zero outside.  
		    
			 After that, a model with n peaks will have n+3 
			columns; its headers will all start Gn (e.g. G4
			for a 4-peak model): 
		    
			 Gn -  Total model, 
				n Gaussian peaks + background  
			 Gn_res -  Observed absorbance minus total model
				 
			 Gn_bg -  Background values from the Gn model
				 
			 Gn_1 -  First Gaussian peak out of n  
			 Gn_2 -  Second Gaussian peak out of n  
			 ... 
			 Gn_n -  Last Gaussian peak  
		    
			 The -verbose flag will thus produce 
			a table with headers: 
fraction,absorbance,pool,G1,G1_res,G1_bg,G1_1,G2,G2_res,G2_bg,G2_1,G2_2,G3,G3_res,G3_bg,G3_1,G3_2,G3_3,G4,...

			 which can be used e.g. by gnuplot or Excel
			to plot each model's overall fit and individual peaks.
			
	
      

   


 Plots 
With the -verbose option, the script generates postscript plots
in the fitting directory - in /tmp, or in the directory given with the
-g option. These files all start with the batch identifier 
given by -b batch_identifier on the command line, or 
with batch identifier of '0' if none is given. For each model with N peaks
you get two plots:

	  batch_identifier_N.ps - observed absorbance (red +),
total model (green dashes) and residual (blue X) vs. fraction number.
	
	  batch_identifier_N_solution.ps - observed (red +),
total model (green dashes), background (blue dashes) and individual 
peaks (dotted lines in other colors) vs. fraction number.

Size Exclusion Chromatography (SEC): sec_cacl.pl