PROSPERO: About

PROSPERO: PRediction of Outcome using Sequence & Protein Experimental Results Online 1.0

DLS | Sequence Analysis | SEC | TM | Yield

Differential Scanning Fluorimetry (T_m): tm_calc.pl

The perl script tm_calc.pl reformats and analyzes T_m curves. This document describes the requirements, input format, command line options, processing and output format for the script. The basic command,

> perl tm_calc.pl path/filename.csv

writes the XML output file path/filename.tmc.csv containing the original data and values such as T_m derived from basic analysis of the major transition. The derived values are also written in a human-readable comma-separated value table near the top, viewable e.g. in Excel. The command

> perl tm_calc.pl path/filename.csv -f

fits one or more transition models to the data for each well, and includes derived values for those models in the output file. The command

> perl tm_calc.pl path/filename.csv -w

does the same as -f and also produces a directory structure containing web pages and plots of the observed data and models:

path/filename_fit/ path/filename_fit/TM_Summary.html path/filename_fit/TM_Details.html path/filename_fit/1/ path/filename_fit/1/filename_1_1.png/ ... (for well 1)
(well 1, transition 1)

Use the -h command or see below for more documentation of command line options. PROSPERO uses the first two forms of the command for initial input and complete analysis after well selection, respectively. The third form is useful for viewing plots when using tm_calc.pl on your own computer.

Requirements for running the script

You must have Perl 5.08 or newer installed on your machine. See e.g. ActiveState Perl for free download.
On Windows, you can install Perl so that it will run the script with just "tm_calc.pl ...", rather than "perl tm_calc.pl ..."
If you want to do curve fitting, you'll need gnuplot installed from SourceForge.
See below for required input format; you must be able to read the input file's directory, and either write to it or to another file you specify for output.

Input Format

The script can read a variety of input formats:

comma-separated value (.csv) or tab-delimited files exported from Opticon Monitor (BioRad) software .tad or .tad2 files - see DSF_sample_data.csv for an example.
.csv files exported from the appropriate Transform Excel file from the Structural Genomics Consortium (SGC). Transform files are available to convert the output from 6 different RT-PCR machines from ABI, Agilent and BioRad; the author welcomes inquiries about other formats. See the manual for SGC's DSF Analysis tool for instructions on converting files into a form their tool can use. Then use either their tool or export the Transform spreadsheet as .csv and use ours. Theirs puts out the results in a 96-well format; ours puts them out in a long list, but fits multiple transitions to one curve where appropriate.
other files with temperature and fluorescence data in columns separated by any common delimiter, e.g. comma, tab, a single space character or any amount of white space (tab or blank), colon, semi-colon, vertical bar, or other common punctuation. There must be a header row with the temperature column starting with "Tem" (any case) followed by sample column headers, with data in following rows.

Files in other formats, e.g. with temperature in one row followed by sample header and fluorescence in following rows, will need to be converted e.g. by using Excel's Edit - paste special - transpose to put the data into columns instead of rows.

Command Line Options

You can get a help message by running tm_calc.pl with no input, or with -h or --help options. You get a summary of options and defaults. See below for more details on options for input, output, sample selection, fitting and plotting, fit parameters and tracking .

USAGE:
perl tm_calc.pl <in_file> | -  [-o (- | <out_file>)]  [-c <columns>] [-r|--raw]
  [-w  | --web]  |  [(-f | --fitdir) [<fit_dir>]]  [(-p | --plot) [<plot_type]]
  [(-m | --mindelta) <min_delta> ]  [-x | -exp]  [(-s | --smooth) <window>]
  [-t | --track]  [-v | --verbose]  [-d | --debug]  [--version]  [-h | --help]

>>>> You must supply either <in_file> or '-' (dash) meaning STANDARD IN. <<<<<

	Items in <angle brackets> should be replaced by the real thing,
	Items in [square brackets] are optional. "a | b" means a or b.
	-h or --help or no input specifier prints this text.
Output: XML including a csv table of derived values including Tm as T at 
max. slope and Tm as mean of T's at half max. slope, followed by raw data
wrapped in XML tags: <X>X1,X2...</X> and <SAMPLE>Y1,Y2...</SAMPLE>. 
Flags are:

-r or --raw 	Include raw data as standardized csv in the XML output.
-f or --fit 	Fit 1 or more Boltzmann curves to data using gnuplot
-p or --plot	Plot raw data, total fit, each transition and derivatives;
            	-p png puts Tm_Details.html & Tm_Summary.html in <fit_dir>.
-w or --web 	shortcut for '-f <in_name>_fit -p png' where <in_file> is
            	<in_name>.csv; puts web pages in subfolder next to <in_file>.
-x or --exp 	Use exponential decay background in curve fitting.
-t or --track  	Write the sample number and name (column heading) to STD ERR
-v or --verbose	Write some intermediate values to STD OUT
-d or --debug  	Write copious intermediate values to STD OUT
All flags are OFF by default.
Default values are:
<out_file> ... 	<in_file>.tmc.csv if <in_file> is given, or STANDARD OUTPUT
	        	    (e.g. screen) if no <in_file> is given.
<columns>  ... 	Use all sample columns.
<fit_dir>  ... 	Use a temporary fit directory e.g. /tmp/<in_file>/<column>;
               	    do NOT read existing fits or write to a named directory.
<min_delta>    	0.02 = fraction of total intensity change for smallest peak.
<window>   ...	3 = size in degrees of smoothing window	(15 points for OM RTPCR)
<plot_type>    	ps for Postscript; the only other tested option is png.
               	   Only works if -f is used

Use -w to fit and make web pages with png plots in <in_file>_fit or ./fit,
OR use -f <fit_dir> -p png to fit and make web pages in <fit_dir>.  Pages are
Tm_Summary.html with 1 plot per sample and Tm_Details.html, 3 per transition+.

If given, <columns> can be a comma-separated list of numbers, wells, or ranges,
e.g."1-3,B3-5,C10-d02". The first sample column after "Temp" is number 1
Wells match the letter and number ignoring case and leading 0: "A2" = "a02".
Ranges are number-number or well-well where well ranges go from the first
to the last column as found in the file, if both first and last are found.
	NOTE: Lists must be either WITHOUT_SPACES or "enclosed in quotes."
	Buffers with spaces and commas are not (yet) recognized.

Input file specification

The input file can be anywhere in the command line except after an option that takes an argument; it is safest to put it first, so it isn't taken as part of some other option.
You can use a dash, '-', instead of a file name to take standard input, e.g. from a linux pipe.
If you don't give either a file name or a dash, you get the above help message.

Output file specification

If you give an output file with the -o option, the script will append ".tmc.csv" if it's not already there. (The output file is actually XML, but it includes at least one comma-delimited table which can most easily be viewed on Windows machines with Excel if it's named .csv .)
If you don't give an output file with the -o option, but you do give an input file, the script will attempt to open an output file in the same directory as the input file, with the same base name but replacing the extension with ".tmc.csv". Example:
```
 > perl tm_calc.pl ../TM_data_dir/TM_data_file.csv 
```
makes file
```
  ../TM_data_dir/TM_data_file.tmc.csv
```
If you give dash for the output file, or if you don't give any output file but gave a dash for input, then the output will go to standard out, e.g. to the terminal (probably not very useful) or to a linux pipe.
The output always contains the raw data in XML form (see below for details). Use -r or --raw to also include the same data (within an XML element) in a comma-separated table that Excel can read and easily plot.

Sample Selection: Columns

If you don't use the -c option, you get all samples in the file.
Samples can be specified with column numbers, counting from the first column to the right of Temperature as 1.
If your file has sample names starting with well position, e.g. A01 to H12, you can specify the sample by well, using either case, with or without the leading zero. So a1 matches A01.
If your file has sample names without spaces, you can use those names (buffer descriptions with commas and spaces are not supported, yet).
You can give a list of samples or sample ranges separated by commas, where a sample range is "number-number" or "well-well", but not "number-well" or "well-number". So a1,3-5,B01-c12,50 is valid; 1-a2 is not.
For well ranges, the script doesn't assume the wells are in order in the file. Instead, it looks in the file for the first well, starts there (if found), and goes until it either finds the second well or runs out of columns.
If you put spaces in the list, you need to enclose the list in quotes so it isn't split up as a separate option.

Fitting and Plotting: Web, Fitdir and Plot

-f or --fitdir tells gnuplot to fit your data with a model
-p or --plot tells gnuplot to plot the data and model
-w or --web does both -f and -p
In more detail:

The script always estimates T_m for the major transition and other values without curve fitting first. It only does curve fitting if you use -f or -w. If you use either -f or -w, the script sends the raw data to gnuplot for curve fitting (gnuplot worked better than the Perl implementations of the same fitting algorithm available at the time the script was developed). The script and gnuplot exchange data through intermediate files of data, models, residuals and derivatives.
- If you want to see these intermediate files, supply a <fitdir> with the -f option. Otherwise they go to /tmp or the equivalent.
- If you give the name of an existing directory for <fitdir> the script will attempt to read intermediate files from there, without re-fitting. This can be much faster if you're debugging the output, but can cause problems if you change the fitting algorithm.
If you use -f <fitdir> you can also use the -p option to creates postscript (ps) or portable network graphics (png) plots of your data in <fitdir> (see below for detail).
- Use -f <fitdir> -p to get plots in <fitdir>/N for each column number N. With no plot type, the default is ps.
- Use -f <fitdir> -p png to also get web pages Tm_Summary.html and Tm_Details.html which organize the display of those plots (see below for details).
- Use -w or --web to get the same result as -f <webdir> -p png where <webdir> is <in_file>_fit if you give an <in_file>, or ./fit (in the directory you ran the script from) if not. Example:
```
 > perl tm_calc.pl ../TM_data_dir/TM_data_file.csv -w 
```
  makes the regular output file AND a directory containing web pages and subdirectories of images:
```
  ../TM_data_dir/TM_data_file.tmc.csv 
  ../TM_data_dir/TM_data_file_fit/
  ../TM_data_dir/TM_data_file_fit/Tm_Summary.html
  ../TM_data_dir/TM_data_file_fit/Tm_Details.html 
  ../TM_data_dir/TM_data_file_fit/TM_data_file_1/tm_file_1_1.png
     ...
```
  You only need to find one of the html files; it contains links to the other html file and to all the images.

Curve Fitting Parameters: Minimum Delta and Exponential Backgound

-m or --min_delta is the fraction of total intensity change used as a cutoff for adding transitions to the model. The default, arrived at after tuning the algorithm on ~500 samples, is 0.02. If a model explains all but 2% of the total intensity change, or if an added transition does not improve the fit by more than this fraction, fitting stops. You may wish to tweak this parameter if the script is fitting too many or too few transitions to your curve.
-x or -exp tells the script to fit an exponentially decaying background to the curve. This may be useful when you have high initial fluorescence which drops more or less regularly through the temperature range. Removing a downward sloped background can reveal upward transitions, but it can also create them. Check the resulting plots for artificial low-temperature transitions introduced into the model by of steep background decay.

Tracking and other options

Use -t or --track if you have a large file e.g. a 96-well plate and you want to monitor progress through that file. The script will report the column number and name on standard error, i.e. the screen usually.
Use -v or --verbose if you want to see some of the intermediate values and steps in the analysis process. These are not likely to be very interesting for most users, but may be useful to developers.
Use -d or --debug for more intermediate values. Increasing numbers of repeats give even more values; e.g. -d -d shows the gnuplot fitting process. Useful only for developers or those who like to watch paint dry.

Data Processing

The basic steps carried out by the script are:

Parse the input file: find the delimiter, the header line, and the columns.
Preliminary analysis: for each column, find the maximum slope and half-max slope points
Fit one transition using maximum slope and half-max width as initial estimates.
Subtract the model from the data and look for max slope and width of the residual.
If the max slope is still high enough above zero, use it to fit another transition, then repeat the previous step.
Report the resulting model parameters.

In more detail:

Parsing

The script first determines if the input is a .tmc.csv file produced by this script; if so it reads in the xml fields using special processing which puts data into the same form as regular processing would. If not:

The script finds the common delimiter by scanning through a group of lines in the middle of the file and counting potential delimiter characters.
The script then looks for the header line near the top with a column starting with the letters "Tem" (disregarding case).
If columns were specified in the command line, the script then matches the columns list to the headers found on that header line to build the list of column numbers to be analyzed.
For each column, the script passes the set of temperatures and intensities to the analysis routine.

Preliminary Analysis

Basic analysis of the major transition is always done, whether or not fitting is used. The steps are:

A running average is taken over 3 degrees (15 points from the Opticon Monitor RT-PCR) before taking the derivative, to reduce noise.
The maximum derivative dI/dT is found; if it is positive, the temperature at that point is T_m-max.
The temperatures above and below T_m-max with half the maximum slope are found; if there are several on one side near each other due to a noisy derivative, their average is taken. The mean of the temperatures at half-maximum slope is taken as T_m-avg. A large difference (> 2 °C) between the two estimates indicates a non-ideal transition.
The distance between the low and high half-max slope points is the full width at half maximum slope FWHM. This is used to estimate T_w, the transition width in the Boltzmann equation:
I = I_min + ^ΔI / _{1 + e^{( T_m - T) / T_w}}
Using the T_m-avg estimate, the temperatures and fluorescence at T_m and T_m - 2T_w are used to calculate ΔI and then I_{min calc}.
I_max-calc is calculated from I_min + ΔI
The difference between the calculated and observed minimum and maximum (i.e. the difference between calculated and observed ΔI) are used to determine the fraction of fluorescence change not accounted for by the major transition, R_mt.

Curve Fitting

When -f or -w options are used, tm_calc.pl does iterative Levenberg-Marquardt curve fitting via gnuplot:

The script looks for data anomalies such as places where the fluorescence change is too great (e.g. due to a bad sensor) or too small (due to reaching the sensor's maximum value). The program attempts to find an appropriate part of the curve without anomalies for fitting.
The derivative is smoothed using a window large enough to reduce the number of local maxima to 7 or less. In noisy data, this makes it much easier to find valid initial estimates for transitions; unsmoothed raw data is still used in the actual fitting.
If you use the -x option, the script looks for high initial fluorescence. If the curve cannot be modelled with exponential decay toward zero and rising transitions, an attempt is made to model the curve with decay toward another value, usually far below zero. This can approximate a linear background with a nearly constant rate of descent.
T_m, FWHM, I_min and ΔI are estimated as in preliminary analysis, using exponential background if any and using the smoothed derivative.
Curve fitting temperature range limits are determined (within the bounds set by any anomalies found in the first step) by scanning up and down in temperature from the estimated T_m and stopping at deviations from smoothly declining slopes.
Raw fluorecsnce data, initial values for the parameters to be fit, and the fit range limits are passed to gnuplot for curve fitting: a constant or exponential background and one or more transitions following the Boltzmann equation are fit to the data. Data is passed to and from gnuplot as intermediate files in <fitdir> or /tmp as specified in the -f or -w option.
The script reads the fit model parameters and residual (observed-data) from gnuplot's output files.
The residual is analyzed as the raw to find T_m, FWHM, I_min and ΔI for another transition. Fitting is repeated if possible.
When no significant positive slope is found, or when ΔI for the next transition is below the threshold set by ΔI_total*min_delta, or when an added transition does not improve the fit by more than that threshold, then several fit parameters are adjusted in an attempt to find a valid transition. Fitting stops when no further valid transition is found.

Output Format: tmc.csv and html files

The main output, <outfile>.tmc.csv , is an XML file containing one or more (XML-wrapped) comma-separated value tables for human readability. Curve fitting also produces directories of intermediate files and, if requested, plots and HTML files. This document covers the main output file in detail, and briefly describes the other files.

See an example. of the XML output from the command:
perl tm_calc.pl DSF_sample_data.csv -f -r
which uses the sample input file shown above, DSF_sample_data.csv.

The XML file has the following heirarchy of TAGS, attributes - and brief descriptions or links:

TM_DATA, version mindelta runtime background source dest
- HEADER - lines from original data file before column header row
- DERIVED_VALUES
  - TABLE - see description of derived values table
  - DATA
    - X_AXIS , label - comma-separated temperatures
    - Y_AXIS , label
      - SAMPLE, name well_number Tm_max_slope Tm_avg FWHM Rmt R30 I30 Itm Imax_obs transition_count major_transition max_delta_transition RMSD R_abs Imin_transition I_initial I_exp_decay data_quality_problem
        - see below
        
        TRANSITION, number I_delta I_tm Tm sd_tm FWHM quality
        
        TRANSITION ...
        
        - comma-separated fluorescence data
      - SAMPLE ...
- RAW_DATA - see description of raw data table
- FOOTER - lines from source file after last data row

The values of these tags, where not obvious, are:

version - tm_calc.pl version e.g. tm_calc_pl_2.11
background - either constant or exponential
source - path and name of input file for tm_calc.pl
dest - path and name of output file
TABLE - comma-separated values with a header row and data in columns, one row for each transition (one per sample unless you do fitting). You get one row per sample with these headers:
- "sample_name", well_number, Tm_max_slope, Tm_avg, FWHM, Rmt, R30, I30, Itm, Imax_obs
With fitting, you also get these:
- Number_of_Transitions, Major_Transition, Max_Delta_Transition, R_abs, I_min_fraction, I_exp_initial_fraction, I_exp_decay, Transition_Number, Tm, SD_of_Tm, dI/dT, I_delta_fraction, FWHM, Major_Transition_Flag, Transition_Quality, Data_Quality
For samples with more than one transition, the first row has both the "one per sample" values and the "one per transition" values; following rows only have the "one per transition" data. Data values are described below in the section on SAMPE attributes.

SAMPLE

 - one entry for each column in the
source file, or for each column specified with the -c option. The same 
values are in the csv table above, with similar but not always identical headers.
 Attributes  ( csv headers if different) and value 
descriptions are:

 
	  Attribute   Description  
	  name 
("sample_name")  
		 text from column header in source file
		
	
	   well_number  
		 1 = first column to the  right of Temp.
		
	
	   Tm_max_slope  
		 T_m as T at maximum dI/dT
		
	
	   Tm_avg  
		 T_m as mean of T at half max. dI/dT
		
	
	   FWHM  
		 Range of T between half max. dI/dT
		
	
	   Rmt  
		 Ratio of minor transitions to total ΔI
			i.e.
			(1 - ΔI_major) / ΔI_obs
		
	
	   R30  
		 Ratio of fluorescence intensity at 30 °C, I₃₀,
			to intensity at T_m-max, I_Tm
		
	
	   I30  
		 Fluorescence intensity at 30 °C 
	
	   Itm  
		 Fluorescence intensity at T_m-max 
	
	   Imax_obs 
		 Maximum intensity in the whole curve 
	

		If you do fitting, you also get one of these values for each sample:

Attribute	Description
name ("sample_name")	text from column header in source file
`well_number`	1 = first column to the right of Temp.
`Tm_max_slope`	T_m as T at maximum dI/dT
`Tm_avg`	T_m as mean of T at half max. dI/dT
`FWHM`	Range of T between half max. dI/dT
`Rmt`	Ratio of minor transitions to total ΔI i.e. (1 - ΔI_major) / ΔI_obs
`R30`	Ratio of fluorescence intensity at 30 °C, I₃₀, to intensity at T_m-max, I_Tm
`I30`	Fluorescence intensity at 30 °C
`Itm`	Fluorescence intensity at T_m-max
`Imax_obs`	Maximum intensity in the whole curve



	 TRANSITION  - one entry for each transition with values
from the fit model, if fitting was done. Attributes (csv headers) and descriptions are:
 
	  Attribute   Description  
	   number 
 (Transition Number) 
		 counting from 1 up
		
	
   Tm
		 Temperature at transition midpoint, from the model
		
	
   sd_tm (SD_of_Tm) 
		 Standard deviation of the T_m estimate,
			based on the deviation of the points near T_m
			from the model.
		
	
   (dI/dT) 
		 (Only in csv: slope at transition midpoint) 
	
   I_delta 
 (I_delta_fraction) 
		 Fluorescence intensity change, ΔI, for this transition
		
(for csv, I_delta as a fraction of ΔI_obs)
		
	
   FWHM
		 Width of the transition at half the max. dI/dT, from the model
		
	
   (Major_Transition_Flag)
		 (only in csv: '#' for steepest transition)
		
	
   quality (Transition_Quality) 
		 issues with individual transitions: high error, too wide,
			etc.
		
	
   I_tm (not in csv) 
		 Intensity at the transition midpoint, I_Tm
		
	 

	
	 RAW_DATA - comma-separated values
table with headers as in the source files, starting with Temp. in the first
column. 

For the -w option or the equivalent -f <fitdir> -p png,
you also get TM_Summary.html and TM_Details.html files. These files both
have an index at the top so you can quickly jump to any well. For each well,
they have "next", "previous" and "index" buttons for quick navigation, as
well as links to each other, to the same well in the other file. The summary
contains one plot for each well, and a table of transitions with numerical
values and text for data quality issues. The details contain 3 plots for
each attempted transition model (including the bad one after the last good
one, if any): the data as given and the model fit to it in one plot; the
data with background subtracted and each transition shown rising from zero;
and the derivative plot showing the slope maximum (or maxima).


For each column (sample) number N and each round of transition
modelling M you get plots:

  <fitdir>/<fitdir>_N_M.png 

  <fitdir>/<fitdir>_N_M_solution.png 

  <fitdir>/<fitdir>_N_M_derivative.png 

 These are the plots displayed in the HTML files.

Attribute	Description
`number (Transition Number)`	counting from 1 up
`Tm`	Temperature at transition midpoint, from the model
`sd_tm (SD_of_Tm)`	Standard deviation of the T_m estimate, based on the deviation of the points near T_m from the model.
`(dI/dT)`	(Only in csv: slope at transition midpoint)
`I_delta (I_delta_fraction)`	Fluorescence intensity change, ΔI, for this transition (for csv, I_delta as a fraction of ΔI_obs)
`FWHM`	Width of the transition at half the max. dI/dT, from the model
`(Major_Transition_Flag)`	(only in csv: '#' for steepest transition)
`quality (Transition_Quality)`	issues with individual transitions: high error, too wide, etc.
`I_tm (not in csv)`	Intensity at the transition midpoint, I_Tm

Differential Scanning Fluorimetry (Tm): tm_calc.pl