How To... using GenomePop2 simulation tool

GENOMEPOP2 HAS BEEN AVAILABLE SINCE 2008 AND NOW IS NOT LONGER MAINTAINED.
PLEASE NOTE THAT SOME RECENT UPDATES DO NOT WORK UNDER SOME OLDER FEATURES (AS THE VIRAL MODEL).

HOW TO... DO DIFFERENT THINGS USING GENOMEPOP2

GenomePop2 (GP2) is a flexible simulation tool. Here we will explain how to apply some of its facilities. The basic input file must be called GP2Input.txt. Inside this file the lines above identifiers beginning with '#' are comments. Please note that the commentary line must begin with '#'. Comments inside normal lines are not allowed.

0.- HOW TO RUN GENOMEPOP

a) Windows: double click the .exe file. Alternatively you can run from the command line redirecting the standard output: GenomePop2.exe >> GP2output.txt.
b) Linux: Just type ./genomepop2Lnx &. The linux executable was compiled in Debian 2.4.22-openmosix-2. Source code for compilation in any other linux/unix system is provided.
c) Mac: Just type ./genomepop2Mac in the command line from the appropriate directory or double click if the shell Terminal is linked to this kind of executable file. The Mac executable was compiled under OsX 10.5.6

In any case, the output data file will be written in the directory GP2_Results that will be created if it does not exist. The GP2Input.txt must be in the same directory as the executable.

1.- HOW TO SET A VALID INPUT FILE IN GENOMEPOP

The input file must be called GP2Input.txt. In such a file include one line beginning with the word 'chromsize'. Below this line add the following 7 values corresponding to:

Chromosome Size
Number of Chromosomes
Maximum population size (Kmax)
Number of populations
Number of generations
Mutation rate per haploid genome
Recombination rate per haploid genome

chromsize    numchroms    kmax    Numpops    maxgen    genome mut Rate    Rec rate per haploid genome
1000           1          1000              4         200             0.01                 0.0

The example in red will run 1 replicate (default value) of a two allele model (the default evolution model called JC2) with 4 isolated (default migration rate is 0) populations with 1000 haploid individuals each with 1 chromosome of size 1000 linked sites. The system will evolve during 200 generations with mutation rate per genome of 0.01 (0.00001 mutation rate per site). To add more runs just add the following line begining with the word 'runs':

runs     diploid
100       false

Which will run the same haploid model but 100 replicates.

2.- HOW TO DEFINE POPULATIONS WITH DIFFERENT SIZES


chromsize    numchroms    kmax    Numpops    maxgen    genome mut Rate    Rec rate per haploid genome
1000           	1          10000     4         200             0.01                 0.0

differentPopSizes
100 20 12500

This will define a model with four populations of sizes 100, 20, 10000 and 10000 respectively. Note that the third population under the differentPopSizes tag was defined with size 12500 but the maximum population size allowed (kmax) was defined as 10000 so that 12500 is reduced to 10000. Note also that the number of items under the identifier 'differentpopsizes' was lower than the number of populationns previously defined under the first line, as in the example (3 < 4), then the lacking populations will have the defined kmax size (10,000). If the number of items is higher than the number of populations defined under the first line then only a number of 'numpops' will be read.
If there are migration between populations the population sizes may vary (see this example and this migration model ). This is allowed from version 2.7. Thus from version GP2.7, if nothing else defined, the population sizes will be Kmax. If a line differentPopSizes is defined the population sizes would be the values under this line. If migration is defined the values of population size could vary between 0 (extinction) and Kmax.

3.- HOW TO DEFINE DIFFERENT SNPs ANCESTRAL ALLELES IN DIFFERENT POPULATIONS

This will only change the identifier used for the ancestral allele. To define different SNP frequencies see point 14
NOTE that the viral model is somewhat OBSOLETE which means that recent additions could not work well under it. For example the Serial Sampling (HowTo 19) does not work under the viral model.

Biological Model
viral

chromsize  numcroms  popsizeKmax  Numpops   maxgen   haploidGenomeMutRate  Recombination
10              1           1000              2              200          0.001                         5

runs    diploid     constantMetapopSize
1       true        false

flowmodel
ISLAND

migration
0.01

recurrentmut    retromutation
true            true

different pop sizes
100  120

sample size 
20

model
jc2

# next line fixes the whole pop 1 with allele 1 and the whole pop2 with allele 2

SNP freqs
1.0  0.0

This example defines 2 populations, with size 100 and 120, under a 2-allele model (JC2). Each population with different initial allele frequencies (the 'viral' model). The frequencies are defined under the identifier 'SNP freqs'. The items under this identifier should correspond with the number of populations. If this identifier does not appear and the biological model is 'viral', the frequencies are assumed to be 0.5 in all populations. If model is not 'viral' the frequencies are 1 (only allele 1 exists initially) for every population. In the example, 10 (chromsize=10) independent (recombination=10 x 0.5 = 5) 2-allele SNPs are defined with initial frequencies of 1 (allele 1 fixed in pop 1) and 0 (allele 2 fixed in pop 2). Therefore, note that in population 1 the ancestral allele will be '1' while the ancestral will be '2' in the population 2.

Note that this model is different of the "isf" one (see howto 17) which defines different individual SNP mutant frequencies at a given population and position.

4.- HOW TO DEFINE RECOMBINATION HOT-SPOTS

chromsize  numcroms  popsizeKmax  Numpops   maxgen   haploidGenomeMutRate  Recombination
1000              1           1000              2              2000          0.001                         0.5

runs    diploid     constantMetapopSize
1       true        true

migration rate
0.01

migrationModel
SteppingStone1

# recombination hot spots are, by now, only allowed for two allele models
# Recombination at region 0 to 100 is 0.05 but at region 500-600 is 0.01. The other regions having NO recombination
# NOTE that if recombination hot spot are defined the above general parameter 'Recombination' will be ignored

hotspots
0 100 0.05
500 600 0.01

# default substitution model is Jukes Cantor with 2 alleles (JC2). The output format is genpop format.

This example defines 2 populations, with equal constant size 1000, under a 2-allele model (JC2). There are a putative number of 1000 SNPs (chromsize=1000) however recombination hot-spots are defined so that the region 0 to 100 recombine at a rate of 0.05 and region 500 to 600 at a rate of 0.01. The rest of the genome is fully linked. Note that the value of recombination (0.5) in the first line is overwritten by the hotspot line.

5.- HOW TO DEFINE SNPs UNDERGOING SELECTIVE PRESSURE

chromsize  numcroms  popsizeKmax  Numpops   maxgen   haploidGenomeMutRate  Recombination
100              1           1000    2              2000          0.001                         0.01

runs    diploid     constantMetapopSize
1       true        true

selpos	pop position s
		1		0	0.9
		1		41	0.0
		2		74	-0.5

This example defines 2 populations, with equal constant size 1000. The recombination rate through the 100 putative SNPs is 0.01. Additionally the mutations at the specified positions below the identifier 'selpos' will have a contribution to the fitness of 1-hs (if homozygous or haploid h=1 if diploid the default is h = 0.5). In population 1 a mutation in position 0 will contribute to the individual fitness with 1 - 0.9/2 if it is in heterozygous condition. This mutation will has no effect in population 2. Conversely, mutation in position 74 will contribute with 1+0.5/2 to the individual fitness in population 2 but will has no effect in population 1.

The dominance coefficient if beta is defined to be higher than 0 (see HowTo 18) comes from dominance[nucposition] = uniform()* exp(-k*selection[nucposition]) where selection[position] is the corresponding selective coefficient s. k is by default 7 and cannot be changed under the selpos model (but see HowTo 18).

6.- HOW TO SET MIGRATION

The default migration model is the island. To include migration with rate 0.01 under an island model just add:

migration
0.01

To set a stepping stone of 1 dimension add a 'flowmodel' identifier and the 'steppingstone' identifier below it. See Migration Models for further information on user-defined migration models.

migration
0.01
flowmodel
steppingstone

7.- HOW TO SAMPLE HAPLOTYPES

haplotypes
true

Different haplotypes or gametes will be sampled. This option only apply in the diploid models

8.- HOW TO DEFINE INDEPENDENTLY SEGREGATING SNPS

independentSNPs
true

This line only will apply if the model is JC2. The SNPs will segregate independently whatever the recombination value. For example, the following lines define the settings to evolve 100,000 SNPs (chromsize) with a SNP mutation rate of 10^-7 (10^-2 /10^5) during 2N generations. Note the line with the 'independentSNPs' identifier that define each SNP as unlinked.

scaling
10

chromsize	numcroms	popsizeKmax	Numpops	maxgen 	mutRate per haploid genome      Rec per haploid genome
100000		1		1000		1	2000			0.01	     0.0

independentSNPs
true

sample size
15

recurrentmut retromut 
true	true

9.- HOW TO EFFICIENTLY SIMULATE A NUMBER OF SNPs COVERING 10 Mb


runs		diploid	constantMetapopSize
1		true		true

scaling
10

chromsize	numcroms	popsizeKmax	Numpops     maxgen 	mutRate per haploid genome     Rec
1000		100		     1000		1	       1000		0.05		      0.1

recurrentmut retromut 
false	false

The above simulation took 20 seconds in a Pentium 4 (3.2 GHz). The recombination 0.1 is per genome i.e. 0.001 per chromosome. If we consider 1cM per 1Mb this implies we are covering 0.001 x 100 = 0.1 Mb per chromosome. That is, a total genome of 10 Mb. The simulation time is reduced to just 3 seconds if we use a scaling of 20. There were no multiple hits under these settings.

10.- HOW TO DEFINE BOTTLENECK AND/OR EXPANSION SCENARIOS


runs		diploid	
1		true		

chromsize	numcroms	popsizeKmax	Numpops     maxgen 	mutRate per haploid genome     Rec
10000		1		     1000		3	       1000		0.1		      0.1

CEDS
1	1 20	2 
1	250 350 2000

The above settings will define, under the CEDS identifier, a bottleneck of size N = 2 in population 1 from generation 1 to 20 and an expansion of N = 2000 in the same population from generation 250 to 350. After that, the original population size of 1000 is recovered. The user can define as many lines as desired under the CEDS identifier. At each line, first item corresponds to the population number, two next to the generation period and the last one to the desired population size.
From version 2.7.4 it is possible to define more complex models. For example:

# after the bottleneck in generation 4999 the recovering follows a continuous logistic model  with rate parameter r=2
CEDS
1	4999 5000	10 
1	5000 5001	69
1	5001 5002	355
1	5002 5003	803
1	5003 5004	968
1	5004 5005	996
1	5005 5006	999

in the input above a bottleneck occurs in generation 4999 reducing the population during one generation to ten individuals. In the following generations the population size increases following a continuous logistic model with Kmax = 1000 and a rate parameter r=2. At generation 5006 the original population size previous to the bottleneck is recovered.

11.- HOW TO SET THE ORIGINAL POPULATION SEQUENCE

The user can define the original sequence from which the evolutionary process occur. Only one sequence (Phylip format) is allowed. The identifier should be the word 'sequence', below which the name of the sequence and below this the sequence. For example:


runs		diploid	constantMetapopSize
1		true		true

chromsize	numcroms	popsizeKmax	Numpops     maxgen 	mutRate per haploid genome     Rec
10		1		     1000		3	       1000		0.1		      0.1

phylipformat
true

sequence
>seqname
ATAAAAAAAA

Note that the length of the sequence must coincide with the value under the chromsize identifier. If there are more than one population all the populations begin with the provided sequence.

12.- HOW TO COMPUTE A NEUTRAL EQUILIBRIUM AS THE BEGINNING POPULATION

An input file line beginning with the word neutraleq and the value "true" below, indicates the program to compute mutation drift equilibrium, distributing the effective number of alleles, q+1, with q = 4Nmu with mu the per site mutation rate, at frequency q = 1/(q +1) in the equilibrium population so that we will have q homozygotes and q / (q+1) heterozygotes.

neutraleq
true

13.- HOW TO SIMULATE A NEUTRAL EQUILIBRIUM AS THE BEGINNING POPULATION

Instead of computing the effective number of alleles we can simulate a scaled population under neutral conditions during 10N generations in order to reach the equilibrium. After the simulation, the program stores homozygote and heterozygote frequencies and will use such frequencies to begin the simulation process after equilibrium.

simneutraleq
true

The scaling can be used in order to improve the efficiency of the simulation. Therefore,

simscale
10

will evolve an equilibrium population of size N/10 during t/10 generations with mutation and recombination rates of 10mu and 10r respectively.

14.- HOW TO DEFINE SPECIFIC SNPs AT A GIVEN FREQUENCY

Add a line with the isf identifier (initial snp frequency) and below as much as desired lines with 3 items representing population, position and heterozygote frequency. Note that the identifier for the first population should be 1 but is 0 for the first position.

isf	popid position freq
	1	20	0.1
	1	0	0.11
	1	10	0.21
	2	20	0.1

Note that this isf model is different of the SNP freqs (see howto 3) one which define different SNPs ancestrals '1' or '2' between whole populations.

15.- HOW TO DEFINE SPECIFIC SNPs AT A GIVEN FREQUENCY AFTER AN EQUILIBRIUM

If any equilibrium line is defined (neutraleq or simneutraleq identifiers) the ÒisfÓ information will be ignored, because the initial population will be the computed under equilibrium. If the user still wants to add some SNPs after the equilibrium, an identifier ÔisfposteqÕ should be added. For example:

isfposteq	pop	pos		num indivs
		1	499999	3

will add a mutation in heterozygous condition at position 499999 in three individuals after the equilibrium population was computed.

16.- HOW TO SET THE OUTPUT FORMAT

There are three possible output formats. The default is the like-Hudson ms format (Hudson 2002) which can produce two different files: First, GP2File_Run0.txt with the diploid (haploid) SNP haplotypes of the sampled individuals for each population. The SNPs positions are also given. A second file, gameteGP2File_Run0.txt, will we generated in the diploid case if the identifier "haplotypes" is set to true (by default). This latter file includes the sampled gametes from the individuals in the previous file. The user gets as many of such files as runs were defined at the input. The second output format is GenePop 4.0 output format (Rousset 2008) and includes all positions, not only SNPs. No file with gametes is produced.
The default is:

mslike
true

If we set

genpopformat
true

the output format will also be GenePop 4.0 one. The third format is the Phylip format which just produces a file with sampled sequences (1 gamete randomly selected per individual). Thus, for each individual, if diploid, gamete 0 or 1 is randomly selected. To generate the sequence from the biallelic setting, a random nucleotide letter (A, C, G, T) is produced at each site for the ancestral allele and another for the derived. The whole sequence is produced, not only SNPs. Therefore,

phylipformat
true

will change the output format to be the Phylip one.

If both identifieres are absent or set to false the default format will be produced.

17.- HOW TO SPLIT POPULATIONS

GP2 now allows for the possibility of splitting populations to get a sample of sequences from different species under a given divergence. For the splitting the user should indicate true for the split to occur, plus a number, T₀, of generations for the initial branch length of the tree, plus the number of splitting (speciation) events and a factor allowing for ancestral (factor>1) or recent (factor<1) radiation. During the process, the computation of ancestral polymorphism at each speciation step is performed. If the split is set to false, it indicates that the split should be asymmetric and the program launches an error because this option is not yet implemented. Let

L = T₀ x Sum(kⁱ) from i=1 to n,

where T₀ is the initial branch length (in generations), n is the number of splits and k the factor to define the kind of radiation: uniform (k=1), ancestral k>1 or recent k<1. If a previous equilibrium is not computed then T₀ is the time for to the first split and the total length of the tree will be T₀ + L. If the initial population is at equilibrium then the total length of the tree is just L. In any case the maximum number of generations that should be defined for evolving the sequences until the end of the tree should be maxgen = T₀ + L. In the case of equilibrium the initial T₀ ones will be ignored. So if T₀ = 782 and n = 4 the maximum number of generations should be 5 x 782 = 3910. Note that the final number of taxa after n splits will be 2ⁿ For example:

neutraleq
true
split
true 782 4	1.0
chromsize numcroms popsizeKmax 	Numpops	 maxgen mutRate per haploid genome Rec
1000000       1      	500    		 1   3910 	0.4                 		0.00

The expected divergence D is computed simply as D = m x L, being m the mutation rate per site. The information about ancestral polymorphims occurring all along the process will be given in an file called GP2_AncVar1.xls. Additionally, at the end of the process, one sequence (gamete) by splitted population is sampled. This is repeated a number of samplesize times for each run. Each sample is writen in phylip format. For example GP2_SeqFile_Run0_Sample3.txt is the fourth sample in run 0 and should have inside as many as sequences as existing final taxa. Finally, the program also performs one intrapopulation sampling of 50 gametes from one taxa (the zeroth one). This is done again for each replicate. The name of this file is GP2_SeqFile_Run3_Pop1.txt for the fourth run. Concerning the GP2_AncVar1.xls file let explain it with a quick input case:

neutraleq
true
split
true 782 3	1.0
chromsize numcroms popsizeKmax Numpops maxgen mutRate per haploid genome 	Rec
1000000       1      	500     	1   		3128		0.1              			   0.00
sample size 
1
runs  diploid 
5	  true
scaling 
10

The expected divergence is L = 782 x 3 and m= 0.1 x 10^-6 so D = L x M = 0.0002346. The obtained output is:

Run	TPS	a0	TPS		a1		TPS		a2		TPS		Current	sd			Divergence

0	200	0	755		7.5		871.75	193.75	975.25	125.75	58.6723		0.0002575

1	200	0	677.5		4.5		740.5		78.75		862.625	58		93.8096		0.0002405

2	200	0	692.5		5		848.5		182.75	831.625	39.875	57.2002		0.0002415

3	200	0	799.5		5.5		815.25	119.25	857.75	42.125	53.7202		0.000243

4	200	0	634.5		2.5		798.5		153.5		857.625	57.25		67.1803		0.000221

where TPS refer to the absolute number of polymorphic sites (averaged through taxa if necessary) and ak to the absolute number of ancestral ones before the k+1 split. Thus, a2 is the averaged (through four taxa) number of ancestral polymorphic sites previous the third split. The last TPS and "Current" refer to the total and ancestral polymorphic sites repectively at the end of the simulation. sd is the between taxa standard deviation of the current ancestral polymorphism. "Divergence" is the mean observed divergence by sample.

18.- HOW TO DEFINE DIFFERENT SELECTIVE DNA REGIONS


simneutraleq
true
runs	diploid	
3	true	
chromsize	numchroms	Kmax	Numpops	maxgen	genome mutRate per haploid genome	Rec rate per haploid genome
1500		2		1000	1	1000	0.0001	0.0

coefsel	coefDom	beta	Epistasis
0.0	0.05	0.5	0

samplesize
10

mslike
true

selrange	pop	posini	posend	s
		1	0		1499		0.1
		1	1500		2999		0.15
		
isfposteq pop position numindivs
	1	1600	1000
	1	2000	1000

From version 2.7.2, the identifier selrange allows to define different selective regions. In the example the positions from 0 to 1499 (corresponding to chromosome 1) experience deletereous selection with a mean selective coefficient of 0.1. The positions from 1500 to 2999 (corresponding to chromosome 2) have a mean coefficient of 0.15. The selective pressure at each specific site is obtained from a gamma(alpha,beta) where alpha = beta/s. The value of beta corresponds to the third item under the line with the coefsel identifier (0.5 in this example). The value of coefsel (the first item under such line, 0.0 in the example) is ignored always when selrange or selpos (HowTo 5) identifiers are defined. If beta equals 0 the model will be constant which means that every site will have the mean selective and dominance (coefDom) coefficients. The dominance coefficient if beta is higher than 0 comes from dominance[nucposition] = uniform()* exp(-k*selection[nucposition]) where selection[nucposition] is the selective coefficient sampled from the gamma and k is by default 7 but k = alfa*(pow((2*coefdom),(-1.0/beta)) -1.0) when beta>=0.5.

19.- HOW TO DEFINE SERIAL SAMPLNG


Chromsize       Numcroms        Kmax    Numpops Maxgen  MutRate per haploid genome      Rec
1000            1               100    	1       201     0.02                            0.0

runs    diploid
1      false

sequence
>seq
ATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAT

sample size
20

CEDS
1	90	91	1
1	91	92	2
1	92	93	4
1	93	94	8
1	94	95	16
1	95	96	32
1	96	97	64
1	97	98	128
1	98	99	256

# The serial sampling is defined as iniG G X. Where iniG is the first sampling, G is the step size between samplings (beginning in generation 0) and X is the sample size.
# In the example iniG=90 and G= 50 generations i.e. a sample is taken at generation 90, 100, 150 and 200. Generation 50 has been ignored because is lower than the iniG.
# The sample size is X = 100. The minimum value of X is 1 and the maximum is 2*Kmax (diploid) or Kmax (haploid).
# If a sampling is defined in a generation where the sample size is higher than the population size then the whole population is sampled with repeated individuals if necessary.
# E.g. if a sample is taken in generation 90 then this sampling point has 100 copies (X=100) of the unique individual.

serial
90	50	100

coefsel	coefDom		beta	Epistasis
0.0		0.05		0.5		0

selnuc pop 	position s
		1	49   	-1

# recall the s coefficient undel selrange cannot be negative 
selrange	pop inipos 	endpos	s
			1	1		48		0.2

phylipformat
true

migration
0.0

storeroot
true