Scalability Evaluation

Complexity Levels
Results
1. Regression Models
Conclusion

Scalability of isa4j was assessed and compared to the python isatools API in two dimensions: number of entries and complexity of entries.

At the simplest complexity (minimum) rows consisted only of a Source connected to a Sample through a Process in the Study File, and that Sample connected to a DataFile through another Process in the Assay File, with no Characteristics, Comments, or other additional Information. At the second level of complexity (reduced), a Characteristic was added to the Sample in the Study File, and the Assay File was expanded to Sample->Process->Material->Process->DataFile. The third and final level of complexity (real world) was modelled after the real-world metadata published for a plant phenotyping experiment that conform to the MIAPPE v1.1 data standard (link). Examplary ISA-Tab output for each of the three complexity levels can be found in the following section.

For each complexity level, CPU execution time was measured for writing a number of $n$ rows in Study and Assay File each, starting at 1 row and increasing stepwise up to 25,000 rows. Every combination of complexity level and $n$ was measured for 5 consecutive runs (15 for isa4j because results varied more) after a warm-up of writing 100 rows. Additionally, memory usage was measured for realistic complexity in 5 separate runs after CPU execution time measurements.

Performance evaluation was carried out on a Macbook Pro 2017 (2.3 GHz Dual-Core Intel Core i5 Processor, 16 GB 2133 MHz LPDDR3 RAM) with macOS Catalina (Version 10.15.2). isatools was evaluated under Python 3.7.3 [Clang 11.0.0 (clang-1100.0.33.16)] using isatools version 0.11 and memory-profiler version 0.57 for measuring RAM usage. CPU execution time was measured with time.process_time_ns. isa4j was evaluated under AdoptOpenJDK 11.0.5 using ThreadMXBean.getCurrentThreadCpuTime() and MemoryMXBean.getHeapMemory().getUsed(). For both platform, memory consumption baseline was calculated after the warm-up runs and an additional GC invocation. This baseline consumption was subtracted from all subsequent memory consumption values (we wanted to measure purely the memory consumed by the ISA-Tab content, not libraries and other periphery).

The actual code generating the files and measuring time and memory usage can be found here for python isatools and here for isa4j

Complexity Levels

Here you can see what the output generated for the different complexity level looks like. It is identical between isa4j and python isatools.

Minimal

Study File:

Source Name	Protocol REF	Sample Name
source_material-0	sample collection	sample_material-0
source_material-1	sample collection	sample_material-1
source_material-2	sample collection	sample_material-2
source_material-3	sample collection	sample_material-3

Assay File:

Sample Name	Protocol REF	Raw Data File
sample_material-0	material sequencing	sequenced-data-0
sample_material-1	material sequencing	sequenced-data-1
sample_material-2	material sequencing	sequenced-data-2
sample_material-3	material sequencing	sequenced-data-3

Reduced

Study File:

Source Name	Protocol REF	Sample Name	Characteristics[Organism]	Term Source REF	Term Accession Number
source_material-0	sample collection	sample_material-0	Homo Sapiens	NCBITaxon	http://purl.bioontology.org/ontology/NCBITAXON/9606
source_material-1	sample collection	sample_material-1	Homo Sapiens	NCBITaxon	http://purl.bioontology.org/ontology/NCBITAXON/9606
source_material-2	sample collection	sample_material-2	Homo Sapiens	NCBITaxon	http://purl.bioontology.org/ontology/NCBITAXON/9606
source_material-3	sample collection	sample_material-3	Homo Sapiens	NCBITaxon	http://purl.bioontology.org/ontology/NCBITAXON/9606

Assay File:

Sample Name	Protocol REF	Extract Name	Protocol REF	Raw Data File
sample_material-0	extraction	extract-0	sequencing	sequenced-data-0
sample_material-1	extraction	extract-1	sequencing	sequenced-data-1
sample_material-2	extraction	extract-2	sequencing	sequenced-data-2
sample_material-3	extraction	extract-3	sequencing	sequenced-data-3

Real World

Study File:

Source Name	Characteristics[Organism]	Term Source REF	Term Accession Number	Characteristics[Genus]	Term Source REF	Term Accession Number	Characteristics[Species]	Characteristics[Infraspecific Name]	Characteristics[Biological Material Latitude]	Characteristics[Biological Material Longitude]	Characteristics[Material Source ID]	Characteristics[Seed Origin]	Characteristics[Growth Facility]	Characteristics[Material Source Latitude]	Characteristics[Material Source Longitude]	Protocol REF	Parameter Value[Rooting medium]	Parameter Value[Container type]	Term Source REF	Term Accession Number	Parameter Value[Container volume]	Unit	Term Source REF	Term Accession Number	Parameter Value[Container height]	Unit	Term Source REF	Term Accession Number	Parameter Value[Number of plants per containers]	Parameter Value[pH]	Parameter Value[Air temperature Day - Stratification]	Unit	Term Source REF	Term Accession Number	Parameter Value[Air temperature Night - Stratification]	Unit	Term Source REF	Term Accession Number	Parameter Value[Average length of the light period - Stratification]	Unit	Term Source REF	Term Accession Number	Unit	Parameter Value[Type of lamps used]	Parameter Value[Average relative humidity during the light period - Stratification]	Unit	Term Source REF	Term Accession Number	Parameter Value[Average relative humidity during the dark period - Stratification]	Unit	Term Source REF	Term Accession Number	Parameter Value[Air temperature Day - Germination]	Unit	Term Source REF	Term Accession Number	Parameter Value[Air temperature Night - Germination]	Unit	Term Source REF	Term Accession Number	Parameter Value[Average length of the light period - Germination]	Unit	Term Source REF	Term Accession Number	Parameter Value[Light intensity - Germination]	Unit	Parameter Value[Average relative humidity during the light period - Germination]	Unit	Term Source REF	Term Accession Number	Parameter Value[Average relative humidity during the dark period - Germination]	Unit	Term Source REF	Term Accession Number	Parameter Value[Air temperature Day - Post Germination]	Unit	Term Source REF	Term Accession Number	Parameter Value[Air temperature Night - Post Germination]	Unit	Term Source REF	Term Accession Number	Parameter Value[Average length of the light period - Post Germination]	Unit	Term Source REF	Term Accession Number	Parameter Value[Light intensity - Post Germination]	Unit	Parameter Value[Average relative humidity during the light period - Post Germination]	Unit	Term Source REF	Term Accession Number	Parameter Value[Average relative humidity during the dark period - Post Germination]	Unit	Term Source REF	Term Accession Number	Parameter Value[Watering regimen]	Parameter Value[Composition of nutrient solutions used for irrigation]	Sample Name	Characteristics[Observation Unit Type]	Factor Value[Soil Cover]	Factor Value[Plant Movement]
Plant_0	Arabidopsis thaliana	NCBITaxon	http://purl.obolibrary.org/obo/NCBITaxon_3702	Arabidopsis	NCBITaxon	http://purl.obolibrary.org/obo/NCBITaxon_3701	thaliana	NA	51.82772	11.27778	http://eurisco.ipk-gatersleben.de/apex/f?p=103:16:::NO::P16_EURISCO_ACC_ID:1668187	http://arabidopsis.info/StockInfo?NASC_id=22680	small LemnaTec phytochamber	51.82772	11.27778	Growth	85% (v) red substrate 1 (Klasmann-Deilmann GmbH, Geeste, Germany) / 15% (v) sand)	pot	AGRO	http://purl.obolibrary.org/obo/AGRO_00000309	0.43	litre	UO	http://purl.obolibrary.org/obo/UO_0000099	0.08	m	UO	http://purl.obolibrary.org/obo/UO_0000008	1	5.5	5	s	UO	http://purl.obolibrary.org/obo/UO_0000027	5	°C	UO	http://purl.obolibrary.org/obo/UO_0000027	24	h	UO	http://purl.obolibrary.org/obo/UO_0000032	µmol m-2 s-1	Whitelux Plus metal halide lamps (Venture Lighting Europe Ltd., Rickmansworth, Hertfordshire, England)	90	%	UO	http://purl.obolibrary.org/obo/UO_0000187	90	%	UO	http://purl.obolibrary.org/obo/UO_0000187	16	°C	UO	http://purl.obolibrary.org/obo/UO_0000027	14	°C	UO	http://purl.obolibrary.org/obo/UO_0000027	16	h	UO	http://purl.obolibrary.org/obo/UO_0000032	140	µmol m-2 s-1	75	%	UO	http://purl.obolibrary.org/obo/UO_0000187	75	%	UO	http://purl.obolibrary.org/obo/UO_0000187	20	°C	UO	http://purl.obolibrary.org/obo/UO_0000027	18	°C	UO	http://purl.obolibrary.org/obo/UO_0000027	16	h	UO	http://purl.obolibrary.org/obo/UO_0000032	140	µmol m-2 s-1	60	%	UO	http://purl.obolibrary.org/obo/UO_0000187	60	%	UO	http://purl.obolibrary.org/obo/UO_0000187	initial watering before germination from bottom, then top irrigation	water	1135FA-0	plant	covered	rotating
Plant_1	Arabidopsis thaliana	NCBITaxon	http://purl.obolibrary.org/obo/NCBITaxon_3702	Arabidopsis	NCBITaxon	http://purl.obolibrary.org/obo/NCBITaxon_3701	thaliana	NA	51.82772	11.27778	http://eurisco.ipk-gatersleben.de/apex/f?p=103:16:::NO::P16_EURISCO_ACC_ID:1668187	http://arabidopsis.info/StockInfo?NASC_id=22680	small LemnaTec phytochamber	51.82772	11.27778	Growth	85% (v) red substrate 1 (Klasmann-Deilmann GmbH, Geeste, Germany) / 15% (v) sand)	pot	AGRO	http://purl.obolibrary.org/obo/AGRO_00000309	0.43	litre	UO	http://purl.obolibrary.org/obo/UO_0000099	0.08	m	UO	http://purl.obolibrary.org/obo/UO_0000008	1	5.5	5	s	UO	http://purl.obolibrary.org/obo/UO_0000027	5	°C	UO	http://purl.obolibrary.org/obo/UO_0000027	24	h	UO	http://purl.obolibrary.org/obo/UO_0000032	µmol m-2 s-1	Whitelux Plus metal halide lamps (Venture Lighting Europe Ltd., Rickmansworth, Hertfordshire, England)	90	%	UO	http://purl.obolibrary.org/obo/UO_0000187	90	%	UO	http://purl.obolibrary.org/obo/UO_0000187	16	°C	UO	http://purl.obolibrary.org/obo/UO_0000027	14	°C	UO	http://purl.obolibrary.org/obo/UO_0000027	16	h	UO	http://purl.obolibrary.org/obo/UO_0000032	140	µmol m-2 s-1	75	%	UO	http://purl.obolibrary.org/obo/UO_0000187	75	%	UO	http://purl.obolibrary.org/obo/UO_0000187	20	°C	UO	http://purl.obolibrary.org/obo/UO_0000027	18	°C	UO	http://purl.obolibrary.org/obo/UO_0000027	16	h	UO	http://purl.obolibrary.org/obo/UO_0000032	140	µmol m-2 s-1	60	%	UO	http://purl.obolibrary.org/obo/UO_0000187	60	%	UO	http://purl.obolibrary.org/obo/UO_0000187	initial watering before germination from bottom, then top irrigation	water	1135FA-1	plant	uncovered	stationary
Plant_2	Arabidopsis thaliana	NCBITaxon	http://purl.obolibrary.org/obo/NCBITaxon_3702	Arabidopsis	NCBITaxon	http://purl.obolibrary.org/obo/NCBITaxon_3701	thaliana	NA	51.82772	11.27778	http://eurisco.ipk-gatersleben.de/apex/f?p=103:16:::NO::P16_EURISCO_ACC_ID:1668187	http://arabidopsis.info/StockInfo?NASC_id=22680	small LemnaTec phytochamber	51.82772	11.27778	Growth	85% (v) red substrate 1 (Klasmann-Deilmann GmbH, Geeste, Germany) / 15% (v) sand)	pot	AGRO	http://purl.obolibrary.org/obo/AGRO_00000309	0.43	litre	UO	http://purl.obolibrary.org/obo/UO_0000099	0.08	m	UO	http://purl.obolibrary.org/obo/UO_0000008	1	5.5	5	s	UO	http://purl.obolibrary.org/obo/UO_0000027	5	°C	UO	http://purl.obolibrary.org/obo/UO_0000027	24	h	UO	http://purl.obolibrary.org/obo/UO_0000032	µmol m-2 s-1	Whitelux Plus metal halide lamps (Venture Lighting Europe Ltd., Rickmansworth, Hertfordshire, England)	90	%	UO	http://purl.obolibrary.org/obo/UO_0000187	90	%	UO	http://purl.obolibrary.org/obo/UO_0000187	16	°C	UO	http://purl.obolibrary.org/obo/UO_0000027	14	°C	UO	http://purl.obolibrary.org/obo/UO_0000027	16	h	UO	http://purl.obolibrary.org/obo/UO_0000032	140	µmol m-2 s-1	75	%	UO	http://purl.obolibrary.org/obo/UO_0000187	75	%	UO	http://purl.obolibrary.org/obo/UO_0000187	20	°C	UO	http://purl.obolibrary.org/obo/UO_0000027	18	°C	UO	http://purl.obolibrary.org/obo/UO_0000027	16	h	UO	http://purl.obolibrary.org/obo/UO_0000032	140	µmol m-2 s-1	60	%	UO	http://purl.obolibrary.org/obo/UO_0000187	60	%	UO	http://purl.obolibrary.org/obo/UO_0000187	initial watering before germination from bottom, then top irrigation	water	1135FA-2	plant	covered	rotating
Plant_3	Arabidopsis thaliana	NCBITaxon	http://purl.obolibrary.org/obo/NCBITaxon_3702	Arabidopsis	NCBITaxon	http://purl.obolibrary.org/obo/NCBITaxon_3701	thaliana	NA	51.82772	11.27778	http://eurisco.ipk-gatersleben.de/apex/f?p=103:16:::NO::P16_EURISCO_ACC_ID:1668187	http://arabidopsis.info/StockInfo?NASC_id=22680	small LemnaTec phytochamber	51.82772	11.27778	Growth	85% (v) red substrate 1 (Klasmann-Deilmann GmbH, Geeste, Germany) / 15% (v) sand)	pot	AGRO	http://purl.obolibrary.org/obo/AGRO_00000309	0.43	litre	UO	http://purl.obolibrary.org/obo/UO_0000099	0.08	m	UO	http://purl.obolibrary.org/obo/UO_0000008	1	5.5	5	s	UO	http://purl.obolibrary.org/obo/UO_0000027	5	°C	UO	http://purl.obolibrary.org/obo/UO_0000027	24	h	UO	http://purl.obolibrary.org/obo/UO_0000032	µmol m-2 s-1	Whitelux Plus metal halide lamps (Venture Lighting Europe Ltd., Rickmansworth, Hertfordshire, England)	90	%	UO	http://purl.obolibrary.org/obo/UO_0000187	90	%	UO	http://purl.obolibrary.org/obo/UO_0000187	16	°C	UO	http://purl.obolibrary.org/obo/UO_0000027	14	°C	UO	http://purl.obolibrary.org/obo/UO_0000027	16	h	UO	http://purl.obolibrary.org/obo/UO_0000032	140	µmol m-2 s-1	75	%	UO	http://purl.obolibrary.org/obo/UO_0000187	75	%	UO	http://purl.obolibrary.org/obo/UO_0000187	20	°C	UO	http://purl.obolibrary.org/obo/UO_0000027	18	°C	UO	http://purl.obolibrary.org/obo/UO_0000027	16	h	UO	http://purl.obolibrary.org/obo/UO_0000032	140	µmol m-2 s-1	60	%	UO	http://purl.obolibrary.org/obo/UO_0000187	60	%	UO	http://purl.obolibrary.org/obo/UO_0000187	initial watering before germination from bottom, then top irrigation	water	1135FA-3	plant	uncovered	stationary

Assay File:

Sample Name	Protocol REF	Parameter Value[Imaging Time]	Parameter Value[Camera Configuration]	Parameter Value[Camera Sensor]	Parameter Value[Camera View]	Parameter Value[Imaging Angle]	Unit	Term Source REF	Term Accession Number	Derived Data File	Comment[Image analysis tool]
1135FA-0	Phenotyping	28.09.2011 12:34:37	A_Fluo_Side_Big_Plant	FLUO	side	90	degree	UO	http://purl.obolibrary.org/obo/UO_0000185	derived_data_files/das_0.txt	IAP
1135FA-1	Phenotyping	28.09.2011 12:34:37	A_Fluo_Side_Big_Plant	FLUO	side	90	degree	UO	http://purl.obolibrary.org/obo/UO_0000185	derived_data_files/das_1.txt	IAP
1135FA-2	Phenotyping	28.09.2011 12:34:37	A_Fluo_Side_Big_Plant	FLUO	side	90	degree	UO	http://purl.obolibrary.org/obo/UO_0000185	derived_data_files/das_2.txt	IAP
1135FA-3	Phenotyping	28.09.2011 12:34:37	A_Fluo_Side_Big_Plant	FLUO	side	90	degree	UO	http://purl.obolibrary.org/obo/UO_0000185	derived_data_files/das_3.txt	IAP

Results

The raw results can be found here if you want to perform your own analyses.

data = read.csv("performance_data.csv")
data[data$memory.usage.in.mb == -1,]$memory.usage.in.mb = NA # Where RAM usage was not measured it was set to -1
data$time.in.ns.log = log(data$time.in.ns/1e+9, 10)
data$n.rows.log     = log(data$n.rows, 10)
data$memory.usage.in.kb.log = log(data$memory.usage.in.mb*1024, 10) # convert to KB so all transformed values are above 0

This is the visualization that is also part of the paper:

data$color = "black"
data[data$row.complexity == "real_world",]$color = "#e69f00"
data[data$row.complexity == "reduced",]$color = "#0072b2"
data[data$row.complexity == "minimal",]$color = "#61BEF3" #56B4E9
col.gray = "gray52"
col.green.dark = "#B8CDC8" #DBF3ED 
col.green.light = "#E7F1EF" #EEF8F6 

#pdf("figure.pdf", 6.92913, 3.4, colormodel="srgb")
par(family="serif", cex=0.7, mar=c(4.5,3.8,0,0), fig=c(0,1,0.2,1))
xlim = c(0, 6.4)
plot(data$time.in.ns.log ~ data$n.rows.log, xlim=xlim, col=data$color, axes=F, xlab=expression("Number of Rows (log"[10]~"Scale)"), ylab="", col.lab=col.gray)
axis(1, col=F, col.tick=col.gray, at=log(c(1,3,5,10,25,50,100,250,500,1000,2500,5000,10000,25000,50000,100000,250000,500000,1000000), 10), labels=c(1,3,5,10,25,50,100,250,500,"1k","2.5k","5k","10k","25k","50k","100k","250k","500k","1 Mio"), col.axis=col.gray)
axis(2, las=2, at=c(seq(-3,2), log(600,10), log(3600,10), log(28800,10)), labels=c("1 ms","10 ms", "100 ms", "1 s", "10 s", "100 s", "10 m", "1 h", "8 h"), col=F, col.axis=col.gray)

text(0, 4, expression("CPU Execution Time (log"[10]~"Scale)"), pos=4, cex=1.5, family="sans")
mtext("isatools", side=2, at=-0.7, line=-1, cex=0.7)
mtext("isa4j", side=2, at=-2.5, line=-1, cex=0.7)
mtext("|", side=2, at=max(data[data$platform == "isatools" & data$row.complexity == "real_world",]$time.in.ns.log), col="#e69f00", cex=0.5)
mtext("|", side=2, at=max(data[data$platform == "isatools" & data$row.complexity == "reduced",]$time.in.ns.log), col="#0072b2", cex=0.5)
mtext("|", side=2, at=max(data[data$platform == "isatools" & data$row.complexity == "minimal",]$time.in.ns.log), col="#61BEF3", cex=0.5)
mtext("|", side=2, at=max(data[data$platform == "isa4J" & data$row.complexity == "real_world",]$time.in.ns.log), col="#e69f00", cex=0.5)
mtext("|", side=2, at=max(data[data$platform == "isa4J" & data$row.complexity == "reduced",]$time.in.ns.log), col="#0072b2", cex=0.5)
mtext("|", side=2, at=max(data[data$platform == "isa4J" & data$row.complexity == "minimal",]$time.in.ns.log), col="#61BEF3", cex=0.5)

sub = data[data$row.complexity == "real_world" & data$platform == "isatools",]
t = tapply(sub$time.in.ns.log, sub$n.rows.log, FUN=median)
lines(as.numeric(names(t)), t, col="#e69f00", type="b")
text(max(sub$n.rows.log), max(sub$time.in.ns.log), "Real World", pos=4, col="#e69f00", cex=0.7)
text(max(sub$n.rows.log), max(sub$time.in.ns.log)-0.23, "Complexity", pos=4, col="#e69f00", cex=0.7)

sub = data[data$row.complexity == "reduced" & data$platform == "isatools",]
t = tapply(sub$time.in.ns.log, sub$n.rows.log, FUN=median)
lines(as.numeric(names(t)), t, col="#0072b2", type="b")
text(max(sub$n.rows.log), max(sub$time.in.ns.log), "Reduced", pos=4, col="#0072b2", cex=0.7)

sub = data[data$row.complexity == "minimal" & data$platform == "isatools",]
t = tapply(sub$time.in.ns.log, sub$n.rows.log, FUN=median)
lines(as.numeric(names(t)), t, col="#61BEF3", type="b")
text(max(sub$n.rows.log), max(sub$time.in.ns.log), "Minimal", pos=4, col="#61BEF3", cex=0.7)

sub = data[data$row.complexity == "real_world" & data$platform == "isa4J",]
t = tapply(sub$time.in.ns.log, sub$n.rows.log, FUN=median)
lines(as.numeric(names(t)), t, col="#e69f00", type="b")

sub = data[data$row.complexity == "reduced" & data$platform == "isa4J",]
t = tapply(sub$time.in.ns.log, sub$n.rows.log, FUN=median)
lines(as.numeric(names(t)), t, col="#0072b2", type="b")

sub = data[data$row.complexity == "minimal" & data$platform == "isa4J",]
t = tapply(sub$time.in.ns.log, sub$n.rows.log, FUN=median)
lines(as.numeric(names(t)), t, col="#61BEF3", type="b")

# Memory Plot
par(fig=c(0,1,0,0.2), mar=c(0.2,3.8,0,0), new=T)
memSub = data[data$row.complexity == "real_world",]

memSub.isatools = memSub[memSub$platform == "isatools",]
memSub.isa4J = memSub[memSub$platform == "isa4J",]

memSub.isatools.medians = aggregate(memSub.isatools$memory.usage.in.kb.log, by=list(memSub.isatools$n.rows), FUN=median)
memSub.isa4J.medians = aggregate(memSub.isa4J$memory.usage.in.kb.log, by=list(memSub.isa4J$n.rows), FUN=median)

plot(-memSub$memory.usage.in.kb.log ~ memSub$n.rows.log, type="n", axes=F, xlim=xlim, xlab="", ylab="")


polygon(
  c(log(memSub.isatools.medians$Group.1, 10), max(log(memSub.isatools.medians$Group.1, 10)), min(log(memSub.isatools.medians$Group.1, 10))),
  -c(memSub.isatools.medians$x, min(memSub$memory.usage.in.kb.log), min(memSub$memory.usage.in.kb.log) ),
  col=col.green.light, border=NA) #DBF3ED

polygon(
  c(log(memSub.isa4J.medians$Group.1, 10), max(log(memSub.isa4J.medians$Group.1, 10)), min(log(memSub.isa4J.medians$Group.1, 10))),
  -c(memSub.isa4J.medians$x, min(memSub$memory.usage.in.kb.log), min(memSub$memory.usage.in.kb.log) ),
  col=col.green.dark, border=NA) #A1D7CA

text(0.01, -5.3, expression("Memory Usage for Real World Complexity (log"[10]~"Scale)"), pos=4, col=col.gray)

#text(0, 0.4, "isa4J", pos=2, xpd=NA, col=col.gray, cex=0.8)
#text(0, -1.6, "isatools", pos=2, xpd=NA, col=col.gray, cex=0.8)

text(log(1000000,10), -2.5, paste("isa4j \n  ",round(10^min(memSub.isa4J.medians$x)/1024, 1), "-", round(10^max(memSub.isa4J.medians$x)/1024, 1),"MB"), pos=4, xpd=NA, col=col.gray, cex=0.6)

text(log(1000000,10), -5.3, paste("isatools \n  ", round(10^min(memSub.isatools.medians$x)/1024, 1), " MB -\n  ", round(10^max(memSub.isatools.medians$x)/1024/1024, 1),"GB"), pos=4, xpd=NA, col=col.gray, cex=0.6)

#dev.off()

Regression Models

To make quantitative statements about scalability it can be helpful to fit some regression models.

python isatools

It appears that the python isatools curves all become pretty linear after 100 rows and they all seem to be parallel, so we can fit a simple regression model without interaction term.

sub = data[data$platform == "isatools" & data$n.rows >= 100,]
plot(sub$time.in.ns.log ~ sub$n.rows.log, col=sub$row.complexity, xlab=expression("log"[10]("Number of Rows")), ylab=expression("log"[10]("CPU Execution time [s]")))
model.isatools = lm(time.in.ns.log ~ n.rows.log + row.complexity, data=sub)
abline(model.isatools)
abline(model.isatools$coefficients[1]+model.isatools$coefficients[4], model.isatools$coefficients[2], col="green")
abline(model.isatools$coefficients[1]+model.isatools$coefficients[3], model.isatools$coefficients[2], col="red")

summary(model.isatools)

## 
## Call:
## lm(formula = time.in.ns.log ~ n.rows.log + row.complexity, data = sub)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.018671 -0.008867 -0.002369  0.006046  0.149223 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -2.3695999  0.0042764  -554.1   <2e-16 ***
## n.rows.log                0.9850514  0.0009361  1052.3   <2e-16 ***
## row.complexityreal_world  0.9412709  0.0028597   329.2   <2e-16 ***
## row.complexityreduced     0.3942198  0.0028485   138.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01624 on 190 degrees of freedom
## Multiple R-squared:  0.9998,	Adjusted R-squared:  0.9998 
## F-statistic: 4.038e+05 on 3 and 190 DF,  p-value: < 2.2e-16

Looks pretty good! What can we learn from it?

Increasing the number of rows 10-fold will increase the required CPU execution time $10^{0.9850514} = 9.6616523$ -fold
Increasing the complexity from minimal to reduced increases execution time $10^{0.3942198} = 2.478676$ -fold and increasing the complexity from minimal to real world increases it $10^{0.9412709} = 8.7351616$ -fold

isa4j

Now let’s repeat the same analyses for the isa4j performance data. We will again assume linearity and parallel lines for more than 100 rows.

sub = data[data$platform == "isa4J" & data$n.rows >= 100,]
plot(sub$time.in.ns.log ~ sub$n.rows.log, col=sub$row.complexity, xlab=expression("log"[10]("Number of Rows")), ylab=expression("log"[10]("CPU Execution time [s]")))
model.isa4J = lm(time.in.ns.log ~ n.rows.log + row.complexity, data=sub)
abline(model.isa4J)
abline(model.isa4J$coefficients[1]+model.isa4J$coefficients[4], model.isa4J$coefficients[2], col="green")
abline(model.isa4J$coefficients[1]+model.isa4J$coefficients[3], model.isa4J$coefficients[2], col="red")

summary(model.isa4J)

## 
## Call:
## lm(formula = time.in.ns.log ~ n.rows.log + row.complexity, data = sub)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.17913 -0.02842  0.00477  0.03270  0.32036 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -4.133611   0.009329 -443.08   <2e-16 ***
## n.rows.log                0.811649   0.002041  397.72   <2e-16 ***
## row.complexityreal_world  0.802811   0.006229  128.88   <2e-16 ***
## row.complexityreduced     0.169827   0.006229   27.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06151 on 581 degrees of freedom
## Multiple R-squared:  0.9967,	Adjusted R-squared:  0.9967 
## F-statistic: 5.888e+04 on 3 and 581 DF,  p-value: < 2.2e-16

This model does not fit as well as the isatools one because there is a lot more variation in the data and there appear some points where the curve is not perfectly linear (for example, Java translates JVM code into native machine code after a certain number of repititions). For simplicity’s sake we will accept the model though and assume it is good enough for our purposes.

So, same calculations like above:

Increasing the number of rows 10-fold will increase the required CPU execution time $10^{0.8116486} = 6.4810982$ -fold
Increasing the complexity from minimal to reduced increases execution time $10^{0.1698272} = 1.47852$ -fold and increasing the complexity from minimal to real world increases it $10^{0.8028115} = 6.3505519$ -fold

We can see that isa4j scales slightly better with number of rows and significantly better at increasing complexity of rows.

Direct Comparison

Now let’s try a direct comparison of both libraries for real world complexity. The slopes are not the same so we need an interaction term here.

sub = data[data$row.complexity == "real_world" & data$n.rows >= 100,]
plot(sub$time.in.ns.log ~ sub$n.rows.log, col=sub$row.complexity, xlab=expression("log"[10]("Number of Rows")), ylab=expression("log"[10]("CPU Execution time [s]")))
model.both = lm(time.in.ns.log ~ n.rows.log * platform, data=sub)
abline(model.both)
abline(model.both$coefficients[1]+model.both$coefficients[3], model.both$coefficients[2]+model.both$coefficients[4], col="red")

summary(model.both)

## 
## Call:
## lm(formula = time.in.ns.log ~ n.rows.log * platform, data = sub)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.072842 -0.023789 -0.002461  0.021472  0.080930 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 -3.363660   0.008204 -410.01   <2e-16 ***
## n.rows.log                   0.819803   0.001945  421.51   <2e-16 ***
## platformisatools             1.914702   0.016463  116.30   <2e-16 ***
## n.rows.log:platformisatools  0.170395   0.003918   43.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03385 on 255 degrees of freedom
## Multiple R-squared:  0.9995,	Adjusted R-squared:  0.9995 
## F-statistic: 1.808e+05 on 3 and 255 DF,  p-value: < 2.2e-16

OK, the models look good enough, now we can make actual comparisons. Since the slopes of the lines are different, isa4j is going to become relatively faster the more rows we write:

When writing 100 lines isa4j is $10^{1.9147021 + 0.1703948 * log_{10}(100)} = 180.0909175$ faster
When writing 25000 lines isa4j is $10^{1.9147021 + 0.1703948 * log_{10}(25000)} = 461.4115139$ faster

Conclusion

There are two take-aways from this:

isa4j scales significantly better when complexity of rows increases (1.47852 and 6.3505519-fold increase for isa4j compared to 2.478676 and 8.7351616-fold for isatools).
The more lines are written, the faster isa4j becomes compared to isatools (180.0909175 faster for 100 lines, 461.4115139 faster for 25,00 lines).

Scalability Evaluation

Table of contents

Complexity Levels

Minimal

Reduced

Real World

Results

Regression Models

python isatools

isa4j

Direct Comparison

Conclusion