Comparison of running time and memory usage with other methods.

To verify DESC is scalable for large dataset, we analyzed a 1.3 million mouse brain cell dataset generated by 10X Genomics. With the aid of NVIDIA TITAN Xp GPU, DESC finished the analysis in about 3.5 hours using less than 10 GB memory (Supplementary Figs. 16,17 and Supplementary Note 7).

To evaluate how running time and memory usage change as the number of cells and number of batches vary, we randomly selected different number of cells from datasets with two batches, four batches, and 30 batches, respectively. For the two batch scenario, we used the data from Kang et al.¹ with CTRL and STIM defined as two different batches, and we randomly selected 1,000, 2,000, 5,000, 8,000, 10,000, 20,000 cells from the original 24,679 cells, to evaluate the impact of the number of cells. For the four-batch and 30-batch scenarios, we used the data from Peng et al.² in which there are four batches if macaque id was taken as batch definition, and 30 batches if sample id was taken as batch definition. For both scenarios, we randomly selected 1,000, 2,000, 5,000, 8,000, 10,000, 20,000, 30,000 cells from the original 30,302 cells, to evaluate the impact of the number of cells.

Our results indicate that the running times of CCA, Seurat 3.0 and MNN increase exponentially as the number of cells and the number of batches increase (Fig. 9). Moreover, Seurat 3.0 and BERMUDA threw out an error when the number of cells within a batch is small and stopped to run (Fig. 9h,i). In contrast, the memory usage and running time of DESC are not affected by the number of batches, and increase linearly as the number of cells increases. It is also worth noting that scVI and BERMUDA’s computing time increases substantially when the number of batches increases from two to 30 even when the number of cells remains the same. More specifically, when 30 samples were considered as batch definition in the macaque retina dataset (i.e. 30 batches), the computing time for analysis of 30,000 cells is about 0.3 hours for DESC, 9.5 hours for MNN, 10.4 hours for scVI, 11.2 hours for BERMUDA, and 0.4 hours for scanorama, respectively. Furthermore, BERMUDA requires more than 32GB memory even for a dataset that only has 30,000 cells when the number of batches is 30. Methods such as CCA and Seurat 3.0 broke due to memory issue and failed to produce results when the number of batches is 30. These computing constraints will limit the practical utility of these methods in large-scale single-cell studies, especially for studies with many batches.

.libPaths(c("/home/xiaoxiang/R/x86_64-pc-linux-gnu-library/3.5",.libPaths()))
suppressPackageStartupMessages(library(cowplot))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(RColorBrewer))
suppressPackageStartupMessages(library(scales))
suppressPackageStartupMessages(library(gridExtra))
suppressPackageStartupMessages(library(grid))
fig_path="/home/xiaoxiang/Documents/DESC_paper_prepare/DESC_paper_final/formal_revised/figures_sep/"

## set color pallate we may used
gg_color_hue <- function(n) {
  hues = seq(15, 375, length = n + 1)
  hcl(h = hues, l = 65, c = 100)[1:n]
}
op=par(mar=c(5,4,6,4))
color_celltype=c('#B903AA','#1CE6FF','#FF34FF','#FF4A46','#008941','#006FA6','#7A4900','#63FFAC','#B79762','#004D43','#8FB0FF','#997D87','#5A0007','#809693','#FEFFE6','#1B4400','#4FC601','#3B5DFF','#4A3B53','#FF2F80','#61615A','#BA0900','#6B7900','#00C2A0','#FFAA92','#FF90C9','#B903AA')
image(1:length(color_celltype),1, as.matrix(1:length(color_celltype)),col=color_celltype,ylab="",xlab="",axes=F)
axis(3,at=seq(1:length(color_celltype)),labels=color_celltype,las=2,lwd=0)

par(op)

old=theme_set(theme_bw()+theme(strip.background = element_rect(fill="white"),
                               axis.line.x = element_line(size=0.6),
                               axis.line.y=element_line(size=0.6),
                               panel.border = element_rect(fill=NA,color="white"),
                                         panel.grid =element_blank()))

Memory useage and running time for 1.3 millon mouse

load("time_memory_new.RData")
colnames(time_final)[1]="time_consume"

time_final$methods=factor(time_final$methods,levels=c("DESC","DESC_GPU","DESC_multicpu","scVI","scVI_GPU","scVI_multicpu","Seurat3.0"),
                         labels=c("DESC","DESC_GPU","DESC_multicpu","scVI","scVI_GPU","scVI_multicpu","Seurat3.0"))
df_time=time_final
tmp=as.character(df_time$samplesize)
tmp[tmp=="all"]=1.3e6
df_time$samplesize=as.integer(tmp)
#df_time=subset(df_time,select=names(table(df_time$methods))[names(table(df_time$methods))%in%])

df0=aggregate(mem_final,by = list(paste0(as.character(mem_final$methods),mem_final$samplesize)),FUN=max)
tmp=as.character(df0$samplesize)
tmp[tmp=="all"]=1.3e6
df0$samplesize=as.integer(tmp)

dfnew_time=subset(df_time,select = c("time_consume","methods","samplesize"))
dfnew_time$time_consume=as.numeric(dfnew_time$time_consume)
tmp=dfnew_time$methods
dfnew_time$methods=factor(tmp)
dfnew_time$samplesize.norm=dfnew_time$samplesize
#prettyNum(c(123,1234),big.mark=",", preserve.width="none")# preserve width
dfnew_time$samplesize=factor(format(as.numeric(as.character(dfnew_time$samplesize.norm)),big.mark=",", preserve.width="none"))
colors_use=color_celltype[1:length(levels(dfnew_time$methods))]
names(colors_use)=levels(dfnew_time$methods)

p=ggplot(data=dfnew_time,aes(x=factor(samplesize),y=time_consume,fill=methods,width=0.85))+
  geom_bar(position= position_dodge(width = 0.85,preserve = "single"),stat="identity")+
  xlab("Number of cells")+
  ylab("Time (seconds)")+
  scale_y_continuous(expand = c(0,0,0.05,0),labels=comma)+
  theme(axis.text.x = element_text(angle=30,hjust=1))
p=p+ggtitle("")+scale_fill_manual(values=colors_use)+
  theme(legend.key.size = unit(0.6,"cm"))
p_time=p+theme(legend.position = c(0.35,0.78))
#p_time
#save_plot(filename = "~/figure3/fig3h_horizontal.pdf",p+theme_mem+theme(legend.key.size = unit(1.2,"cm"))+theme(legend.position ="none"),base_height = 11,base_width = 18)

df0=aggregate(mem_final,by = list(paste0(mem_final$methods,mem_final$samplesize)),FUN=max)
df0=subset(df0,select = c("memory_use","methods","samplesize"))
df0$methods=factor(df0$methods,levels=c("DESC","DESC_GPU","DESC_multicpu","scVI","scVI_GPU","scVI_multicpu","Seurat3.0"),
                         labels=c("DESC","DESC_GPU","DESC_multicpu","scVI","scVI_GPU","scVI_multicpu","Seurat3.0"))
tmp=as.character(df0$samplesize)
tmp[tmp=="all"]=1.3e6
df0$samplesize=as.integer(tmp)

df0$samplesize.norm=df0$samplesize
df0$samplesize=factor(format(as.numeric(as.character(df0$samplesize.norm)),big.mark=",", preserve.width="none"))
colors_use=color_celltype[1:length(levels(df0$methods))]
names(colors_use)=levels(df0$methods)

p=ggplot(data=df0,aes(x=factor(samplesize),y=memory_use,fill=methods,width=0.85))+
  geom_bar(position= position_dodge(width = 0.9,preserve = "single"),stat="identity")+xlab("Number of cells")+ylab("Memory (GB)")+scale_y_continuous(expand = c(0,0,0.05,0),labels=comma)+theme(axis.text.x = element_text(angle=30,hjust=1))
p=p+ggtitle("")+scale_fill_manual(values=colors_use)+theme(legend.key.size = unit(0.6,"cm"))
p_mem=p+theme(legend.position = c(0.65,0.78))
#p_mem
#save_plot(filename = "~/figure3/fig3I_horizontal.pdf",p+theme_mem+theme(legend.key.size = unit(1.2,"cm"))+theme(legend.position ="none"),base_height = 11,base_width = 18)

p=egg::ggarrange(p_time+scale_x_discrete(expand = c(0.02,0.02))+theme(axis.text = element_text(size=10))+
                   theme(legend.position = "none",plot.margin =unit(c(0,0,0,0),"cm"))+xlab(""),
                 p_mem+scale_x_discrete(expand = c(0.02,0.02))+theme(axis.text = element_text(size=10))+
                   theme(legend.position = "none",plot.margin =unit(c(0,0,0,0),"cm")),ncol=1,draw=F,
                 labels = c("a","b"),
                       label.args = list(gp = grid::gpar(cex =3,color="black",face="bold")))

p_legend=get_legend(p_time+theme(legend.position = "top")+
                      guides(fill = guide_legend(override.aes = list(alpha = 1, size=8),ncol=1,title.position="top")))
p0=grid.arrange(p,p_legend,ncol=2,widths=c(5,1),newpage = F)

#ggsave("~/Documents/DESC_paper_prepare/DESC_paper_v3/figures/figure_main/memory_time.tiff",p0,width=12,height = 7,dpi=120)

`Supplementary Figure 17

##Supplementary Figure 17
ggsave(file.path(fig_path,"Figure.S17.tiff"),plot_grid(p0),width=12,height = 7,dpi=300,compression="lzw")
ggsave(file.path(fig_path,"Figure.S17.pdf"),plot_grid(p0),width=12,height = 7,dpi=300)
plot_grid(p0)

Suplementary Fig 17. Comparison of running time (a) and memory usage (b) of three clustering method for datasets with various numbers of cells, which were randomly sampled from the 1.3 million mouse brain dataset. DESC: used a single CPU; DESC_GPU: used a single GPU; DESC_multicpu: used 10 CPUs; ScVI: used a single CPU. scVI_GPU: used a single GPU; scVI_multicpu: used 10 CPUs. DESC used desc.train function in python module desc with version 1.0.0.5. Seurat used FindClusters function in R package Seurat version 2.3.4. Seurat3.0 used FindClusters function in R package Seurat version 3.0.0. scVI used UnsupervisedTrainer function in python module scvi version 0.3.0. Remark: we analyzed this dataset on Ubuntu 16.04.4 LTS with Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz and 128GB memory.

All data analyses reported in this paper were conducted on Ubuntu 18.04.1 LTS with Intel® Core (TM) i7-8700K CPU @ 3.70GHz and 64GB memory, except for the 1.3 million cells mouse brain data. For the 1.3 million cells mouse brain dataset, we analyzed on Ubuntu 16.04.4 LTS with Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz and total 128GB memory.

Running time and memory useage and for different number of batches

The number of Batches is 2

env=new.env()
load("../memory_compare_DESC/time_memory_new.RData",envir = env)
mem_final=env$mem_final
time_final=env$time_final
colnames(time_final)[1]="time_consume"

df_time=time_final
df0=aggregate(mem_final,by = list(paste0(mem_final$methods,mem_final$samplesize)),FUN=max)
df0$Method=plyr::mapvalues(df0$methods,from=c("desc","cca3.0","cca","mnn","scVI","bermuda","scanorama"),
                           to=c("DESC","Seurat3.0","CCA","MNN","scVI","BERMUDA","scanorama"))
df0$Method=factor(df0$Method,levels=c("DESC","scVI","scanorama","MNN","BERMUDA","Seurat3.0","CCA"))
colors_use=color_celltype[1:length(levels(df0$Method))]
names(colors_use)=levels(df0$Method)
#df0$samplesize=factor(df0$samplesize,levels =sort(as.numeric(unique(df0$samplesize))))
df0$samplesize=factor(format(as.numeric(as.character(df0$samplesize)),big.mark=",", preserve.width="none"))
p=ggplot(data=df0,aes(x=factor(samplesize),y=memory_use,fill=Method,width=0.85))+
  geom_bar(position= position_dodge(width = 0.9,preserve = "single"),stat="identity")+
  xlab("Number of cells")+
  ylab("Memory (GB)")+
  scale_y_continuous(expand = c(0,0,0.05,0),labels=scales::comma)+
  theme(axis.text.x = element_text(angle=60,hjust=1))+
  ggtitle("")+
  scale_fill_manual(values=colors_use)+
  theme(legend.key.size = unit(0.6,"cm"))
#p_mem=p+theme(legend.position = c(0.65,0.78))
p_mem_batch2=p+theme(legend.position="right")
p_mem_batch2

Figure 10 (right)

df_time=time_final
df0=aggregate(mem_final,by = list(paste0(mem_final$methods,mem_final$samplesize)),FUN=max)
df0$Method=plyr::mapvalues(df0$methods,from=c("desc","cca3.0","cca","mnn","scVI","bermuda","scanorama"),
                           to=c("DESC","Seurat3.0","CCA","MNN","scVI","BERMUDA","scanorama"))
df0$Method=factor(df0$Method,levels=c("DESC","scVI","scanorama","MNN","BERMUDA","Seurat3.0","CCA"))
colors_use=color_celltype[1:length(levels(df0$Method))]
names(colors_use)=levels(df0$Method)
#df0$samplesize=factor(df0$samplesize,levels =sort(as.numeric(unique(df0$samplesize))))
df0$samplesize=factor(format(as.numeric(as.character(df0$samplesize)),big.mark=",", preserve.width="none"))
p=ggplot()+
  geom_bar(data=df0,aes(x=Method,y=memory_use,group=factor(samplesize),fill=Method),
           width=0.7,
           position= position_dodge(width = 0.8,preserve = "single"),stat="identity")+
  xlab("Number of cells")+
  ylab("Memory (GB)")+
  scale_y_continuous(expand = c(0,0,0.05,0),labels=scales::comma,position = "left",breaks = c(0,10,20,30,40),limits = c(0,40))+
  scale_x_discrete(expand=c(0,0,0,0))+
  theme(axis.text.x = element_text(angle=60,hjust=1))+
  theme(axis.title.y = element_text(angle=-90,hjust=0.5))+
  geom_text(aes(x=1,y=c(10,20,30,40)-1,label=c("10GB","20GB","30GB","40GB")))+
  ggtitle("")+
  scale_fill_manual(values=colors_use,guide=F)+
  theme(legend.key.size = unit(0.6,"cm"))
#p_mem=p+theme(legend.position = c(0.65,0.78))
pp=p+theme(legend.position="none")+xlab("")+theme_void()+
  theme( panel.grid.major.y = element_line(size=0.3,linetype = "dotted"))

# Figure 10 (right)
ggsave(file.path(fig_path,"Figure.10.right.pdf"),pp,width = 8,height = 3.5,dpi=300)
ggsave(file.path(fig_path,"Figure.10.right.tiff"),pp,width = 8,height = 3.5,dpi=300,compression="lzw")
pp

dfnew_time=subset(df_time,select = c("time_consume","methods","samplesize"))
dfnew_time$time_consume=as.numeric(dfnew_time$time_consume)
dfnew_time$samplesize.norm=dfnew_time$samplesize
dfnew_time$Method=plyr::mapvalues(dfnew_time$methods,from=c("desc","cca3.0","cca","mnn","scVI","bermuda","scanorama"),
                                  to=c("DESC","Seurat3.0","CCA","MNN","scVI","BERMUDA","scanorama"))
dfnew_time$Method=factor(dfnew_time$Method,levels=c("DESC","scVI","scanorama","MNN","BERMUDA","Seurat3.0","CCA"))
dfnew_time0=subset(dfnew_time,dfnew_time$Method%in%c("DESC","scVI","BERMUDA"))
#prettyNum(c(123,1234),big.mark=",", preserve.width="none")# preserve width
dfnew_time0$samplesize=factor(format(as.numeric(as.character(dfnew_time0$samplesize.norm)),big.mark=",", preserve.width="none"))
colors_use=color_celltype[c(1:length(levels(df0$Method)))]
names(colors_use)=levels(dfnew_time0$Method)

p_descscvi_2=ggplot(data=dfnew_time0,aes(x=factor(samplesize),y=time_consume,fill=Method,width=0.85))+
  geom_bar(position= position_dodge(width = 0.9,preserve = "single"),stat="identity")+
  xlab("Number of cells")+
  ylab("Time (seconds)")+
  scale_y_continuous(expand = c(0,0,0.05,0.),labels=scales::comma)+
  theme(axis.text.x = element_text(hjust=1,angle=60))+
  ggtitle("")+
  scale_fill_manual(values=colors_use)+
  theme(legend.key.size = unit(0.6,"cm"))+
  theme(legend.position = "right")
p_descscvi_2

dfnew_time=subset(df_time,select = c("time_consume","methods","samplesize"))
dfnew_time$time_consume=as.numeric(dfnew_time$time_consume)
dfnew_time$samplesize.norm=dfnew_time$samplesize
dfnew_time$Method=plyr::mapvalues(dfnew_time$methods,from=c("desc","cca3.0","cca","mnn","scVI","bermuda","scanorama"),
                                  to=c("DESC","Seurat3.0","CCA","MNN","scVI","BERMUDA","scanorama"))
dfnew_time$Method=factor(dfnew_time$Method,levels=c("DESC","scVI","scanorama","MNN","BERMUDA","Seurat3.0","CCA"))
dfnew_time0=subset(dfnew_time,!dfnew_time$Method%in%c("DESC","scVI","BERMUDA"))


#prettyNum(c(123,1234),big.mark=",", preserve.width="none")# preserve width
dfnew_time0$samplesize=factor(format(as.numeric(as.character(dfnew_time0$samplesize.norm)),big.mark=",", preserve.width="none"))
colors_use=color_celltype[c(1:length(levels(df0$Method)))]
names(colors_use)=levels(dfnew_time0$Method)

p_other_2=ggplot(data=dfnew_time0,aes(x=factor(samplesize),y=time_consume,fill=Method,width=0.85))+
  geom_bar(position= position_dodge(width = 0.9,preserve = "single"),stat="identity")+
  xlab("Number of cells")+
  ylab("Time (seconds)")+
  scale_y_continuous(expand = c(0,0,0.05,0),labels=scales::comma)+
  theme(axis.text.x = element_text(hjust=1,angle=60))+
  ggtitle("")+
  scale_fill_manual(values=colors_use)+
  theme(legend.key.size = unit(0.6,"cm"))+
  theme(legend.position = "right")
p_other_2

The number of Batches is 4

env2=new.env()
load("../memory_compare_DESC2/time_memory_new.RData",envir = env2)
time_final=env2$time_final
mem_final=env2$mem_final
colnames(time_final)[1]="time_consume"

df_time=time_final
df0=aggregate(mem_final,by = list(paste0(mem_final$methods,mem_final$samplesize)),FUN=max)
df0$Method=plyr::mapvalues(df0$methods,from=c("desc","cca3.0","cca","mnn","scVI","bermuda","scanorama"),
                           to=c("DESC","Seurat3.0","CCA","MNN","scVI","BERMUDA","scanorama"))
df0$Method=factor(df0$Method,levels=c("DESC","scVI","scanorama","MNN","BERMUDA","Seurat3.0","CCA"))

colors_use=color_celltype[1:length(levels(df0$Method))]
names(colors_use)=levels(df0$Method)
#df0$samplesize=factor(df0$samplesize,levels =sort(as.numeric(unique(df0$samplesize))))
df0$samplesize=factor(format(as.numeric(as.character(df0$samplesize)),big.mark=",", preserve.width="none"))
p=ggplot(width=0.85)+
  geom_bar(data=df0,aes(x=factor(samplesize),y=memory_use,fill=Method),
                        position= position_dodge(width = 0.9,preserve = "single"),stat="identity")+
  xlab("Number of cells")+
  ylab("Memory (GB)")+
  scale_y_continuous(expand = c(0,0),labels=scales::comma,limits = c(0,1.06*max(df0$memory_use)))+
  theme(axis.text.x = element_text(angle=60,hjust=1))+
  ggtitle("")+
  scale_fill_manual(values=colors_use)+
  theme(legend.key.size = unit(0.6,"cm"))+
  geom_point(aes(x=7.4,y=df0$memory_use[grep(x=df0$Group.1,pattern = "cca30000")]*1.02),shape=8,size=2)+
  geom_point(aes(x=7.25,y=df0$memory_use[grep(x=df0$Group.1,pattern = "cca3.030000")]*1.02),shape=8,size=2)
#p_mem=p+theme(legend.position = c(0.65,0.78))
p_mem_batch4=p+theme(legend.position="right")
p_mem_batch4

dfnew_time=subset(df_time,select = c("time_consume","methods","samplesize"))
dfnew_time$time_consume=as.numeric(dfnew_time$time_consume)
dfnew_time$samplesize.norm=dfnew_time$samplesize
dfnew_time$Method=plyr::mapvalues(dfnew_time$methods,from=c("desc","cca3.0","cca","mnn","scVI","bermuda","scanorama"),
                                  to=c("DESC","Seurat3.0","CCA","MNN","scVI","BERMUDA","scanorama"))
dfnew_time$Method=factor(dfnew_time$Method,levels=c("DESC","scVI","scanorama","MNN","BERMUDA","Seurat3.0","CCA"))
dfnew_time0=subset(dfnew_time,dfnew_time$Method%in%c("DESC","scVI","BERMUDA"))

#prettyNum(c(123,1234),big.mark=",", preserve.width="none")# preserve width
dfnew_time0$samplesize=factor(format(as.numeric(as.character(dfnew_time0$samplesize.norm)),big.mark=",", preserve.width="none"))
colors_use=color_celltype[1:length(levels(dfnew_time0$Method))]
names(colors_use)=levels(dfnew_time0$Method)

p_descscvi_4=ggplot(data=dfnew_time0,aes(x=factor(samplesize),y=time_consume,fill=Method,width=0.85))+
  geom_bar(position= position_dodge(width = 0.9,preserve = "single"),stat="identity")+
  xlab("Number of cells")+
  ylab("Time (seconds)")+
  scale_y_continuous(expand = c(0,0,0.05,0),labels=scales::comma)+
  theme(axis.text.x = element_text(hjust=1,angle=60))+
  ggtitle("")+
  scale_fill_manual(values=colors_use)+
  theme(legend.key.size = unit(0.6,"cm"))+
  theme(legend.position = "right")
p_descscvi_4

dfnew_time=subset(df_time,select = c("time_consume","methods","samplesize"))
dfnew_time$time_consume=as.numeric(dfnew_time$time_consume)
dfnew_time$samplesize.norm=dfnew_time$samplesize
dfnew_time$Method=plyr::mapvalues(dfnew_time$methods,from=c("desc","cca3.0","cca","mnn","scVI","bermuda","scanorama"),
                                  to=c("DESC","Seurat3.0","CCA","MNN","scVI","BERMUDA","scanorama"))
dfnew_time$Method=factor(dfnew_time$Method,levels=c("DESC","scVI","scanorama","MNN","BERMUDA","Seurat3.0","CCA"))
dfnew_time0=subset(dfnew_time,!dfnew_time$Method%in%c("DESC","scVI","BERMUDA"))

dfnew_time0$samplesize=factor(format(as.numeric(as.character(dfnew_time0$samplesize.norm)),big.mark=",", preserve.width="none"))
colors_use=color_celltype[1:length(levels(dfnew_time0$Method))]
names(colors_use)=levels(dfnew_time0$Method)

p_other_4=ggplot(width=0.85)+
  geom_bar(data=dfnew_time0,aes(x=factor(samplesize),y=time_consume,fill=Method),
                                position= position_dodge(width = 0.9,preserve = "single"),stat="identity")+
  xlab("Number of cells")+
  ylab("Time (seconds)")+
  scale_y_continuous(expand = c(0,0),labels=scales::comma,limits=c(0,1.03*max(dfnew_time0$time_consume)))+
  theme(axis.text.x = element_text(hjust=1,angle=60))+
  ggtitle("")+
  scale_fill_manual(values=colors_use)+
  theme(legend.key.size = unit(0.6,"cm"))+
  theme(legend.position = "right")+
  geom_point(aes(x=7.34,y=dfnew_time0$time_consume[grep(x=rownames(dfnew_time0),pattern = "cca_30000")]*1.02),shape=8,size=2)+
  geom_point(aes(x=7.1,y=dfnew_time0$time_consume[grep(x=rownames(dfnew_time0),pattern = "cca3.0_30000")]*1.02),shape=8,size=2)
p_other_4=p_other_4+theme(legend.position = "right")

p_other_4

#tiff("~/Documents/DESC_paper_prepare/DESC_paper_v2/figures/figure_supp/memory_time_other_batch4.tiff",width=0.7*800,height = 0.67*400)
#p_other_4
#dev.off()

The number of Batches is 30

env2=new.env()
load("../memory_compare_DESC3/time_memory_new.RData",envir = env2)
time_final=env2$time_final
mem_final=env2$mem_final
colnames(time_final)[1]="time_consume"

p1=p_descscvi_30+coord_cartesian(ylim = c(0,49000))+theme(plot.margin = unit(c(0,0,0,0),"cm"))+
  theme(axis.title.y = element_text(hjust=1))
p2=p_descscvi_30+coord_cartesian(ylim = c(50000,150000))+
  theme(axis.text.x = element_blank(),
        plot.margin = unit(c(0,0,-0.6,0),"cm"),
        axis.ticks.x =element_blank(),
        axis.title.x = element_blank(),
        legend.position = "none",
        axis.line.x=element_blank())+ylab("")+xlab("")+scale_y_continuous(breaks = c(50000,100000,150000),labels =scales::comma)

## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.

p=egg::ggarrange(p2,p1,ncol=1,heights=c(1,3),draw = F)

#ggsave("~/Documents/DESC_paper_prepare/DESC_paper_v3/figures/figure_main/memory_time_add_2.tiff",p,width=5,height=3,dpi=300)
p

p_final=egg::ggarrange(p_mem_batch2+theme(axis.text = element_text(size=10)),
                       p_descscvi_2+theme(axis.text = element_text(size=10)),
                       p_other_2+theme(axis.text = element_text(size=10)),
                       p_mem_batch4+theme(axis.text = element_text(size=10)),
                       p_descscvi_4+theme(axis.text = element_text(size=10)),
                       p_other_4+theme(axis.text = element_text(size=10)),
                       p_mem_batch30+theme(axis.text = element_text(size=10))+theme(plot.margin = unit(c(2,0,0,0),"cm")),
                       #p_descscvi_30+theme(axis.text = element_text(size=10)),
                       ggplot()+theme_void(),
                       p_other_30+theme(axis.text = element_text(size=10)),
                       ncol=3,draw=F)
                       #labels = c("a","","","b","","","c","",""),
                       #label.args = list(gp = grid::gpar(font = 4, cex =2.5,color="darkblue",face="bold")))

#ggsave("~/Documents/DESC_paper_prepare/DESC_paper_v3/figures/figure_main/memory_time_add.tiff",p_final,width=18,height=11,dpi=300)
#ggsave("~/Documents/DESC_paper_prepare/DESC_paper_v3/figures/figure_main/memory_time_add.pdf",p_final,width=18,height=11)

p_final=egg::ggarrange(p_mem_batch2+theme(axis.text = element_text(size=10)),
                       p_descscvi_2+theme(axis.text = element_text(size=10)),
                       p_other_2+theme(axis.text = element_text(size=10)),
                       p_mem_batch4+theme(axis.text = element_text(size=10)),
                       p_descscvi_4+theme(axis.text = element_text(size=10)),
                       p_other_4+theme(axis.text = element_text(size=10)),
                       p_mem_batch30+theme(axis.text = element_text(size=10))+theme(plot.margin = unit(c(2,0,0,0),"cm")),
                       #p_descscvi_30+theme(axis.text = element_text(size=10)),
                       ggplot()+theme_void(),
                       p_other_30+theme(axis.text = element_text(size=10)),
                       ncol=3,draw=F,
                       labels = c("a","b","c","d","e","f","g","h","i"),
                       label.args = list(gp = grid::gpar( cex =3.5,color="black",face="bold")))

p_final

options(device = cairo_pdf)
pp=ggdraw() +
  draw_plot(p_final, x=0, y=0, width=1, height = 1)+
  draw_plot(p, x=0.33, y=0, width=0.33, height = 0.31)

Figure 9

## Figure 9
ggsave(file.path(fig_path,"Figure.9.tiff"),pp,width=18,height = 11,dpi=300,compression="lzw")
ggsave(file.path(fig_path,"Figure.9.pdf"),pp,width=18,height = 11,dpi=300)
pp

Figure 9. Comparison of memory usage (first column) and running time (second and third columns). (a-c) The number of batches for analyzed samples is 2. Analyzed data were from Kang et al. (d-f) The number of batches for analyzed samples is 4. (g-i) The number of batches for analyzed samples is 30. For (d-i) the analyzed data were form Peng et al. in which there are 4 batches when taking macaque id as batch definition and 30 batches when taking sample as batch definition. Because DESC, scVI, and BERMUDA are deep learning based methods, we put them together for ease of comparison. Remark: The reason that the running time of batch=30 is smaller than that of batch=4 for DESC is because when the data were standardized by sample id (i.e., when batch=30), the algorithm converged quickly before reaching to the maximum number of epochs (300). The ‘Error’ in the bar plot in (g,i) indicates that there was an error when using Seurat 3.0. This is because the numbers of cells in some batches are very small. The ‘*’ above the bar plot in (d,f,g,i) indicates that the corresponding method broke due to memory issue (i.e. cannot allocate memory). Therefore, the recorded time is the computing time until the method broke. When the number of batches is 30, BERMUDA always throws out an error when the number of cells is less than 8,000, so we only report BERMUDA when the number of cells >=10,000. Additionally, the reported running time and memory usage only include clustering procedure and not include the procedure of computing t-SNE or UAMP. All reported time and memory usage related to this figure were analyzed on our workstation Ubuntu 18.04.1 LTS with Intel® Core(TM) i7-8700K CPU @ 3.70GHz and 64GB memory.

Kang, H.M., Subramaniam, M., Targ, S., Nguyen, M., Maliskova, L., McCarthy, E., Wan, E., Wong, S., Byrnes, L., Lanata, C.M., et al. (2018). Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat Biotechnol 36, 89-94↩
Peng, Y.R., Shekhar, K., Yan, W., Herrmann, D., Sappington, A., Bryman, G.S., van Zyl, T., Do, M.T.H., Regev, A., and Sanes, J.R. (2019). Molecular Classification and Comparative Taxonomics of Foveal and Peripheral Cells in Primate Retina. Cell 176, 1222-1237 e1222.↩

Deep learning enables accurate clustering with batch effect removal in single-cell RNA-seq analysis

The figures related to running time and memory usage

Xiangjie Li, Kui Wang, Yafei Lyu, Huize Pan, Jingxiao Zhang, Dwight Stambolian, Katalin Susztak, Muredach P. Reilly, Gang Hu, Mingyao Li

03/10/2020