PhytoCell: An ensemble learning framework for identifying cell states in plant scRNA-seq data
Published 14 April, 2026
With the rapid development of single-cell RNA sequencing (scRNA-seq), researchers can now examine gene activity in individual plant cells at unprecedented resolution, opening new opportunities to study cell differentiation, tissue development, and stress responses. However, scRNA-seq datasets compile data from thousands of cells and are characterized by high dimensionality, extreme sparsity, and substantial technical noise. Notably, most of the genes expressed in a given cell are expressed in every type of cell; only a relatively small number of genes, so-called marker genes, are specific to each cell type. Consequently, the task of assigning roles to individual cells relies heavily on prior knowledge of the biological context and which genes are highly expressed in each cell type — making it difficult to identify marker genes and assign cell types accurately.
Against this backdrop, a research team led by Dr. Qiang He and Dr. Aiguo Yang from Chinese Academy of Agricultural Sciences, developed PhytoCell, an ensemble learning framework for plant single-cell RNA sequencing data analysis (Fig. 1). Their study, made available online on March 27 2026 in The Crop Journal, aims to enable robust cell biomarker identification and cell subpopulation classification.
The research team's PhytoCell framework integrates four machine learning models grouped together as a computational stack; this "ensemble" of models harnesses a powerful learning strategy that uses some of the data to identify biomarkers, which are then used to analyze the remaining data. Also, using four models instead of one improves the predictive stability and generalization of the framework. The framework ranks genes based on their importance to the overall data structure by calculating their maximal information coefficient and iteratively selects marker genes for model training.
To evaluate the performance of PhytoCell, the researchers used scRNA-seq datasets from the corolla tissues of Nicotiana attenuata (coyote tobacco) collected at three time points. They found that the framework successfully identified key marker genes and accurately sorted the cells into different cell states and subpopulations.
As an independent validation of PhytoCell performance, the researchers then turned to a large-scale scRNA-seq atlas generated in rice spanning multiple tissues, comprising ~120,000 individual cells. Again, they found that the framework identified sets of biomarkers that unambiguously assigned cell states and grouped similar cells together. The results showed that PhytoCell is robust and has broad applicability across plant species, effectively removing redundant noise, selecting core marker genes, and achieving precise annotation of cell subpopulations.
Unlike conventional methods that rely on prior biological expertise, the team's PhytoCell framework adopts a purely data-driven strategy to identify marker genes, preserving the original structure of the biological data even when using a minimal set of biomarker genes. The team found that, in addition to some marker genes identified by conventional methods, the framework also identified additional genes that had clearly been overlooked by other methods.
"These new biomarkers are valuable candidate gene resources for crop improvement and studies of cellular mechanisms," says senior author Qiang He. "We've made made PhytoCell available on a user-friendly web server for marker gene exploration and cell-type annotation."
The platform is freely available at https://cgris.net/phyto.
"Overall, PhytoCell offers a robust and scalable machine learning–powered solution for annotating cell subpopulations and identifying marker genes in plant scRNA-seq datasets," says senior and co-corresponding author Aiguo Yang. "It further advances the integration of machine learning into plant genomics."
Contact Author:
Qiang He
Email address: heqiang@caas.cn
Conflicts of Interest Statement:
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
See the Article:
PhytoCell: An ensemble learning framework for identifying cell states in plant scRNA-seq data, The Crop Journal, https://doi.org/10.1016/j.cj.2026.02.021