Journal of Machine and Computing


An Evaluation of Supervised Dimensionality Reduction For Large Scale Data



Journal of Machine and Computing

Received On : 30 April 2021

Revised On : 25 July 2021

Accepted On : 18 October 2021

Published On : 05 January 2022

Volume 02, Issue 01

Pages : 017-025


Abstract


Experimenters today frequently quantify millions or even billions of characteristics (measurements) each sample to address critical biological issues, in the hopes that machine learning tools would be able to make correct data-driven judgments. An efficient analysis requires a low-dimensional representation that preserves the differentiating features in data whose size and complexity are orders of magnitude apart (e.g., if a certain ailment is present in the person's body). While there are several systems that can handle millions of variables and yet have strong empirical and conceptual guarantees, there are few that can be clearly understood. This research presents an evaluation of supervised dimensionality reduction for large scale data. We provide a methodology for expanding Principal Component Analysis (PCA) by including category moment estimations in low-dimensional projections. Linear Optimum Low-Rank (LOLR) projection, the cheapest variant, includes the class-conditional means. We show that LOLR projections and its extensions enhance representations of data for future classifications while retaining computing flexibility and reliability using both experimental and simulated data benchmark. When it comes to accuracy, LOLR prediction outperforms other modular linear dimension reduction methods that require much longer computation times on conventional computers. LOLR uses more than 150 million attributes in brain image processing datasets, and many genome sequencing datasets have more than half a million attributes.


Keywords


Linear Optimum Low-Rank (LOLR), Linear Discriminant Analysis (LDA), Canonical Correlations Analyses (CCA), Principal Component Analysis (PCA), Partial Least Squares (PLS).


  1. C. S. Pun and M. Z. Hadimaja, “A self-calibrated direct approach to precision matrix estimation and linear discriminant analysis in high dimensions,” Comput. Stat. Data Anal., vol. 155, no. 107105, p. 107105, 2021.
  2. L. Cope, D. Q. Naiman, and G. Parmigiani, “Integrative correlation: Properties and relation to canonical correlations,” J. Multivar. Anal., vol. 123, pp. 270–280, 2014.
  3. Z. Cai and B. Chen, “Least‐squares method for the Oseen equation: Least Squares For Oseen’s Problem,” Numer. Methods Partial Differ. Equ., vol. 32, no. 4, pp. 1289–1303, 2016.
  4. M. L. Stein, J. Chen, and M. Anitescu, “Difference Filter Preconditioning for Large Covariance Matrices,” SIAM J. Matrix Anal. Appl., vol. 33, no. 1, pp. 52–72, 2012.
  5. M. Pal, N. K. Mandal, and M. L. Aggarwal, “A-optimal designs for optimum mixture in an additive quadratic mixture model,” Statistics (Ber.), vol. 51, no. 2, pp. 265–276, 2017.
  6. J. Busa and I. Polaka, “Variability of classification results in data with high dimensionality and small sample size,” Inf. Technol. Manag. Sci., vol. 24, pp. 45–52, 2021.
  7. S. Mainali et al., “An Information-theoretic approach to dimensionality reduction in data science,” Int. J. Data Sci. Anal., 2021.
  8. N. Fischer and C. Ikenmeyer, “The computational complexity of plethysm coefficients,” Comput. Complex., vol. 29, no. 2, 2020.
  9. A. Alcalde-Barros, D. García-Gil, S. García, and F. Herrera, “DPASF: a flink library for streaming data preprocessing,” Big Data Anal., vol. 4, no. 1, 2019.
  10. A. Chapman, P. Missier, G. Simonelli, and R. Torlone, “Capturing and querying fine-grained provenance of preprocessing pipelines in data science,” Proceedings VLDB Endowment, vol. 14, no. 4, pp. 507–520, 2020.

Acknowledgements


Author(s) thanks to Norwegian University of Science and Technology for research lab and equipment support.


Funding


No funding was received to assist with the preparation of this manuscript.


Ethics declarations


Conflict of interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.


Availability of data and materials


No data available for above study.


Author information


Contributions

All authors have equal contribution in the paper and all authors have read and agreed to the published version of the manuscript.


Corresponding author


Rights and permissions


Open Access This article is licensed under a Creative Commons Attribution NoDerivs is a more restrictive license. It allows you to redistribute the material commercially or non-commercially but the user cannot make any changes whatsoever to the original, i.e. no derivatives of the original work. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-nd/4.0/


Cite this article


Nancy Jan Sliper, “An Evaluation of Supervised Dimensionality Reduction For Large Scale Data”, Journal of Machine and Computing, vol.2, no.1, pp. 017-025, January 2022. doi: 10.53759/7669/jmc202202003.


Copyright


© 2022 Nancy Jan Sliper. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.