The importance of data standardization from the viewpoint of various enterprises is explored in this study. In this article, we'll look at how data standards have evolved over time and how application programming interface (APIs) have become the de facto norm. Promoted system-to-system interoperability, less translation hurdles, and elimination of missing data issues are just a few of the benefits of data standardization highlighted in the paper. Multiple approaches to data normalization are investigated in the research as well. These include maximum scores, logarithms, and z-scores. This study compares three different standardization approaches to cluster analysis, looking at their effects and weighing the pros and downsides of each. Also included are the results of tests conducted with two databases and a study of data standardization as it pertains to Marajoara ceramics. The findings show that certain standardization methods work well when looking for correlations between different variables. Data standardization and its implications in many academic and corporate contexts are thoroughly examined in this work.
R. G. Miller, “To Save a City: The Berlin Airlift, 1948-1949,” The SHAFR Guide Online. Oct. 02, 2017. doi: 10.1163/2468-1733_shafr_sim140160374.
D. A. Jabs, R. B. Nussenblatt, and J. T. Rosenbaum, “Standardization of UVeitis Nomenclature for Reporting Clinical Data. Results of the first international Workshop,” American Journal of Ophthalmology, vol. 140, no. 3, pp. 509–516, Sep. 2005, doi: 10.1016/j.ajo.2005.03.057.
D. Blumenthal, C. M. DesRoches, K. Donelan, S. Rosenbaum, and T. G. Ferris, “Health Information Technology in the United States: The Information Base for Progress,” HEALTH POLICY AND MANAGEMENT FACULTY PUBLICATIONS, Jan. 2006, [Online]. Available: https://hsrc.himmelfarb.gwu.edu/cgi/viewcontent.cgi?article=1473&context=sphhs_policy_facpubs
O. M. Elzeki, M. Z. Reshad, and M. A. ElSoud, “Improved Max-Min algorithm in cloud computing,” International Journal of Computer Applications, vol. 50, no. 12, pp. 22–27, Jul. 2012, doi: 10.5120/7823-1009.
J. N. Darroch and D. Ratcliff, “Generalized iterative scaling for Log-Linear models,” Annals of Mathematical Statistics, vol. 43, no. 5, pp. 1470–1480, Oct. 1972, doi: 10.1214/aoms/1177692379.
L. A. Palinkas, S. M. Horwitz, C. A. Green, J. P. Wisdom, N. Duan, and K. Hoagwood, “Purposeful sampling for qualitative data collection and analysis in mixed method implementation research,” Administration and Policy in Mental Health and Mental Health Services Research, vol. 42, no. 5, pp. 533–544, Nov. 2013, doi: 10.1007/s10488-013-0528-y.
I. S. Chua et al., “Artificial intelligence in oncology: Path to implementation,” Cancer Medicine, vol. 10, no. 12, pp. 4138–4149, May 2021, doi: 10.1002/cam4.3935.
C. M. Schaffer and P. E. Green, “An Empirical comparison of variable standardization methods in cluster analysis,” Multivariate Behavioral Research, vol. 31, no. 2, pp. 149–167, Apr. 1996, doi: 10.1207/s15327906mbr3102_1.
A. Hazra and N. J. Gogtay, “Biostatistics series module 3: Comparing groups: Numerical variables,” Indian Journal of Dermatology, vol. 61, no. 3, p. 251, Jan. 2016, doi: 10.4103/0019-5154.182416.
T. Cole, “Fitting smoothed centile curves to reference data,” Journal of the Royal Statistical Society, vol. 151, no. 3, p. 385, Jan. 1988, doi: 10.2307/2982992.
J. T. Holladay, J. R. Moran, and G. M. Kezirian, “Analysis of aggregate surgically induced refractive change, prediction error, and intraocular astigmatism,” Journal of Cataract and Refractive Surgery, vol. 27, no. 1, pp. 61–79, Jan. 2001, doi: 10.1016/s0886-3350(00)00796-3.
A. I. Fleishman, “A method for simulating non-normal distributions,” Psychometrika, vol. 43, no. 4, pp. 521–532, Dec. 1978, doi: 10.1007/bf02293811.
G. W. Milligan and M. Cooper, “A study of standardization of variables in cluster analysis,” Journal of Classification, vol. 5, no. 2, pp. 181–204, Sep. 1988, doi: 10.1007/bf01897163.
K. Ichikawa and S. Morishita, “A Simple but Powerful Heuristic Method for Accelerating $k$ -Means Clustering of Large-Scale Data in Life Science,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 11, no. 4, pp. 681–692, Jul. 2014, doi: 10.1109/tcbb.2014.2306200.
L. Fu and C. L. Kane, “Time reversal polarization and aZ2adiabatic spin pump,” Physical Review B, vol. 74, no. 19, Nov. 2006, doi: 10.1103/physrevb.74.195312.
S. J. Hageman, “Alternative methods for dealing by nonnormality and heteroscedasticity in paleontological data,” Journal of Paleontology, vol. 66, no. 6, pp. 857–867, Nov. 1992, doi: 10.1017/s0022336000020989.
T. J. Henderson, “Quantitative NMR spectroscopy using coaxial inserts containing a reference standard: purity determinations for military nerve agents,” Analytical Chemistry, vol. 74, no. 1, pp. 191–198, Dec. 2001, doi: 10.1021/ac010809.
W. H. Woodall, R. Koudelik, K. L. Tsui, S. B. Kim, Z. G. Stoumbos, and C. P. Carvounis, “A review and analysis of the Mahalanobis—Taguchi system,” Technometrics, vol. 45, no. 1, pp. 1–15, Feb. 2003, doi: 10.1198/004017002188618626.
A. Martino, A. Ghiglietti, F. Ieva, and A. M. Paganoni, “A k-means procedure based on a Mahalanobis type distance for clustering multivariate functional data,” Statistical Methods & Applications, vol. 28, no. 2, pp. 301–322, Nov. 2018, doi: 10.1007/s10260-018-00446-6.
D. Steinley and M. J. Brusco, “Selection of variables in cluster analysis: An empirical comparison of eight procedures,” Psychometrika, vol. 73, no. 1, pp. 125–144, Aug. 2007, doi: 10.1007/s11336-007-9019-y.
D. Charalampidis, “A modified k-means algorithm for circular invariant clustering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 12, pp. 1856–1865, Dec. 2005, doi: 10.1109/tpami.2005.230.
A. M. Ikotun, A. E. Ezugwu, L. Abualigah, B. Abuhaija, and H. Jia, “K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data,” Information Sciences, vol. 622, pp. 178–210, Apr. 2023, doi: 10.1016/j.ins.2022.11.139.
D. Jiang, C. Tang, and A. Zhang, “Cluster analysis for gene expression data: a survey,” IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 11, pp. 1370–1386, Nov. 2004, doi: 10.1109/tkde.2004.68.
I. S. Aref, J. Kadum, and A. Kadum, “Optimization of Max-Min and Min-Min Task Scheduling Algorithms Using G.A in Cloud Computing,” 2022 5th International Conference on Engineering Technology and Its Applications (IICETA), May 2022, doi: 10.1109/iiceta54559.2022.9888542.
H. F. Pardede and K. Sairyo, “Generalized-log spectral mean normalization for speech recognition,” INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, Aug. 2011, doi: 10.21437/interspeech.2011-209.
M. Mazziotta and A. Pareto, “Normalization methods for spatio‐temporal analysis of environmental performance: Revisiting the Min–Max method,” Environmetrics, vol. 33, no. 5, May 2022, doi: 10.1002/env.2730.
G. J. Brouwer and D. J. Heeger, “Categorical clustering of the neural representation of color,” The Journal of Neuroscience, vol. 33, no. 39, pp. 15454–15465, Sep. 2013, doi: 10.1523/jneurosci.2472-13.2013.
D. Chicco and G. Jurman, “A statistical comparison between Matthews correlation coefficient (MCC), prevalence threshold, and Fowlkes–Mallows index,” Journal of Biomedical Informatics, vol. 144, p. 104426, Aug. 2023, doi: 10.1016/j.jbi.2023.104426.
R. J. G. B. Campello, “A fuzzy extension of the Rand index and other related indexes for clustering and classification assessment,” Pattern Recognition Letters, vol. 28, no. 7, pp. 833–841, May 2007, doi: 10.1016/j.patrec.2006.11.010.
J. R. Oliver, “The Archaeology of Agriculture in Ancient Amazonia,” in Springer eBooks, 2008, pp. 185–216. doi: 10.1007/978-0-387-74907-5_12.
J. Morales, A. R. Rodríguez, V. Alberto, C. Machado, and C. C. Hernández, “The impact of human activities on the natural environment of the Canary Islands (Spain) during the pre-Hispanic stage (3rd–2nd Century BC to 15th Century AD): an overview,” Environmental Archaeology, vol. 14, no. 1, pp. 27–36, Apr. 2009, doi: 10.1179/174963109x400655.
L. Lobao and K. Meyer, “The Great Agricultural Transition: Crisis, change, and social consequences of twentieth century US farming,” Annual Review of Sociology, vol. 27, no. 1, pp. 103–124, Aug. 2001, doi: 10.1146/annurev.soc.27.1.103.
C. Cartwright, “The principles, procedures and pitfalls in identifying archaeological and historical wood samples,” Annals of Botany, vol. 116, no. 1, pp. 1–13, May 2015, doi: 10.1093/aob/mcv056.
A. L. Price, R. M. Plenge, M. E. Weinblatt, N. A. Shadick, and D. Reich, “Principal components analysis corrects for stratification in genome-wide association studies,” Nature Genetics, vol. 38, no. 8, pp. 904–909, Jul. 2006, doi: 10.1038/ng1847.
D. M. Nemeskey, “Natural language processing methods for language modeling,” 2023. doi: 10.15476/elte.2020.066.
A. F. L. Nemec and R. O. Brinkhurst, “The Fowlkes–Mallows statistic and the comparison of two independently determined dendrograms,” Canadian Journal of Fisheries and Aquatic Sciences, vol. 45, no. 6, pp. 971–975, Jun. 1988, doi: 10.1139/f88-119.
R. Hansen et al., “Adaptations to the current ECCO/ESPGHAN guidelines on the management of paediatric acute severe colitis in the context of the COVID-19 pandemic: a RAND appropriateness panel,” Gut, vol. 70, no. 6, pp. 1044–1052, Sep. 2020, doi: 10.1136/gutjnl-2020-322449.
J. Vesanto and E. Alhoniemi, “Clustering of the self-organizing map,” IEEE Transactions on Neural Networks, vol. 11, no. 3, pp. 586–600, May 2000, doi: 10.1109/72.846731.
Acknowledgements
We would like to thank Reviewers for taking the time and effort necessary to review the manuscript. We sincerely appreciate all valuable comments and suggestions, which helped us to improve the quality of the manuscript.
Funding
No funding was received to assist with the preparation of this manuscript.
Ethics declarations
Conflict of interest
The authors have no conflicts of interest to declare that are relevant to the content of this article.
Availability of data and materials
No data available for above study.
Author information
Contributions
All authors have equal contribution in the paper and all authors have read and agreed to the published version of the manuscript.
Corresponding author
Amirah Muhammad
Universiti Putra Malaysia, 43400 Serdang, Selangor, Malaysia.
Open Access This article is licensed under a Creative Commons Attribution NoDerivs is a more restrictive license. It allows you to redistribute the material commercially or non-commercially but the user cannot make any changes whatsoever to the original, i.e. no derivatives of the original work. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-nd/4.0/
Cite this article
Amirah Muhammad, “The Influence of Data Standardization on Cluster Analysis: A Case Study of Marajoara Ceramics”, Journal of Computational Intelligence in Materials Science, vol.2, pp. 059-067, 2024. doi: 10.53759/832X/JCIMS202402006.