Details der Publikation - A survey on missing data in machine learning

A survey on missing data in machine learning

© The Author(s) 2021..

Machine learning has been the corner stone in analysing and extracting information from data and often a problem of missing values is encountered. Missing values occur because of various factors like missing completely at random, missing at random or missing not at random. All these may result from system malfunction during data collection or human error during data pre-processing. Nevertheless, it is important to deal with missing values before analysing data since ignoring or omitting missing values may result in biased or misinformed analysis. In literature there have been several proposals for handling missing values. In this paper, we aggregate some of the literature on missing data particularly focusing on machine learning techniques. We also give insight on how the machine learning approaches work by highlighting the key features of missing values imputation techniques, how they perform, their limitations and the kind of data they are most suitable for. We propose and evaluate two methods, the k nearest neighbor and an iterative imputation method (missForest) based on the random forest algorithm. Evaluation is performed on the Iris and novel power plant fan data with induced missing values at missingness rate of 5% to 20%. We show that both missForest and the k nearest neighbor can successfully handle missing values and offer some possible future research direction.

Medienart:	E-Artikel

Erscheinungsjahr:	2021
Erschienen:	2021

Enthalten in:	Zur Gesamtaufnahme - volume:8
Enthalten in:	Journal of big data - 8(2021), 1 vom: 25., Seite 140

Sprache:	Englisch

Beteiligte Personen:	Emmanuel, Tlamelo [VerfasserIn] Maupong, Thabiso [VerfasserIn] Mpoeleng, Dimane [VerfasserIn] Semong, Thabo [VerfasserIn] Mphago, Banyatsang [VerfasserIn] Tabona, Oteng [VerfasserIn]

Links:	Volltext

Themen:	Imputation Journal Article Machine learning Missing data

Anmerkungen:	Date Revised 20.02.2023 published: Print-Electronic Citation Status PubMed-not-MEDLINE

doi:	10.1186/s40537-021-00516-9

funding:
Förderinstitution / Projekttitel:

PPN (Katalog-ID):	NLM332603598

Internformat


LEADER	01000naa a22002652 4500
001	NLM332603598
003	DE-627
005	20231225215916.0
007	cr uuu---uuuuu
008	231225s2021 xx \|\|\|\|\|o 00\| \|\|eng c
024	7		\|a 10.1186/s40537-021-00516-9 \|2 doi
028	5	2	\|a pubmed24n1108.xml
035			\|a (DE-627)NLM332603598
035			\|a (NLM)34722113
040			\|a DE-627 \|b ger \|c DE-627 \|e rakwb
041			\|a eng
100	1		\|a Emmanuel, Tlamelo \|e verfasserin \|4 aut
245	1	2	\|a A survey on missing data in machine learning
264		1	\|c 2021
336			\|a Text \|b txt \|2 rdacontent
337			\|a ƒaComputermedien \|b c \|2 rdamedia
338			\|a ƒa Online-Ressource \|b cr \|2 rdacarrier
500			\|a Date Revised 20.02.2023
500			\|a published: Print-Electronic
500			\|a Citation Status PubMed-not-MEDLINE
520			\|a © The Author(s) 2021.
520			\|a Machine learning has been the corner stone in analysing and extracting information from data and often a problem of missing values is encountered. Missing values occur because of various factors like missing completely at random, missing at random or missing not at random. All these may result from system malfunction during data collection or human error during data pre-processing. Nevertheless, it is important to deal with missing values before analysing data since ignoring or omitting missing values may result in biased or misinformed analysis. In literature there have been several proposals for handling missing values. In this paper, we aggregate some of the literature on missing data particularly focusing on machine learning techniques. We also give insight on how the machine learning approaches work by highlighting the key features of missing values imputation techniques, how they perform, their limitations and the kind of data they are most suitable for. We propose and evaluate two methods, the k nearest neighbor and an iterative imputation method (missForest) based on the random forest algorithm. Evaluation is performed on the Iris and novel power plant fan data with induced missing values at missingness rate of 5% to 20%. We show that both missForest and the k nearest neighbor can successfully handle missing values and offer some possible future research direction
650		4	\|a Journal Article
650		4	\|a Imputation
650		4	\|a Machine learning
650		4	\|a Missing data
700	1		\|a Maupong, Thabiso \|e verfasserin \|4 aut
700	1		\|a Mpoeleng, Dimane \|e verfasserin \|4 aut
700	1		\|a Semong, Thabo \|e verfasserin \|4 aut
700	1		\|a Mphago, Banyatsang \|e verfasserin \|4 aut
700	1		\|a Tabona, Oteng \|e verfasserin \|4 aut
773	0	8	\|i Enthalten in \|t Journal of big data \|d 2015 \|g 8(2021), 1 vom: 25., Seite 140 \|w (DE-627)NLM249913828 \|x 2196-1115 \|7 nnns
773	1	8	\|g volume:8 \|g year:2021 \|g number:1 \|g day:25 \|g pages:140
856	4	0	\|u http://dx.doi.org/10.1186/s40537-021-00516-9 \|3 Volltext
912			\|a GBV_USEFLAG_A
912			\|a GBV_NLM
951			\|a AR
952			\|d 8 \|j 2021 \|e 1 \|b 25 \|h 140

A survey on missing data in machine learning

Zugang & Verfügbarkeit

Zugehörige Publikationen/Bände