Details der Publikation - Generating Contextual Variables From Web-Based Data for Health Research

Generating Contextual Variables From Web-Based Data for Health Research : Tutorial on Web Scraping, Text Mining, and Spatial Overlay Analysis

©Pablo Galvez-Hernandez, Angelina Gonzalez-Viana, Luis Gonzalez-de Paz, Ketan Shankardass, Carles Muntaner. Originally published in JMIR Public Health and Surveillance (https://publichealth.jmir.org), 08.01.2024..

BACKGROUND: Contextual variables that capture the characteristics of delimited geographic or jurisdictional areas are vital for health and social research. However, obtaining data sets with contextual-level data can be challenging in the absence of monitoring systems or public census data.

OBJECTIVE: We describe and implement an 8-step method that combines web scraping, text mining, and spatial overlay analysis (WeTMS) to transform extensive text data from government websites into analyzable data sets containing contextual data for jurisdictional areas.

METHODS: This tutorial describes the method and provides resources for its application by health and social researchers. We used this method to create data sets of health assets aimed at enhancing older adults' social connections (eg, activities and resources such as walking groups and senior clubs) across the 374 health jurisdictions in Catalonia from 2015 to 2022. These assets are registered on a web-based government platform by local stakeholders from various health and nonhealth organizations as part of a national public health program. Steps 1 to 3 involved defining the variables of interest, identifying data sources, and using Python to extract information from 50,000 websites linked to the platform. Steps 4 to 6 comprised preprocessing the scraped text, defining new variables to classify health assets based on social connection constructs, analyzing word frequencies in titles and descriptions of the assets, creating topic-specific dictionaries, implementing a rule-based classifier in R, and verifying the results. Steps 7 and 8 integrate the spatial overlay analysis to determine the geographic location of each asset. We conducted a descriptive analysis of the data sets to report the characteristics of the assets identified and the patterns of asset registrations across areas.

RESULTS: We identified and extracted data from 17,305 websites describing health assets. The titles and descriptions of the activities and resources contained 12,560 and 7301 unique words, respectively. After applying our classifier and spatial analysis algorithm, we generated 2 data sets containing 9546 health assets (5022 activities and 4524 resources) with the potential to enhance social connections among older adults. Stakeholders from 318 health jurisdictions registered identified assets on the platform between July 2015 and December 2022. The agreement rate between the classification algorithm and verified data sets ranged from 62.02% to 99.47% across variables. Leisure and skill development activities were the most prevalent (1844/5022, 36.72%). Leisure and cultural associations, such as social clubs for older adults, were the most common resources (878/4524, 19.41%). Health asset registration varied across areas, ranging between 0 and 263 activities and 0 and 265 resources.

CONCLUSIONS: The sequential use of WeTMS offers a robust method for generating data sets containing contextual-level variables from internet text data. This study can guide health and social researchers in efficiently generating ready-to-analyze data sets containing contextual variables.

Medienart:	E-Artikel

Erscheinungsjahr:	2024
Erschienen:	2024

Enthalten in:	Zur Gesamtaufnahme - volume:10
Enthalten in:	JMIR public health and surveillance - 10(2024) vom: 08. Jan., Seite e50379

Sprache:	Englisch

Beteiligte Personen:	Galvez-Hernandez, Pablo [VerfasserIn] Gonzalez-Viana, Angelina [VerfasserIn] Gonzalez-de Paz, Luis [VerfasserIn] Shankardass, Ketan [VerfasserIn] Muntaner, Carles [VerfasserIn]

Links:	Volltext

Themen:	Contextual variables Health assets Health services research Journal Article Multilevel analysis Program evaluation Social connection Social environment Spatial overlay analysis Text mining Web scraping

Anmerkungen:	Date Completed 09.01.2024 Date Revised 25.01.2024 published: Electronic Citation Status MEDLINE

doi:	10.2196/50379

funding:
Förderinstitution / Projekttitel:

PPN (Katalog-ID):	NLM366808680

Internformat


LEADER	01000caa a22002652 4500
001	NLM366808680
003	DE-627
005	20240125232040.0
007	cr uuu---uuuuu
008	240114s2024 xx \|\|\|\|\|o 00\| \|\|eng c
024	7		\|a 10.2196/50379 \|2 doi
028	5	2	\|a pubmed24n1270.xml
035			\|a (DE-627)NLM366808680
035			\|a (NLM)38190245
040			\|a DE-627 \|b ger \|c DE-627 \|e rakwb
041			\|a eng
100	1		\|a Galvez-Hernandez, Pablo \|e verfasserin \|4 aut
245	1	0	\|a Generating Contextual Variables From Web-Based Data for Health Research \|b Tutorial on Web Scraping, Text Mining, and Spatial Overlay Analysis
264		1	\|c 2024
336			\|a Text \|b txt \|2 rdacontent
337			\|a ƒaComputermedien \|b c \|2 rdamedia
338			\|a ƒa Online-Ressource \|b cr \|2 rdacarrier
500			\|a Date Completed 09.01.2024
500			\|a Date Revised 25.01.2024
500			\|a published: Electronic
500			\|a Citation Status MEDLINE
520			\|a ©Pablo Galvez-Hernandez, Angelina Gonzalez-Viana, Luis Gonzalez-de Paz, Ketan Shankardass, Carles Muntaner. Originally published in JMIR Public Health and Surveillance (https://publichealth.jmir.org), 08.01.2024.
520			\|a BACKGROUND: Contextual variables that capture the characteristics of delimited geographic or jurisdictional areas are vital for health and social research. However, obtaining data sets with contextual-level data can be challenging in the absence of monitoring systems or public census data
520			\|a OBJECTIVE: We describe and implement an 8-step method that combines web scraping, text mining, and spatial overlay analysis (WeTMS) to transform extensive text data from government websites into analyzable data sets containing contextual data for jurisdictional areas
520			\|a METHODS: This tutorial describes the method and provides resources for its application by health and social researchers. We used this method to create data sets of health assets aimed at enhancing older adults' social connections (eg, activities and resources such as walking groups and senior clubs) across the 374 health jurisdictions in Catalonia from 2015 to 2022. These assets are registered on a web-based government platform by local stakeholders from various health and nonhealth organizations as part of a national public health program. Steps 1 to 3 involved defining the variables of interest, identifying data sources, and using Python to extract information from 50,000 websites linked to the platform. Steps 4 to 6 comprised preprocessing the scraped text, defining new variables to classify health assets based on social connection constructs, analyzing word frequencies in titles and descriptions of the assets, creating topic-specific dictionaries, implementing a rule-based classifier in R, and verifying the results. Steps 7 and 8 integrate the spatial overlay analysis to determine the geographic location of each asset. We conducted a descriptive analysis of the data sets to report the characteristics of the assets identified and the patterns of asset registrations across areas
520			\|a RESULTS: We identified and extracted data from 17,305 websites describing health assets. The titles and descriptions of the activities and resources contained 12,560 and 7301 unique words, respectively. After applying our classifier and spatial analysis algorithm, we generated 2 data sets containing 9546 health assets (5022 activities and 4524 resources) with the potential to enhance social connections among older adults. Stakeholders from 318 health jurisdictions registered identified assets on the platform between July 2015 and December 2022. The agreement rate between the classification algorithm and verified data sets ranged from 62.02% to 99.47% across variables. Leisure and skill development activities were the most prevalent (1844/5022, 36.72%). Leisure and cultural associations, such as social clubs for older adults, were the most common resources (878/4524, 19.41%). Health asset registration varied across areas, ranging between 0 and 263 activities and 0 and 265 resources
520			\|a CONCLUSIONS: The sequential use of WeTMS offers a robust method for generating data sets containing contextual-level variables from internet text data. This study can guide health and social researchers in efficiently generating ready-to-analyze data sets containing contextual variables
650		4	\|a Journal Article
650		4	\|a contextual variables
650		4	\|a health assets
650		4	\|a health services research
650		4	\|a multilevel analysis
650		4	\|a program evaluation
650		4	\|a social connection
650		4	\|a social environment
650		4	\|a spatial overlay analysis
650		4	\|a text mining
650		4	\|a web scraping
700	1		\|a Gonzalez-Viana, Angelina \|e verfasserin \|4 aut
700	1		\|a Gonzalez-de Paz, Luis \|e verfasserin \|4 aut
700	1		\|a Shankardass, Ketan \|e verfasserin \|4 aut
700	1		\|a Muntaner, Carles \|e verfasserin \|4 aut
773	0	8	\|i Enthalten in \|t JMIR public health and surveillance \|d 2015 \|g 10(2024) vom: 08. Jan., Seite e50379 \|w (DE-627)NLM257939679 \|x 2369-2960 \|7 nnns
773	1	8	\|g volume:10 \|g year:2024 \|g day:08 \|g month:01 \|g pages:e50379
856	4	0	\|u http://dx.doi.org/10.2196/50379 \|3 Volltext
912			\|a GBV_USEFLAG_A
912			\|a GBV_NLM
951			\|a AR
952			\|d 10 \|j 2024 \|b 08 \|c 01 \|h e50379

Generating Contextual Variables From Web-Based Data for Health Research : Tutorial on Web Scraping, Text Mining, and Spatial Overlay Analysis

Zugang & Verfügbarkeit

Zugehörige Publikationen/Bände