10 Chapter 1 Large, opportunistic datasets and “big data” There is an exponential increase in the amount of data worldwide1. Advancements in technology have made the collection and storage of enormous amounts of data easier than ever before, and the amount of data collected and stored is expected to increase even further in the coming years2. Data is collected through every aspect of our daily activities: mobile (wearable) devices, such as smartphones or smartwatches provide a constant data stream, monitoring our geographical location or measure vital signs like heart rate3–5. Whenever we use Google to look up information, this is stored and can be analysed using Google search trends6. It is not just tech companies that have expanded their data collection, the way data is collected in health care has also drastically changed in the past decades4,7. Paper health records are mostly replaced with electronic health records (EHR)4,8, and tremendous amounts of data are generated and stored on a daily basis in hospitals across the globe, although organising and utilising this data is not easy3. This routinely collected or generated data is sometimes used for clinical research, but manually extracting data from EHR is labour intensive. Automated extraction of structured data from EHR can create opportunities to rapidly build large datasets, with relatively minimal effort. Large, observational datasets containing data on many variables for many patients are often referred to as “big data”9. An important aspect of big data is that the relative size is increased, compared to what is considered a typical dataset in the field7. Other important features of datasets that are considered big data are volume (amount of data), velocity (speed of collection) and variety (types of data, like structured and unstructured data), also known as the ‘three V terms’. With these large datasets, data collection is not always done with a specific research question in mind. This leads to an obvious disadvantage: when the dataset is not designed for a specific research question, often times the dataset will be sub optimal to answer said research question. Epidemiological knowledge is frequently needed to avoid drawing invalid conclusions, for example due to differences between groups leading to confounding. Specific care is needed to ensure that the study population of interest is represented and included in the database. Patient with mild disease, for example, whose disease is managed by their general practitioner may not have an EHR in the hospital system. Moreover, data should be of sufficient quality. I.e., there are several diagnostic approaches for diseases like heart failure, and for most research questions it will be important to ensure that the same diagnostic criteria are used in the entire study population10. Moreover, routinely collected datasets often lack quality control. Large datasets can contain incomplete or inaccurate information that can be difficult to identify, and there are not always standard procedures to check the quality of the data. However, there are also many opportunities and advantages using these datasets for health care research, if the aforementioned obstacles are overcome. First of all, real world data reflects the heterogeneity of the population and clinical practice,
RkJQdWJsaXNoZXIy MTk4NDMw