Statistical work cannot be performed in a vacuum. Before all else, such work requires the acquisition of a crucial type of raw material: information relevant to the subject matter under study. Before going with data collection, you must become familiar with a number of crucial concepts that you will employ.
Internal vs. External Data Sources
Sometimes relevant information already exists somewhere; in that case, an investigator need only find it. A business administrator, for example, might simply search the the firm’s internal records for material that already resides in filing cabinets or computer memories. Thus, customer records would provide names, addresses, telephone numbers, data on amounts purchased, credit limits, and more. Employee records would provide names, addresses, job titles, years of service, salaries, social security numbers, and even numbers of sick days used. Production records would contain lists of products, part numbers and quantities produced, along with associated labor costs, raw material consumption, and equipment usage. A government economist would, similarly, have access to a vast database held by the Bureau of the Census, the Department of Labor, the Federal Reserve Board, and the Office of Management and Budget, to name but a few. All the sources just mentioned are internal sources.
In addition to scouring internal sources of information, our business administrator or government economist could also look for external depositories of already existing data and persuade their owners to share information. Indeed, all kinds of organizations routinely gather data and sell them to would-be users in the private sector and in government agencies alike.
From a point of view of the professional statistician, the matter of collecting data is much more complicated than checking out sources of already existing data. It concerns the question of how valid data can be generated in the first place.
Elementary Units and Variables
A statistical investigation invariably focuses on people or things with characteristics in which someone is interested. The persons or objects possessing the characteristics that interest the statistician are called elementary units. A complete listing of all elementary units relevant to a statistical investigation is called a frame. Any single observation about a specified characteristic of interest is called a datum; it is the basic unit of the statistician‘s raw material. Any collection of observations about one or more characteristics of interest, for one or more elementary units, is called a data set. A data set is univariate, bivariate, or multivariate depending on whether it contains information on one variable only, on two variables, or on more than two. The table shown below contains a multivariate data set.
Population vs. Sample
There are two important concepts we must consider: (1) The set of all possible observations about a specified characteristic of interest is called a statistical population. (2) A subset of a statistical population, or of the frame from which it is derived, is called a sample.
Qualitative vs. Quantitative Variables
Any given characteristic of interest to the statistician can differ in kind or in degree among various elementary units. A variable that is normally described in words rather than numerically is called a qualitative variable. As shown in the table above, examples of qualitative variables are: race, sex, and job title. Qualitative variables can, in turn, be binomial or multinomial. Observations about a binomial qualitative variable can be made in only two categories: for example, male or female, employed or unemployed, correct or incorrect, defective or satisfactory, elected or defeated, absent or present. Observations about a multinomial qualitative variable can be made in more than two categories; consider job titles, colors, languages, religions, or types of businesses.
On the other hand, a variable that is normally expressed numerically (because it differs in degree rather than kind among the elementary units under study) is called a quantitative variable. Examples of quantitative variables, as shown in the table above, include: years of service and annual salary. Quantitative variables can, in turn, be discrete or continuous. Observations about a discrete quantitative variable can assume values only at specific points on a scale of values, with gaps between them. For example, the number of children in families, of employees in firms, of students in classes, of rooms in houses, of cars in stock, of cows in pastures. Observations about a continuous quantitative variable can, in contrast, assume values at all points on a scale of values, with no breaks between possible values. Consider hight, temperature, time, volume, or weight.
Surveys vs. Experiments
The collection of data from elementary units without exercising any particular control over factors that make these units different from one another and that may, therefore, affect the characteristic of interest being observed is called an observational study or survey.
On the other hand, the collection of data from elementary units while exercising control over some or all factors that may make these units different from one another and that may, therefore, affect the characteristic of interest being observed is called an experiment.
Census Taking vs. Sampling
A census is a complete survey in which observations about one or more characteristics of interest are made for every elementary unit that exists.
A sample survey is a partial survey in which observations about one or more characteristics are made for only a subset of all existing elementary units.
Kohler, H., 1994. Statistics For Business And Economics. 3rd ed. New York: HarperCollins College Publishers, pp.5-10.