Statistical work cannot be performed in a vacuum. Before all else, such work requires the acquisition of a crucial type of raw material: information relevant to the subject matter under study. Before going with data collection, you must become familiar with a number of crucial concepts that you will employ.

# Internal vs. External Data Sources

Sometimes relevant information already exists somewhere; in that case, an investigator need only find it. A business administrator, for example, might simply search the the firm’s internal records for material that already resides in filing cabinets or computer memories. Thus, customer records would provide names, addresses, telephone numbers, data on amounts purchased, credit limits, and more. Employee records would provide names, addresses, job titles, years of service, salaries, social security numbers, and even numbers of sick days used. Production records would contain lists of products, part numbers and quantities produced, along with associated labor costs, raw material consumption, and equipment usage. A government economist would, similarly, have access to a vast database held by the Bureau of the Census, the Department of Labor, the Federal Reserve Board, and the Office of Management and Budget, to name but a few. All the sources just mentioned are **internal sources**.

In addition to scouring **internal sources** of information, our business administrator or government economist could also look for **external** depositories of already existing data and persuade their owners to share information. Indeed, all kinds of organizations routinely gather **data** and sell them to would-be users in the private sector and in government agencies alike.

From a point of view of the professional **statistician**, the matter of **collecting data** is much more complicated than checking out sources of already existing **data**. It concerns the question of how **valid data** can be generated in the first place.

# Elementary Units and Variables

A **statistical** investigation invariably focuses on people or things with characteristics in which someone is interested. The persons or objects possessing the characteristics that interest the **statistician** are called **elementary units**. A complete listing of all **elementary units** relevant to a **statistical** investigation is called a **frame**. Any single **observation** about a specified characteristic of interest is called a **datum**; it is the basic unit of the **statistician**‘s raw material. Any collection of observations about one or more characteristics of interest, for one or more elementary units, is called a **data set**. A data set is univariate, bivariate, or multivariate depending on whether it contains information on one variable only, on two variables, or on more than two. The table shown below contains a multivariate data set.

# Population vs. Sample

There are two important concepts we must consider: (1) The set of all possible **observations** about a specified characteristic of interest is called a **statistical population**. (2) A subset of a **statistical population**, or of the **frame** from which it is derived, is called a **sample**.

# Qualitative vs. Quantitative Variables

Any given characteristic of interest to the **statistician** can differ in kind or in degree among various **elementary units**. A variable that is normally described in words rather than numerically is called a **qualitative variable**. As shown in the table above, examples of **qualitative variables** are: race, sex, and job title. **Qualitative variables** can, in turn, be **binomial** or **multinomial**. **Observations** about a **binomial qualitative variable** can be made in only two categories: for example, male or female, employed or unemployed, correct or incorrect, defective or satisfactory, elected or defeated, absent or present. **Observations** about a **multinomial qualitative variable** can be made in more than two categories; consider job titles, colors, languages, religions, or types of businesses.

On the other hand, a variable that is normally expressed **numerically **(because it differs in degree rather than kind among the **elementary units **under study) is called a **quantitative variable**. Examples of **quantitative variables**, as shown in the table above, include: years of service and annual salary. **Quantitative variables **can, in turn, be **discrete **or **continuous**. **Observations **about a **discrete quantitative variable **can assume values only at specific points on a **scale **of values, with gaps between them. For example, the number of children in families, of employees in firms, of students in classes, of rooms in houses, of cars in stock, of cows in pastures. **Observations **about a **continuous quantitative variable **can, in contrast, assume values at all points on a **scale **of values, with no breaks between possible values. Consider hight, temperature, time, volume, or weight.

# Surveys vs. Experiments

The **collection **of **data **from **elementary units **without exercising any particular **control **over **factors **that make these **units **different from one another and that may, therefore, affect the characteristic of interest being observed is called an **observational study** or **survey**.

On the other hand, the **collection **of **data **from **elementary units **while exercising **control **over some or all **factors **that may make these **units **different from one another and that may, therefore, affect the characteristic of interest being observed is called an **experiment**.

# Census Taking vs. Sampling

A **census** is a complete **survey **in which **observations **about one or more characteristics of interest are made for every **elementary unit **that exists.

A **sample survey **is a partial **survey **in which **observations **about one or more characteristics are made for only a subset of all existing **elementary units**.

# References

*Kohler, H., 1994. Statistics For Business And Economics. 3rd ed. New York: HarperCollins College Publishers, pp.5-10.*