Small drug use data analysis

Posted on Вт 06 Сентябрь 2016 in data analysis

I was looking for health databases for my first reseach. While searching I found the National Survey on Drug Use and Health. Data were collected and prepared for release by Research Triangle Institute, Research Triangle Park, North Carolina. Since these data are available for the general public, I chose this database. I downloaded the data file and the codebook for transcription. Reading the codebook (at least the introduction) is important for better data understanding.

For each variable in the 2012 data file, the codebook provides a variable name, a description of the variable, value codes and their meanings, and an unweighted univariate frequency distribution. Most of the variables originated directly as interview items. The variable names in this codebook are the same as variable names used in the dataset.

So, now let's take a look at the data:

In [1]:

import pandas as pd
import matplotlib.pyplot as plt

file_name = "34933-0001-Data.tsv"
data = pd.read_csv("data/DrugUse/ICPSR_34933/DS0001/{0}".format(file_name), sep="\t")
data.head()

Out[1]:

	CASEID	QUESTID2	CIGEVER	CIGOFRSM	CIGWILYR	CIGTRY	CIGYFU	CIGMFU	CIGREC	CIG30USE	...	IIEMPSTY	II2EMSTY	EMPSTAT4	IIEMPST4	II2EMST4	PDEN00	COUTYP2	ANALWT_C	VESTR	VEREP
0	1	50886467	2	4	4	991	9991	91	91	91	...	1	1	99	9	9	2	2	1275.597449	30054	2
1	2	13766883	2	99	99	991	9991	91	91	91	...	1	1	1	1	1	2	2	5191.071173	30031	1
2	3	17772877	2	99	99	991	9991	91	91	91	...	1	1	1	1	1	3	3	419.742011	30056	2
3	4	45622817	1	99	99	13	9999	99	2	93	...	1	1	2	1	1	2	2	1449.303889	30054	1
4	5	17239390	1	99	99	11	9999	99	4	93	...	1	1	1	1	1	1	1	15344.293577	30012	2

5 rows × 3120 columns

We should clean the data for a better usability and to increase the speed of analysis (I added some unfamiliar variables for future):

In [2]:

needed_fields = ["QUESTID2", "LSD", "LSDAGE", "LSDAGLST", 
                 "CIGTRY", "CIG30AV", "CIGAGE", "ALCYRTOT", 
                 "COCEVER", "HEREVER", "CRKEVER", "METHDES", 
                 "DIETPILS", "MJEVER"]
data = data.loc[:, needed_fields]

I decided to start with LSD interview item. I found these variables in the codebook: LSD (ever used LSD ("acid")), LSDAGE (age when first used LSD), LSDAGLST (how old were you the last time used LSD).

Now I want to know how many people ever tried LSD:

In [6]:

lsd = data.groupby('LSD').agg({"QUESTID2": 'count'})
lsd.columns = ['COUNT']
lsd_prc = lsd.apply(lambda x: 100*x/float(x.sum()))
lsd_prc.columns = ['PERCENT']
lsd = lsd.join(lsd_prc)
lsd.head(6)

Out[6]:

	COUNT	PERCENT
LSD
1	3647	6.598755
2	3759	6.801404
3	2	0.003619
91	47816	86.516610
94	39	0.070565
97	5	0.009047

We need to transcript this and analyse what we get. 1 means "yes", 2 - "no", 3 means "yes", but logicaly assigned, 91 means "never used hallucinogens", 94 - "don't know" and 97 - "refused".

In [10]:

lsd["ANSWER"] = ["Yes", "No", "Yes (logicaly assigned)", 
                 "Never", "Don't know", "Refused"]
lsd.head(6)

Out[10]:

	COUNT	PERCENT	ANSWER
LSD
1	3647	6.598755	Yes
2	3759	6.801404	No
3	2	0.003619	Yes (logicaly assigned)
91	47816	86.516610	Never
94	39	0.070565	Don't know
97	5	0.009047	Refused

With these data we constructed a pie chart. For well-looking it would be better to put down items with percentage lower than 1.

In [7]:

%matplotlib inline
lsd[lsd['PERCENT'] >= 1]['PERCENT'].plot.pie(labels = ['Yes', 'No', 'Never'],
                             colors = ['yellowgreen', 'gold', 'lightskyblue'],
                            explode = (0.1, 0, 0),
                            autopct = '%1.1f%%', shadow = True)
plt.suptitle('Have you ever, even once, used LSD, also called "acid"?', fontsize=16)
plt.ylabel('')
plt.axis('equal')
plt.show()

So, it's only 6.6% ever tried LSD. Does it means hippies are still alive? Still, we have another interesting variable: age when first used lsd. For this data type it is nice to construct histogram for better visualiztion. This time I defined a function for histogram, because we will need it in future.

In [8]:

def hist(data, field, max = 1000, title = '', xlabel = '', xsteps = 1, color = 'green'):
    lsd_age = data[data[field] < max][field]
    lsd_age.plot.hist(color = color, bins = max - 1)
    plt.grid(b=True)
    plt.suptitle(title, fontsize=16)
    plt.xlabel(xlabel)
    plt.xticks(range(1, max, xsteps))
    
hist(data, 'LSDAGE', 68, 
     'How old were you the first time you used LSD?', 
     'Ages of first use LSD', 5, 'pink')

What a beatiful colorful childhood it was! I guess, we shouldn't care about some obviously false answers like "I was 1 (2/3/..) when i first time used LSD". For now, let's take a look at histogram with LAST time used LSD.

In [24]:

hist(data, 'LSDAGLST', 69, 
     'How old were you the last time you used LSD?', 
     'Ages of last use LSD', 5, 'red')

Interesting, some (and many) people said they used LSD last time at the age of 18. Should we trust them? I don't think so. I think it's time to take a rest from hallucinogens and construct same histograms for smoking people.

In [15]:

hist(data, 'CIGTRY', 69, 
     'How old were you the first time you smoked part or all of a cigarette?', 
     'Ages of first smoked a cigarette', 5, 'burlywood')

In [9]:

hist(data, 'CIGAGE', 69, 
     'How old were you when you first started smoking cigarettes every day?', 
     'Ages of first started smoking cigarettes', 5, 'brown')

We might compare these histograms for visual clarity and look at them a little closer:

In [24]:

hist(data, 'CIGTRY', 50, 
     'How old were you the first time you smoked part or all of a cigarette?', 
     'Ages of first smoked a cigarette', 2, 'burlywood')
hist(data, 'CIGAGE', 50, 
     'How old were you when you first started smoking cigarettes every day?', 
     'Ages of first started smoking cigarettes', 2, 'brown')

So many teenagers tried smoking so early, but only 18 is the "best" age to start smoking everyday. Dear parents, I think you should take it into consideration.

Now, let's take a look how LSD usage (ever once) correlate with the usage of other drugs. I constructed bar chart with yes-no answers to illustrate this correlation.

In [30]:

def lsd_something_prc(data, field, title):
    lsd_try = data[data["LSD"] == 1]
    no_lsd = data[data["LSD"] == 2]
    lsd_try = lsd_try[lsd_try[field] <= 2].groupby(field).agg({"QUESTID2": 'count'})
    no_lsd = no_lsd[no_lsd[field] <= 2].groupby(field).agg({"QUESTID2": 'count'})
    lsd_try.columns = ['COUNT']
    lsd_try = lsd_try.apply(lambda x: 100 * x / float(x.sum()))
    lsd_try.columns = ['Ever used lsd']
    no_lsd.columns = ['COUNT']
    no_lsd = no_lsd.apply(lambda x: 100 * x / float(x.sum()))
    no_lsd.columns = ['Never used lsd']
    lsd_coc = lsd_try.join(no_lsd)
    lsd_coc.plot.bar(color=['pink', 'blue'])
    plt.grid(b=True)
    plt.suptitle(title, fontsize=16)
    plt.xlabel('')
    plt.xticks([0, 1], ["Yes", "No"], rotation='horizontal')
    plt.ylabel('Percent')
    plt.ylim(ymax=100)

lsd_something_prc(data, "MJEVER", 
                  "Have you ever, even once, used marijuana or hashish?")
lsd_something_prc(data, "COCEVER", 
                  "Have you ever, even once, used any form of cocaine?")
lsd_something_prc(data, "HEREVER", 
                  "Have you ever, even once, used heroin?")
lsd_something_prc(data, "CRKEVER", 
                  "Have you ever, even once, used crack?")
lsd_something_prc(data, "DIETPILS", 
                  "Have you ever, even once, used prescription diet pills, such as Amphetamines (etc)?")
lsd_something_prc(data, "METHDES", 
                  "Have you ever, even once, used Methamphetamine, Desoxyn, or Methedrine?")

What does it mean? Maybe people who ever once tried any drugs are more open for experiments. But it's obvious that marijuana is the most popular drug and heroin is the most unpopular (thank God).

At this interesting point, I want to finish my first data analysis. There are a lot of data in this database, so analysis might become endless. But for now I want to say goodbye and wish you a good health.