Small drug use data analysis

Posted on Вт 06 Сентябрь 2016 in data analysis

I was looking for health databases for my first reseach. While searching I found the National Survey on Drug Use and Health. Data were collected and prepared for release by Research Triangle Institute, Research Triangle Park, North Carolina. Since these data are available for the general public, I chose this database. I downloaded the data file and the codebook for transcription. Reading the codebook (at least the introduction) is important for better data understanding.

For each variable in the 2012 data file, the codebook provides a variable name, a description of the variable, value codes and their meanings, and an unweighted univariate frequency distribution. Most of the variables originated directly as interview items. The variable names in this codebook are the same as variable names used in the dataset.

So, now let's take a look at the data:

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

file_name = "34933-0001-Data.tsv"
data = pd.read_csv("data/DrugUse/ICPSR_34933/DS0001/{0}".format(file_name), sep="\t")
data.head()
Out[1]:
CASEID QUESTID2 CIGEVER CIGOFRSM CIGWILYR CIGTRY CIGYFU CIGMFU CIGREC CIG30USE ... IIEMPSTY II2EMSTY EMPSTAT4 IIEMPST4 II2EMST4 PDEN00 COUTYP2 ANALWT_C VESTR VEREP
0 1 50886467 2 4 4 991 9991 91 91 91 ... 1 1 99 9 9 2 2 1275.597449 30054 2
1 2 13766883 2 99 99 991 9991 91 91 91 ... 1 1 1 1 1 2 2 5191.071173 30031 1
2 3 17772877 2 99 99 991 9991 91 91 91 ... 1 1 1 1 1 3 3 419.742011 30056 2
3 4 45622817 1 99 99 13 9999 99 2 93 ... 1 1 2 1 1 2 2 1449.303889 30054 1
4 5 17239390 1 99 99 11 9999 99 4 93 ... 1 1 1 1 1 1 1 15344.293577 30012 2

5 rows × 3120 columns

We should clean the data for a better usability and to increase the speed of analysis (I added some unfamiliar variables for future):

In [2]:
needed_fields = ["QUESTID2", "LSD", "LSDAGE", "LSDAGLST", 
                 "CIGTRY", "CIG30AV", "CIGAGE", "ALCYRTOT", 
                 "COCEVER", "HEREVER", "CRKEVER", "METHDES", 
                 "DIETPILS", "MJEVER"]
data = data.loc[:, needed_fields]

I decided to start with LSD interview item. I found these variables in the codebook: LSD (ever used LSD ("acid")), LSDAGE (age when first used LSD), LSDAGLST (how old were you the last time used LSD).

Now I want to know how many people ever tried LSD:

In [6]:
lsd = data.groupby('LSD').agg({"QUESTID2": 'count'})
lsd.columns = ['COUNT']
lsd_prc = lsd.apply(lambda x: 100*x/float(x.sum()))
lsd_prc.columns = ['PERCENT']
lsd = lsd.join(lsd_prc)
lsd.head(6)
Out[6]:
COUNT PERCENT
LSD
1 3647 6.598755
2 3759 6.801404
3 2 0.003619
91 47816 86.516610
94 39 0.070565
97 5 0.009047

We need to transcript this and analyse what we get. 1 means "yes", 2 - "no", 3 means "yes", but logicaly assigned, 91 means "never used hallucinogens", 94 - "don't know" and 97 - "refused".

In [10]:
lsd["ANSWER"] = ["Yes", "No", "Yes (logicaly assigned)", 
                 "Never", "Don't know", "Refused"]
lsd.head(6)
Out[10]:
COUNT PERCENT ANSWER
LSD
1 3647 6.598755 Yes
2 3759 6.801404 No
3 2 0.003619 Yes (logicaly assigned)
91 47816 86.516610 Never
94 39 0.070565 Don't know
97 5 0.009047 Refused

With these data we constructed a pie chart. For well-looking it would be better to put down items with percentage lower than 1.

In [7]:
%matplotlib inline
lsd[lsd['PERCENT'] >= 1]['PERCENT'].plot.pie(labels = ['Yes', 'No', 'Never'],
                             colors = ['yellowgreen', 'gold', 'lightskyblue'],
                            explode = (0.1, 0, 0),
                            autopct = '%1.1f%%', shadow = True)
plt.suptitle('Have you ever, even once, used LSD, also called "acid"?', fontsize=16)
plt.ylabel('')
plt.axis('equal')
plt.show()

So, it's only 6.6% ever tried LSD. Does it means hippies are still alive? Still, we have another interesting variable: age when first used lsd. For this data type it is nice to construct histogram for better visualiztion. This time I defined a function for histogram, because we will need it in future.

In [8]:
def hist(data, field, max = 1000, title = '', xlabel = '', xsteps = 1, color = 'green'):
    lsd_age = data[data[field] < max][field]
    lsd_age.plot.hist(color = color, bins = max - 1)
    plt.grid(b=True)
    plt.suptitle(title, fontsize=16)
    plt.xlabel(xlabel)
    plt.xticks(range(1, max, xsteps))
    
hist(data, 'LSDAGE', 68, 
     'How old were you the first time you used LSD?', 
     'Ages of first use LSD', 5, 'pink')

What a beatiful colorful childhood it was! I guess, we shouldn't care about some obviously false answers like "I was 1 (2/3/..) when i first time used LSD". For now, let's take a look at histogram with LAST time used LSD.

In [24]:
hist(data, 'LSDAGLST', 69, 
     'How old were you the last time you used LSD?', 
     'Ages of last use LSD', 5, 'red')

Interesting, some (and many) people said they used LSD last time at the age of 18. Should we trust them? I don't think so. I think it's time to take a rest from hallucinogens and construct same histograms for smoking people.

In [15]:
hist(data, 'CIGTRY', 69, 
     'How old were you the first time you smoked part or all of a cigarette?', 
     'Ages of first smoked a cigarette', 5, 'burlywood')
In [9]:
hist(data, 'CIGAGE', 69, 
     'How old were you when you first started smoking cigarettes every day?', 
     'Ages of first started smoking cigarettes', 5, 'brown')

We might compare these histograms for visual clarity and look at them a little closer:

In [24]:
hist(data, 'CIGTRY', 50, 
     'How old were you the first time you smoked part or all of a cigarette?', 
     'Ages of first smoked a cigarette', 2, 'burlywood')
hist(data, 'CIGAGE', 50, 
     'How old were you when you first started smoking cigarettes every day?', 
     'Ages of first started smoking cigarettes', 2, 'brown')

So many teenagers tried smoking so early, but only 18 is the "best" age to start smoking everyday. Dear parents, I think you should take it into consideration.

Now, let's take a look how LSD usage (ever once) correlate with the usage of other drugs. I constructed bar chart with yes-no answers to illustrate this correlation.

In [30]:
def lsd_something_prc(data, field, title):
    lsd_try = data[data["LSD"] == 1]
    no_lsd = data[data["LSD"] == 2]
    lsd_try = lsd_try[lsd_try[field] <= 2].groupby(field).agg({"QUESTID2": 'count'})
    no_lsd = no_lsd[no_lsd[field] <= 2].groupby(field).agg({"QUESTID2": 'count'})
    lsd_try.columns = ['COUNT']
    lsd_try = lsd_try.apply(lambda x: 100 * x / float(x.sum()))
    lsd_try.columns = ['Ever used lsd']
    no_lsd.columns = ['COUNT']
    no_lsd = no_lsd.apply(lambda x: 100 * x / float(x.sum()))
    no_lsd.columns = ['Never used lsd']
    lsd_coc = lsd_try.join(no_lsd)
    lsd_coc.plot.bar(color=['pink', 'blue'])
    plt.grid(b=True)
    plt.suptitle(title, fontsize=16)
    plt.xlabel('')
    plt.xticks([0, 1], ["Yes", "No"], rotation='horizontal')
    plt.ylabel('Percent')
    plt.ylim(ymax=100)

lsd_something_prc(data, "MJEVER", 
                  "Have you ever, even once, used marijuana or hashish?")
lsd_something_prc(data, "COCEVER", 
                  "Have you ever, even once, used any form of cocaine?")
lsd_something_prc(data, "HEREVER", 
                  "Have you ever, even once, used heroin?")
lsd_something_prc(data, "CRKEVER", 
                  "Have you ever, even once, used crack?")
lsd_something_prc(data, "DIETPILS", 
                  "Have you ever, even once, used prescription diet pills, such as Amphetamines (etc)?")
lsd_something_prc(data, "METHDES", 
                  "Have you ever, even once, used Methamphetamine, Desoxyn, or Methedrine?")

What does it mean? Maybe people who ever once tried any drugs are more open for experiments. But it's obvious that marijuana is the most popular drug and heroin is the most unpopular (thank God).

At this interesting point, I want to finish my first data analysis. There are a lot of data in this database, so analysis might become endless. But for now I want to say goodbye and wish you a good health.