Introduction

I’m currently spending my time in the USA, and it’s (pre-)election time! Having a memebership to a 24h gym gets me exposed to a massive amount of politics on TV while I’m doing my time on the crosstrainer or even when I’m just changing in the dressing room. It’s February, the elections are in November, but there is a near-constant broadcast of candidates battling each other on CNN.

I’m impressed and surprised about this omnipresence of the elections on TV, already so early on. And I’m concerned about the rudeness that these debates often display. Politics in Europe are different. They’re not golden, nope. And there are insults and dirty tricks played also there. But here in the US there’s a level of “attack”, “aggression” and “god” in the talks, that I am thoroughly not used to. :/

One of the times while in the dressing room, I got stuck watching the TV talk about Dollars per Vote, and found this comparison very interesting. Some candidates spend very little money as compared to others in absolute, and yet in the end some end up spending still more money per vote received.

Now, my analysis will not deal with this comparison, since the ballots have not been cast yet (for a few months…), but I will make a preliminary analysis of the money that the presidential candidates were receiving from individual contributors in the run-up for the presidency.


Pt. 0: The Data

The data for the 2016 Presidential Campaign Contributions can be found here: http://fec.gov/disclosurep/PDownload.do Since I am currently in California, I chose and downloaded the dataset for CA.zip.

Here’s also a peek at the structure of the data that the .csv contains. It’s an excerpt from this file: ftp://ftp.fec.gov/FEC/Presidential_Map/2016/DATA_DICTIONARIES/CONTRIBUTOR_FORMAT.txt I have also indicated which of the columns I ended up dropping.

The text file is comma delimited and uses double-quotation marks as the text qualifier.

------------------------------------------------------------------------ 

CMTE_ID                 COMMITTEE ID                                      S <- dropped
CAND_ID                 CANDIDATE ID                                      S
CAND_NM                 CANDIDATE NAME                                    S
CONTBR_NM               CONTRIBUTOR NAME                                  S
CONTBR_CITY             CONTRIBUTOR CITY                                  S
CONTBR_ST               CONTRIBUTOR STATE                                 S <- dropped
CONTBR_ZIP              CONTRIBUTOR ZIP CODE                              S
CONTBR_EMPLOYER         CONTRIBUTOR EMPLOYER                              S
CONTBR_OCCUPATION       CONTRIBUTOR OCCUPATION                            S
CONTB_RECEIPT_AMT       CONTRIBUTION RECEIPT AMOUNT                       N
CONTB_RECEIPT_DT        CONTRIBUTION RECEIPT DATE                         D 
RECEIPT_DESC            RECEIPT DESCRIPTION                               S <- dropped
MEMO_CD                 MEMO CODE                                         S <- dropped
MEMO_TEXT               MEMO TEXT                                         S <- dropped
FORM_TP                 FORM TYPE                                         S <- dropped
FILE_NUM                FILE NUMBER                                       N <- dropped
TRAN_ID                 TRANSACTION ID                                    S <- dropped
ELECTION_TP             ELECTION TYPE/PRIMARY GENERAL INDICATOR           S <- dropped


Data Type:  S = string (alpha or alpha-numeric); D = date; N = numeric  

------------------------------------------------------------------------ 

One thing that becomes very obvious here, and that strongly influenced my exploration, is that there is only one column containing numerical variables: contb_receipt_amt.

This is the column that holds the amounts of the contributions in $, and it is therefore also one of the most interesting columns. A lot of comparisons were anchored to it.

However, there is also another “hidden” numerical value associated with the dataset, and that is the number of rows, that allows many graphs to be created in combination with the categorical values of the other columns.


Ideas and Interests that I will explore

  • Who got the most contributions? Which person, which party, which gender?

  • Which city contributed the most, and to whom (also: in relation to the number of citizens)

  • How many “NOT EMPLOYED” people contributed to the campaigns as compared to people with employment; Also: whom did the “NOT EMPLOYED” support the most?


Pt. 1: Mini-Wrangling

In order to be able to load the data, I had to add a comma at the end of the header row in my .csv file.


Adding new columns: cand_gender and cand_party

For some of my questions, I am interested to take a look at the data also depending on the gender and the party affiliation of the different candidates. Since these columns are not present in the dataset, I went to gather the required information online and added it as two new columns to my data.frame df.

First I’ll be mapping Candidate names to Candidate IDs. This I’ll do so that I will be able to refer to a cand_id when passing a function, which is easier than writing out the full name.

Now I can add this new column to the data.frame. There is one member of the green party ‘G’, the republicans will get the value ‘R’ and the democrats the value ‘D’


Removing obsolete columns

There is quite a big amount of data that I am not intending to use for my exploration. These include e.g. conbr_st, which is in my case always CA, or file_num, a unique number used to link the transactions to the reports (which I’ll also drop).


Pt. 2: Exploration

After having cleaned and adapted the dataset to my wishes, I can start into my exploration.

For example, I can query how many individual contributions did each candidate receive, e.g. Hillary Clinton:

## [1] 42063

QUESTION 1: Who got the most money through contributions?

In this section I’ll take a look at money and amounts. I’m wondering which candidate got the most contributions, and is it the same person who got the most money through contributions? How is the size of the contributions distributed in regards to the candidates? Which gender received more contributions, or more money? Which political party?


PART 1: the mean and distributions

Constructing a frequency plot makes me see something about Sanders, Bernard (who has a big dot low down). Some candidates have a small distribution, and some a wider one, there is another person with a big dot at the bottom, yet also a quite big one higher up (Clinton, Hillary); and further one person with a very high distribution up and down, but also a strong base it seems (Cruz, Ted). Also Carson, Benjamin shows a large number of contributions in the lower amounts.

This is a try experimenting with visualizing statistical variables, such as the mean or the median, on top of the data.

I chose to display the mean and the median contribution for each candidate in relation to each other.

Cleaned up the graph a bit and reordered ascending! This plot gives a nice overview of the mean contributions per candidate! Maybe it’s a keeper ; )


PART 2: amounts

Most of the contributions are below 1000 $!


PART 2.1: refunds

The previous graph gives me negative values for amounts, which is at first very confusing. So I construct a vector holding only the refunds.

## [1] 1498

Seems there are nearly 1500 “contributors” who actually got more money refunded than what they gave.

And some gave some, but also got some back (or got some back multiple times?):

## [1] 2102

So I tried a plot for these refunds (which was too big, so I’ll take a look at the most Dollars)

This is a lot of money to get back. I wonder why and how. But I must admit I don’t really understand these US politics and fundings of the candidates Here’s some info: https://ballotpedia.org/California_Proposition_34,_Limits_on_Campaign_Contributions_(2000)


PART 2.2: contributions

Well, so let’s see who gave the most:

## [1] 24

There are 24 people that all gave the same amount, so I suspect that there is an upper limit around 10.000$ (maybe without taxes, or such? They should be alread PACs?)

I also spotted potentially erroneous data, with the contributor DE GROOTE, DOUG MR. being listed, as well as a DE GROOTE, DOUG, both with the same amount, which makes me believe that it is a mistake in the data (or a rather badly executed way of increasing one’s contribution limit…)

Maybe can check :)

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
## [23]  TRUE  TRUE

Yep, seems that some of those transactions are listed more often in the dataset. Actually quite a lot of them! 7 TRUE!

It seems that DE GROOTE could be also a company, because it’s listed also among the contbr_employer column. Well. But I’m not gonna go hunt down these individuals :)


PART 3: gender and money

Number of contributions per gender of candidate

Total amount of money through contributions per gender of candidate

Whoa! seems that females received nearly as much money in contributions as males did, even though there are only 3 female vs. 19 male candidates + there were way less fewer contributions to femal candidates than to male candidates!


PART 3.1: binning

It could be interesting to take a look at the mean contribution per F/M candidate.

So here are the summaries for the contributions for the female candidates, then for the male candidates, and finally for the whole dataset combined:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -5400.0    25.0   100.0   501.7   250.0  5400.0
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -10000.0     25.0     50.0    179.2    100.0  10800.0
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -10000.0     25.0     50.0    262.4    100.0  10800.0

I’ll start with a generous binwidth of 1000:

This graphic is pretty useless - nearly every contribution falls within the first bin, there are very few that are above 1000. So i can adjust the binnning:

Whoa! Most contributions are actually between 0-100$!! So there are many many small contributions that were made in CA. Let’s look at this data in a table:

## 
##         (0,100]       (100,200]       (200,300]       (300,400] 
##          135435            7100           12498             493 
##       (400,500]       (500,600]       (600,700]       (700,800] 
##            5781             140             214             144 
##       (800,900]     (900,1e+03]   (1e+03,2e+03]   (2e+03,3e+03] 
##              47            4777            1448            9823 
## (3e+03,1.1e+04] 
##             445

Interesting, because I already saw that most low contributions went to one candidate: Sanders, Bernard.


PART 3.2: proportions

I’ll go forward displaying proportions of female/male candidates in the respective parties

The proportions on how many candidates of the respective gender are running in the presidential elections is very different between the three parties.

Let’s see a graph plotting the statistical percentage of money per candidate per gender

But of course that’s VERY misleading!… Clinton, Hillary got so much of the amounts contributed to female candidates, that we should make this more clear:

As we can see, Hillary Clinton received by far the largest amount of money through contributions. She nearly single-handedly takes half of all the contributions made.


PART 4: party money

I can also see, that Clinton and Sanders make the two biggest single sections, and they are both with the Democrats. Therefore it would be interesting to plot money per party instead of by gender.

Here we can see that the Democrats received way more money through contributitons than the Republicans did, and that Hillary Clinton alone received more money than all the Republican candidates combined. Whereas the contributions to the green party are so tiny, that they become invisible in this visualization.


PART 5: candidate money

Number of contributions per candidate:

## 
##                 Bush, Jeb       Carson, Benjamin S. 
##                      2762                     21045 
##  Christie, Christopher J.   Clinton, Hillary Rodham 
##                       316                     42063 
## Cruz, Rafael Edward 'Ted'            Fiorina, Carly 
##                     21645                      4426 
##        Graham, Lindsey O.            Huckabee, Mike 
##                       331                       447 
##             Jindal, Bobby           Kasich, John R. 
##                        31                       701 
##          Lessig, Lawrence   O'Malley, Martin Joseph 
##                       372                       383 
##         Pataki, George E.                Paul, Rand 
##                        20                      4117 
##    Perry, James R. (Rick)              Rubio, Marco 
##                       116                      7994 
##          Sanders, Bernard      Santorum, Richard J. 
##                     72179                        79 
##               Stein, Jill          Trump, Donald J. 
##                        85                       590 
##             Walker, Scott     Webb, James Henry Jr. 
##                       670                       106

While most are vanishingly small in comparison, four candidates stick out in terms of number of contributions.

Let’s look at the money amassed through the contributions:

These plots are not very interesting and don’t tell me too much new, so enough of this topic for now.


QUESTION 2: Which city contributed the most, and to whom?

The next one is actually quite interesting, because it shows how people often give full-number-amounts of money (with the distinct lines going vertical).

Here i’ll apply geom_jitter() only to the actual contributions, leaving out the refunds.

It nicely shows how much do most people donate (first big bar to the right of 0), about which cities had the most contributions (horizontal black lines), and where are there common discrete jumps, maybe related to regulations such as donation limits (vertical lines). This is a nice graph :)

Trying to remove cities and keep only the onese with high contribution amounts:

Hm… this is not so interesting, so maybe I should rather right away put it in relation with the number of inhabitants.


Getting Population estimates for CA

I found some here: https://www.census.gov/popest/data/cities/totals/2012/SUB-EST2012.html The data is for 2012, but it’s the most current one that is listing cities that I could find there. It’s not perfect, but for now I just want to take a look :)

Could be interesting to display this with a facet_wrap(). However, there are too many cities. So i should maybe try this instead:

  1. calculating the ratios for contributions/inhabitant
  2. subsetting into categories of lower quartile, mean, upper quartile
  3. showing one plot for maybe the mean of each group

Calculating the Ratios

##    contbr_city contb_receipt_amt POPESTIMATE2012 number_of_contb
## 1     ADELANTO           1000.00           31239               2
## 2 AGOURA HILLS          72089.48           20657             346
## 3      ALAMEDA         252246.21         1554720            2064
## 4      ALAMEDA         252246.21           75641            2064
## 5       ALBANY          55625.94           18969             594
## 6     ALHAMBRA         117505.04           84322             362

Oops, there is also data for “counties” in here, that have now the same name as some of the cities… so I’ll have to remove the rows with the higher values for capita.

##    contbr_city contb_receipt_amt POPESTIMATE2012 number_of_contb
## 1     ADELANTO           1000.00           31239               2
## 2 AGOURA HILLS          72089.48           20657             346
## 4      ALAMEDA         252246.21           75641            2064
## 5       ALBANY          55625.94           18969             594
## 6     ALHAMBRA         117505.04           84322             362
## 7  ALISO VIEJO          70309.58           49493             282

That’s better :)

Here comes a bit more playing around while extracting the cities with the most contributions, and trying to use melt().

I’m looking to narrow down to find the cities with the most contributions. Trying different values, I found that using 30000 limits the amount of cities to

## [1] 5
## [1] "LOS ANGELES"   "OAKLAND"       "SAN DIEGO"     "SAN FRANCISCO"
## [5] "SAN JOSE"

which is something I can work with! :)

Here’s an interesting little fact: more than half the number of contributions come from only 5 cities! Percentage:

## [1] 56.87286

Well… but somehow I’m more of a country-boy myself. For now I have enough of these cities, so I’ll now move forward to investigate another question. :)


QUESTION 3: How much and to whom did NOT EMPLOYED give

I am very perplexed and involved with the topic of homeless people in the US, so anything that goes into that direction rings a bell with me. Here I’m trying to investigate a little bit into the political direction that homeless might be having a tendency for.

However, I understand that this is highly hypothetical, because I only have data of monetary contributions, that “NOT EMPLOYED” people gave for the presidential campaigns. Of course giving money != political orientation (It might be a good proxy, however what I’m trying to say is, that it’s not an exhaustive factor. Many people might have a political orientation, however did not contribute to the campaigns monetarily. This might be especially true for homeless people, who are very likely to have very little money at their hands). Further, NOT EMPLOYED != homeless. There are quite a few people that are employed, but homeless in the US. Assuming that they would give a contribution, they would fall into a different category. It is also true that NOT EMPLOYED could be people that are currently simply not working in the official job market. Stay-at-home mums and dads would qualify for this section of society, and they might make up a rather larger number of contributors. These factors have to be considered when looking at the results further down.

The following graphs are not gonna show much about how much money was flowing, but rather is intended as a proxy on where does a certain section of society lean towards politically.

Who gave how much?

Okay, this is not so exciting :) Wait…

## 
##     employed not employed 
##       159985        20446

There are very few NOT EMPLOYED compared to those with employment, so I’ll need ratios

Displaying proportions of which party did people with/without employment give contributions to:

Interesting: it seems that a much higher percentage of people without employment contribute to the Democrats.

NOTE: I boosted the max_size variable, in order to make it more clear how vanishingly small is the percentage of NOT EMPLOYED people that contributed to the Republican party.


So, let’s see for whom

Sanders, Bernard gets percentage-wise the most contributions from the NOT EMPLOYED! And it seems that apart from Trump, Donald J. there is no Republican currently left in the ballots who got contributions by NOT EMPLOYED people.

When increasing the max_size also here it becomes even more clear how great the difference between Sanders, Bernard and the other candidates is in this aspect.

NOTE: NOT EMPLOYED probably also includes college students! Which makes sense because Sanders wants to take away college debt, so he has a big bunch of the students on his side (I heard of approximately 80% somewhere).

Let’s look at this in another way:

This sounds like a pie chart, haha, also with the implications of who gets the biggest piece :)


Thinking about occupation-voter distribution, now I’m wondering which party do IT people lean towards. :)

So I was collecting all the unique jobs present in the data.frame.

## [1] 8958

Haha, found one row listing with contb_occupation : GRANDPA !! :)

But well, these are too many, and seems that people just put what they wanted. It hasn’t been scanned through and grouped it seems. I don’t want to get into this. So let’s rather wrap it up :)


Pt. 3: Discussion of the Analysis

Structure of the dataset

The dataset consists mainly of categorical variables, such as cand_nm, contbr_occupation or contbr_zip. All these are interesting to put in context with the one continuous variable cont_receipt_amt, and of course the amount of rows.

A more thorough overview of the data can be taken from the official site: ftp://ftp.fec.gov/FEC/Presidential_Map/2016/DATA_DICTIONARIES/CONTRIBUTOR_FORMAT.txt

Main feature(s) of interest in the dataset

How is the number of individual contributions and the amount of money gained through them spread out among the different candidates. In further abstraction it is interesting to look at the same with aggregating or subsetting the dataset (e.g. by gender of the candidate, or by origin city of the contributor).

Other features that help to support the investigation into the features of interest

I found it interesting to add the population count of the cities in CA, to be able to calculate a ratio regarding how many contributions and how much money in total, respectively, was given by city in relations to the number of inhabitants.

New variables created (indirectly) from existing variables

I created some new columns containing new variables:

  • cand_gender, defining the gender of the presidential candidate ('F' for “female” and 'M' for “male”)
  • cand_party, defining the party affiliation of the candidate ('D' for “Democrats”, 'R' for “Republicans”, and 'G' for “Greens”)
  • POPESTIMATE2012, containing the population estimates of the CA cities for 2012, taken from a different governmental source online
  • contb_emp_status, where I reduced the contbr_occupation to show only whether a contributor was employed or not

Strongest relationship discovered

In my opinion the strongest relationship that I kept discovering over different plots, was that the number of contributions received, was very clearly “won” by one candidate: Sanders, Bernard.


Pt. 4: Final Plots and Summary

Plot One

Description One

In this graph displays how much money did each party and each candidate receive through individual contributions for the presidential elections in 2016.

It shows values only for those (main) candidates that are currently still running for the presidency in the primaries (as of 2016-03-02 and from here).

We can see the absolute amounts of money in USD that each of the two major political parties in the US received through individual contributions.

It is immediately obvious, that the democratic party obtained more than twice the amount of contributions than the republican party managed to lucrate.

The contributions are further subdivided to display the amounts that each individual candidate received.

Here Hillary Clinton sticks out with an amazingly big sum of contribution-money received. She alone got more Dollars than all the other candidates combined.


Plot Two

Description Two

The graph shows all contributions to the candidates of the 2016 presidential elections, arranged by cities in California. The 5 cities where the highest number of contributions were given are labeled on the y-axis.

While most of these cities are also among the 5 most populous cities in CA, Oakland holds only the 8th place in that ranking. This means that it has an unusually high ratio of contributions per capita.

Another interesting feature of this graph are the clear vertical lines. These lines denote most likely full-number-amounts and/or contribution-limits defined by law.


Plot Three

Description Three

This graphs gives some insight as to which candidates of the 2016 presidential elections did people without employment contribute to.

While most individual contributions come from people with employment, only three candidates received a significant amount of money from people who declared themselves as NOT EMPLOYED. These are:

  • Sanders, Bernard
  • Clinton, Hillary Rodham
  • Trump, Donald J.

It is interesting to see that most of the contributions of this sector of society went to the Democrats, and yet again the largest part to Sanders, Bernard.

Many college and university students are most likely counted among the NOT EMPLOYED. Sanders, Bernard is advocating free tuition for public colleges and universities, and enjoys high popularity among students. This might explain the large number of individual contributions received from people that are “unemployed”.


Pt. 5: Reflection

For me this was a very interesting project to work on. Being in the USA, I get constantly exposed to the upcoming (and long-lasting) process of narrowing down the presidential candidates.

Being able to dive into the topic by doing data analysis gave me an interesting new way of approaching the topic, and I believe I managed to get a much more wholesome picture of what is currently going on. Working on the project made me investigate some auxiliary topics regarding US politics that were (more or less) necessary to perform the analyses that I chose to do. Further, having raw data at hand and R as a tool to take it apart and put it together, allowed me an exciting venture into Exploratory Data Analysis.

I truly felt it as what its name implies :)

I had not used R before, and the language seems to be differently structured than python, which I am more familiar with. I repeatedly noticed how I was trying to solve tasks wanting to loop over the dataframe, instead of e.g. subsetting it. However, I believe that through working on the project I managed to get a good beginner’s grasp on the language, and especially on working with ggplot. I thoroughly enjoy the layered structure that ggplot uses to build graphs and found it to be rather intuitive and powerful.

I’ve spent a rather large amount of time figuring out some details about how to tweak these layers to display the graphs in specific ways. The ability to format and reduce information contained within a graph can make a huge difference regarding how accessible it will feel and how easily it can be read.

Digesting the discovered insights seems to me a crucial part of communicating the results of the analysis, and formatting options of ggplot are ample in providing these opportunities.


Certain decisions during my analysis were also influencing the final outcomes. For example regarding the Second Plot, I originally wanted to display both aspects that the original graph that utilizes all of the data shows - the horizontal and the vertical lines. However, I did not manage to achieve a certain aspect of the final visual formatting (handling the large amount of categorical values = city names on the y-axis satisfactorily), so I ended up deciding to instead reduce the information I will present to a smaller subset of the data.

Through this decision I effectively lost some part of the story that the data could tell, however the focus allows for another aspect to be more present. I believe that these kinds of choices are often necessary to take. Big data can tell an infinite amount of stories, and it always depends on which parts do I focus on, which variables do I put into context, and what type of visualization will I choose for it.

It’s a sea of data out there, often it can be interesting to just take a look at one aspect of its shore.


Throughout my analysis I kept encountering issues that required me to restructure the data I was using, or become more specific in which part of it to access.

Some findings lead me forward, and even promoted me to look for additional data outside of the dataset provided (gender of the candidates, which candidates are currently still in the Primaries, but also finding population estimates for all CA cities).

Many investigations died down at a certain point, when I started to lose interest to continue exploring that path. However, I am aware that there are tons of directions to take and questions to explore further. :)

E.g. I eventually abandoned comparing the ratios of population information of the CA cities to the number or summed amounts of contributions given in those cities. However, this would be a very interesting aspect of the data to explore further.

Also, I’ve narrowed down the question of which candidates will people of certain employment groups more likely contribute to, by only investigating unemployed contributors. It would be equally interesting to select a certain occupation-group (or a cluster of specific jobs) and analyze them regarding to how high are their mean contributions and in the pocket of which party are they more likely to flow.


All in all I really enjoyed working on this project, learning to work with R, and educating myself more on political structures and current events in the area that I am currently spending my time in. Happily learning! :)


References:

R general:

R {ggplot}:

US politics: