I’m currently spending my time in the USA, and it’s (pre-)election time! Having a memebership to a 24h gym gets me exposed to a massive amount of politics on TV while I’m doing my time on the crosstrainer or even when I’m just changing in the dressing room. It’s February, the elections are in November, but there is a near-constant broadcast of candidates battling each other on CNN.

I’m impressed and surprised about this omnipresence of the elections on TV, already so early on. And I’m concerned about the rudeness that these debates often display. Politics in Europe are different. They’re not golden, nope. And there are insults and dirty tricks played also there. But here in the US there’s a level of “attack”, “aggression” and “god” in the talks, that I am thoroughly not used to. :/

One of the times while in the dressing room, I got stuck watching the TV talk about Dollars per Vote, and found this comparison very interesting. Some candidates spend very little money as compared to others in absolute, and yet in the end some end up spending still more money per vote received.

Now, my analysis will not deal with this comparison, since the ballots have not been cast yet (for a few months…), but I will make a preliminary analysis of the money that the presidential candidates were receiving from individual contributors in the run-up for the presidency.

Pt. 0: The Data

The data for the 2016 Presidential Campaign Contributions can be found here: Since I am currently in California, I chose and downloaded the dataset for

Here’s also a peek at the structure of the data that the .csv contains. It’s an excerpt from this file: I have also indicated which of the columns I ended up dropping.

The text file is comma delimited and uses double-quotation marks as the text qualifier.


CMTE_ID                 COMMITTEE ID                                      S <- dropped
CAND_ID                 CANDIDATE ID                                      S
CAND_NM                 CANDIDATE NAME                                    S
CONTBR_NM               CONTRIBUTOR NAME                                  S
CONTBR_CITY             CONTRIBUTOR CITY                                  S
CONTBR_ST               CONTRIBUTOR STATE                                 S <- dropped
CONTBR_ZIP              CONTRIBUTOR ZIP CODE                              S
CONTBR_EMPLOYER         CONTRIBUTOR EMPLOYER                              S
CONTB_RECEIPT_DT        CONTRIBUTION RECEIPT DATE                         D 
RECEIPT_DESC            RECEIPT DESCRIPTION                               S <- dropped
MEMO_CD                 MEMO CODE                                         S <- dropped
MEMO_TEXT               MEMO TEXT                                         S <- dropped
FORM_TP                 FORM TYPE                                         S <- dropped
FILE_NUM                FILE NUMBER                                       N <- dropped
TRAN_ID                 TRANSACTION ID                                    S <- dropped

Data Type:  S = string (alpha or alpha-numeric); D = date; N = numeric  


One thing that becomes very obvious here, and that strongly influenced my exploration, is that there is only one column containing numerical variables: contb_receipt_amt.

This is the column that holds the amounts of the contributions in $, and it is therefore also one of the most interesting columns. A lot of comparisons were anchored to it.

However, there is also another “hidden” numerical value associated with the dataset, and that is the number of rows, that allows many graphs to be created in combination with the categorical values of the other columns.

Ideas and Interests that I will explore

  • Who got the most contributions? Which person, which party, which gender?

  • Which city contributed the most, and to whom (also: in relation to the number of citizens)

  • How many “NOT EMPLOYED” people contributed to the campaigns as compared to people with employment; Also: whom did the “NOT EMPLOYED” support the most?

Pt. 1: Mini-Wrangling

In order to be able to load the data, I had to add a comma at the end of the header row in my .csv file.

Adding new columns: cand_gender and cand_party

For some of my questions, I am interested to take a look at the data also depending on the gender and the party affiliation of the different candidates. Since these columns are not present in the dataset, I went to gather the required information online and added it as two new columns to my data.frame df.

First I’ll be mapping Candidate names to Candidate IDs. This I’ll do so that I will be able to refer to a cand_id when passing a function, which is easier than writing out the full name.

Now I can add this new column to the data.frame. There is one member of the green party ‘G’, the republicans will get the value ‘R’ and the democrats the value ‘D’

Removing obsolete columns

There is quite a big amount of data that I am not intending to use for my exploration. These include e.g. conbr_st, which is in my case always CA, or file_num, a unique number used to link the transactions to the reports (which I’ll also drop).

Pt. 2: Exploration

After having cleaned and adapted the dataset to my wishes, I can start into my exploration.

For example, I can query how many individual contributions did each candidate receive, e.g. Hillary Clinton:

## [1] 42063

QUESTION 1: Who got the most money through contributions?

In this section I’ll take a look at money and amounts. I’m wondering which candidate got the most contributions, and is it the same person who got the most money through contributions? How is the size of the contributions distributed in regards to the candidates? Which gender received more contributions, or more money? Which political party?

PART 1: the mean and distributions

Constructing a frequency plot makes me see something about Sanders, Bernard (who has a big dot low down). Some candidates have a small distribution, and some a wider one, there is another person with a big dot at the bottom, yet also a quite big one higher up (Clinton, Hillary); and further one person with a very high distribution up and down, but also a strong base it seems (Cruz, Ted). Also Carson, Benjamin shows a large number of contributions in the lower amounts.