I’m currently spending my time in the USA, and it’s (pre-)election time! Having a memebership to a 24h gym gets me exposed to a massive amount of politics on TV while I’m doing my time on the crosstrainer or even when I’m just changing in the dressing room. It’s February, the elections are in November, but there is a near-constant broadcast of candidates battling each other on CNN.
I’m impressed and surprised about this omnipresence of the elections on TV, already so early on. And I’m concerned about the rudeness that these debates often display. Politics in Europe are different. They’re not golden, nope. And there are insults and dirty tricks played also there. But here in the US there’s a level of “attack”, “aggression” and “god” in the talks, that I am thoroughly not used to. :/
One of the times while in the dressing room, I got stuck watching the TV talk about Dollars per Vote, and found this comparison very interesting. Some candidates spend very little money as compared to others in absolute, and yet in the end some end up spending still more money per vote received.
Now, my analysis will not deal with this comparison, since the ballots have not been cast yet (for a few months…), but I will make a preliminary analysis of the money that the presidential candidates were receiving from individual contributors in the run-up for the presidency.
The data for the 2016 Presidential Campaign Contributions can be found here: http://fec.gov/disclosurep/PDownload.do Since I am currently in California, I chose and downloaded the dataset for CA.zip
.
Here’s also a peek at the structure of the data that the .csv contains. It’s an excerpt from this file: ftp://ftp.fec.gov/FEC/Presidential_Map/2016/DATA_DICTIONARIES/CONTRIBUTOR_FORMAT.txt I have also indicated which of the columns I ended up dropping.
The text file is comma delimited and uses double-quotation marks as the text qualifier.
------------------------------------------------------------------------
CMTE_ID COMMITTEE ID S <- dropped
CAND_ID CANDIDATE ID S
CAND_NM CANDIDATE NAME S
CONTBR_NM CONTRIBUTOR NAME S
CONTBR_CITY CONTRIBUTOR CITY S
CONTBR_ST CONTRIBUTOR STATE S <- dropped
CONTBR_ZIP CONTRIBUTOR ZIP CODE S
CONTBR_EMPLOYER CONTRIBUTOR EMPLOYER S
CONTBR_OCCUPATION CONTRIBUTOR OCCUPATION S
CONTB_RECEIPT_AMT CONTRIBUTION RECEIPT AMOUNT N
CONTB_RECEIPT_DT CONTRIBUTION RECEIPT DATE D
RECEIPT_DESC RECEIPT DESCRIPTION S <- dropped
MEMO_CD MEMO CODE S <- dropped
MEMO_TEXT MEMO TEXT S <- dropped
FORM_TP FORM TYPE S <- dropped
FILE_NUM FILE NUMBER N <- dropped
TRAN_ID TRANSACTION ID S <- dropped
ELECTION_TP ELECTION TYPE/PRIMARY GENERAL INDICATOR S <- dropped
Data Type: S = string (alpha or alpha-numeric); D = date; N = numeric
------------------------------------------------------------------------
One thing that becomes very obvious here, and that strongly influenced my exploration, is that there is only one column containing numerical variables: contb_receipt_amt
.
This is the column that holds the amounts of the contributions in $, and it is therefore also one of the most interesting columns. A lot of comparisons were anchored to it.
However, there is also another “hidden” numerical value associated with the dataset, and that is the number of rows, that allows many graphs to be created in combination with the categorical values of the other columns.
Who got the most contributions? Which person, which party, which gender?
Which city contributed the most, and to whom (also: in relation to the number of citizens)
How many “NOT EMPLOYED” people contributed to the campaigns as compared to people with employment; Also: whom did the “NOT EMPLOYED” support the most?
In order to be able to load the data, I had to add a comma at the end of the header row in my .csv file.
For some of my questions, I am interested to take a look at the data also depending on the gender and the party affiliation of the different candidates. Since these columns are not present in the dataset, I went to gather the required information online and added it as two new columns to my data.frame df
.
First I’ll be mapping Candidate names to Candidate IDs. This I’ll do so that I will be able to refer to a cand_id when passing a function, which is easier than writing out the full name.
Now I can add this new column to the data.frame. There is one member of the green party ‘G’, the republicans will get the value ‘R’ and the democrats the value ‘D’
There is quite a big amount of data that I am not intending to use for my exploration. These include e.g. conbr_st
, which is in my case always CA
, or file_num
, a unique number used to link the transactions to the reports (which I’ll also drop).
After having cleaned and adapted the dataset to my wishes, I can start into my exploration.
For example, I can query how many individual contributions did each candidate receive, e.g. Hillary Clinton:
## [1] 42063
In this section I’ll take a look at money and amounts. I’m wondering which candidate got the most contributions, and is it the same person who got the most money through contributions? How is the size of the contributions distributed in regards to the candidates? Which gender received more contributions, or more money? Which political party?
Constructing a frequency plot makes me see something about Sanders, Bernard (who has a big dot low down). Some candidates have a small distribution, and some a wider one, there is another person with a big dot at the bottom, yet also a quite big one higher up (Clinton, Hillary); and further one person with a very high distribution up and down, but also a strong base it seems (Cruz, Ted). Also Carson, Benjamin shows a large number of contributions in the lower amounts.