(Posch, Nath, & Ziegler 2019) The Limits of Interest

This is the syntax which was used to process the raw data file from the Commodity Futures Trading Commission (CFTC) through to the final results published in:

Posch, Konrad, Nath, Thomas and Ziegler, J. Nicholas.  “The Limits of Interest: Capture, Financialization, or Contestation in the Politics of Rule-Making for Derivatives”

Narrative online Appendices: (Posch, Nath, & Ziegler 2019-11-24) The Limits of Interest – Watson Working Paper – Online Appendices

Syntax appear below:

While the dataset was produced from internal CFTC servers (because the CFTC did not participate in Regulations.gov in 2014), they explicitly only provided publicly available data. Thus, the full dataset of 37,703 comments will be available from Dryad (https://doi.org/10.6078/D1610G) upon publication along with the codebook provided by the CFTC and amended with our additional classification variables as used below and described in the manuscript and online supporting information C.

It should be noted that the CFTC does have additional meta-data which they declined to provide (i.e. IP addresses for commenters and other personally identifiable information) which they noted could be obtained through a Freedom of Information Act (FOIA) request. We did not pursue this avenue, but future researchers interested in, for example, the geographic distribution of commenters could request such data.

The Latent Dirichlet Analysis (LDA) used here is executed with Mallet. Mallet is a popular topic modelling software application. Mallet is written in java – which means it’s fast – and R has a Mallet wrapper package – which means it’s more userfriendly.

However, Mallet is a very resource-heavy app (because Java loves to use resources) so optimizing the number of working threads is key to running the code in a reasonable amount of time. Details are provided in the Setup Environment syntax, but the gist is you want to experiment with the number of threads based on your configuration to get the fastest runtime possible. Runtime for the 12,000 iteration topic model used to generate results for publication took on the order of 36 minutes using a Ryzen 3600 CPU (6 core, 12 logical processors, 4.1Ghz, 12 threads). The old Phenom II 720 processor (3 core, 3 logical processors, 3.5GHz, 15 threads) took on the order of 3.5 hours. Configuration and processor power matter a great deal to how this code runs (although anything can run it eventually).

NOTE: While it should NOT affect the stability of results, apparently the number of threads does. There is an element of randomness to topic modelling so while topics created from a corpus run with different random seeds will be substantively similar, they may appear in different orders and with slightly different topic proportions. To create reproducible results, this syntax sets the random seed (see section 1.2.1). However, it also appears that the number of threads is affecting output as well, so if you would like to exactly reproduce our results in the manuscript, you need to leave the “number of logical processors” value (numLogicalProcessors) in Section 0 set to 15.

Setup Environment

## Mallet is a massive pig when it comes to processing (roughly 11 seconds/10 iterations).  Enter the number of logical processors below to enable multi-threading
numLogicalProcessors = 15

# Set the number of iterations. 1000 for testing, 12000 for analysis
iterations <- 12000

## desktop wd
wd <- "D:/Dropbox/The Limits of Interest"

## Laptop wd
#wd <- "C:/Dropbox/The Limits of Interest"


## install REQUIRED non-default packages if not yet installed

##                                                                                   ##
##  NOTE: For mallet to work, you must install the Java Development Kit (JDK)        ##
##       not just the runtime environment (JRE).  This is free on the java website,  ##
##       but must be installed on your system before you can call the mallet         ##
##       library below.                                                              ##
##                                                                                   ##

options(java.parameters = "-Xmx8000m") #this is the magic solution to the out of memory error as Mallet limits RAM usage to 1gb which is much too small for our dataset

# load REQUIRED libraries
library(rJava) # the interface between R and Java, needed for mallet
library(mallet) # a wrapper around the Java machine learning tool MALLET
library(SnowballC) # for stemming
library(tm) # Framework for text mining
#library(RTextTools) # a machine learning package for text classification written in R
library(dplyr) # Data preparation and pipes $>$
library(ggplot2) # for plotting word frequencies
library(reshape2) #visualization
library(xtable) # pretty tables
library(knitr) # final output and file handling
library(viridis) #Inclusive design palettes for stacked bar charts because inclusive design is good design

0. Prepare Corpus

This module reads in the raw data and prepares it for topic modeling.

0.0 Combine the CFTC raw data file with the organization codings in this file from (2017-05-11) syntax file

The CFTC provided all comments submitted during the public comment periods on all rules they have written to implement the Dodd-Frank Act.

It is INCREDIBLY important that this combination happen in R and not Excel as Excel has a 32,765 (32,767-2 text qualifier “) character limit for it’s cells and it will truncate any data placed in there. There are a number of comment letters which are longer than this limit (ex. 8895, 26166, 26171) so all matching and processing MUST happen in R.

Note: the original data file had to be changed to UTF-8 from unknown (possibly Unicode) format using Notepad.exe because otherwise nothing showed up in R.

# load data
cftc.Comments <- read.delim("D:/Dropbox/The Limits of Interest/DoddFrankCommentsAll(UTF-8).txt",
                            stringsAsFactors = FALSE,
                            encoding ="UTF-8",
                            strip.white = TRUE,
                            quote = "" #this is a unbelieveably vital inclusion otherwise you lose cases!
                            ) ##These are the comments on proposed regulations (NPRMs)

####Check to see if we have unexpected blank data on import, esp. in CommentText
#load in the classified Organization values and meta-data
cftc.Comments.metaData<-read.csv("D:/Dropbox/The Limits of Interest/(2019-06-12) Final Classifications for R_noText.csv", stringsAsFactors = FALSE)


#check the classifications for Final.Classification
## Write this prepared corpus to a data file
write.table(cftc.Comments.withOrgAndMeta,file="DoddFrankCommentsWithOrgValueAndMetaData.txt",sep="|",row.names = FALSE,quote=FALSE)

0.1 Load the Data

First let’s read in our data. The corpus we’ll be using today is a database of CFTC comments related to Dodd-Frank Financial Reform Implementation which was created as described in step 0.0 above.

#read in text, | delimited file
documents <- read.delim("DoddFrankCommentsWithOrgValueAndMetaData.txt",
                            stringsAsFactors = FALSE,
                            quote = "" #this is a unbelieveably vital inclusion otherwise you                                                        lose cases!  This turns off quotes which is important for the full text                                        otherwise cases get wrapped into previous cases.  It is causing all the                                        data read in to have quotes wrapped around it and to have blank values                                        stored as ""
                            ) ##These are the comments on proposed rulemaking

# examine the variables names in our dataset
#Lets see where we have some missing data in text variables (such as ExtractedText)
0.1.1 Drop Ex Parte Meetings and Comments “None” type values in Organization

The CFTC included 286 ex parte meetings in their dataset. Since the data generating process for those comments is not the same as that for the other comments, we need to drop them from the analysis. These cases can be identified based on the value "Ex Parte" in the FirstName variable.

There are also 20 cases with organization values which are uninterpretable. They must be dropped

#Check the number of ex parte meetings
sum(1*(documents$FirstName=="Ex Parte")) #should be 286
## [1] 286
# Subset the data to include only those cases which are not ex parte meetings
documents <- subset(documents, FirstName!="Ex Parte")

# verify that the correct number of comments remain
nrow(documents) #should be 8282
## [1] 8282
# create the list of junk org values
junkOrgValues <- c("(NONE)",".","/","[none]","-none-","None","None ","None.","none","none ","none.")

##  [1] "(NONE)" "."      "/"      "[none]" "-none-" "None"   "None " 
##  [8] "None."  "none"   "none "  "none."
#check number of comments with junk org values
sum(1*(documents$Organization %in% junkOrgValues)) #should be 20
## [1] 20
# Subset the data to include only those cases which do not have junk organization values
documents <- subset(documents, !(Organization %in% junkOrgValues))

# verify that the correct number of comments remain
nrow(documents) #should be 8262
## [1] 8262

0.1.2 Get Unique Organization Values from the Corpus for Data Coding Purposes

In order to code comments to our typology, we need a list of all unique values in the Organization variable.

## Sort in decending order of frequency to see top values
validOrgValues <- validOrgValues[order(validOrgValues$Freq, decreasing = TRUE),]

## examine the top 30 most common raw values, notice seemingly identical values which are actually slightly different necessitating manual conceptual coding by analysts
head(validOrgValues, 30)
## For easy coding, sort validOrgValues alphabetically by Organization column
0.2 Created a unified text column to solve the gaps in CommentText and ExtractedText

As seen above in section 0.1, there are 5 blank values in ExtractedText. We need to create a unified text column called UnifiedText which includes a value for all comments by filling the 5 blanks with the values from CommentText.

## View the 5 comments which have missing ExtractedText values and confirm that they have CommentText Values
#   Truncate the printout so we don't get a mess

str_trunc(documents[documents$ExtractedText=="",]$CommentText,300) #only 5 results expected, confirm that all [index] numbers are followed by non-NULL values that look like comment letter text
## [1] "Dear Sirs:®®There has been way too much interference with the silver market by certain big players, artificially keeping prices totally out of realistic market ranges.  The CFTC and the Comex were never designed to facilitate market manipulation, but with the existing rules, that has become the c..."
## [2] "Futures are used in the real world to control costs, protect profits, reduce liability.   Real people, producing real commodities, producing real good that people use and consume.®Of they are also used as a trading vehicle, of which I consider some trading to be useful as to a price discovery.®®B..."
## [3] "You must not allow more than 1,500 contracts as the appropriate concentration limit for the COMEX Silver market.  Our liberty is directly related to this issue."                                                                                                                                            
## [4] "I urge you to please curb excessive gambling in commodities markets like food and oil. ®®While many factors contribute to today’s highly volatile commodity prices, it is clear that excessive speculation is significantly responsible, as shown in dozens of studies by several respected institutions ..."
## [5] "Title: End-User Exception to Mandatory Clearing of Swaps®FR Document Number: 2010-31578®Legacy Document ID: ®RIN: 3038-AD10®Publish Date: 12/23/2010 12:00:00 AM®®Submitter Info:®First Name:  Fred®Last Name:  Nadelman, LMSW®Mailing Address:  1825 East Gwinnett Street®City:  Savannah®Country:  Unit..."
## Create a new combined text variable which will be used for anlysis called "UnifiedText" and populate it with ExtractedText for all 8257 comments (total comments minus 5) which have a value for ExtractedText.
documents$UnifiedText <- documents$ExtractedText

## Fill the 5 missing values in ExtractedText using CommentText and save to new variable UnifiedText
documents[documents$ExtractedText=="",]$UnifiedText <- documents[documents$ExtractedText=="",]$CommentText

## Confirm that UnifiedText has no empty values
sum(1*(documents$UnifiedText=="")) #should be 0
## [1] 0
## Confirm that the 5 comments which had missing ExtractedText values have CommentText values in UnifiedText
documents[documents$ExtractedText=="",]$UnifiedText == documents[documents$ExtractedText=="",]$CommentText #should be 5 TRUE values

0.3 Make the 18-part typology a factor

Now, for representativeness, we need to convert the 18 part typology (17 substantive categories + 1 “Un-coded”) into a factor because the typology was constructed to span a spectrum roughly from industry insider (top) to general public (bottom) with the residual category of “Un-Coded” at the end.

## confirm that Final.Classification has no erroneous spellings (should have 18 unique values)
##  [1] "Un-Coded"                                      
##  [2] "Private Asset Manager"                         
##  [3] "Market Infrastructure Firm"                    
##  [4] "Law Firms, Consultants, and Related Advisors"  
##  [5] "Unaffiliated Individual"                       
##  [6] "Non-Financial Firm"                            
##  [7] "Academic or Other Expert"                      
##  [8] "Non-Financial Private-Sector Association"      
##  [9] "Government"                                    
## [10] "Financial Sector Association"                  
## [11] "Consumer Advocacy or other Citizens Group"     
## [12] "Core Financial Service Trade Association"      
## [13] "U.S. Chamber of Commerce or Affiliate"         
## [14] "Major Wall Street Sell-Side Bank"              
## [15] "Public Asset Manager"                          
## [16] "Other Sell-Side Bank"                          
## [17] "Trade Union or other Formal Labor Organization"
## [18] "Market Advocacy or other Anti-Regulation Group"
##c creates a vector of the different typologies
documents$Final.Classification <- factor(documents$Final.Classification, levels = c(
  "Major Wall Street Sell-Side Bank",
  "Other Sell-Side Bank",
  "Core Financial Service Trade Association",
  "Financial Sector Association",
  "Public Asset Manager",
  "Private Asset Manager",
  "U.S. Chamber of Commerce or Affiliate",
  "Market Infrastructure Firm",
  "Law Firms, Consultants, and Related Advisors",
  "Non-Financial Private-Sector Association",
  "Non-Financial Firm",
  "Academic or Other Expert",
  "Consumer Advocacy or other Citizens Group",
  "Trade Union or other Formal Labor Organization",
  "Market Advocacy or other Anti-Regulation Group",
  "Unaffiliated Individual",

# now, verify that we indeed have an ORDERED 17 level typology as listed above (look at [indecies] and confirm it goes in the order listed in the above creation statement):

##  [1] "Major Wall Street Sell-Side Bank"              
##  [2] "Other Sell-Side Bank"                          
##  [3] "Core Financial Service Trade Association"      
##  [4] "Financial Sector Association"                  
##  [5] "Public Asset Manager"                          
##  [6] "Private Asset Manager"                         
##  [7] "U.S. Chamber of Commerce or Affiliate"         
##  [8] "Market Infrastructure Firm"                    
##  [9] "Law Firms, Consultants, and Related Advisors"  
## [10] "Non-Financial Private-Sector Association"      
## [11] "Non-Financial Firm"                            
## [12] "Government"                                    
## [13] "Academic or Other Expert"                      
## [14] "Consumer Advocacy or other Citizens Group"     
## [15] "Trade Union or other Formal Labor Organization"
## [16] "Market Advocacy or other Anti-Regulation Group"
## [17] "Unaffiliated Individual"                       
## [18] "Un-Coded"

1. Estimate Mallet Topics

Now that we have a corpus of comments, we need to train a topic model

1.1 Clean the comment database

This is preprocessing that includes stripping whitespace, punctuation, stopwords, and stemming the document. It then saves a backup of the database in this state to save processing time when carrying out exploratory analysis. For the same reason, each function which takes significant processing time is wrapped in system.time() to show how long processing takes displayed in seconds.

1.1.1 Clean the database

# remove extraneous whitespace
documents$PreppedText <- stripWhitespace(documents$UnifiedText)

# remove punctuation
documents$PreppedText <- gsub(pattern="[[:punct:]]",replacement="",documents$PreppedText)

# convert all to lowercase
documents$PreppedText <- tolower(documents$PreppedText)

# Create Stopwords list based on standard english and context specific words listed below 
mystopwords <-c(stopwords("english"), "market", "trade", "propos", "rule", "please", "financi", "make", "act", "cftc")

# remove common words (stop words) before stemming to avoid the "ani" problem where any is change to ani and then not removed by mallet's stopword function.
1.1.2 Save a Clean Copy of the Database

Since data analysis often involves exploration and experimentation, this chunk creates a clean backup of the data to avoid the processing time which chunk 1.1.1 takes, especially on laptop processors.

#create a backup version of the database at this state to save processing time if we alter the corpus below in 1.2.1.
documentsBKUP_1.1 <- documents

1.2 Generate the mallet topic model

In this section, we generate the topic model. For this project, this will include all of the comments so that we get topics which represent the entire corpus and then we subset the results to see variation among commenter types.

1.2.1 Run the topic model

This chunk loads data into mallet, sets number of topics, trains the model, sets the random seed (for reproducible results), THEN inputs the data into the topic model. Following that, we set the hyperparameters as well as the number of iterations, which are the primary parameters for LDA.

Note: Mallet is powerful and much faster than non-Java LDA, but it is still a beast on processing power. Added to that fact is that a single thread does NOT max out even a single core of a modern processor. Thus, you must tweak the code to optimise for your computer. I have found that 5 threads per logical processor works best for me on an ancient Phenom II 720 (3-core processor circa 2009, OC to 3.5 GHz) but 1 thread per logical processor on my new computer (Ryzen 5 3600, 6-core, 12 logical processor, 4.1 GHz stock clock). You’ll see my log of testing runs below in the code. I suggest anyone adopting this technique does the same debugging BEFORE you settle in for an interpretation-grade run (for us, 12,000 iterations) as the runtime can vary 50-100% based on improper thread number choices and the runtime is on the order of hours. Do you want to wait 2 hours or 4?

But, be careful about what order you call setNumThreads in.

It must be AFTER the instances have been loaded or you will mess up the random seed and produce inconsistent results.

The order must be: (1) set seed (2) load instances (3) set threads (4) run model

#remove output from this chunk just in case to prevent any messy double-use when experimenting with other analyses
documents <- documentsBKUP_1.1

# load data into mallet
mallet.instances <- mallet.import(as.character(documents$ControlNumberID), as.character(documents$PreppedText), "stoplist.csv", FALSE, token.regexp="[\\p{L}']+")

# Decide what number of topics to model #
n.topics = 14

## Create a topic trainer object.
topic.model <- MalletLDA(n.topics)

##For reproducible results, we need to set the seed the same for the final run.   ##
## This MUST be set BEFORE LOADING THE INSTANCES, odd behavior if it is not.      ##


##                                                                                      ##
##  Load our documents.  THis MUST happen after setRandomSeed and BEFORE setNumThreads  ##
##                                                                                      ##


#                                                                                   #
#   Set Topic Model to use Multiple Threads across multiple cores                   #
#                                                                                   #
#   Based on armchair optimization, I found that diminishing returns happen around  #
#   5x number of logical processors.  It would behoove you to test on your machine  #
#   with some 500 iteration runs to find the sweet spot to avoid spending an extra  #
#   couple hours on the 12k final run for analyis and interpretation.               #
#                                                                                   #


## Get the vocabulary, and some statistics about word frequencies.
##  These may be useful in further curating the stopword list.
vocabulary <- topic.model$getVocabulary()
word.freqs <- mallet.word.freqs(topic.model)

# examine some of the vocabulary, first sorting it by word frequency and then document frequency
word.freqs <- word.freqs[order(-word.freqs$term.freq, -word.freqs$doc.freq),]
2. Substantive Analyses of Groups of Comments (sub-Corpora)

Now that we have a topic model, the interesting substantive comparisons for understanding Dodd-Frank financial reform at the CFTC involves understanding the different topics of discussion by different groups of commenters.

These are the analyses which we run after generating a new topic model

2.1 Initial Topic Model Descriptives

Generate some quick descriptive data about the topic.model

## Get the probability of topics in documents and the probability of words in topics.
## By default, these functions return raw word counts. Here we want probabilities, 
##  so we normalize, and add "smoothing" so that nothing has exactly 0 probability.
doc.topics <- mallet.doc.topics(topic.model, smoothed=T, normalized=T)
topic.words <- mallet.topic.words(topic.model, smoothed=T, normalized=T)

## What are the top words in topic 7?
## Get a vector containing short names for the topics
topics.labels <- rep("", n.topics)
for (topic in 1:n.topics) topics.labels[topic] <- paste(mallet.top.words(topic.model, topic.words[topic,], num.top.words=5)$words, collapse=" ")

# have a look at keywords for each topic
## Generate a quick list of the top N words for each topic
## Display that list
## Reorder topics to substantive order from arbitrary order. This is based on analysis and case knowledge and added to the code after careful human analysis of output of topic model.

newTopicOrder <- c(3,   8,  6,  11, 12, 5,  7,  13, 9,  4,  2,  10, 1,  14) #Note that the order is based on what the new roman numeral will be (the new substantive order) rather than the order of the topic rows since we are assigning a value than sorting by that value.  This will be different below in 2.2 when we rename and reorder columns for figures.  Note also that the list of names for the topics in 2.3.2 and 2.4.2 for the figures is also updated when this is changed

topics.list.newOrder <- cbind(data.frame(topics.list),newTopicOrder)

topics.list.newOrder <- topics.list.newOrder[order(topics.list.newOrder$newTopicOrder, decreasing = FALSE),]


## Put the new order number in front for printing
topics.list.newOrder <- topics.list.newOrder[,c(2,1)]

## Write that list to a | delimited file for easy copy into excel
dir.create(paste(wd,"/",iterations," iterations/", sep=""))
## Warning in dir.create(paste(wd, "/", iterations, " iterations/", sep =
## "")): 'D:\Dropbox\The Limits of Interest\12000 iterations' already exists
filename = paste(iterations," iterations/", "Top_40_words_in_",n.topics,"_topics_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".txt", sep="")

write.table(topics.list.newOrder, row.names=FALSE, col.names=FALSE, file = filename, sep = "|")

2.2 Create a dataframe with the documents, meta-data, and topic Proportions

This step essentially creates one large dataframe with all the stuff we want to use, so that it can be quickly analyzed in one step. It also creates a typology factor which we can then use to do the various subsets which are of interest later (e.g. Industry vs. Nonindustry, 4 super types, 5 super types)

# add back in the ControlNumberID for each document
doc.topics.IDnumbered <- data.frame(documents$ControlNumberID,doc.topics)

## Create a vector containing short names for the topics to use as variable names
topics.var.labels <- rep("", n.topics)
for (topic in 1:n.topics) topics.var.labels[topic] <- paste(mallet.top.words(topic.model, topic.words[topic,], num.top.words=3)$words, collapse=".")

# have a look at these var names for each topic, note that they are in the non-substantive order
# label the variables of this new dataframe based on a short version of the topic names
# check the initial (raw) order and first couple values
############# To be updated after 2019-06-12 changes ##############
# re-order topics into subtantive order. Note that the order is based on the order of the raw topic columns +1 (because of ControlNumberID) rather than what the new roman numeral will be (the new substantive order) since we are reorganizing the columns by column index.  This is different than above in 2.1 when we asigned new row numbers then sorted by them.

doc.topics.IDnumbered <- doc.topics.IDnumbered[c(1, 14, 12, 2,  11, 7,  4,  8,  3,  10, 13, 5,  6,  9,  15)]


#check that this worked (first should be ControlNumberID, second is swap.dealer.foreign, last should be specul.commod.price) [2019-08-07 consistent]
# merge the doc.topics.IDNumbered and documents dataframe to create an analytical base for all the comparisons
# check it out real quick
#create a factor for the typology
typology <- factor(
  "Major Wall Street Sell-Side Bank",
  "Other Sell-Side Bank",
  "Core Financial Service Trade Association",
  "Financial Sector Association",
  "Public Asset Manager",
  "Private Asset Manager",
  "U.S. Chamber of Commerce or Affiliate",
  "Market Infrastructure Firm",
  "Law Firms, Consultants, and Related Advisors",
  "Non-Financial Private-Sector Association",
  "Non-Financial Firm",
  "Academic or Other Expert",
  "Consumer Advocacy or other Citizens Group",
  "Trade Union or other Formal Labor Organization",
  "Market Advocacy or other Anti-Regulation Group",
  "Unaffiliated Individual",

2.2.1 Deal with Comment Letters Signed by Multiple Types of Commentors

A number of Comment letters had multiple organization types listed in the Organization value (161 with two, 27 with three, 1 with four). We have coded each of these types so that each letter will count as one contribution to each type of commenter. This means that the total number of comments in the topic proportions below will exceed the total number of actual comment letters (n=n.topics).

We chose this approach because we are interested in who is saying what and how that differs between different types of commenters. This means it is less important to maintain exact numberical equivalency since we are comparing proportions.

We chose 1 contribution to each category (rather than fractional based on some arbitrary formula) for the same reason that every signatory to the Declaration of Independence was equally guilty of treason even though only some (e.g. Thomas Jefferson) did the majority of writing.

If you signed a letter, you are jointly, separably, wholly, and equally liable for it’s contents even if you were (hypothetically) less part of the writing process.

## Add new variable for Unified Classifications with multivalues
documents.withTopics$Classifications.with.Multivalue <- NA

## Copy out all comments with non-empty multivalue columns into new dataframes with rows that are only those comments with that number of multivalues and all columns.  Then, assign their respective multivalue to their value for Classifications.with.Multivalue.  Finally, change their ControlNumberID to be OriginalID_<MultivalueIndex> so that ControlNUmberID remains a Globally Unique Identifier (GUID).

# Multivalue 2, 161 rows expected
multivalue2.documents <- documents.withTopics[documents.withTopics$Final.Classification.Multivalue.2!="",]

multivalue2.documents$Classifications.with.Multivalue <- multivalue2.documents$Final.Classification.Multivalue.2

multivalue2.documents <- multivalue2.documents %>% as_tibble() %>% mutate(ControlNumberID = paste(ControlNumberID,"2", sep = "_"))

# Multivalue 3, 27 rows expected
multivalue3.documents <- documents.withTopics[documents.withTopics$Final.Classification.Multivalue.3!="",]

multivalue3.documents$Classifications.with.Multivalue <- multivalue3.documents$Final.Classification.Multivalue.3

multivalue3.documents <- multivalue3.documents %>% as_tibble() %>% mutate(ControlNumberID = paste(ControlNumberID,"3", sep = "_"))

# Multivalue 4, 1 rows expected
multivalue4.documents <- documents.withTopics[documents.withTopics$Final.Classification.Multivalue.4!="",]

multivalue4.documents$Classifications.with.Multivalue <- multivalue4.documents$Final.Classification.Multivalue.4

multivalue4.documents <- multivalue4.documents %>% as_tibble() %>% mutate(ControlNumberID = paste(ControlNumberID,"4", sep = "_"))

# Multivalue 5, 0 row expected
multivalue5.documents <- documents.withTopics[documents.withTopics$Final.Classification.Multivalue.5!="",]

multivalue5.documents$Classifications.with.Multivalue <- multivalue5.documents$Final.Classification.Multivalue.5

multivalue5.documents <- multivalue5.documents %>% as_tibble() %>% mutate(ControlNumberID = paste(ControlNumberID,"5", sep = "_"))

## Assign all original cases their Final.Classification value in the new Classifications.with.Multivalue.  This needed to come after the copy out in order to prevent possible innaccuracy in the data
documents.withTopics$Classifications.with.Multivalue <- documents.withTopics$Final.Classification

## Combine the rows from all of the dataframes to a new documents.withTopicsAndMultivalues
documents.withTopicsAndMultivalues <- 

## Verify that there are now the expected number of rows, which is 8262 + 161 + 27 + 1 + 0 = 8451
## [1] 8451
nrow(documents.withTopicsAndMultivalues) == 8451
## [1] TRUE
## Confirm that all cases have a Classifications.with.Multivalues value
sum(1*(is.na(documents.withTopicsAndMultivalues$Classifications.with.Multivalue))) #should be 0
## [1] 0
#check that columns still named correctly (first should be ControlNumberID, second is posit.limit.commod, last should be loan.bank.cooper)
##  [1] "ControlNumberID"                  
2.2.2 Backup the Analysis Dataframe

Because the topic model takes several hours to complete, create a copy of the analysis working dataframe just in case so each analysis starts with guranteed fresh data.

#create a backup version of the database at this state to save processing time if we alter the corpus below in any of the analyses.
documents.withTopicsAndMultivaluesBKUP_2.2 <- documents.withTopicsAndMultivalues

2.3 Generate and Display Topic Proportions for each of the 18 groups

This section first generates and then displays different representations of the topic proportions for each of the 18 groups in they commenter typology

2.3.1 Calculate the 18 Type Typology Topic Proportions

To calculate this, R must first calculate the individual Topic Proportions for each document, and then sum over the 18 typologies. Stores this in typologyTopicProportions.

# reset the working dataframe to the backup to ensure clean data for every step
documents.withTopicsAndMultivalues <- documents.withTopicsAndMultivaluesBKUP_2.2

# variables in documents.withTopicsAndMultivalues
##  [1] "ControlNumberID"                  
# average all the documents down to the 18 part typology in Classifications.with.Multivalue

typologyTopicProportions <- ddply(
                        documents.withTopicsAndMultivalues[2:length(names(documents.withTopicsAndMultivalues))] ### omit ID

#to count the number of comments in each type:
sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Major Wall Street Sell-Side Bank"))
## [1] 124
sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Core Financial Service Trade Association"))
## [1] 278
sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Other Sell-Side Bank"))
## [1] 159
sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Public Asset Manager"))
## [1] 23
sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Private Asset Manager"))
## [1] 680
sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="U.S. Chamber of Commerce or Affiliate"))
## [1] 59
sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Market Infrastructure Firm"))
## [1] 659
sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Law Firms, Consultants, and Related Advisors"))
## [1] 316
sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Non-Financial Firm"))
## [1] 931
sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Financial Sector Association"))
## [1] 576
sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Non-Financial Private-Sector Association"))
## [1] 677
## [1] 317
sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Academic or Other Expert"))
## [1] 127
sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Consumer Advocacy or other Citizens Group"))
## [1] 442
sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Trade Union or other Formal Labor Organization"))
## [1] 38
sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Market Advocacy or other Anti-Regulation Group"))
## [1] 29
sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Unaffiliated Individual"))
## [1] 2916
## [1] 100
# report the dataframe where each row is one of the commenter types from Classifications.with.Multivalue, and each column is a topic.  Each cell will then be the average Topic Proportions of the commenter type to mention a particular topic.  (The rows will always adds up to 1 in theory but may be slightly off due to rounding in the output below)

##                   Classifications.with.Multivalue swap.dealer.foreign
## Write the table to a | delimited file for easy copy into excel
filename = paste(iterations," iterations/","Topic_Proportions_by_Commenter_Type_",n.topics,"_topics_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".txt", sep="")

write.table(typologyTopicProportions, row.names=FALSE, col.names=TRUE, file = filename, sep = "|")

2.3.2 Create a Table of Aggregated Topic Proportions for Plotting

This dataframe is used in 2.3.3 and 2.3.4.

# This creates a 252 obs x 3 var table where each is the average topic porportions 
#   for a topic-classifcation pair

tempPlot <- melt(typologyTopicProportions, id.vars="Classifications.with.Multivalue")

## give the columns the proper descriptive names
colnames(tempPlot)<- c("Classifications.with.Multivalue","topic","Proportions")

## look at the topic names
##  [1] swap.dealer.foreign      fund.bank.entiti        
## Rename the topics from variable names to descriptive ones with roman numeral indexes
tempPlot$topic <- mapvalues(tempPlot$topic, from = levels(tempPlot$topic), to = c(
                              "Cross-Border Transactions (I)",
                              "Volcker Rule (II)",
                              "End-User Rule (III)",
                              "Compliance Rules (IV)",
                              "Swap Clearing Rule (V)",
                              "Swap Reporting Rules (VI)",
                              "Macro Risk Reporting & Monitoring (VII)",
                              "Addresses & Names (VIII)",
                              "Firm-Level Risk Models (IX)",
                              "Community Banks & Derivative Users  (X)",
                              "End Users & Public Utilities (XI)",
                              "Speculation in Precious Metals (XII)",
                              "Speculation in Agricultural & Energy Commodities (XIII)",
                              "Speculation in Household Commodities (XIV)"

## look at the topic names
##  [1] Cross-Border Transactions (I)                          
##  [2] Volcker Rule (II)                                      
##  [3] End-User Rule (III)                                    
##  [4] Compliance Rules (IV)                                  
##  [5] Swap Clearing Rule (V)                                 
##  [6] Swap Reporting Rules (VI)                              
##  [7] Macro Risk Reporting & Monitoring (VII)                
##  [8] Addresses & Names (VIII)                               
##  [9] Firm-Level Risk Models (IX)                            
## [10] Community Banks & Derivative Users  (X)                
## [11] End Users & Public Utilities (XI)                      
## [12] Speculation in Precious Metals (XII)                   
## [13] Speculation in Agricultural & Energy Commodities (XIII)
## [14] Speculation in Household Commodities (XIV)             
## 14 Levels: Cross-Border Transactions (I) ... Speculation in Household Commodities (XIV)

2.3.3 Stacked Bar Chart of Proportions for Visual Comparison

This generates a stacked bar chart which compares the proportions of topics across all 18 commenter types. Note that they are all scaled to [0,1] which means that they do not indicate the relative number of commenters in each category.

The output is saved to a png file with a descriptive name.

## Stacked Bar Chart of Proportions for Visual Comparison
#Consider using scales library to make the colors more visually pleasing

chartTitle = "Topic Proportions in Comments from 18 types of Commenters"
filename = paste(iterations," iterations/","Topic_Proportions_by_Commenter_Type_",n.topics,"_topics_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")

#creation of png file
png(filename = paste("",filename,sep=""), width = 753/72, height = 578/72, units = "in", res=600)

stackedBar <- ggplot(tempPlot, aes(Classifications.with.Multivalue, Proportions, fill=topic)) +
  geom_bar(stat = "identity") + 
  theme(axis.text.x=element_text(angle=-35,hjust=0,vjust=1)) + 
  scale_fill_viridis(name="Topic", discrete=TRUE) +
  xlab("Commenter Types") + ylab("Average Topic Proportions") +
  labs(title=paste(chartTitle,"\n (n = ",nrow(documents.withTopicsAndMultivalues),")", sep=""))


## png 
##   2
## Attached image to output as well in order to enrich Rmd online appendix



2.3.4 Generate a Heat-Map Table of Topic Proportions

The heat map greatly assists in noticing patterns in the propensity table.

The output is saved to a png file with a descriptive name. Color Landscape Heatmap

## make a nice heatmap of the proportions
# this is the table thing where cells are colored white low to red high.  It is useful for noticing patterns in topics

filename = paste(iterations," iterations/","Topic_Proportions_by_Commenter_Type_heatmap_",n.topics,"_topics_","landscape_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")

# Note that default resolution (res) in ggplot is 72 dpi. So, to get a perfect reproduction of debug dimensions but at arbitrarily higher resolution, we simply divide the px size by 72, change unit to in, and set the res.
png(filename = paste("",filename,sep=""), width = 1100/72, height = 578/72, units = "in", res = 600)

heatmapLandscape <- ggplot(tempPlot, aes(Classifications.with.Multivalue, topic)) + 
  geom_tile(aes(fill = Proportions), color= "white") + 
  theme_bw() + 
  xlab("") + #blank to avoid conflict with labels below
  ylab("") + #blank to avoid conflict with labels below
  scale_fill_gradient2(low = "#1A3C71", #cal Poster Template Blue
                       mid = "white",
                       high = "#FFB800", #Cal Poster Template Gold
                       midpoint = mean(tempPlot$Proportions)) + # set the scale mean at the data mean
  theme(axis.text.x=element_text(angle=35,hjust=1,vjust=1)) + 
  geom_text(aes(label = round(Proportions, 4)), size=4) + 
  scale_y_discrete(limits = rev(levels(tempPlot$topic)))


## png 
##   2
## Attached image to output as well in order to enrich Rmd online appendix


Topic_Proportions_by_Commenter_Type_heatmap_14_topics_landscape_2019-12-01_13.35.19 Color Portrait Heatmap

## make a nice heatmap of the proportions
# this is the table thing where cells are colored white low to red high.  It is useful for noticing patterns in topics

filename = paste(iterations," iterations/","Topic_Proportions_by_Commenter_Type_heatmap_",n.topics,"_topics_","portrait_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")

# Note that default resolution (res) in ggplot is 72 dpi. So, to get a perfect reproduction of debug dimensions but at arbitrarily higher resolution, we simply divide the px size by 72, change unit to in, and set the res.
png(filename = paste("",filename,sep=""), width = 1000/72, height = 578/72, units = "in", res = 600)

heatmapPortrait <- ggplot(tempPlot, aes(Classifications.with.Multivalue, topic)) + 
  geom_tile(aes(fill = Proportions), color= "white") + 
  theme_bw() + 
  xlab("") + #blank to avoid conflict with labels below
  ylab("") + #blank to avoid conflict with labels below
  scale_fill_gradient2(low = "#1A3C71", #cal Poster Template Blue
                       mid = "white",
                       high = "#FFB800", #Cal Poster Template Gold
                       midpoint = mean(tempPlot$Proportions) # set the scale mean at the data mean
                       ) + 
  theme(axis.text.x=element_text(angle=35,hjust=1,vjust=1)) + 
  geom_text(aes(label = round(Proportions, 4)), size=4) + 
  scale_x_discrete(limits = rev(levels(tempPlot$Classifications.with.Multivalue))) +


## png 
##   2
## Attached image to output as well in order to enrich Rmd online appendix



2.4 Sample of Representative Comment Letters from Each Topic

To supplement and verify the computational topic model’s analysis of the comment corpus, this section selects comment letters which exemplify each of the topics. Several strategies of selecting “exemplary” comments are used in the sections below.

2.4.1 Select ALL Comment Letters which are {1,2} standard deviations above the mean topic proportions

This first section selects all comment letters which are at least N standard deviations above the mean topic proportions for each topic, where is N is {1,2}. These letters should contain the strongest representation of each topic to display what that topic looks like in its pureset form.

It seems like 1 and 2 deviations give a very large number of letters (hundreds). numDeviationsAbove = 5 seems to give about 100 which is still more than we are interested in. This is based on testing with the topic swap.foreign.regul and may be a different pattern for other topics. 5 is too high for some topics though since there are some topics which do not have a comment which is that far away from mean. 2 creates too big a set with duplicates.

With this in mind, I will be deprecating this approach and shifting to the “top 5 approach” in 2.4.2.

# set the number of standard deviations above mean which will be used
numDeviationsAbove = 2

# create output dataframe
sampleCommentLetters_DeviationsAbove <- documents.withTopicsAndMultivalues[FALSE,]

# get the list of topic variables names in substantive (and thus df consistent) order from section 2.2
topicNamesOrdered <- names(doc.topics.IDnumbered)[2:length(names(doc.topics.IDnumbered))] #we drop the first variable name because it is ControlNumberID which is not relevant here

#### Start Loop Here ####
for (currentTopicName in topicNamesOrdered){ # wrap the entire algorithm in a loop to execute through all of the the topics

#get the mean and standard deviation
meanOfTopicProportion <- mean(documents.withTopicsAndMultivalues[,currentTopicName])
stdevOfTopicProportion <- sd(documents.withTopicsAndMultivalues[,currentTopicName])

# Start by creating a new dataframe which selects out docs which are numDeivationsAbove above the mean
currentTopicSample <- documents.withTopicsAndMultivalues[documents.withTopicsAndMultivalues[,currentTopicName]>=(meanOfTopicProportion+(numDeviationsAbove*stdevOfTopicProportion)),]

# Add a variable to CurrentTopicSample to capture which topic the comment is selected for
currentTopicSample$relevantTopic <- currentTopicName

## Add these selected documents to the sampling dataframe
sampleCommentLetters_DeviationsAbove <- 
  rbind(sampleCommentLetters_DeviationsAbove, currentTopicSample)

## END OF LOOP ###

## Check if Duplicates have been added

## Write the table to a | delimited file for easy copy into excel
filename = paste(iterations," iterations/","Sample_Comment_Letters_",numDeviationsAbove,"_Deviations_Above_Mean_",n.topics,"_topics_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".txt", sep="")

write.table(typologyTopicProportions, row.names=FALSE, col.names=TRUE, file = filename, sep = "|")

2.4.2 Select Top XX highest topic proportion Comment Letters

As compared to 2.4.1, this is an even more extreme selection criteria although it may be susceptible to highly clusters topic proportions since it does not verify how extreme.

This was the analysis that ultimately was most useful for understanding what letters were in topics. However, 2.4.1 and 2.4.3 are retained in specification and partial implementation for posterity in case a future analyst wishes to explore alternative meanings of “typical” or “representative” when evaluating topics.

# set the number of comments to be selected from the top of the distribution
topNumSelected = 30

#create working copy of documents.withTopics dropping unnneded metadata
#   Note: we do NOT need to use multivalues for this strict "top ranking" selection criteria since we are not selecting in comparison to the entire corpus as we would be with the stdev or porportion models.
##  [1] "ControlNumberID"                  
documents.ToSample <- documents.withTopics[,c(

# create output dataframe, on re-runs, note that this also blanks out previous sample data which is desireable behavior to prevent accidents.
sampleCommentLetters_topNumber <- documents.withTopics[FALSE,]

# get the list of topic variables names in substantive (and thus df consistent) order from section 2.2
topicNamesOrdered <- names(doc.topics.IDnumbered)[2:length(names(doc.topics.IDnumbered))] #we drop the first variable name because it is ControlNumberID which is not relevant here

#### Start Loop Here ####
for (currentTopicName in topicNamesOrdered){ # wrap the entire algorithm in a loop to execute through all of the the topics

  #create temp working dataframe from topics without multivalues
  tempCommentsByTopics <- documents.ToSample[order(-documents.ToSample[,currentTopicName]),]
  # Create a new dataframe which selects out docs which are in the topNumSelected
  currentTopicSample <- tempCommentsByTopics[1:topNumSelected,]
  # Add a variable to CurrentTopicSample to capture which topic the comment is selected for
  currentTopicSample$relevantTopic <- currentTopicName
  ## Add these selected documents to the sampling dataframe
  sampleCommentLetters_topNumber <- 
    rbind(sampleCommentLetters_topNumber, currentTopicSample)

## END OF LOOP ###

## Are there duplicate Comment Letters? (False means no because we are comparing the number of unique comment IDs to the total number of comments)
length(unique(sampleCommentLetters_topNumber$ControlNumberID)) != nrow(sampleCommentLetters_topNumber)
## [1] FALSE
## Change relevantTopic to factor and rename levels to descriptive names
sampleCommentLetters_topNumber$relevantTopic <- factor(sampleCommentLetters_topNumber$relevantTopic)

sampleCommentLetters_topNumber$relevantTopic <- mapvalues(sampleCommentLetters_topNumber$relevantTopic, 
                                                          from = topicNamesOrdered, #
                                                          to = c(
                              "Cross-Border Transactions (I)",
                              "Volcker Rule (II)",
                              "End-User Rule (III)",
                              "Compliance Rules (IV)",
                              "Swap Clearing Rule (V)",
                              "Swap Reporting Rules (VI)",
                              "Macro Risk Reporting & Monitoring (VII)",
                              "Addresses & Names (VIII)",
                              "Firm-Level Risk Models (IX)",
                              "Community Banks & Derivative Users  (X)",
                              "End Users & Public Utilities (XI)",
                              "Speculation in Precious Metals (XII)",
                              "Speculation in Agricultural & Energy Commodities (XIII)",
                              "Speculation in Household Commodities (XIV)"

## Put the relevant topic variables at the front and relevant comment letter text at the end
sampleCommentLetters_topNumber <- sampleCommentLetters_topNumber[,c(23,22,1:6,8:21,7)]

## Create a directory to hold this large number of letters
letterDirectory <- paste(wd,"/",iterations," iterations/Letters/", sep="")

## Warning in dir.create(letterDirectory): 'D:\Dropbox\The Limits of
## Interest\12000 iterations\Letters' already exists
## Write each selected letter to a SEPERATE file with it's metadata

for ( commentNum in 1:nrow(sampleCommentLetters_topNumber)){
  currentLetter <- sampleCommentLetters_topNumber[commentNum,]
  if (commentNum%%topNumSelected == 0)
    currentRank <- topNumSelected
  } else {
    currentRank <- commentNum%%topNumSelected    
  filename = paste(letterDirectory, "/",
                   format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),
  for ( varNum in 1:ncol(currentLetter) ){
    write(c(paste(as.character(colnames(currentLetter)[varNum]), ## This is the variable name
                 as.character(currentLetter[1,varNum]), ## This is the variable value
                 sep = " : ")
          file = filename,
          append = TRUE

#NOTE: Due to the excel cell character limit, example letters need to be outputted as individual files 
#       and then copied manually into word files for full text.  But, a tabular comparison file is useful
## Write the table to a | delimited file for easy copy into excel
filename = paste(letterDirectory,"/","Sample_Comment_Letters_TRUNCATED_Top_",topNumSelected,"_by_Proportions_",n.topics,"_topics_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".txt", sep="")

write.table(sampleCommentLetters_topNumber, row.names=FALSE, col.names=TRUE, file = filename, sep = "|")

2.4.3 Select Median 5 Comment letters closest to the median Topic Proportion

This selection emphasizes the “typical” usage of the topic in the corpus but will likely, for that reason, be more difficult to isolate the essence of the specific topic. However, it may be useful should there be concerns about misrepresentativeness of extreme proportion selections in 2.4.1 and 2.4.2

3. Who Commented on the 10 most commented rules in the Coded (8k) dataset

Now we want to see who was commenting on the 10 most commented on rules. This data primarily discussed in Supporting Information B and E.

3.1 The top 10 Rules by number of comments

First, we need that top 10 list. Note: This will be based on the coded comments only, which is the set of 8568 comments not the set of 37,232 comments.

# reset the working dataframe to the backup to ensure clean data for every step
documents.withTopicsAndMultivalues <- documents.withTopicsAndMultivaluesBKUP_2.2

# get the unique list of rules with counts of relevant comments
cftc.Comments.coded.counts <- as.data.frame(table(documents.withTopicsAndMultivalues$UniqueName))

colnames(cftc.Comments.coded.counts)[1]<- "UniqueName"
colnames(cftc.Comments.coded.counts)[2]<- "Comments"

## Sort in decending order of frequency to see top values
cftc.Comments.coded.counts <- cftc.Comments.coded.counts[order(cftc.Comments.coded.counts$Comments, decreasing = TRUE),]

# examine the top 15 rules to get a sense for patterns
head(cftc.Comments.coded.counts, 15)
##      UniqueName Comments
## \end{table}

3.2 Getting the comments from the coded dataset

# reset the working dataframe to the backup to ensure clean data for every step
documents.withTopicsAndMultivalues <- documents.withTopicsAndMultivaluesBKUP_2.2

# create the top 10 rule comment dataset
cftc.Comments.coded.top10 <- subset(documents.withTopicsAndMultivalues, UniqueName %in% top.commented.rules$UniqueName)

# create the inverse as well
cftc.Comments.coded.11toEnd <- subset(documents.withTopicsAndMultivalues, !(UniqueName %in% top.commented.rules$UniqueName))

# check that the number of comments pulled is the sum total of the number in the top 10 list
## [1] 5812
## [1] 0

3.3 Stacked Bar Charts of Commenter type

Now that we have the comments pulled out, we’ll use this dataset to generate some stacked bar charts.

3.3.1 Top 10 Rules Combined

First, the pool of comments for all 10 rules combined.

all.comments = x=factor("All Rules") ##dummy factor to produce a stacked bar chart with only one bar.

chartTitle = "Most Frequently Commented Ten CFTC Proposed Rules:\nComments by Type of Organization"
filename=paste(iterations," iterations/","Comments_by_OrgType_Top10Rules_n",nrow(cftc.Comments.coded.top10),"_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")

png(filename = paste(filename,sep=""), width = 753/72, height = 578/72, units = "in", res=600)

topTenCombinedBar <- ggplot(cftc.Comments.coded.top10,
           order=-as.numeric(Classifications.with.Multivalue))) + 
  geom_bar() + 
  scale_fill_viridis(name="Commenter Type", discrete = TRUE,
                                   " (",
                                   ")",sep=""))  +
  xlab("") + 
  ylab("Number of Comments") +
  labs(title=paste(chartTitle,"\n (n = ",nrow(cftc.Comments.coded.top10),")", sep=""))


## png 
##   2
## Attach image to output as well in order to enrich Rmd online appendix



3.3.2 11 to N Rules Combined for Comparison

For comparison, let’s also look at the inverse: the 11th through end rules

all.comments = x=factor("All Rules") ##dummy factor to produce a stacked bar chart with only one bar.

chartTitle = "Less Frequently Commented (11 to 114) CFTC Proposed Rules:\nComments by Type of Organization"
filename=paste(iterations," iterations/","Comments_by_OrgType_11toEnd_n",nrow(cftc.Comments.coded.11toEnd),"_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")

png(filename = paste(filename,sep=""), width = 753/72, height = 578/72, units = "in", res=600)

elevenToEndCombinedBar <- ggplot(cftc.Comments.coded.11toEnd, 
                                     order=-as.numeric(Classifications.with.Multivalue))) +
  geom_bar() + 
  scale_fill_viridis(name="Commenter Type", discrete = TRUE,
                                   " (",
                                   ")",sep="")) + 
  xlab("") + ylab("Number of Comments") +
  labs(title=paste(chartTitle,"\n (n = ",nrow(cftc.Comments.coded.11toEnd),")", sep=""))


## png 
##   2
## Attach image to output as well in order to enrich Rmd online appendix



3.4 Top 10 Commented Rules split out by Rule

Then, we also want to see each of the 10 rules split out.

3.4.1 Top 1 to 10 Rules Split Out

chartTitle = "Most Frequently Commented Ten CFTC Proposed Rules:\nComments by Type of Organization per Rule"
filename=paste(iterations," iterations/","Comments_by_OrgType_Top10Rules_splitout_n",nrow(cftc.Comments.coded.top10),"_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")

# NOTE: in order to get the rules sorted by descending order of volume of comments, we on-the-fly cast  UniqueName into a factor with levels from the top.commented.rules table created earlier which has the names of the rules sorted in descending order of comment volume.  This happens in the aes argument of the ggplot function

png(filename = paste(filename,sep=""), width = 753/72, height = 578/72, units = "in", res=600)

topTenSplitBar <- ggplot(cftc.Comments.coded.top10,
                         aes(factor(UniqueName,levels = top.commented.rules$UniqueName),
                             order=-as.numeric(Classifications.with.Multivalue))) + 
                                 vjust=1)) + 
  scale_fill_viridis(name="Commenter Type", discrete=TRUE) + 
  xlab("Top 10 Commented Rules") + 
  ylab("Number of Comments") +
  labs(title=paste(chartTitle,"\n (n = ",nrow(cftc.Comments.coded.top10),")", sep=""))


## png 
##   2
## Attach image to output as well in order to enrich Rmd online appendix



### This lets you try it facetted instead, but the results are poor.  Maintained for archival purposes
# png(filename = "../Results/Comments_top10_by_Rule_facets.png", width = 753, height = 578, units = "px")
# ggplot(cftc.Comments.coded.top10, aes(all.comments, fill=Classifications.with.Multivalue, order=-as.numeric(Classifications.with.Multivalue))) + geom_bar() + scale_fill_discrete()  + xlab("") + ylab("Number of Comments") +labs(title=paste("All Comments on the Top 10 Commented CFTC Proposed Rules to Implement the Dodd-Frank Act\n (n =",nrow(cftc.Comments.coded.top10),")")) + facet_wrap(~UniqueName)
# dev.off()

3.4.2 Top 2 to 10 Rules (For easier comparison)

Now, let’s drop the first rule (76 FR 4752) since it is so many more comments than the other rules Then, we also want to see each of the 10 rules split out.

# create the top 2to10 rule comment dataset
cftc.Comments.coded.top2thru10 <- subset(cftc.Comments.coded.top10, UniqueName !="76 FR 4752")

chartTitle = "(2nd to 10th) Most Frequently Commented CFTC Proposed Rules:\nComments by Type of Organization per Rule"
filename=paste(iterations," iterations/","Comments_by_OrgType_Top2thru10Rules_splitout_n",nrow(cftc.Comments.coded.top2thru10),"_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")

# NOTE: in order to get the rules sorted by descending order of volume of comments, we on-the-fly cast  UniqueName into a factor with levels from the top.commented.rules table created earlier which has the names of the rules sorted in descending order of comment volume.  This happens in the aes argument of the ggplot function

png(filename = paste(filename,sep=""), width = 753/72, height = 578/72, units = "in", res=600)

twoToTenSplitBar <- ggplot(cftc.Comments.coded.top2thru10, 
                                      levels = top.commented.rules$UniqueName),
                               order=-as.numeric(Classifications.with.Multivalue))) + 
                                 vjust=1)) + 
  scale_fill_viridis(name="Commenter Type", discrete=TRUE) + 
  xlab("Top 2 to 10 Commented Rules") + 
  ylab("Number of Comments") +
  labs(title=paste(chartTitle,"\n (n = ",nrow(cftc.Comments.coded.top2thru10),")", sep=""))


## png 
##   2
## Attach image to output as well in order to enrich Rmd online appendix



3.5 Create a Table of the number of comments by type for each of the top 10 rules

because we cannot include the totals in a clear way in the legend of the 10-bar stacked chart, we need a quick cross-tabs table to give us the totals for each rule. We’ll use the table function to accomplish this.

cftc.Comments.coded.top10.counts <- table(cftc.Comments.coded.top10$Classifications.with.Multivalue,factor(cftc.Comments.coded.top10$UniqueName,levels = top.commented.rules$UniqueName))

#Quick print to console, but this is not useful for copy/paste because window is too narrow
##                                                  76 FR 4752 75 FR 80747
4. Deprecated Additional Analyses

In our initial exploration of the topic models and topic proportions, these analyses were helpful to characterize the data. They are not used in the paper as they ultimately did not provide the insights which answered our research questions. They are retained here for future analysts looking to explore and characterize topic model data based on our approach.

4.1 Easily adjustable two-type comparison

Based on the 17 part proportions in 2.3, this section allows us to easily explore different commenter type combinations. It works by setting two types at the begining of the code block. To set the types, you must copy the exact name from the list below into the typeA and typeB variables.

“Major Wall Street Sell-Side Bank”, “Core Financial Service Trade Association”, “Other Sell-Side Bank”, “Public Asset Manager”, “Private Asset Manager”, “U.S. Chamber of Commerce or Affiliate”, “Market Infrastructure Firm”, “Law Firms, Consultants, and Related Advisors”, “Non-Financial Firm”, “Private-Sector Association”, “Government”, “Academic or Other Expert”, “Consumer Advocacy or other Citizens Group”, “Trade Union or other Formal Labor Organization”, “Market Advocacy or other Anti-Regulation Group”, “Unaffiliated Individual”, “Un-Coded” <———————————————————-

# reset the working dataframe to the backup to ensure clean data for every step
documents.withTopicsAndMultivalues <- documents.withTopicsAndMultivaluesBKUP_2.2

## Set the commenter types you want to compare below
groupTypes = c(

# average all the documents down to the 17 part typology in Classifications.with.Multivalue

typologyTopicProportions <- ddply(
                        documents.withTopicsAndMultivalues[2:length(names(documents.withTopicsAndMultivalues))] ### omit ID

# report the dataframe where each row is one of the commenter types from VA.KSC, and each column is a topic.  Each cell will then be the average Proportions of the commenter type to mention a particular topic.  (Does this add up to 1 always?, probably not due to averaging)


# now, drop all lines from typologyTopicProportions which are not typeA and typeB
typeABtopicProportions <- typologyTopicProportions[typologyTopicProportions$Classifications.with.Multivalue %in% groupTypes,]

# confirm that the correct types were retained

#   creation of png file  #
#   (be sure to turn on   #
#   dev.off() line after  #
#    the chart)           #
#filename = paste("Topic_Proportions_",paste(groupTypes, collapse ="&"),"_",n.topics,"_topics_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")
#png(filename = paste("",filename,sep=""), width = 753, height = 578, units = "px")

# Stacked Bar Chart of Proportions for Visual Comparison
chartTitle = paste("Topic Proportions in Comments from ",paste(groupTypes, collapse =" & ")," type Commenters", sep="")

#create a "tall" file which lists each of the variables vertically rather than a wide file where each variable is a column.
tempPlot <- melt(typeABtopicProportions, id.vars="Classifications.with.Multivalue")
colnames(tempPlot)<- c("Classifications.with.Multivalue","topic","Proportions") #make the names descriptive

ggplot(tempPlot, aes(Classifications.with.Multivalue, Proportions, fill=topic)) +geom_bar(stat = "identity") + theme(axis.text.x=element_text(angle=0,hjust=.5,vjust=1)) + xlab("Commenter Types") + ylab("Average Topic Proportions") +labs(title=paste(chartTitle,"\n (#topics = ",n.topics,")", sep=""))


## make a nice pseudo-table of the Proportions
# this is the weird green circle table-type thing that we came up with since R doesn't do tables well.

#filename = paste("Topic_Proportions_",paste(groupTypes, collapse ="&"),"_dotsplot_",n.topics,"_topics_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")
#png(filename = paste("",filename,sep=""), width = 1000, height = 578, units = "px")

#g <- ggplot(tempPlot, aes(Classifications.with.Multivalue, topic)) + geom_point(aes(size = Proportions), color= "green") + theme_bw() + xlab("") + ylab("")
#g + scale_size_continuous(range=c(1,10)) + theme(axis.text.x=element_text(angle=35,hjust=1,vjust=1))  + geom_text(aes(label = round(Proportions, 4)), size=4)


4.2 Basic Binary Industry vs. Non-Industry comparison

Following the data presentation in (Levy & Franklin 2014), we begin by presenting a comparison between two halves of the data: those commenter types who can be considered Industry and those which are non-Industry organization, academics, and private individuals.

Based on the 17 part typology Industry “Major Wall Street Sell-Side Bank”, “Core Financial Service Trade Association”, “Other Sell-Side Bank”, “Public Asset Manager”, “Private Asset Manager”, “U.S. Chamber of Commerce or Affiliate”, “Market Infrastructure Firm”, “Law Firms, Consultants, and Related Advisors”, “Non-Financial Firm”, “Financial Sector Association”, “Non-Financial Private-Sector Association”,

Non-Industry “Government”, “Academic or Other Expert”, “Consumer Advocacy or other Citizens Group”, “Trade Union or other Formal Labor Organization”, “Market Advocacy or other Anti-Regulation Group”, “Unaffiliated Individual”, “Un-Coded” <———————————————————-

# reset the working dataframe to the backup to ensure clean data for every step
documents.withTopicsAndMultivalues <- documents.withTopicsAndMultivaluesBKUP_2.2

# add in the Industry binary as a variable to our documents.withTopic dataframe

### Last classification is blank and represents the Uncoded category.
### Omit blank line from furter analysis

indusClassificationDF <- data.frame(typology,industryClassification)


temp <- merge(documents.withTopicsAndMultivalues, indusClassificationDF, by.x = "Classifications.with.Multivalue", by.y="typology") 

# average documents to two lines and N topics columns based on the list above
industryBinaryTopicProportions <- ddply(
                        documents.withTopicsAndMultivalues[3:length(names(documents.withTopicsAndMultivalues))] ### omit ID

# report the dataframe where each row is one of the commenter types from VA.KSC, and each column is a topic.  


#to count the number of comments in each super type:

#drop the uncoded
industryBinaryTopicProportions <- industryBinaryTopicProportions[2:nrow(industryBinaryTopicProportions),]

# report as a table where each cell is the Proportions of group [row] to mention topic [column]. Each cell will then be the average Proportions of the commenter super-type to mention a particular topic.  (Does this add up to 1 always?, probably not due to averaging, but it's pretty close)

# Stacked Bar Chart of Proportions for Visual Comparison
chartTitle = "Topic Proportions in Comments from Industry and Non-Industry"
filename = paste("Topic_Proportions_by_Industry_NonIndustry_",n.topics,"_topics_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")

tempPlot <- melt(industryBinaryTopicProportions, id.vars="industryClassification")
colnames(tempPlot)<- c("industryClassification","topic","Proportions")

png(filename = paste("",filename,sep=""), width = 753, height = 578, units = "px")

ggplot(tempPlot, aes(industryClassification, Proportions, fill=topic)) +geom_bar(stat = "identity") + theme(axis.text.x=element_text(angle=35,hjust=1,vjust=1)) + xlab("Commenter Types") + ylab("Average Topic Proportions") +labs(title=paste(chartTitle,"\n (n = ",nrow(documents.withTopicsAndMultivalues),")", sep=""))


# make a nice pseudo-table of the Proportions

filename = paste("Topic_Proportions_by_Industry_NonIndustry_dotsplot_",n.topics,"_topics_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")
png(filename = paste("",filename,sep=""), width = 1000, height = 578, units = "px")
g <- ggplot(tempPlot, aes(industryClassification, topic)) + geom_point(aes(size = Proportions), color= "green") + theme_bw() + xlab("") + ylab("")
g + scale_size_continuous(range=c(1,10)) + theme(axis.text.x=element_text(angle=35,hjust=1,vjust=1))  + geom_text(aes(label = round(Proportions, 4)), size=4)

4.3 Four part comparison of Supertypes

Based on the 17 part typology

Industry Sell-Side

“Major Wall Street Sell-Side Bank”, “Core Financial Service Trade Association”, “Other Sell-Side Bank”,

Industry Buy-Side (Derivative and Commodity Users)

“Public Asset Manager”, “Private Asset Manager”, “Non-Financial Firm”, “Financial Sector Association”, “Non-Financial Private-Sector Association”,

Industry Joint Sell-Side and Buy-Side

“U.S. Chamber of Commerce or Affiliate”, “Market Infrastructure Firm”, “Law Firms, Consultants, and Related Advisors”,

Non-Industry Organizations and Individuals

“Government”, “Academic or Other Expert”, “Consumer Advocacy or other Citizens Group”, “Trade Union or other Formal Labor Organization”, “Market Advocacy or other Anti-Regulation Group”, “Unaffiliated Individual”,

This group is dropped as it is categorically not categorizable

“Un-Coded” <———————————————————-

# reset the working dataframe to the backup to ensure clean data for every step
documents.withTopicsAndMultivalues <- documents.withTopicsAndMultivaluesBKUP_2.2

# add in the Industry binary as a variable to our documents.withTopic dataframe

   "Industry Sell-Side"
  ,"Industry Joint Sell-Side and Buy-Side"
  ,"Industry Sell-Side"
  ,"Industry Buy-Side (Derivative and Commodity Users)"
  ,"Industry Buy-Side (Derivative and Commodity Users)"
  ,"Industry Joint Sell-Side and Buy-Side"
  ,"Industry Joint Sell-Side and Buy-Side"
  ,"Industry Joint Sell-Side and Buy-Side"
  ,"Industry Buy-Side (Derivative and Commodity Users)"
  ,"Industry Buy-Side (Derivative and Commodity Users)"
  ,"Industry Buy-Side (Derivative and Commodity Users)"

superTypeDF <- data.frame(typology,superType)


temp <- merge(documents.withTopicsAndMultivalues, superTypeDF, by.x = "Classifications.with.Multivalue", by.y="typology" ) 

##Note: there are 96 blanks due to the "Uncoded"" classification

#to count the number of comments in each super type:
sum(1*(temp$superType=="Industry Sell-Side"))
sum(1*(temp$superType=="Industry Joint Sell-Side and Buy-Side"))
sum(1*(temp$superType=="Industry Buy-Side (Derivative and Commodity Users)"))

# average documents to two lines and N topics columns based on the list above
superTypeTopicProportions <- ddply(
                        documents.withTopicsAndMultivalues[3:length(names(documents.withTopicsAndMultivalues))] ### omit ID

# report the dataframe where each row is one of the commenter superTypes and each column is a topic.  


#drop the uncoded
superTypeTopicProportions <- superTypeTopicProportions[2:nrow(superTypeTopicProportions),]

# report as a table where each cell is the Proportions of group [row] to mention topic [column]. Each cell will then be the average Proportions of the commenter super-type to mention a particular topic.  (Does this add up to 1 always?, probably not due to averaging, but it's pretty close)

# Stacked Bar Chart of Proportions for Visual Comparison
chartTitle = "Topic Proportions in Comments from 4 Super Types"
filename = paste("Topic_Proportions_by_SuperType_",n.topics,"_topics_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")

tempPlot <- melt(superTypeTopicProportions, id.vars="superType")
colnames(tempPlot)<- c("superType","topic","Proportions")

png(filename = paste("",filename,sep=""), width = 753, height = 578, units = "px")

ggplot(tempPlot, aes(superType, Proportions, fill=topic)) +geom_bar(stat = "identity") + theme(axis.text.x=element_text(angle=35,hjust=1,vjust=1)) + xlab("Commenter Types") + ylab("Average Topic Proportions") +labs(title=paste(chartTitle,"\n (n = ",nrow(documents.withTopicsAndMultivalues),")", sep=""))


# make a nice pseudo-table of the Proportions

filename = paste("Topic_Proportions_by_SuperType_dotsplot_",n.topics,"_topics_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")
png(filename = paste("",filename,sep=""), width = 1000, height = 578, units = "px")
g <- ggplot(tempPlot, aes(superType, topic)) + geom_point(aes(size = Proportions), color= "green") + theme_bw() + xlab("") + ylab("")
g + scale_size_continuous(range=c(1,10)) + theme(axis.text.x=element_text(angle=35,hjust=1,vjust=1))  + geom_text(aes(label = round(Proportions, 4)), size=5)

4.4 Five part comparison of Supertypes

Based on the 17 part typology

Industry Sell-Side “Major Wall Street Sell-Side Bank”, “Core Financial Service Trade Association”, “Other Sell-Side Bank”,

** Industry Buy-Side (Derivative and Commodity Users)** “Public Asset Manager”, “Private Asset Manager”, “Non-Financial Firm”, “Financial Sector Association”, “Non-Financial Private-Sector Association”,

Industry Joint Sell-Side and Buy-Side “U.S. Chamber of Commerce or Affiliate”,

Infrastructure and Market Services “Market Infrastructure Firm”, “Law Firms, Consultants, and Related Advisors”,

Non-Industry Organizations and Individuals “Government”, “Academic or Other Expert”, “Consumer Advocacy or other Citizens Group”, “Trade Union or other Formal Labor Organization”, “Market Advocacy or other Anti-Regulation Group”, “Unaffiliated Individual”,

This group is dropped as it is categorically not categorizable “Un-Coded” <———————————————————-

# reset the working dataframe to the backup to ensure clean data for every step
documents.withTopicsAndMultivalues <- documents.withTopicsAndMultivaluesBKUP_2.2

# add in the Industry binary as a variable to our documents.withTopic dataframe

   "Industry Sell-Side"
  ,"Industry Sell-Side"
  ,"Industry Joint Sell-Side and Buy-Side"
  ,"Industry Buy-Side (Derivative and Commodity Users)"
  ,"Industry Buy-Side (Derivative and Commodity Users)"
  ,"Industry Joint Sell-Side and Buy-Side"
  ,"Infrastructure and Market Services"
  ,"Infrastructure and Market Services"
  ,"Industry Buy-Side (Derivative and Commodity Users)"
  ,"Industry Buy-Side (Derivative and Commodity Users)"
  ,"Industry Buy-Side (Derivative and Commodity Users)"

FiveTypeDF <- data.frame(typology,FiveType)


temp <- merge(documents.withTopicsAndMultivalues, FiveTypeDF, by.x = "Classifications.with.Multivalue", by.y="typology" ) 

##Note: there are 96 blanks due to the "Uncoded"" classification
##counts number of documents with a blank

sum(1*(temp$FiveType=="Industry Sell-Side"))
sum(1*(temp$FiveType=="Industry Joint Sell-Side and Buy-Side"))
sum(1*(temp$FiveType=="Industry Buy-Side (Derivative and Commodity Users)"))
sum(1*(temp$FiveType=="Infrastructure and Market Services"))
# average documents to two lines and N topics columns based on the list above
fiveTypeTopicProportions <- ddply(
                        documents.withTopicsAndMultivalues[3:length(names(documents.withTopicsAndMultivalues))] ### omit ID

# report the dataframe where each row is one of the commenter superTypes and each column is a topic.  


#drop the uncoded
fiveTypeTopicProportions <- fiveTypeTopicProportions[2:nrow(fiveTypeTopicProportions),]

# report as a table where each cell is the Proportions of group [row] to mention topic [column]. Each cell will then be the average Proportions of the commenter super-type to mention a particular topic.  (Does this add up to 1 always?, probably not due to averaging, but it's pretty close)

# Stacked Bar Chart of Proportions for Visual Comparison
chartTitle = "Topic Proportions in Comments In Five Super Types"
filename = paste("Topic_Proportions_by_Five_Types_",n.topics,"_topics_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")

tempPlot <- melt(fiveTypeTopicProportions, id.vars="FiveType")
colnames(tempPlot)<- c("FiveType","topic","Proportions")

png(filename = paste("",filename,sep=""), width = 753, height = 578, units = "px")

ggplot(tempPlot, aes(FiveType, Proportions, fill=topic)) +geom_bar(stat = "identity") + theme(axis.text.x=element_text(angle=35,hjust=1,vjust=1)) + xlab("Commenter Types") + ylab("Average Topic Proportions") +labs(title=paste(chartTitle,"\n (n = ",nrow(documents.withTopicsAndMultivalues),")", sep=""))


# make a nice pseudo-table of the Proportions
filename = paste("Topic_Proportions_by_Five_Types_dotsplot_",n.topics,"_topics_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")
png(filename = paste("",filename,sep=""), width = 1000, height = 578, units = "px")
g <- ggplot(tempPlot, aes(FiveType, topic)) + geom_point(aes(size = Proportions), color= "green") + theme_bw() + xlab("") + ylab("")
g + scale_size_continuous(range=c(1,10)) + theme(axis.text.x=element_text(angle=35,hjust=1,vjust=1))  + geom_text(aes(label = round(Proportions, 4)), size=5)