This is the syntax which was used to process the raw data file from the Commodity Futures Trading Commission (CFTC) through to the final results published in:

Posch, Konrad, Nath, Thomas and Ziegler, J. Nicholas. “The Limits of Interest: Capture, Financialization, or Contestation in the Politics of Rule-Making for Derivatives”

Narrative online Appendices: (Posch, Nath, & Ziegler 2019-11-24) The Limits of Interest – Watson Working Paper – Online Appendices

Syntax appear below:

Setup Environment
0. Prepare Corpus
1. Estimate Mallet Topics
- 1.1 Clean the comment database
  - 1.1.1 Clean the database
  - 1.1.2 Save a Clean Copy of the Database
- 1.2 Generate the mallet topic model
  - 1.2.1 Run the topic model
2. Substantive Analyses of Groups of Comments (sub-Corpora)
3. Who Commented on the 10 most commented rules in the Coded (8k) dataset
4. Deprecated Additional Analyses

While the dataset was produced from internal CFTC servers (because the CFTC did not participate in Regulations.gov in 2014), they explicitly only provided publicly available data. Thus, the full dataset of 37,703 comments will be available from Dryad (https://doi.org/10.6078/D1610G) upon publication along with the codebook provided by the CFTC and amended with our additional classification variables as used below and described in the manuscript and online supporting information C.

It should be noted that the CFTC does have additional meta-data which they declined to provide (i.e. IP addresses for commenters and other personally identifiable information) which they noted could be obtained through a Freedom of Information Act (FOIA) request. We did not pursue this avenue, but future researchers interested in, for example, the geographic distribution of commenters could request such data.

The Latent Dirichlet Analysis (LDA) used here is executed with Mallet. Mallet is a popular topic modelling software application. Mallet is written in java – which means it’s fast – and R has a Mallet wrapper package – which means it’s more userfriendly.

However, Mallet is a very resource-heavy app (because Java loves to use resources) so optimizing the number of working threads is key to running the code in a reasonable amount of time. Details are provided in the Setup Environment syntax, but the gist is you want to experiment with the number of threads based on your configuration to get the fastest runtime possible. Runtime for the 12,000 iteration topic model used to generate results for publication took on the order of 36 minutes using a Ryzen 3600 CPU (6 core, 12 logical processors, 4.1Ghz, 12 threads). The old Phenom II 720 processor (3 core, 3 logical processors, 3.5GHz, 15 threads) took on the order of 3.5 hours. Configuration and processor power matter a great deal to how this code runs (although anything can run it eventually).

NOTE: While it should NOT affect the stability of results, apparently the number of threads does. There is an element of randomness to topic modelling so while topics created from a corpus run with different random seeds will be substantively similar, they may appear in different orders and with slightly different topic proportions. To create reproducible results, this syntax sets the random seed (see section 1.2.1). However, it also appears that the number of threads is affecting output as well, so if you would like to exactly reproduce our results in the manuscript, you need to leave the “number of logical processors” value (numLogicalProcessors) in Section 0 set to 15.

Setup Environment

##clears R memory
rm(list=ls())

## Mallet is a massive pig when it comes to processing (roughly 11 seconds/10 iterations).  Enter the number of logical processors below to enable multi-threading
numLogicalProcessors = 15

# Set the number of iterations. 1000 for testing, 12000 for analysis
iterations <- 12000


## desktop wd
wd <- "D:/Dropbox/The Limits of Interest"

## Laptop wd
#wd <- "C:/Dropbox/The Limits of Interest"

setwd(wd)


## install REQUIRED non-default packages if not yet installed
#install.packages(c("mallet","rJava","RTextTools","SnowballC","plyr","dplyr","ggplot2","reshape2","xtable","qdap","knitr","viridis"))

#######################################################################################
##                                                                                   ##
##  NOTE: For mallet to work, you must install the Java Development Kit (JDK)        ##
##       not just the runtime environment (JRE).  This is free on the java website,  ##
##       but must be installed on your system before you can call the mallet         ##
##       library below.                                                              ##
##                                                                                   ##
#######################################################################################


options(java.parameters = "-Xmx8000m") #this is the magic solution to the out of memory error as Mallet limits RAM usage to 1gb which is much too small for our dataset

# load REQUIRED libraries
library(rJava) # the interface between R and Java, needed for mallet
library(mallet) # a wrapper around the Java machine learning tool MALLET
library(SnowballC) # for stemming
library(tm) # Framework for text mining
#library(RTextTools) # a machine learning package for text classification written in R
library(plyr)
library(dplyr) # Data preparation and pipes $>$
library(ggplot2) # for plotting word frequencies
library(scales)
library(stringr)
library(reshape2) #visualization
library(xtable) # pretty tables
library(knitr) # final output and file handling
library(viridis) #Inclusive design palettes for stacked bar charts because inclusive design is good design

0. Prepare Corpus

This module reads in the raw data and prepares it for topic modeling.

0.0 Combine the CFTC raw data file with the organization codings in this file from (2017-05-11) syntax file

The CFTC provided all comments submitted during the public comment periods on all rules they have written to implement the Dodd-Frank Act.

It is INCREDIBLY important that this combination happen in R and not Excel as Excel has a 32,765 (32,767-2 text qualifier “) character limit for it’s cells and it will truncate any data placed in there. There are a number of comment letters which are longer than this limit (ex. 8895, 26166, 26171) so all matching and processing MUST happen in R.

Note: the original data file had to be changed to UTF-8 from unknown (possibly Unicode) format using Notepad.exe because otherwise nothing showed up in R.

# load data
cftc.Comments <- read.delim("D:/Dropbox/The Limits of Interest/DoddFrankCommentsAll(UTF-8).txt",
                            sep="|",
                            header=TRUE,
                            stringsAsFactors = FALSE,
                            encoding ="UTF-8",
                            strip.white = TRUE,
                            quote = "" #this is a unbelieveably vital inclusion otherwise you lose cases!
                            ) ##These are the comments on proposed regulations (NPRMs)

####Check to see if we have unexpected blank data on import, esp. in CommentText
sum(1*(cftc.Comments$CommentText=="")) # 1036 Expected, although will be only 86 for cases we analyze, see 0.1 below

## [1] 1036

#CHeck the variable names
names(cftc.Comments)

## [1] "X.U.FEFF.ControlNumberID" "SubmitDate"              
## [3] "UniqueName"               "FirstName"               
## [5] "LastName"                 "Organization"            
## [7] "Title"                    "CommentText"             
## [9] "ExtractedText"

#Fix the oddly named first variable
colnames(cftc.Comments)[1] <- "ControlNumberID"
names(cftc.Comments)

## [1] "ControlNumberID" "SubmitDate"      "UniqueName"      "FirstName"      
## [5] "LastName"        "Organization"    "Title"           "CommentText"    
## [9] "ExtractedText"

#load in the classified Organization values and meta-data
cftc.Comments.metaData<-read.csv("D:/Dropbox/The Limits of Interest/(2019-06-12) Final Classifications for R_noText.csv", stringsAsFactors = FALSE)

####################################################################

#CHeck the variable names
names(cftc.Comments.metaData)

##  [1] "ControlNumberID"                  
##  [2] "SubmitDate"                       
##  [3] "UniqueName"                       
##  [4] "FirstName"                        
##  [5] "LastName"                         
##  [6] "Organization"                     
##  [7] "Title"                            
##  [8] "In.NZ.and.LE.memos"               
##  [9] "Rule.Super.set..from.Memo.Title." 
## [10] "Vlookup.Key"                      
## [11] "Final.Classification"             
## [12] "Final.Classification.Multivalue.2"
## [13] "Final.Classification.Multivalue.3"
## [14] "Final.Classification.Multivalue.4"
## [15] "Final.Classification.Multivalue.5"
## [16] "VA.KSC.Classification"            
## [17] "VA.KSC.Multivalue.2"              
## [18] "VA.KSC.Multivalue.3"              
## [19] "VA.KSC.Multivalue.4"              
## [20] "VA.KSC.Multivalue.5"              
## [21] "VA.KSC.Non.US...Non.US.or.Blank." 
## [22] "LE.Classification"                
## [23] "LE.Multivalue.2"                  
## [24] "LE.Multivalue.3"                  
## [25] "LE.Multivalue.4"                  
## [26] "LE.Non.US...Non.US.or.Blank."

#Fix the oddly named first variable
colnames(cftc.Comments.metaData)[1] <- "ControlNumberID"
names(cftc.Comments.metaData)

##  [1] "ControlNumberID"                  
##  [2] "SubmitDate"                       
##  [3] "UniqueName"                       
##  [4] "FirstName"                        
##  [5] "LastName"                         
##  [6] "Organization"                     
##  [7] "Title"                            
##  [8] "In.NZ.and.LE.memos"               
##  [9] "Rule.Super.set..from.Memo.Title." 
## [10] "Vlookup.Key"                      
## [11] "Final.Classification"             
## [12] "Final.Classification.Multivalue.2"
## [13] "Final.Classification.Multivalue.3"
## [14] "Final.Classification.Multivalue.4"
## [15] "Final.Classification.Multivalue.5"
## [16] "VA.KSC.Classification"            
## [17] "VA.KSC.Multivalue.2"              
## [18] "VA.KSC.Multivalue.3"              
## [19] "VA.KSC.Multivalue.4"              
## [20] "VA.KSC.Multivalue.5"              
## [21] "VA.KSC.Non.US...Non.US.or.Blank." 
## [22] "LE.Classification"                
## [23] "LE.Multivalue.2"                  
## [24] "LE.Multivalue.3"                  
## [25] "LE.Multivalue.4"                  
## [26] "LE.Non.US...Non.US.or.Blank."

#check the classifications for Final.Classification
unique(cftc.Comments.metaData$Final.Classification)

##  [1] "Un-Coded"                                      
##  [2] "Private Asset Manager"                         
##  [3] "Market Infrastructure Firm"                    
##  [4] "Law Firms, Consultants, and Related Advisors"  
##  [5] "Unaffiliated Individual"                       
##  [6] "Non-Financial Firm"                            
##  [7] "Academic or Other Expert"                      
##  [8] "Non-Financial Private-Sector Association"      
##  [9] "Government"                                    
## [10] "Financial Sector Association"                  
## [11] "Consumer Advocacy or other Citizens Group"     
## [12] "Core Financial Service Trade Association"      
## [13] "U.S. Chamber of Commerce or Affiliate"         
## [14] "Major Wall Street Sell-Side Bank"              
## [15] "Public Asset Manager"                          
## [16] "Other Sell-Side Bank"                          
## [17] "Trade Union or other Formal Labor Organization"
## [18] "Market Advocacy or other Anti-Regulation Group"

Now there are only some of the comments with an organization val, 8568 to be precise. This chunk produces 9450 comments but that includes NA values which are seen as non-empty. Final cull to 8568 happens when matched to Final.Classification

For our analysis, those are the only comments we need to topic model, so let’s cut the database down to size.

# Let's subset out the comments with organization values which are the ones we want
cftc.Comments.withOrg <- subset(cftc.Comments, Organization!="")

# lets count the total number of unique org values
length(unique(cftc.Comments.withOrg$Organization))

## [1] 4752

# lets see a few of those
head(unique(cftc.Comments.withOrg$Organization),20)

##  [1] "NOYB"                         "Home Communications"         
##  [3] "khterFX DBA"                  "FXP Intl"                    
##  [5] "Gain Capital/Forex"           "STG Holdings LC"             
##  [7] "Very"                         "FX Bridge Technologies"      
##  [9] "Black Bay FX Services LLC"    "Forex Power Net"             
## [11] "Forex Trader"                 "Bay View Trading Corporation"
## [13] "Back Bay FX Services LLC"     "ProAct Traders LLC"          
## [15] "Forex Community"              "Silicon Valley Forex"        
## [17] "Partick Smith Realty"         "Gami/Tati"                   
## [19] "Yoder & Associates Inc"       "Kelly Industries"

# see a truncated version of this now smaller data, truncdf seems to be deprecated after R 2.3.2
#head(truncdf(cftc.Comments.withOrg,10))

Finally, we need to load in the meta-data related to classification and super group

#First, drop the duplicate metadata
cftc.Comments.metaData <- subset(cftc.Comments.metaData, select=c(
ControlNumberID
,In.NZ.and.LE.memos
,Rule.Super.set..from.Memo.Title.
,Final.Classification
,Final.Classification.Multivalue.2
,Final.Classification.Multivalue.3
,Final.Classification.Multivalue.4
,Final.Classification.Multivalue.5
,VA.KSC.Classification
,VA.KSC.Multivalue.2
,VA.KSC.Multivalue.3
,VA.KSC.Multivalue.4
,VA.KSC.Multivalue.5
,VA.KSC.Non.US...Non.US.or.Blank.
,LE.Non.US...Non.US.or.Blank.
,LE.Classification
,LE.Multivalue.2
,LE.Multivalue.3
,LE.Multivalue.4)
)

#merge the meta-data dataframe with the comment dataframe, NOTEL this drops 889 cases (9457->8568 which are some version of none or N/A as entered in the original dataset by the commenter)
cftc.Comments.withOrgAndMeta<- merge(cftc.Comments.withOrg, cftc.Comments.metaData, by="ControlNumberID")

#Lets see where we have some missing data in text variables (such as ExtractedText)
sum(1*(cftc.Comments.withOrgAndMeta$ExtractedText=="")) #should be 5, some of the original file had no text here

## [1] 5

sum(1*(cftc.Comments.withOrgAndMeta$CommentText=="")) #should be 86

## [1] 86

sum(1*(cftc.Comments.withOrgAndMeta$Final.Classification=="")) #should be 0

## [1] 0

#Reorder the columns so the massive and messy variables of CommentText and Extracted Text appear at the end of the file for cleanliness
names(cftc.Comments.withOrgAndMeta)

##  [1] "ControlNumberID"                  
##  [2] "SubmitDate"                       
##  [3] "UniqueName"                       
##  [4] "FirstName"                        
##  [5] "LastName"                         
##  [6] "Organization"                     
##  [7] "Title"                            
##  [8] "CommentText"                      
##  [9] "ExtractedText"                    
## [10] "In.NZ.and.LE.memos"               
## [11] "Rule.Super.set..from.Memo.Title." 
## [12] "Final.Classification"             
## [13] "Final.Classification.Multivalue.2"
## [14] "Final.Classification.Multivalue.3"
## [15] "Final.Classification.Multivalue.4"
## [16] "Final.Classification.Multivalue.5"
## [17] "VA.KSC.Classification"            
## [18] "VA.KSC.Multivalue.2"              
## [19] "VA.KSC.Multivalue.3"              
## [20] "VA.KSC.Multivalue.4"              
## [21] "VA.KSC.Multivalue.5"              
## [22] "VA.KSC.Non.US...Non.US.or.Blank." 
## [23] "LE.Non.US...Non.US.or.Blank."     
## [24] "LE.Classification"                
## [25] "LE.Multivalue.2"                  
## [26] "LE.Multivalue.3"                  
## [27] "LE.Multivalue.4"

cftc.Comments.withOrgAndMeta<-cftc.Comments.withOrgAndMeta[c(1:7,10:24,8,9)]

## Write this prepared corpus to a data file
write.table(cftc.Comments.withOrgAndMeta,file="DoddFrankCommentsWithOrgValueAndMetaData.txt",sep="|",row.names = FALSE,quote=FALSE)

0.1 Load the Data

First let’s read in our data. The corpus we’ll be using today is a database of CFTC comments related to Dodd-Frank Financial Reform Implementation which was created as described in step 0.0 above.

#read in text, | delimited file
documents <- read.delim("DoddFrankCommentsWithOrgValueAndMetaData.txt",
                            sep="|",
                            header=TRUE,
                            stringsAsFactors = FALSE,
                            quote = "" #this is a unbelieveably vital inclusion otherwise you                                                        lose cases!  This turns off quotes which is important for the full text                                        otherwise cases get wrapped into previous cases.  It is causing all the                                        data read in to have quotes wrapped around it and to have blank values                                        stored as ""
                            ) ##These are the comments on proposed rulemaking

# examine the variables names in our dataset
names(documents)

##  [1] "ControlNumberID"                  
##  [2] "SubmitDate"                       
##  [3] "UniqueName"                       
##  [4] "FirstName"                        
##  [5] "LastName"                         
##  [6] "Organization"                     
##  [7] "Title"                            
##  [8] "In.NZ.and.LE.memos"               
##  [9] "Rule.Super.set..from.Memo.Title." 
## [10] "Final.Classification"             
## [11] "Final.Classification.Multivalue.2"
## [12] "Final.Classification.Multivalue.3"
## [13] "Final.Classification.Multivalue.4"
## [14] "Final.Classification.Multivalue.5"
## [15] "VA.KSC.Classification"            
## [16] "VA.KSC.Multivalue.2"              
## [17] "VA.KSC.Multivalue.3"              
## [18] "VA.KSC.Multivalue.4"              
## [19] "VA.KSC.Multivalue.5"              
## [20] "VA.KSC.Non.US...Non.US.or.Blank." 
## [21] "LE.Non.US...Non.US.or.Blank."     
## [22] "LE.Classification"                
## [23] "CommentText"                      
## [24] "ExtractedText"

#Lets see where we have some missing data in text variables (such as ExtractedText)
sum(1*(documents$ExtractedText=="")) #should be 5

## [1] 5

sum(1*(documents$CommentText=="")) #should be 86

## [1] 86

#confirm that every comment has a classification
sum(1*(documents$Final.Classification=="")) #should be 0

## [1] 0

0.1.1 Drop Ex Parte Meetings and Comments “None” type values in Organization

The CFTC included 286 ex parte meetings in their dataset. Since the data generating process for those comments is not the same as that for the other comments, we need to drop them from the analysis. These cases can be identified based on the value "Ex Parte" in the FirstName variable.

There are also 20 cases with organization values which are uninterpretable. They must be dropped

#Check the number of ex parte meetings
sum(1*(documents$FirstName=="Ex Parte")) #should be 286

## [1] 286

# Subset the data to include only those cases which are not ex parte meetings
documents <- subset(documents, FirstName!="Ex Parte")

# verify that the correct number of comments remain
nrow(documents) #should be 8282

## [1] 8282

# create the list of junk org values
junkOrgValues <- c("(NONE)",".","/","[none]","-none-","None","None ","None.","none","none ","none.")

junkOrgValues

##  [1] "(NONE)" "."      "/"      "[none]" "-none-" "None"   "None " 
##  [8] "None."  "none"   "none "  "none."

#check number of comments with junk org values
sum(1*(documents$Organization %in% junkOrgValues)) #should be 20

## [1] 20

# Subset the data to include only those cases which do not have junk organization values
documents <- subset(documents, !(Organization %in% junkOrgValues))

# verify that the correct number of comments remain
nrow(documents) #should be 8262

## [1] 8262

0.1.2 Get Unique Organization Values from the Corpus for Data Coding Purposes

In order to code comments to our typology, we need a list of all unique values in the Organization variable.

# count the number of unique values in Organization
length(unique(documents$Organization)) #should be 4671

## [1] 4603

# get the unique list of Organization values with counts of relevant comments
validOrgValues <- as.data.frame(table(documents$Organization))

colnames(validOrgValues)[1]<- "Organization"

## look at the first 30 and notice that we have a number that are conceptually identical but technically distinct due to trival variations such as punctuation and spacing.  These are solved by conceptual coding to avoid accidental loss of data
head(validOrgValues, 30)

##                                                                                                                                                                                                Organization
## 1                                                                                 - National Rural Electric Cooperative Association;®- American Public Power Association; and®- Large Public Power Council.
## 2                                                                                                                                                                                    (a private individual)
## 3                                                                                                                                                                                              (individual)
## 4                                                                                                                                      (none; private investor, retired from career in banking and finance)
## 5                                                                                                                                                                                        (Private Investor)
## 6                                                                                                                                                                              ®  ®Eden Earth Resources LLC
## 7                                                                                                                         ®Free citizen, tax payer, and in 10 years been promised a social security payback
## 8                                                                                                                                                                                                     ®self
## 9                                                                                                                                                                                                        00
## 10 1,500 contracts is a fair and appropriate position limit in silver. 1,500 should be the maximum (or 7 million ounces).  The proposed limit of 5,000 + will not solve the manipulation problem in silver.
## 11                                                                                                                                                                               17 CFR  (SDR)®79 FR  16672
## 12                                                                                                                                                                          19 Community and Regional Banks
## 13                                                                                                                                                           1st Infantry Division, Combat Aviation Brigade
## 14                                                                                                                                                                                            25th Aviation
## 15                                                                                                                                                                                                 36 South
## 16                                                                                                                                                                                     3Degrees Group, Inc.
## 17                                                                                                                                                                                             3F Forecasts
## 18                                                                                                                                                                                               401K owner
## 19                                                                                            56 National coalitions and organizations and ®28 International coalitions and organizations from 16 countries
## 20                                                                                             56 National coalitions and organizations and®28 International coalitions and organizations from 16 countries
## 21                                                                                                                                                                                      79 FR 1347 (17 CFR)
## 22                                                                                                                                                                                                      965
## 23                                                                                                                                                                                             A Bonvouloir
## 24                                                                                                                                                                   A citizen for a fair and just society.
## 25                                                                                                                                                          a concerned citizen of united states of America
## 26                                                                                                                                                                                                 A dad!!!
## 27                                                                                                                                                                                          A regional bank
## 28                                                                                             A screwed American due to the Corruption that is so widespread throughout the Government including the CFTC.
## 29                                                                                                                                                                           A slave to the corrupt system.
## 30                                                                                                                                                                                   A Slave to the system.
##    Freq
## 1     1
## 2     1
## 3     1
## 4     1
## 5     1
## 6     1
## 7     1
## 8     1
## 9     1
## 10    1
## 11    1
## 12    2
## 13    1
## 14    1
## 15    1
## 16    1
## 17    1
## 18    1
## 19    1
## 20    1
## 21    1
## 22    1
## 23    2
## 24    1
## 25    1
## 26    1
## 27    3
## 28    1
## 29    1
## 30    1

## Sort in decending order of frequency to see top values
validOrgValues <- validOrgValues[order(validOrgValues$Freq, decreasing = TRUE),]

## examine the top 30 most common raw values, notice seemingly identical values which are actually slightly different necessitating manual conceptual coding by analysts
head(validOrgValues, 30)

##                                                                       Organization
## 3702                                                                          self
## 3703                                                                          Self
## 251                                                 Americans for Financial Reform
## 3306                                                              Private Investor
## 1848                                                                    Individual
## 477                                                           Better Markets, Inc.
## 1982                                                                      Investor
## 2416                                                     Managed Funds Association
## 1861                                                           Individual Investor
## 3303                                                              private investor
## 3305                                                              Private investor
## 1847                                                                    individual
## 3259                                                               private citizen
## 3261                                                               Private Citizen
## 3466                                                                       Retired
## 3253                                                                       Private
## 3252                                                                       private
## 845                                  Coalition of Physical Energy Companies (COPE)
## 1752                                                         Hunton & Williams LLP
## 3260                                                               Private citizen
## 506                                                                BlackRock, Inc.
## 1387                                                 Financial Services Roundtable
## 2451                                                                    MarkitSERV
## 777                                                                        Citizen
## 1860                                                           Individual investor
## 4338                                                          United States Senate
## 2009                                                                          ISDA
## 3465                                                                       retired
## 344  Asset Management Group, Securities Industry and Financial Markets Association
## 4090                                   The Depository Trust & Clearing Corporation
##      Freq
## 3702   91
## 3703   79
## 251    76
## 3306   76
## 1848   73
## 477    60
## 1982   59
## 2416   55
## 1861   50
## 3303   50
## 3305   47
## 1847   44
## 3259   44
## 3261   39
## 3466   38
## 3253   36
## 3252   35
## 845    33
## 1752   33
## 3260   33
## 506    30
## 1387   30
## 2451   30
## 777    29
## 1860   29
## 4338   29
## 2009   28
## 3465   28
## 344    26
## 4090   26

## For easy coding, sort validOrgValues alphabetically by Organization column
validOrgValues <- validOrgValues[order(validOrgValues$Organization, decreasing = FALSE),]

## Write that list to a | delimited file for easy copy into excel and sending to coders
filename = paste("Unique_Valid_Organization_Values_withCounts_",nrow(documents),"_comments_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".txt", sep="")

write.table(validOrgValues, row.names=FALSE, col.names=TRUE, file = filename, sep = "|")

filename

## [1] "Unique_Valid_Organization_Values_withCounts_8262_comments_2019-12-01_11.57.15.txt"

0.2 Created a unified text column to solve the gaps in CommentText and ExtractedText

As seen above in section 0.1, there are 5 blank values in ExtractedText. We need to create a unified text column called UnifiedText which includes a value for all comments by filling the 5 blanks with the values from CommentText.

## View the 5 comments which have missing ExtractedText values and confirm that they have CommentText Values
#   Truncate the printout so we don't get a mess

str_trunc(documents[documents$ExtractedText=="",]$CommentText,300) #only 5 results expected, confirm that all [index] numbers are followed by non-NULL values that look like comment letter text

## [1] "Dear Sirs:®®There has been way too much interference with the silver market by certain big players, artificially keeping prices totally out of realistic market ranges.  The CFTC and the Comex were never designed to facilitate market manipulation, but with the existing rules, that has become the c..."
## [2] "Futures are used in the real world to control costs, protect profits, reduce liability.   Real people, producing real commodities, producing real good that people use and consume.®Of they are also used as a trading vehicle, of which I consider some trading to be useful as to a price discovery.®®B..."
## [3] "You must not allow more than 1,500 contracts as the appropriate concentration limit for the COMEX Silver market.  Our liberty is directly related to this issue."                                                                                                                                            
## [4] "I urge you to please curb excessive gambling in commodities markets like food and oil. ®®While many factors contribute to today’s highly volatile commodity prices, it is clear that excessive speculation is significantly responsible, as shown in dozens of studies by several respected institutions ..."
## [5] "Title: End-User Exception to Mandatory Clearing of Swaps®FR Document Number: 2010-31578®Legacy Document ID: ®RIN: 3038-AD10®Publish Date: 12/23/2010 12:00:00 AM®®Submitter Info:®First Name:  Fred®Last Name:  Nadelman, LMSW®Mailing Address:  1825 East Gwinnett Street®City:  Savannah®Country:  Unit..."

## Create a new combined text variable which will be used for anlysis called "UnifiedText" and populate it with ExtractedText for all 8257 comments (total comments minus 5) which have a value for ExtractedText.
documents$UnifiedText <- documents$ExtractedText

## Fill the 5 missing values in ExtractedText using CommentText and save to new variable UnifiedText
documents[documents$ExtractedText=="",]$UnifiedText <- documents[documents$ExtractedText=="",]$CommentText

## Confirm that UnifiedText has no empty values
sum(1*(documents$UnifiedText=="")) #should be 0

## [1] 0

## Confirm that the 5 comments which had missing ExtractedText values have CommentText values in UnifiedText
documents[documents$ExtractedText=="",]$UnifiedText == documents[documents$ExtractedText=="",]$CommentText #should be 5 TRUE values

## [1] TRUE TRUE TRUE TRUE TRUE

0.3 Make the 18-part typology a factor

Now, for representativeness, we need to convert the 18 part typology (17 substantive categories + 1 “Un-coded”) into a factor because the typology was constructed to span a spectrum roughly from industry insider (top) to general public (bottom) with the residual category of “Un-Coded” at the end.

## confirm that Final.Classification has no erroneous spellings (should have 18 unique values)
unique(documents$Final.Classification)

##  [1] "Un-Coded"                                      
##  [2] "Private Asset Manager"                         
##  [3] "Market Infrastructure Firm"                    
##  [4] "Law Firms, Consultants, and Related Advisors"  
##  [5] "Unaffiliated Individual"                       
##  [6] "Non-Financial Firm"                            
##  [7] "Academic or Other Expert"                      
##  [8] "Non-Financial Private-Sector Association"      
##  [9] "Government"                                    
## [10] "Financial Sector Association"                  
## [11] "Consumer Advocacy or other Citizens Group"     
## [12] "Core Financial Service Trade Association"      
## [13] "U.S. Chamber of Commerce or Affiliate"         
## [14] "Major Wall Street Sell-Side Bank"              
## [15] "Public Asset Manager"                          
## [16] "Other Sell-Side Bank"                          
## [17] "Trade Union or other Formal Labor Organization"
## [18] "Market Advocacy or other Anti-Regulation Group"

##c creates a vector of the different typologies
documents$Final.Classification <- factor(documents$Final.Classification, levels = c(
  "Major Wall Street Sell-Side Bank",
  "Other Sell-Side Bank",
  "Core Financial Service Trade Association",
  "Financial Sector Association",
  "Public Asset Manager",
  "Private Asset Manager",
  "U.S. Chamber of Commerce or Affiliate",
  "Market Infrastructure Firm",
  "Law Firms, Consultants, and Related Advisors",
  "Non-Financial Private-Sector Association",
  "Non-Financial Firm",
  "Government",
  "Academic or Other Expert",
  "Consumer Advocacy or other Citizens Group",
  "Trade Union or other Formal Labor Organization",
  "Market Advocacy or other Anti-Regulation Group",
  "Unaffiliated Individual",
  "Un-Coded")
  )

# now, verify that we indeed have an ORDERED 17 level typology as listed above (look at [indecies] and confirm it goes in the order listed in the above creation statement):

levels(documents$Final.Classification)

##  [1] "Major Wall Street Sell-Side Bank"              
##  [2] "Other Sell-Side Bank"                          
##  [3] "Core Financial Service Trade Association"      
##  [4] "Financial Sector Association"                  
##  [5] "Public Asset Manager"                          
##  [6] "Private Asset Manager"                         
##  [7] "U.S. Chamber of Commerce or Affiliate"         
##  [8] "Market Infrastructure Firm"                    
##  [9] "Law Firms, Consultants, and Related Advisors"  
## [10] "Non-Financial Private-Sector Association"      
## [11] "Non-Financial Firm"                            
## [12] "Government"                                    
## [13] "Academic or Other Expert"                      
## [14] "Consumer Advocacy or other Citizens Group"     
## [15] "Trade Union or other Formal Labor Organization"
## [16] "Market Advocacy or other Anti-Regulation Group"
## [17] "Unaffiliated Individual"                       
## [18] "Un-Coded"

1. Estimate Mallet Topics

Now that we have a corpus of comments, we need to train a topic model

1.1 Clean the comment database

This is preprocessing that includes stripping whitespace, punctuation, stopwords, and stemming the document. It then saves a backup of the database in this state to save processing time when carrying out exploratory analysis. For the same reason, each function which takes significant processing time is wrapped in system.time() to show how long processing takes displayed in seconds.

1.1.1 Clean the database

# remove extraneous whitespace
documents$PreppedText <- stripWhitespace(documents$UnifiedText)

# remove punctuation
documents$PreppedText <- gsub(pattern="[[:punct:]]",replacement="",documents$PreppedText)

# convert all to lowercase
documents$PreppedText <- tolower(documents$PreppedText)


# Create Stopwords list based on standard english and context specific words listed below 
mystopwords <-c(stopwords("english"), "market", "trade", "propos", "rule", "please", "financi", "make", "act", "cftc")

# remove common words (stop words) before stemming to avoid the "ani" problem where any is change to ani and then not removed by mallet's stopword function.
Sys.time()

## [1] "2019-12-01 11:57:25 PST"

system.time(
  documents$PreppedText<-removeWords(documents$PreppedText, mystopwords)
  ) #this takes about 19 sec on 4.1 GHz desktop processor

##    user  system elapsed 
##   18.66    0.00   18.66

sort(mystopwords) # check out what was removed

##   [1] "a"          "about"      "above"      "act"        "after"     
##   [6] "again"      "against"    "all"        "am"         "an"        
##  [11] "and"        "any"        "are"        "aren't"     "as"        
##  [16] "at"         "be"         "because"    "been"       "before"    
##  [21] "being"      "below"      "between"    "both"       "but"       
##  [26] "by"         "can't"      "cannot"     "cftc"       "could"     
##  [31] "couldn't"   "did"        "didn't"     "do"         "does"      
##  [36] "doesn't"    "doing"      "don't"      "down"       "during"    
##  [41] "each"       "few"        "financi"    "for"        "from"      
##  [46] "further"    "had"        "hadn't"     "has"        "hasn't"    
##  [51] "have"       "haven't"    "having"     "he"         "he'd"      
##  [56] "he'll"      "he's"       "her"        "here"       "here's"    
##  [61] "hers"       "herself"    "him"        "himself"    "his"       
##  [66] "how"        "how's"      "i"          "i'd"        "i'll"      
##  [71] "i'm"        "i've"       "if"         "in"         "into"      
##  [76] "is"         "isn't"      "it"         "it's"       "its"       
##  [81] "itself"     "let's"      "make"       "market"     "me"        
##  [86] "more"       "most"       "mustn't"    "my"         "myself"    
##  [91] "no"         "nor"        "not"        "of"         "off"       
##  [96] "on"         "once"       "only"       "or"         "other"     
## [101] "ought"      "our"        "ours"       "ourselves"  "out"       
## [106] "over"       "own"        "please"     "propos"     "rule"      
## [111] "same"       "shan't"     "she"        "she'd"      "she'll"    
## [116] "she's"      "should"     "shouldn't"  "so"         "some"      
## [121] "such"       "than"       "that"       "that's"     "the"       
## [126] "their"      "theirs"     "them"       "themselves" "then"      
## [131] "there"      "there's"    "these"      "they"       "they'd"    
## [136] "they'll"    "they're"    "they've"    "this"       "those"     
## [141] "through"    "to"         "too"        "trade"      "under"     
## [146] "until"      "up"         "very"       "was"        "wasn't"    
## [151] "we"         "we'd"       "we'll"      "we're"      "we've"     
## [156] "were"       "weren't"    "what"       "what's"     "when"      
## [161] "when's"     "where"      "where's"    "which"      "while"     
## [166] "who"        "who's"      "whom"       "why"        "why's"     
## [171] "with"       "won't"      "would"      "wouldn't"   "you"       
## [176] "you'd"      "you'll"     "you're"     "you've"     "your"      
## [181] "yours"      "yourself"   "yourselves"

# stem the document
documents$PreppedText <- stemDocument(documents$PreppedText)

# remove words again to deal with context specific stopwards which are in stem form.
Sys.time()

## [1] "2019-12-01 11:57:51 PST"

system.time(
  documents$PreppedText<-removeWords(documents$PreppedText, mystopwords)
  ) #this takes about 19 sec on 4.1 GHz desktop processor

##    user  system elapsed 
##   19.21    0.00   19.20

1.1.2 Save a Clean Copy of the Database

Since data analysis often involves exploration and experimentation, this chunk creates a clean backup of the data to avoid the processing time which chunk 1.1.1 takes, especially on laptop processors.

#create a backup version of the database at this state to save processing time if we alter the corpus below in 1.2.1.
documentsBKUP_1.1 <- documents

1.2 Generate the mallet topic model

In this section, we generate the topic model. For this project, this will include all of the comments so that we get topics which represent the entire corpus and then we subset the results to see variation among commenter types.

1.2.1 Run the topic model

This chunk loads data into mallet, sets number of topics, trains the model, sets the random seed (for reproducible results), THEN inputs the data into the topic model. Following that, we set the hyperparameters as well as the number of iterations, which are the primary parameters for LDA.

Note: Mallet is powerful and much faster than non-Java LDA, but it is still a beast on processing power. Added to that fact is that a single thread does NOT max out even a single core of a modern processor. Thus, you must tweak the code to optimise for your computer. I have found that 5 threads per logical processor works best for me on an ancient Phenom II 720 (3-core processor circa 2009, OC to 3.5 GHz) but 1 thread per logical processor on my new computer (Ryzen 5 3600, 6-core, 12 logical processor, 4.1 GHz stock clock). You’ll see my log of testing runs below in the code. I suggest anyone adopting this technique does the same debugging BEFORE you settle in for an interpretation-grade run (for us, 12,000 iterations) as the runtime can vary 50-100% based on improper thread number choices and the runtime is on the order of hours. Do you want to wait 2 hours or 4?

But, be careful about what order you call setNumThreads in.

It must be AFTER the instances have been loaded or you will mess up the random seed and produce inconsistent results.

The order must be: (1) set seed (2) load instances (3) set threads (4) run model

#remove output from this chunk just in case to prevent any messy double-use when experimenting with other analyses
rm(mallet.instances,topic.model,vocabulary,topic,n.topics,topics.labels,doc.topics,topic.words,word.freqs)

## Warning in rm(mallet.instances, topic.model, vocabulary, topic, n.topics, :
## object 'mallet.instances' not found

## Warning in rm(mallet.instances, topic.model, vocabulary, topic, n.topics, :
## object 'topic.model' not found

## Warning in rm(mallet.instances, topic.model, vocabulary, topic, n.topics, :
## object 'vocabulary' not found

## Warning in rm(mallet.instances, topic.model, vocabulary, topic, n.topics, :
## object 'topic' not found

## Warning in rm(mallet.instances, topic.model, vocabulary, topic, n.topics, :
## object 'n.topics' not found

## Warning in rm(mallet.instances, topic.model, vocabulary, topic, n.topics, :
## object 'topics.labels' not found

## Warning in rm(mallet.instances, topic.model, vocabulary, topic, n.topics, :
## object 'doc.topics' not found

## Warning in rm(mallet.instances, topic.model, vocabulary, topic, n.topics, :
## object 'topic.words' not found

## Warning in rm(mallet.instances, topic.model, vocabulary, topic, n.topics, :
## object 'word.freqs' not found

#reset the working documents corpus, "documents", to the original full loaded database as it was at the end of step 1.1 to prevent any messiness from experimenting with other analyses
documents <- documentsBKUP_1.1

# load data into mallet
mallet.instances <- mallet.import(as.character(documents$ControlNumberID), as.character(documents$PreppedText), "stoplist.csv", FALSE, token.regexp="[\\p{L}']+")


#########################################
# Decide what number of topics to model #
#########################################
n.topics = 14

## Create a topic trainer object.
topic.model <- MalletLDA(n.topics)



####################################################################################
##For reproducible results, we need to set the seed the same for the final run.   ##
## This MUST be set BEFORE LOADING THE INSTANCES, odd behavior if it is not.      ##
####################################################################################

topic.model$model$setRandomSeed(as.integer(1))

##########################################################################################
##                                                                                      ##
##  Load our documents.  THis MUST happen after setRandomSeed and BEFORE setNumThreads  ##
##                                                                                      ##
##########################################################################################

topic.model$loadDocuments(mallet.instances)

#####################################################################################
#                                                                                   #
#   Set Topic Model to use Multiple Threads across multiple cores                   #
#                                                                                   #
#   Based on armchair optimization, I found that diminishing returns happen around  #
#   5x number of logical processors.  It would behoove you to test on your machine  #
#   with some 500 iteration runs to find the sweet spot to avoid spending an extra  #
#   couple hours on the 12k final run for analyis and interpretation.               #
#                                                                                   #
#####################################################################################

topic.model$model$setNumThreads(as.integer(numLogicalProcessors*1))

## Get the vocabulary, and some statistics about word frequencies.
##  These may be useful in further curating the stopword list.
vocabulary <- topic.model$getVocabulary()
word.freqs <- mallet.word.freqs(topic.model)

# examine some of the vocabulary, first sorting it by word frequency and then document frequency
word.freqs <- word.freqs[order(-word.freqs$term.freq, -word.freqs$doc.freq),]
word.freqs[1:50,]

##             words term.freq doc.freq
## 2064         swap    150826     3764
## 55        commiss     95386     4215
## 162        requir     93405     4193
## 801          bank     58507     3226
## 174          risk     58271     3749
## 966        entiti     55485     2805
## 403          fund     54465     2367
## 1502        clear     52832     4240
## 12          regul     50925     3832
## 688      transact     49941     3065
## 120         limit     46849     5754
## 53         commod     46074     5395
## 856         posit     45038     5187
## 534        provid     42898     3793
## 578      particip     42472     3270
## 2505       report     37749     2362
## 1307       dealer     37018     2372
## 812          hedg     36707     3279
## 1620     contract     36085     4153
## 397         price     34328     4136
## 938         activ     32919     2801
## 389        includ     32270     4328
## 250        invest     32074     2362
## 2192      section     32042     2539
## 54          futur     31540     4003
## 77          manag     30146     3198
## 550       definit     29743     2456
## 2084       exempt     28255     3812
## 2057    doddfrank     27625     3758
## 470         deriv     26438     4206
## 157         state     25760     3511
## 119        custom     25103     2369
## 469         secur     25073     2754
## 124        believ     24799     3437
## 1901 counterparti     23926     2240
## 1463     interest     23781     3049
## 67           time     23307     3314
## 684       compani     23045     3518
## 29         effect     22777     3950
## 134          busi     22421     3902
## 942        system     22262     3449
## 955     implement     21873     3504
## 415        execut     21657     2605
## 161        margin     21532     2191
## 376       exchang     21458     3721
## 364    regulatori     21151     3040
## 1233       liquid     20887     2355
## 791          data     20837     1744
## 1061       member     20230     2556
## 365         agenc     19673     1627

## Optimize hyperparameters every 20 iterations, 
##  after 50 burn-in iterations.
topic.model$setAlphaOptimization(20, 50)


## Now train a model. Note that hyperparameter optimization is on, by default.
##  We can specify the number of iterations. Here we'll use a large-ish round number.
#####################################################################################
### THIS TAKES TIME!!! (~2.5 min for 8568 comments/50 topics, 100 iterations)     ###
###   Faster for fewer topics (~1.5min on 8568 com/10 topics, 100 iterations)     ###
###                                                                               ###
###              --<< Old Computer, 3.5 Ghz 3 Core Phenom II 720 >>--             ###
###                                                                               ###
###   (  2 min 26 sec on 8568 com/15 topics,   100 iterations, 1? thread)         ###
###   ( 24 min 13 sec on 8568 com/15 topics,  1000 iterations, 1? thread)         ###
###   (  2 min 39 sec on 8568 com/14 topics,   100 iterations, 1? thread)         ###
###   ( 13 min  0 sec on 8568 com/14 topics,   500 iterations,  3 thread)         ###
###   (  8 min 56 sec on 8568 com/14 topics,   500 iterations,  6 thread)         ###
###   (  8 min  0 sec on 8568 com/14 topics,   500 iterations,  9 thread)         ###
###   (  7 min 25 sec on 8568 com/14 topics,   500 iterations, 12 thread)         ###
###   (  8 min 17 sec on 8568 com/14 topics,   500 iterations, 30 thread)         ###
###   (  7 min 18 sec on 8568 com/14 topics,   500 iterations, 15 thread)         ###
###   (172 min  8 sec on 8568 com/14 topics, 12000 iterations, 15 thread)         ###
###   (  8 min 50 sec on 8568 com/14 topics,  1000 iterations, 15 thread, reboot) ###
###   (  9 min 55 sec on 8568 com/14 topics,  1000 iterations, 30 thread, reboot) ###
###   (  9 min 44 sec on 8568 com/14 topics,  1000 iterations, 18 thread, reboot) ###
###   (  9 min 44 sec on 8568 com/14 topics,  1000 iterations, 15 thread, reboot) ###
###   (116 min 11 sec on 8568 com/14 topics, 12000 iterations, 15 thread, reboot) ###
###                                                                               ###
###              --<< New Computer, 4.1 Ghz 6 Core Ryzen 3600 >>--                ###
###                                                                               ###
###   (  2 min  3 sec on 8568 com/14 topics,   500 iterations, 60 thread)         ###
###   (  3 min  4 sec on 8568 com/14 topics,   500 iterations,120 thread)         ###
###   (  2 min  5 sec on 8568 com/14 topics,   500 iterations, 60 thread)         ###
###   (  1 min 37 sec on 8568 com/14 topics,   500 iterations, 12 thread)         ###
###   (  1 min 30 sec on 8568 com/14 topics,   500 iterations, 12 thread)         ###
###   ( 35 min 28 sec on 8568 com/14 topics, 12000 iterations, 12 thread)         ###
###                                                                               ###
###   Use 500 iterations for testing syntax and 12000 iterations for analysis     ###
#####################################################################################

# print the number of iterations to console
iterations

## [1] 12000

Sys.time()

## [1] "2019-12-01 11:58:14 PST"

system.time(
  topic.model$train(iterations) #use 12,000 for final
  ) #this takes about ___ sec on 3.5GHz desktop processor locked in P0

##     user   system  elapsed 
## 11748.74     2.75  2717.05

### Useful for debuging on short runs
# 
# topic.words.test <- mallet.topic.words(topic.model, smoothed=T, normalized=T)
# 
# ### Get a vector containing short names for the topics
# topics.labels.test <- rep("", n.topics)
# for (topic in 1:n.topics) 
#   topics.labels.test[topic] <- paste(mallet.top.words(topic.model, 
#                                                       topic.words.test[topic,],
#                                                       num.top.words=10)$words, 
#                                       collapse=" ")
# 
# ### have a look at keywords for each topic
# topics.labels.test
# 
# ## Write that list to a | delimited file for easy copy into excel
# dir.create(paste(wd,"/",iterations," iterations/", sep=""))
#
# filename = filename = paste(iterations," iterations/","Test_Top_10_words_in_",n.topics,"_topics_",iterations,"_iter_",
#                   format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".txt", sep=""
#                  )
# 
# write.table(topics.labels.test, row.names=TRUE, col.names=FALSE, file = filename, sep = "|")

2. Substantive Analyses of Groups of Comments (sub-Corpora)

Now that we have a topic model, the interesting substantive comparisons for understanding Dodd-Frank financial reform at the CFTC involves understanding the different topics of discussion by different groups of commenters.

These are the analyses which we run after generating a new topic model

2.1 Initial Topic Model Descriptives

Generate some quick descriptive data about the topic.model

## Get the probability of topics in documents and the probability of words in topics.
## By default, these functions return raw word counts. Here we want probabilities, 
##  so we normalize, and add "smoothing" so that nothing has exactly 0 probability.
doc.topics <- mallet.doc.topics(topic.model, smoothed=T, normalized=T)
topic.words <- mallet.topic.words(topic.model, smoothed=T, normalized=T)

## What are the top words in topic 7?
##  Notice that R indexes from 1, so this will be the topic that mallet called topic 6.
mallet.top.words(topic.model, topic.words[6,], 40)

##        words     weights
## 1      clear 0.036884494
## 2     requir 0.018593748
## 3     custom 0.015830499
## 4    commiss 0.015615892
## 5        sef 0.015160026
## 6       swap 0.013349186
## 7     member 0.011811865
## 8     execut 0.011054426
## 9       risk 0.009526925
## 10       dco 0.009388061
## 11  particip 0.009212728
## 12     futur 0.006894123
## 13      dcos 0.006570108
## 14   exchang 0.006229260
## 15    provid 0.005962754
## 16     limit 0.005779005
## 17     manag 0.005314723
## 18     deriv 0.005268435
## 19    believ 0.005233369
## 20  interest 0.005084686
## 21  contract 0.004875689
## 22    margin 0.004589546
## 23     facil 0.004489956
## 24  principl 0.004445071
## 25       dcm 0.004432447
## 26       fcm 0.004330053
## 27   comment 0.004144901
## 28    system 0.004136485
## 29      dcms 0.004087392
## 30     model 0.003947125
## 31     regul 0.003899435
## 32      core 0.003858757
## 33    liquid 0.003819483
## 34   account 0.003745141
## 35   stawick 0.003697451
## 36    access 0.003682022
## 37 doddfrank 0.003679216
## 38   default 0.003667995
## 39    includ 0.003639942
## 40     order 0.003635734

## Get a vector containing short names for the topics
topics.labels <- rep("", n.topics)
for (topic in 1:n.topics) topics.labels[topic] <- paste(mallet.top.words(topic.model, topic.words[topic,], num.top.words=5)$words, collapse=" ")

# have a look at keywords for each topic
topics.labels

##  [1] "swap requir risk margin dealer"       
##  [2] "ca ny fl wa gmailcom"                 
##  [3] "swap report requir commiss data"      
##  [4] "swap energi commiss transact electr"  
##  [5] "silver limit contract posit ounc"     
##  [6] "clear requir custom commiss sef"      
##  [7] "risk liquid price cost deriv"         
##  [8] "posit limit commod price contract"    
##  [9] "cio occ jpmorgan risk chase"          
## [10] "commiss requir complianc commod regul"
## [11] "fund bank entiti invest activ"        
## [12] "bank doddfrank state requir clear"    
## [13] "swap dealer foreign regul requir"     
## [14] "specul commod price limit posit"

## Generate a quick list of the top N words for each topic
numberOfWords = 40

topics.list <- rep("", n.topics)

for (topic in 1:n.topics) topics.list[topic] <- paste(mallet.top.words(topic.model, topic.words[topic,], num.top.words=numberOfWords)$words, collapse=" ")

## Display that list
topics.list

##  [1] "swap requir risk margin dealer particip entiti counterparti commiss transact insur regul major definit plan hedg doddfrank clear provid valu endus credit contract enter collater includ capit manag exposur deriv busi initi believ section asset compani parti subject posit system"           
##  [2] "ca ny fl wa gmailcom coop pa john yahoocom state il titl ma robert tx david san signatur mi michael jame wi william mari richard nj az va ii nc ri md mn ct barbara paul thoma ann susan citi"                                                                                                   
##  [3] "swap report requir commiss data transact particip implement time clear execut provid block counterparti believ regul public inform final process confirm sef system parti realtim dealer comment regulatori doddfrank liquid sdr size price determin servic term avail sdrs includ deriv"        
##  [4] "swap energi commiss transact electr commod commerci contract entiti user end option comment physic gas definit regul power risk member requir associ regulatori provid cooper natur public util particip forward nonfinanci servic agreement state deliveri includ respect cost nfp busi"        
##  [5] "silver limit contract posit ounc manipul million specul amount current world year consum tm produc formula greater institut commission price concentr comex commod chairman gensler prevent hold number high exempt urg power result compani respect deriv mine fellow fair approv"              
##  [6] "clear requir custom commiss sef swap member execut risk dco particip futur dcos exchang provid limit manag deriv believ interest contract margin facil principl dcm fcm comment system dcms model regul core liquid account stawick access doddfrank default includ order"                       
##  [7] "risk liquid price cost deriv capit increas time rate valu year regul report larg product interest standard asset secur provid inform high level system requir effect futur instrument bank impact firm base measur chang exchang hedg money relat posit avail"                                   
##  [8] "posit limit commod price contract futur hedg commiss specul index risk month agricultur aggreg exempt trader physic bona oil fide particip commerci swap fund spot suppli option manag provid increas level includ interest chang deriv liquid econom open referenc excess"                      
##  [9] "cio occ jpmorgan risk chase bank credit mr subcommitte scp loss report var email posit portfolio limit model interview manag book drew deriv million told offic synthet iksil hedg april trader day price januari invest breach billion march task forc"                                         
## [10] "commiss requir complianc commod regul section entiti inform provid futur particip comment doddfrank order final exempt report effect regist manag invest person includ registr compani issu regulatori pool believ provis applic associ firm rulemak definit stawick offic cea sec request"      
## [11] "fund bank entiti invest activ cover agenc secur volcker hedg exempt section requir compani manag definit asset provid state transact includ permit proprietari interest prohibit regul risk unit equiti limit privat custom account restrict foreign investor believ feder liquid insur"         
## [12] "bank doddfrank state requir clear forex institut compani import deriv transpar commerci regul busi street retail chairman communiti system endus reform leverag hous american comment nation exchang account rin chang unit secretari organ shadow legisl includ citi big exempt implement"      
## [13] "swap dealer foreign regul requir bank commiss person nonus activ transact entiti affili titl jurisdict vii regist counterparti state guidanc definit intern unit applic registr regulatori branch appli subject section countri complianc doddfrank feder clear capit respect busi deal determin"
## [14] "specul commod price limit posit excess food gas famili institut oil mani volatil urg street impos wall effect exempt high studi industri consum respons commiss affect tm contribut adopt energi dozen senat hedg undermin put univers place rise especi clear"

## Reorder topics to substantive order from arbitrary order. This is based on analysis and case knowledge and added to the code after careful human analysis of output of topic model.

newTopicOrder <- c(3,   8,  6,  11, 12, 5,  7,  13, 9,  4,  2,  10, 1,  14) #Note that the order is based on what the new roman numeral will be (the new substantive order) rather than the order of the topic rows since we are assigning a value than sorting by that value.  This will be different below in 2.2 when we rename and reorder columns for figures.  Note also that the list of names for the topics in 2.3.2 and 2.4.2 for the figures is also updated when this is changed

topics.list.newOrder <- cbind(data.frame(topics.list),newTopicOrder)

topics.list.newOrder <- topics.list.newOrder[order(topics.list.newOrder$newTopicOrder, decreasing = FALSE),]

###################################################################

## Put the new order number in front for printing
topics.list.newOrder <- topics.list.newOrder[,c(2,1)]

## Write that list to a | delimited file for easy copy into excel
dir.create(paste(wd,"/",iterations," iterations/", sep=""))

## Warning in dir.create(paste(wd, "/", iterations, " iterations/", sep =
## "")): 'D:\Dropbox\The Limits of Interest\12000 iterations' already exists

filename = paste(iterations," iterations/", "Top_40_words_in_",n.topics,"_topics_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".txt", sep="")

write.table(topics.list.newOrder, row.names=FALSE, col.names=FALSE, file = filename, sep = "|")

2.2 Create a dataframe with the documents, meta-data, and topic Proportions

This step essentially creates one large dataframe with all the stuff we want to use, so that it can be quickly analyzed in one step. It also creates a typology factor which we can then use to do the various subsets which are of interest later (e.g. Industry vs. Nonindustry, 4 super types, 5 super types)

# add back in the ControlNumberID for each document
doc.topics.IDnumbered <- data.frame(documents$ControlNumberID,doc.topics)

## Create a vector containing short names for the topics to use as variable names
topics.var.labels <- rep("", n.topics)
for (topic in 1:n.topics) topics.var.labels[topic] <- paste(mallet.top.words(topic.model, topic.words[topic,], num.top.words=3)$words, collapse=".")

# have a look at these var names for each topic, note that they are in the non-substantive order
topics.var.labels

##  [1] "swap.requir.risk"         "ca.ny.fl"                
##  [3] "swap.report.requir"       "swap.energi.commiss"     
##  [5] "silver.limit.contract"    "clear.requir.custom"     
##  [7] "risk.liquid.price"        "posit.limit.commod"      
##  [9] "cio.occ.jpmorgan"         "commiss.requir.complianc"
## [11] "fund.bank.entiti"         "bank.doddfrank.state"    
## [13] "swap.dealer.foreign"      "specul.commod.price"

# label the variables of this new dataframe based on a short version of the topic names
colnames(doc.topics.IDnumbered) <- c("ControlNumberID", topics.var.labels)

# check the initial (raw) order and first couple values
names(doc.topics.IDnumbered)

##  [1] "ControlNumberID"          "swap.requir.risk"        
##  [3] "ca.ny.fl"                 "swap.report.requir"      
##  [5] "swap.energi.commiss"      "silver.limit.contract"   
##  [7] "clear.requir.custom"      "risk.liquid.price"       
##  [9] "posit.limit.commod"       "cio.occ.jpmorgan"        
## [11] "commiss.requir.complianc" "fund.bank.entiti"        
## [13] "bank.doddfrank.state"     "swap.dealer.foreign"     
## [15] "specul.commod.price"

head(doc.topics.IDnumbered[,1:7], 5)

##   ControlNumberID swap.requir.risk     ca.ny.fl swap.report.requir
## 1            1024     0.0010566891 3.620512e-04       0.0010046525
## 2            1032     0.0003459797 1.185423e-04       0.0003289419
## 3            1035     0.0005628735 1.928562e-04       0.0005351548
## 4            1059     0.0007081652 2.426372e-04       0.0006732916
## 5            1081     0.0001844480 6.319705e-05       0.0001753649
##   swap.energi.commiss silver.limit.contract clear.requir.custom
## 1        0.0007478465          0.0015061863        0.0010528454
## 2        0.0002448589          0.0004931534        0.0003447212
## 3        0.0003983603          0.0463948474        0.0005608260
## 4        0.0005011870          0.0870510729        0.2301500335
## 5        0.0001305387          0.0562887379        0.0001837771

############# To be updated after 2019-06-12 changes ##############
# re-order topics into subtantive order. Note that the order is based on the order of the raw topic columns +1 (because of ControlNumberID) rather than what the new roman numeral will be (the new substantive order) since we are reorganizing the columns by column index.  This is different than above in 2.1 when we asigned new row numbers then sorted by them.

doc.topics.IDnumbered <- doc.topics.IDnumbered[c(1, 14, 12, 2,  11, 7,  4,  8,  3,  10, 13, 5,  6,  9,  15)]

###################################################################

#check that this worked (first should be ControlNumberID, second is swap.dealer.foreign, last should be specul.commod.price) [2019-08-07 consistent]
names(doc.topics.IDnumbered)

##  [1] "ControlNumberID"          "swap.dealer.foreign"     
##  [3] "fund.bank.entiti"         "swap.requir.risk"        
##  [5] "commiss.requir.complianc" "clear.requir.custom"     
##  [7] "swap.report.requir"       "risk.liquid.price"       
##  [9] "ca.ny.fl"                 "cio.occ.jpmorgan"        
## [11] "bank.doddfrank.state"     "swap.energi.commiss"     
## [13] "silver.limit.contract"    "posit.limit.commod"      
## [15] "specul.commod.price"

head(doc.topics.IDnumbered[,1:7], 5)

##   ControlNumberID swap.dealer.foreign fund.bank.entiti swap.requir.risk
## 1            1024        0.0008010986     0.0008754913     0.0010566891
## 2            1032        0.0072683507     0.0002866522     0.0003459797
## 3            1035        0.1144080698     0.0004663536     0.0005628735
## 4            1059        0.0005368751     0.0005867312     0.0007081652
## 5            1081        0.0001398340     0.0001528194     0.0001844480
##   commiss.requir.complianc clear.requir.custom swap.report.requir
## 1              0.001930152        0.0010528454       0.0010046525
## 2              0.175783370        0.0003447212       0.0003289419
## 3              0.023824415        0.0005608260       0.0005351548
## 4              0.001293537        0.2301500335       0.0006732916
## 5              0.056362742        0.0001837771       0.0001753649

# merge the doc.topics.IDNumbered and documents dataframe to create an analytical base for all the comparisons
documents.withTopics <- merge(documents, doc.topics.IDnumbered, "ControlNumberID")

# check it out real quick
names(documents.withTopics)

##  [1] "ControlNumberID"                  
##  [2] "SubmitDate"                       
##  [3] "UniqueName"                       
##  [4] "FirstName"                        
##  [5] "LastName"                         
##  [6] "Organization"                     
##  [7] "Title"                            
##  [8] "In.NZ.and.LE.memos"               
##  [9] "Rule.Super.set..from.Memo.Title." 
## [10] "Final.Classification"             
## [11] "Final.Classification.Multivalue.2"
## [12] "Final.Classification.Multivalue.3"
## [13] "Final.Classification.Multivalue.4"
## [14] "Final.Classification.Multivalue.5"
## [15] "VA.KSC.Classification"            
## [16] "VA.KSC.Multivalue.2"              
## [17] "VA.KSC.Multivalue.3"              
## [18] "VA.KSC.Multivalue.4"              
## [19] "VA.KSC.Multivalue.5"              
## [20] "VA.KSC.Non.US...Non.US.or.Blank." 
## [21] "LE.Non.US...Non.US.or.Blank."     
## [22] "LE.Classification"                
## [23] "CommentText"                      
## [24] "ExtractedText"                    
## [25] "UnifiedText"                      
## [26] "PreppedText"                      
## [27] "swap.dealer.foreign"              
## [28] "fund.bank.entiti"                 
## [29] "swap.requir.risk"                 
## [30] "commiss.requir.complianc"         
## [31] "clear.requir.custom"              
## [32] "swap.report.requir"               
## [33] "risk.liquid.price"                
## [34] "ca.ny.fl"                         
## [35] "cio.occ.jpmorgan"                 
## [36] "bank.doddfrank.state"             
## [37] "swap.energi.commiss"              
## [38] "silver.limit.contract"            
## [39] "posit.limit.commod"               
## [40] "specul.commod.price"

#create a factor for the typology
typology <- factor(
  c(
  "Major Wall Street Sell-Side Bank",
  "Other Sell-Side Bank",
  "Core Financial Service Trade Association",
  "Financial Sector Association",
  "Public Asset Manager",
  "Private Asset Manager",
  "U.S. Chamber of Commerce or Affiliate",
  "Market Infrastructure Firm",
  "Law Firms, Consultants, and Related Advisors",
  "Non-Financial Private-Sector Association",
  "Non-Financial Firm",
  "Government",
  "Academic or Other Expert",
  "Consumer Advocacy or other Citizens Group",
  "Trade Union or other Formal Labor Organization",
  "Market Advocacy or other Anti-Regulation Group",
  "Unaffiliated Individual",
  "Un-Coded")
  )

2.2.1 Deal with Comment Letters Signed by Multiple Types of Commentors

A number of Comment letters had multiple organization types listed in the Organization value (161 with two, 27 with three, 1 with four). We have coded each of these types so that each letter will count as one contribution to each type of commenter. This means that the total number of comments in the topic proportions below will exceed the total number of actual comment letters (n=n.topics).

We chose this approach because we are interested in who is saying what and how that differs between different types of commenters. This means it is less important to maintain exact numberical equivalency since we are comparing proportions.

We chose 1 contribution to each category (rather than fractional based on some arbitrary formula) for the same reason that every signatory to the Declaration of Independence was equally guilty of treason even though only some (e.g. Thomas Jefferson) did the majority of writing.

If you signed a letter, you are jointly, separably, wholly, and equally liable for it’s contents even if you were (hypothetically) less part of the writing process.

## Add new variable for Unified Classifications with multivalues
documents.withTopics$Classifications.with.Multivalue <- NA

## Copy out all comments with non-empty multivalue columns into new dataframes with rows that are only those comments with that number of multivalues and all columns.  Then, assign their respective multivalue to their value for Classifications.with.Multivalue.  Finally, change their ControlNumberID to be OriginalID_<MultivalueIndex> so that ControlNUmberID remains a Globally Unique Identifier (GUID).

# Multivalue 2, 161 rows expected
multivalue2.documents <- documents.withTopics[documents.withTopics$Final.Classification.Multivalue.2!="",]

multivalue2.documents$Classifications.with.Multivalue <- multivalue2.documents$Final.Classification.Multivalue.2

multivalue2.documents <- multivalue2.documents %>% as_tibble() %>% mutate(ControlNumberID = paste(ControlNumberID,"2", sep = "_"))


# Multivalue 3, 27 rows expected
multivalue3.documents <- documents.withTopics[documents.withTopics$Final.Classification.Multivalue.3!="",]

multivalue3.documents$Classifications.with.Multivalue <- multivalue3.documents$Final.Classification.Multivalue.3

multivalue3.documents <- multivalue3.documents %>% as_tibble() %>% mutate(ControlNumberID = paste(ControlNumberID,"3", sep = "_"))

# Multivalue 4, 1 rows expected
multivalue4.documents <- documents.withTopics[documents.withTopics$Final.Classification.Multivalue.4!="",]

multivalue4.documents$Classifications.with.Multivalue <- multivalue4.documents$Final.Classification.Multivalue.4

multivalue4.documents <- multivalue4.documents %>% as_tibble() %>% mutate(ControlNumberID = paste(ControlNumberID,"4", sep = "_"))

# Multivalue 5, 0 row expected
multivalue5.documents <- documents.withTopics[documents.withTopics$Final.Classification.Multivalue.5!="",]

multivalue5.documents$Classifications.with.Multivalue <- multivalue5.documents$Final.Classification.Multivalue.5

multivalue5.documents <- multivalue5.documents %>% as_tibble() %>% mutate(ControlNumberID = paste(ControlNumberID,"5", sep = "_"))

## Assign all original cases their Final.Classification value in the new Classifications.with.Multivalue.  This needed to come after the copy out in order to prevent possible innaccuracy in the data
documents.withTopics$Classifications.with.Multivalue <- documents.withTopics$Final.Classification

## Combine the rows from all of the dataframes to a new documents.withTopicsAndMultivalues
documents.withTopicsAndMultivalues <- 
  rbind(documents.withTopics, 
        multivalue2.documents, 
        multivalue3.documents, 
        multivalue4.documents, 
        multivalue5.documents
        )

## Verify that there are now the expected number of rows, which is 8262 + 161 + 27 + 1 + 0 = 8451
nrow(documents.withTopicsAndMultivalues)

## [1] 8451

nrow(documents.withTopicsAndMultivalues) == 8451

## [1] TRUE

## Confirm that all cases have a Classifications.with.Multivalues value
sum(1*(is.na(documents.withTopicsAndMultivalues$Classifications.with.Multivalue))) #should be 0

## [1] 0

#check that columns still named correctly (first should be ControlNumberID, second is posit.limit.commod, last should be loan.bank.cooper)
names(documents.withTopicsAndMultivalues)

##  [1] "ControlNumberID"                  
##  [2] "SubmitDate"                       
##  [3] "UniqueName"                       
##  [4] "FirstName"                        
##  [5] "LastName"                         
##  [6] "Organization"                     
##  [7] "Title"                            
##  [8] "In.NZ.and.LE.memos"               
##  [9] "Rule.Super.set..from.Memo.Title." 
## [10] "Final.Classification"             
## [11] "Final.Classification.Multivalue.2"
## [12] "Final.Classification.Multivalue.3"
## [13] "Final.Classification.Multivalue.4"
## [14] "Final.Classification.Multivalue.5"
## [15] "VA.KSC.Classification"            
## [16] "VA.KSC.Multivalue.2"              
## [17] "VA.KSC.Multivalue.3"              
## [18] "VA.KSC.Multivalue.4"              
## [19] "VA.KSC.Multivalue.5"              
## [20] "VA.KSC.Non.US...Non.US.or.Blank." 
## [21] "LE.Non.US...Non.US.or.Blank."     
## [22] "LE.Classification"                
## [23] "CommentText"                      
## [24] "ExtractedText"                    
## [25] "UnifiedText"                      
## [26] "PreppedText"                      
## [27] "swap.dealer.foreign"              
## [28] "fund.bank.entiti"                 
## [29] "swap.requir.risk"                 
## [30] "commiss.requir.complianc"         
## [31] "clear.requir.custom"              
## [32] "swap.report.requir"               
## [33] "risk.liquid.price"                
## [34] "ca.ny.fl"                         
## [35] "cio.occ.jpmorgan"                 
## [36] "bank.doddfrank.state"             
## [37] "swap.energi.commiss"              
## [38] "silver.limit.contract"            
## [39] "posit.limit.commod"               
## [40] "specul.commod.price"              
## [41] "Classifications.with.Multivalue"

2.2.2 Backup the Analysis Dataframe

Because the topic model takes several hours to complete, create a copy of the analysis working dataframe just in case so each analysis starts with guranteed fresh data.

#create a backup version of the database at this state to save processing time if we alter the corpus below in any of the analyses.
documents.withTopicsAndMultivaluesBKUP_2.2 <- documents.withTopicsAndMultivalues

2.3 Generate and Display Topic Proportions for each of the 18 groups

This section first generates and then displays different representations of the topic proportions for each of the 18 groups in they commenter typology

2.3.1 Calculate the 18 Type Typology Topic Proportions

To calculate this, R must first calculate the individual Topic Proportions for each document, and then sum over the 18 typologies. Stores this in typologyTopicProportions.

# reset the working dataframe to the backup to ensure clean data for every step
documents.withTopicsAndMultivalues <- documents.withTopicsAndMultivaluesBKUP_2.2

# variables in documents.withTopicsAndMultivalues
names(documents.withTopicsAndMultivalues)

##  [1] "ControlNumberID"                  
##  [2] "SubmitDate"                       
##  [3] "UniqueName"                       
##  [4] "FirstName"                        
##  [5] "LastName"                         
##  [6] "Organization"                     
##  [7] "Title"                            
##  [8] "In.NZ.and.LE.memos"               
##  [9] "Rule.Super.set..from.Memo.Title." 
## [10] "Final.Classification"             
## [11] "Final.Classification.Multivalue.2"
## [12] "Final.Classification.Multivalue.3"
## [13] "Final.Classification.Multivalue.4"
## [14] "Final.Classification.Multivalue.5"
## [15] "VA.KSC.Classification"            
## [16] "VA.KSC.Multivalue.2"              
## [17] "VA.KSC.Multivalue.3"              
## [18] "VA.KSC.Multivalue.4"              
## [19] "VA.KSC.Multivalue.5"              
## [20] "VA.KSC.Non.US...Non.US.or.Blank." 
## [21] "LE.Non.US...Non.US.or.Blank."     
## [22] "LE.Classification"                
## [23] "CommentText"                      
## [24] "ExtractedText"                    
## [25] "UnifiedText"                      
## [26] "PreppedText"                      
## [27] "swap.dealer.foreign"              
## [28] "fund.bank.entiti"                 
## [29] "swap.requir.risk"                 
## [30] "commiss.requir.complianc"         
## [31] "clear.requir.custom"              
## [32] "swap.report.requir"               
## [33] "risk.liquid.price"                
## [34] "ca.ny.fl"                         
## [35] "cio.occ.jpmorgan"                 
## [36] "bank.doddfrank.state"             
## [37] "swap.energi.commiss"              
## [38] "silver.limit.contract"            
## [39] "posit.limit.commod"               
## [40] "specul.commod.price"              
## [41] "Classifications.with.Multivalue"

# average all the documents down to the 18 part typology in Classifications.with.Multivalue

typologyTopicProportions <- ddply(
                        documents.withTopicsAndMultivalues[2:length(names(documents.withTopicsAndMultivalues))] ### omit ID
                        ,"Classifications.with.Multivalue"
                        ,numcolwise(mean))

#to count the number of comments in each type:
sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Major Wall Street Sell-Side Bank"))

## [1] 124

sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Core Financial Service Trade Association"))

## [1] 278

sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Other Sell-Side Bank"))

## [1] 159

sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Public Asset Manager"))

## [1] 23

sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Private Asset Manager"))

## [1] 680

sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="U.S. Chamber of Commerce or Affiliate"))

## [1] 59

sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Market Infrastructure Firm"))

## [1] 659

sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Law Firms, Consultants, and Related Advisors"))

## [1] 316

sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Non-Financial Firm"))

## [1] 931

sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Financial Sector Association"))

## [1] 576

sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Non-Financial Private-Sector Association"))

## [1] 677

sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Government"))

## [1] 317

sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Academic or Other Expert"))

## [1] 127

sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Consumer Advocacy or other Citizens Group"))

## [1] 442

sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Trade Union or other Formal Labor Organization"))

## [1] 38

sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Market Advocacy or other Anti-Regulation Group"))

## [1] 29

sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Unaffiliated Individual"))

## [1] 2916

sum(1*(documents.withTopicsAndMultivalues$Classifications.with.Multivalue=="Un-Coded"))

## [1] 100

# report the dataframe where each row is one of the commenter types from Classifications.with.Multivalue, and each column is a topic.  Each cell will then be the average Topic Proportions of the commenter type to mention a particular topic.  (The rows will always adds up to 1 in theory but may be slightly off due to rounding in the output below)

typologyTopicProportions

##                   Classifications.with.Multivalue swap.dealer.foreign
## 1                Major Wall Street Sell-Side Bank          0.27125393
## 2                            Other Sell-Side Bank          0.42939115
## 3        Core Financial Service Trade Association          0.10977561
## 4                    Financial Sector Association          0.06894585
## 5                            Public Asset Manager          0.04321758
## 6                           Private Asset Manager          0.04290791
## 7           U.S. Chamber of Commerce or Affiliate          0.03496904
## 8                      Market Infrastructure Firm          0.02762622
## 9    Law Firms, Consultants, and Related Advisors          0.06223273
## 10       Non-Financial Private-Sector Association          0.01479069
## 11                             Non-Financial Firm          0.01223069
## 12                                     Government          0.12077104
## 13                       Academic or Other Expert          0.03660041
## 14      Consumer Advocacy or other Citizens Group          0.02987776
## 15 Trade Union or other Formal Labor Organization          0.05014513
## 16 Market Advocacy or other Anti-Regulation Group          0.05582802
## 17                        Unaffiliated Individual          0.00761514
## 18                                       Un-Coded          0.01111662
##    fund.bank.entiti swap.requir.risk commiss.requir.complianc
## 1       0.182152866      0.055261419               0.06526539
## 2       0.172785448      0.057591188               0.06871371
## 3       0.095574748      0.138650946               0.22397905
## 4       0.150370659      0.174003557               0.19011441
## 5       0.039841535      0.536473324               0.13884104
## 6       0.097781220      0.159360054               0.10576564
## 7       0.202695129      0.142923851               0.28637685
## 8       0.014837954      0.022757306               0.11876734
## 9       0.082823967      0.079189388               0.16979593
## 10      0.021004110      0.122767807               0.12201036
## 11      0.011802151      0.044828378               0.04682064
## 12      0.098108385      0.089230290               0.11355606
## 13      0.028154515      0.055557050               0.09321066
## 14      0.021404494      0.036070675               0.07130382
## 15      0.026576153      0.037835928               0.08833894
## 16      0.036555546      0.096885330               0.22394928
## 17      0.007662605      0.005295777               0.02570274
## 18      0.008819201      0.007434799               0.04160458
##    clear.requir.custom swap.report.requir risk.liquid.price     ca.ny.fl
## 1          0.141462149        0.088782695        0.08700791 0.0037791177
## 2          0.104898270        0.072156176        0.05221871 0.0005682826
## 3          0.116620436        0.200497764        0.05511965 0.0010013839
## 4          0.094161092        0.186088139        0.05028473 0.0017995947
## 5          0.031716907        0.018387654        0.06081134 0.0041419015
## 6          0.141171083        0.068331389        0.10010175 0.0072355473
## 7          0.010238079        0.108052828        0.06119348 0.0007308395
## 8          0.310485466        0.205422280        0.09433105 0.0086407201
## 9          0.042214675        0.068458432        0.08728723 0.0081717515
## 10         0.039552279        0.065062254        0.04438245 0.0048780267
## 11         0.014060126        0.024645824        0.07404155 0.0075746556
## 12         0.095497379        0.047901773        0.09875625 0.0174727890
## 13         0.033156639        0.016013372        0.16130822 0.0039026071
## 14         0.038629233        0.029314653        0.11607530 0.0117663293
## 15         0.034333310        0.004847072        0.06782249 0.0218252307
## 16         0.072315414        0.123926174        0.06525446 0.0006720449
## 17         0.007175437        0.006097648        0.08087038 0.0069946313
## 18         0.028956886        0.009786669        0.13316206 0.0063991578
##    cio.occ.jpmorgan bank.doddfrank.state swap.energi.commiss
## 1      6.074353e-04          0.022525087        0.0165415260
## 2      1.317570e-03          0.009139515        0.0076812243
## 3      8.810289e-04          0.014896962        0.0159078726
## 4      1.041906e-03          0.022130793        0.0232727266
## 5      1.887336e-05          0.057198160        0.0063732417
## 6      2.037075e-03          0.109682796        0.0147806690
## 7      3.719676e-03          0.057158270        0.0248818766
## 8      2.684083e-03          0.073103270        0.0189898970
## 9      3.465068e-04          0.124082555        0.0723104788
## 10     6.448907e-04          0.037720890        0.2736787445
## 11     1.739143e-03          0.153877123        0.0739756643
## 12     1.699957e-02          0.118480804        0.0483419509
## 13     3.137881e-03          0.190561536        0.0151037497
## 14     2.231176e-03          0.276509972        0.0268778699
## 15     3.424001e-03          0.314571660        0.0045921169
## 16     1.067412e-04          0.046714295        0.0003523865
## 17     3.016382e-03          0.186793025        0.0046453953
## 18     4.391291e-03          0.229102496        0.0243360976
##    silver.limit.contract posit.limit.commod specul.commod.price
## 1           0.0207017176        0.044013516        0.0006452482
## 2           0.0138125156        0.009250486        0.0004757500
## 3           0.0009369734        0.024367285        0.0017903005
## 4           0.0101595526        0.025249345        0.0023776430
## 5           0.0340070225        0.026331271        0.0026401613
## 6           0.0931194565        0.042717939        0.0150074729
## 7           0.0030365959        0.043827341        0.0201961428
## 8           0.0360492625        0.059120215        0.0071849472
## 9           0.1158188089        0.052875597        0.0343919528
## 10          0.0256757666        0.124291665        0.1035400664
## 11          0.1824884112        0.077750023        0.2741656128
## 12          0.0717141572        0.030809473        0.0323600704
## 13          0.0998560357        0.088940692        0.1744966335
## 14          0.1009942598        0.035749810        0.2031946436
## 15          0.0953573530        0.028168151        0.2221624587
## 16          0.1090928276        0.054458383        0.1138891055
## 17          0.3887012904        0.017809693        0.2516198556
## 18          0.3419092481        0.028436245        0.1245446515

## Write the table to a | delimited file for easy copy into excel
filename = paste(iterations," iterations/","Topic_Proportions_by_Commenter_Type_",n.topics,"_topics_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".txt", sep="")

write.table(typologyTopicProportions, row.names=FALSE, col.names=TRUE, file = filename, sep = "|")

2.3.2 Create a Table of Aggregated Topic Proportions for Plotting

This dataframe is used in 2.3.3 and 2.3.4.

# This creates a 252 obs x 3 var table where each is the average topic porportions 
#   for a topic-classifcation pair

tempPlot <- melt(typologyTopicProportions, id.vars="Classifications.with.Multivalue")

## give the columns the proper descriptive names
colnames(tempPlot)<- c("Classifications.with.Multivalue","topic","Proportions")

## look at the topic names
unique(tempPlot$topic)

##  [1] swap.dealer.foreign      fund.bank.entiti        
##  [3] swap.requir.risk         commiss.requir.complianc
##  [5] clear.requir.custom      swap.report.requir      
##  [7] risk.liquid.price        ca.ny.fl                
##  [9] cio.occ.jpmorgan         bank.doddfrank.state    
## [11] swap.energi.commiss      silver.limit.contract   
## [13] posit.limit.commod       specul.commod.price     
## 14 Levels: swap.dealer.foreign fund.bank.entiti ... specul.commod.price

levels(tempPlot$topic)

##  [1] "swap.dealer.foreign"      "fund.bank.entiti"        
##  [3] "swap.requir.risk"         "commiss.requir.complianc"
##  [5] "clear.requir.custom"      "swap.report.requir"      
##  [7] "risk.liquid.price"        "ca.ny.fl"                
##  [9] "cio.occ.jpmorgan"         "bank.doddfrank.state"    
## [11] "swap.energi.commiss"      "silver.limit.contract"   
## [13] "posit.limit.commod"       "specul.commod.price"

## Rename the topics from variable names to descriptive ones with roman numeral indexes
tempPlot$topic <- mapvalues(tempPlot$topic, from = levels(tempPlot$topic), to = c(
                              "Cross-Border Transactions (I)",
                              "Volcker Rule (II)",
                              "End-User Rule (III)",
                              "Compliance Rules (IV)",
                              "Swap Clearing Rule (V)",
                              "Swap Reporting Rules (VI)",
                              "Macro Risk Reporting & Monitoring (VII)",
                              "Addresses & Names (VIII)",
                              "Firm-Level Risk Models (IX)",
                              "Community Banks & Derivative Users  (X)",
                              "End Users & Public Utilities (XI)",
                              "Speculation in Precious Metals (XII)",
                              "Speculation in Agricultural & Energy Commodities (XIII)",
                              "Speculation in Household Commodities (XIV)"
                            ))

## look at the topic names
unique(tempPlot$topic)

##  [1] Cross-Border Transactions (I)                          
##  [2] Volcker Rule (II)                                      
##  [3] End-User Rule (III)                                    
##  [4] Compliance Rules (IV)                                  
##  [5] Swap Clearing Rule (V)                                 
##  [6] Swap Reporting Rules (VI)                              
##  [7] Macro Risk Reporting & Monitoring (VII)                
##  [8] Addresses & Names (VIII)                               
##  [9] Firm-Level Risk Models (IX)                            
## [10] Community Banks & Derivative Users  (X)                
## [11] End Users & Public Utilities (XI)                      
## [12] Speculation in Precious Metals (XII)                   
## [13] Speculation in Agricultural & Energy Commodities (XIII)
## [14] Speculation in Household Commodities (XIV)             
## 14 Levels: Cross-Border Transactions (I) ... Speculation in Household Commodities (XIV)

2.3.3 Stacked Bar Chart of Proportions for Visual Comparison

This generates a stacked bar chart which compares the proportions of topics across all 18 commenter types. Note that they are all scaled to [0,1] which means that they do not indicate the relative number of commenters in each category.

The output is saved to a png file with a descriptive name.

## Stacked Bar Chart of Proportions for Visual Comparison
#Consider using scales library to make the colors more visually pleasing

chartTitle = "Topic Proportions in Comments from 18 types of Commenters"
filename = paste(iterations," iterations/","Topic_Proportions_by_Commenter_Type_",n.topics,"_topics_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")

#creation of png file
png(filename = paste("",filename,sep=""), width = 753/72, height = 578/72, units = "in", res=600)



stackedBar <- ggplot(tempPlot, aes(Classifications.with.Multivalue, Proportions, fill=topic)) +
  geom_bar(stat = "identity") + 
  theme(axis.text.x=element_text(angle=-35,hjust=0,vjust=1)) + 
  scale_fill_viridis(name="Topic", discrete=TRUE) +
  xlab("Commenter Types") + ylab("Average Topic Proportions") +
  labs(title=paste(chartTitle,"\n (n = ",nrow(documents.withTopicsAndMultivalues),")", sep=""))

stackedBar

dev.off()

## png 
##   2

## Attached image to output as well in order to enrich Rmd online appendix

include_graphics(filename)

Topic_Proportions_by_Commenter_Type_14_topics_2019-12-01_13.35.17

2.3.4 Generate a Heat-Map Table of Topic Proportions

The heat map greatly assists in noticing patterns in the propensity table.

The output is saved to a png file with a descriptive name.

2.3.4.1 Color Landscape Heatmap

## make a nice heatmap of the proportions
# this is the table thing where cells are colored white low to red high.  It is useful for noticing patterns in topics

filename = paste(iterations," iterations/","Topic_Proportions_by_Commenter_Type_heatmap_",n.topics,"_topics_","landscape_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")

# Note that default resolution (res) in ggplot is 72 dpi. So, to get a perfect reproduction of debug dimensions but at arbitrarily higher resolution, we simply divide the px size by 72, change unit to in, and set the res.
png(filename = paste("",filename,sep=""), width = 1100/72, height = 578/72, units = "in", res = 600)

heatmapLandscape <- ggplot(tempPlot, aes(Classifications.with.Multivalue, topic)) + 
  geom_tile(aes(fill = Proportions), color= "white") + 
  theme_bw() + 
  xlab("") + #blank to avoid conflict with labels below
  ylab("") + #blank to avoid conflict with labels below
  scale_fill_gradient2(low = "#1A3C71", #cal Poster Template Blue
                       mid = "white",
                       high = "#FFB800", #Cal Poster Template Gold
                       midpoint = mean(tempPlot$Proportions)) + # set the scale mean at the data mean
  theme(axis.text.x=element_text(angle=35,hjust=1,vjust=1)) + 
  geom_text(aes(label = round(Proportions, 4)), size=4) + 
  scale_y_discrete(limits = rev(levels(tempPlot$topic)))

heatmapLandscape

dev.off()

## png 
##   2

## Attached image to output as well in order to enrich Rmd online appendix

include_graphics(filename)

Topic_Proportions_by_Commenter_Type_heatmap_14_topics_landscape_2019-12-01_13.35.19

2.3.4.2 Color Portrait Heatmap

## make a nice heatmap of the proportions
# this is the table thing where cells are colored white low to red high.  It is useful for noticing patterns in topics

filename = paste(iterations," iterations/","Topic_Proportions_by_Commenter_Type_heatmap_",n.topics,"_topics_","portrait_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")

# Note that default resolution (res) in ggplot is 72 dpi. So, to get a perfect reproduction of debug dimensions but at arbitrarily higher resolution, we simply divide the px size by 72, change unit to in, and set the res.
png(filename = paste("",filename,sep=""), width = 1000/72, height = 578/72, units = "in", res = 600)

heatmapPortrait <- ggplot(tempPlot, aes(Classifications.with.Multivalue, topic)) + 
  geom_tile(aes(fill = Proportions), color= "white") + 
  theme_bw() + 
  xlab("") + #blank to avoid conflict with labels below
  ylab("") + #blank to avoid conflict with labels below
  scale_fill_gradient2(low = "#1A3C71", #cal Poster Template Blue
                       mid = "white",
                       high = "#FFB800", #Cal Poster Template Gold
                       midpoint = mean(tempPlot$Proportions) # set the scale mean at the data mean
                       ) + 
  theme(axis.text.x=element_text(angle=35,hjust=1,vjust=1)) + 
  geom_text(aes(label = round(Proportions, 4)), size=4) + 
  scale_x_discrete(limits = rev(levels(tempPlot$Classifications.with.Multivalue))) +
  coord_flip()

heatmapPortrait

dev.off()

## png 
##   2

## Attached image to output as well in order to enrich Rmd online appendix

include_graphics(filename)

Topic_Proportions_by_Commenter_Type_heatmap_14_topics_portrait_2019-12-01_13.35.21

2.4 Sample of Representative Comment Letters from Each Topic

To supplement and verify the computational topic model’s analysis of the comment corpus, this section selects comment letters which exemplify each of the topics. Several strategies of selecting “exemplary” comments are used in the sections below.

2.4.1 Select ALL Comment Letters which are {1,2} standard deviations above the mean topic proportions

This first section selects all comment letters which are at least N standard deviations above the mean topic proportions for each topic, where is N is {1,2}. These letters should contain the strongest representation of each topic to display what that topic looks like in its pureset form.

It seems like 1 and 2 deviations give a very large number of letters (hundreds). numDeviationsAbove = 5 seems to give about 100 which is still more than we are interested in. This is based on testing with the topic swap.foreign.regul and may be a different pattern for other topics. 5 is too high for some topics though since there are some topics which do not have a comment which is that far away from mean. 2 creates too big a set with duplicates.

With this in mind, I will be deprecating this approach and shifting to the “top 5 approach” in 2.4.2.

# set the number of standard deviations above mean which will be used
numDeviationsAbove = 2

# create output dataframe
sampleCommentLetters_DeviationsAbove <- documents.withTopicsAndMultivalues[FALSE,]



# get the list of topic variables names in substantive (and thus df consistent) order from section 2.2
topicNamesOrdered <- names(doc.topics.IDnumbered)[2:length(names(doc.topics.IDnumbered))] #we drop the first variable name because it is ControlNumberID which is not relevant here

#### Start Loop Here ####
for (currentTopicName in topicNamesOrdered){ # wrap the entire algorithm in a loop to execute through all of the the topics

#get the mean and standard deviation
meanOfTopicProportion <- mean(documents.withTopicsAndMultivalues[,currentTopicName])
stdevOfTopicProportion <- sd(documents.withTopicsAndMultivalues[,currentTopicName])

# Start by creating a new dataframe which selects out docs which are numDeivationsAbove above the mean
currentTopicSample <- documents.withTopicsAndMultivalues[documents.withTopicsAndMultivalues[,currentTopicName]>=(meanOfTopicProportion+(numDeviationsAbove*stdevOfTopicProportion)),]

# Add a variable to CurrentTopicSample to capture which topic the comment is selected for
currentTopicSample$relevantTopic <- currentTopicName

## Add these selected documents to the sampling dataframe
sampleCommentLetters_DeviationsAbove <- 
  rbind(sampleCommentLetters_DeviationsAbove, currentTopicSample)

#
}
## END OF LOOP ###

## Check if Duplicates have been added
length(unique(sampleCommentLetters_DeviationsAbove$ControlNumberID))


## Write the table to a | delimited file for easy copy into excel
filename = paste(iterations," iterations/","Sample_Comment_Letters_",numDeviationsAbove,"_Deviations_Above_Mean_",n.topics,"_topics_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".txt", sep="")

write.table(typologyTopicProportions, row.names=FALSE, col.names=TRUE, file = filename, sep = "|")

2.4.2 Select Top XX highest topic proportion Comment Letters

As compared to 2.4.1, this is an even more extreme selection criteria although it may be susceptible to highly clusters topic proportions since it does not verify how extreme.

This was the analysis that ultimately was most useful for understanding what letters were in topics. However, 2.4.1 and 2.4.3 are retained in specification and partial implementation for posterity in case a future analyst wishes to explore alternative meanings of “typical” or “representative” when evaluating topics.

# set the number of comments to be selected from the top of the distribution
topNumSelected = 30

#create working copy of documents.withTopics dropping unnneded metadata
#   Note: we do NOT need to use multivalues for this strict "top ranking" selection criteria since we are not selecting in comparison to the entire corpus as we would be with the stdev or porportion models.
names(documents.withTopics)

##  [1] "ControlNumberID"                  
##  [2] "SubmitDate"                       
##  [3] "UniqueName"                       
##  [4] "FirstName"                        
##  [5] "LastName"                         
##  [6] "Organization"                     
##  [7] "Title"                            
##  [8] "In.NZ.and.LE.memos"               
##  [9] "Rule.Super.set..from.Memo.Title." 
## [10] "Final.Classification"             
## [11] "Final.Classification.Multivalue.2"
## [12] "Final.Classification.Multivalue.3"
## [13] "Final.Classification.Multivalue.4"
## [14] "Final.Classification.Multivalue.5"
## [15] "VA.KSC.Classification"            
## [16] "VA.KSC.Multivalue.2"              
## [17] "VA.KSC.Multivalue.3"              
## [18] "VA.KSC.Multivalue.4"              
## [19] "VA.KSC.Multivalue.5"              
## [20] "VA.KSC.Non.US...Non.US.or.Blank." 
## [21] "LE.Non.US...Non.US.or.Blank."     
## [22] "LE.Classification"                
## [23] "CommentText"                      
## [24] "ExtractedText"                    
## [25] "UnifiedText"                      
## [26] "PreppedText"                      
## [27] "swap.dealer.foreign"              
## [28] "fund.bank.entiti"                 
## [29] "swap.requir.risk"                 
## [30] "commiss.requir.complianc"         
## [31] "clear.requir.custom"              
## [32] "swap.report.requir"               
## [33] "risk.liquid.price"                
## [34] "ca.ny.fl"                         
## [35] "cio.occ.jpmorgan"                 
## [36] "bank.doddfrank.state"             
## [37] "swap.energi.commiss"              
## [38] "silver.limit.contract"            
## [39] "posit.limit.commod"               
## [40] "specul.commod.price"              
## [41] "Classifications.with.Multivalue"

documents.ToSample <- documents.withTopics[,c(
  1:6,
  25,
  27:40,
  10
)]

# create output dataframe, on re-runs, note that this also blanks out previous sample data which is desireable behavior to prevent accidents.
sampleCommentLetters_topNumber <- documents.withTopics[FALSE,]


# get the list of topic variables names in substantive (and thus df consistent) order from section 2.2
topicNamesOrdered <- names(doc.topics.IDnumbered)[2:length(names(doc.topics.IDnumbered))] #we drop the first variable name because it is ControlNumberID which is not relevant here

#### Start Loop Here ####
for (currentTopicName in topicNamesOrdered){ # wrap the entire algorithm in a loop to execute through all of the the topics

  #create temp working dataframe from topics without multivalues
  tempCommentsByTopics <- documents.ToSample[order(-documents.ToSample[,currentTopicName]),]
  
  # Create a new dataframe which selects out docs which are in the topNumSelected
  currentTopicSample <- tempCommentsByTopics[1:topNumSelected,]
  
  # Add a variable to CurrentTopicSample to capture which topic the comment is selected for
  currentTopicSample$relevantTopic <- currentTopicName
  
  ## Add these selected documents to the sampling dataframe
  sampleCommentLetters_topNumber <- 
    rbind(sampleCommentLetters_topNumber, currentTopicSample)

#
}
## END OF LOOP ###

## Are there duplicate Comment Letters? (False means no because we are comparing the number of unique comment IDs to the total number of comments)
length(unique(sampleCommentLetters_topNumber$ControlNumberID)) != nrow(sampleCommentLetters_topNumber)

## [1] FALSE

## Change relevantTopic to factor and rename levels to descriptive names
sampleCommentLetters_topNumber$relevantTopic <- factor(sampleCommentLetters_topNumber$relevantTopic)

sampleCommentLetters_topNumber$relevantTopic <- mapvalues(sampleCommentLetters_topNumber$relevantTopic, 
                                                          from = topicNamesOrdered, #
                                                          to = c(
                              "Cross-Border Transactions (I)",
                              "Volcker Rule (II)",
                              "End-User Rule (III)",
                              "Compliance Rules (IV)",
                              "Swap Clearing Rule (V)",
                              "Swap Reporting Rules (VI)",
                              "Macro Risk Reporting & Monitoring (VII)",
                              "Addresses & Names (VIII)",
                              "Firm-Level Risk Models (IX)",
                              "Community Banks & Derivative Users  (X)",
                              "End Users & Public Utilities (XI)",
                              "Speculation in Precious Metals (XII)",
                              "Speculation in Agricultural & Energy Commodities (XIII)",
                              "Speculation in Household Commodities (XIV)"
                            ))

## Put the relevant topic variables at the front and relevant comment letter text at the end
sampleCommentLetters_topNumber <- sampleCommentLetters_topNumber[,c(23,22,1:6,8:21,7)]

## Create a directory to hold this large number of letters
letterDirectory <- paste(wd,"/",iterations," iterations/Letters/", sep="")

dir.create(letterDirectory)

## Warning in dir.create(letterDirectory): 'D:\Dropbox\The Limits of
## Interest\12000 iterations\Letters' already exists

## Write each selected letter to a SEPERATE file with it's metadata

for ( commentNum in 1:nrow(sampleCommentLetters_topNumber)){
  
  currentLetter <- sampleCommentLetters_topNumber[commentNum,]
  
  if (commentNum%%topNumSelected == 0)
  {
    currentRank <- topNumSelected
  } else {
    currentRank <- commentNum%%topNumSelected    
    }
  
  filename = paste(letterDirectory, "/",
                   currentLetter[1,1],"_Letter_#",
                   currentRank,
                   "_",
                   format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),
                   ".txt",
                   sep="")
  
  for ( varNum in 1:ncol(currentLetter) ){
    write(c(paste(as.character(colnames(currentLetter)[varNum]), ## This is the variable name
                 as.character(currentLetter[1,varNum]), ## This is the variable value
                 sep = " : ")
            ), 
          file = filename,
          append = TRUE
          )
  }
}

#NOTE: Due to the excel cell character limit, example letters need to be outputted as individual files 
#       and then copied manually into word files for full text.  But, a tabular comparison file is useful
#
## Write the table to a | delimited file for easy copy into excel
filename = paste(letterDirectory,"/","Sample_Comment_Letters_TRUNCATED_Top_",topNumSelected,"_by_Proportions_",n.topics,"_topics_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".txt", sep="")

write.table(sampleCommentLetters_topNumber, row.names=FALSE, col.names=TRUE, file = filename, sep = "|")

2.4.3 Select Median 5 Comment letters closest to the median Topic Proportion

This selection emphasizes the “typical” usage of the topic in the corpus but will likely, for that reason, be more difficult to isolate the essence of the specific topic. However, it may be useful should there be concerns about misrepresentativeness of extreme proportion selections in 2.4.1 and 2.4.2

3. Who Commented on the 10 most commented rules in the Coded (8k) dataset

Now we want to see who was commenting on the 10 most commented on rules. This data primarily discussed in Supporting Information B and E.

3.1 The top 10 Rules by number of comments

First, we need that top 10 list. Note: This will be based on the coded comments only, which is the set of 8568 comments not the set of 37,232 comments.

# reset the working dataframe to the backup to ensure clean data for every step
documents.withTopicsAndMultivalues <- documents.withTopicsAndMultivaluesBKUP_2.2

# get the unique list of rules with counts of relevant comments
cftc.Comments.coded.counts <- as.data.frame(table(documents.withTopicsAndMultivalues$UniqueName))

colnames(cftc.Comments.coded.counts)[1]<- "UniqueName"
colnames(cftc.Comments.coded.counts)[2]<- "Comments"

## Sort in decending order of frequency to see top values
cftc.Comments.coded.counts <- cftc.Comments.coded.counts[order(cftc.Comments.coded.counts$Comments, decreasing = TRUE),]

# examine the top 15 rules to get a sense for patterns
head(cftc.Comments.coded.counts, 15)

##      UniqueName Comments
## 67   76 FR 4752     3335
## 38  75 FR 80747      659
## 1    75 FR 3281      432
## 101  77 FR 8332      377
## 35  75 FR 80174      261
## 6   75 FR 63732      185
## 93  77 FR 41213      168
## 107 78 FR 75680      153
## 42   76 FR 1214      129
## 82  77 FR 15460      113
## 2   75 FR 51429      106
## 27  75 FR 76139      102
## 37  75 FR 80638       92
## 56  76 FR 29818       92
## 52  76 FR 23732       82

## get the top 10
top.commented.rules <- head(cftc.Comments.coded.counts,n = 10)

#simple print
top.commented.rules

##      UniqueName Comments
## 67   76 FR 4752     3335
## 38  75 FR 80747      659
## 1    75 FR 3281      432
## 101  77 FR 8332      377
## 35  75 FR 80174      261
## 6   75 FR 63732      185
## 93  77 FR 41213      168
## 107 78 FR 75680      153
## 42   76 FR 1214      129
## 82  77 FR 15460      113

# write all rule counts to csv for simple excel manipulation/copy paste
write.csv(cftc.Comments.coded.counts, file=paste(iterations," iterations/","Comments_Top10Rules_n",sum(top.commented.rules$NumberOfComments),format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".csv", sep=""), row.names = FALSE)

# latex table for publishing etc.
print(xtable(top.commented.rules), include.rownames = FALSE)

## % latex table generated in R 3.6.1 by xtable 1.8-4 package
## % Sun Dec 01 12:44:09 2019
## \begin{table}[ht]
## \centering
## \begin{tabular}{lr}
##   \hline
## UniqueName & Comments \\ 
##   \hline
## 76 FR 4752 & 3335 \\ 
##   75 FR 80747 & 659 \\ 
##   75 FR 3281 & 432 \\ 
##   77 FR 8332 & 377 \\ 
##   75 FR 80174 & 261 \\ 
##   75 FR 63732 & 185 \\ 
##   77 FR 41213 & 168 \\ 
##   78 FR 75680 & 153 \\ 
##   76 FR 1214 & 129 \\ 
##   77 FR 15460 & 113 \\ 
##    \hline
## \end{tabular}
## \end{table}

3.2 Getting the comments from the coded dataset

# reset the working dataframe to the backup to ensure clean data for every step
documents.withTopicsAndMultivalues <- documents.withTopicsAndMultivaluesBKUP_2.2

# create the top 10 rule comment dataset
cftc.Comments.coded.top10 <- subset(documents.withTopicsAndMultivalues, UniqueName %in% top.commented.rules$UniqueName)

# create the inverse as well
cftc.Comments.coded.11toEnd <- subset(documents.withTopicsAndMultivalues, !(UniqueName %in% top.commented.rules$UniqueName))


# check that the number of comments pulled is the sum total of the number in the top 10 list
nrow(cftc.Comments.coded.top10)

## [1] 5812

sum(top.commented.rules$NumberOfComments)

## [1] 0

3.3 Stacked Bar Charts of Commenter type

Now that we have the comments pulled out, we’ll use this dataset to generate some stacked bar charts.

3.3.1 Top 10 Rules Combined

First, the pool of comments for all 10 rules combined.

all.comments = x=factor("All Rules") ##dummy factor to produce a stacked bar chart with only one bar.

chartTitle = "Most Frequently Commented Ten CFTC Proposed Rules:\nComments by Type of Organization"
filename=paste(iterations," iterations/","Comments_by_OrgType_Top10Rules_n",nrow(cftc.Comments.coded.top10),"_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")

png(filename = paste(filename,sep=""), width = 753/72, height = 578/72, units = "in", res=600)

topTenCombinedBar <- ggplot(cftc.Comments.coded.top10,
       aes(all.comments, 
           fill=Classifications.with.Multivalue,
           order=-as.numeric(Classifications.with.Multivalue))) + 
  geom_bar() + 
  scale_fill_viridis(name="Commenter Type", discrete = TRUE,
                      labels=paste(levels(cftc.Comments.coded.top10$Classifications.with.Multivalue),
                                   " (",
                                   table(cftc.Comments.coded.top10$Classifications.with.Multivalue),
                                   ")",sep=""))  +
  xlab("") + 
  ylab("Number of Comments") +
  labs(title=paste(chartTitle,"\n (n = ",nrow(cftc.Comments.coded.top10),")", sep=""))

topTenCombinedBar

dev.off()

## png 
##   2

## Attach image to output as well in order to enrich Rmd online appendix

include_graphics(filename)

Comments_by_OrgType_Top10Rules_n5812_2019-12-01_13.35.52

3.3.2 11 to N Rules Combined for Comparison

For comparison, let’s also look at the inverse: the 11th through end rules

all.comments = x=factor("All Rules") ##dummy factor to produce a stacked bar chart with only one bar.

chartTitle = "Less Frequently Commented (11 to 114) CFTC Proposed Rules:\nComments by Type of Organization"
filename=paste(iterations," iterations/","Comments_by_OrgType_11toEnd_n",nrow(cftc.Comments.coded.11toEnd),"_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")

png(filename = paste(filename,sep=""), width = 753/72, height = 578/72, units = "in", res=600)

elevenToEndCombinedBar <- ggplot(cftc.Comments.coded.11toEnd, 
                                 aes(all.comments, 
                                     fill=Classifications.with.Multivalue,
                                     order=-as.numeric(Classifications.with.Multivalue))) +
  geom_bar() + 
  scale_fill_viridis(name="Commenter Type", discrete = TRUE,
                      labels=paste(levels(cftc.Comments.coded.top10$Classifications.with.Multivalue),
                                   " (",
                                   table(cftc.Comments.coded.top10$Classifications.with.Multivalue),
                                   ")",sep="")) + 
  xlab("") + ylab("Number of Comments") +
  labs(title=paste(chartTitle,"\n (n = ",nrow(cftc.Comments.coded.11toEnd),")", sep=""))

elevenToEndCombinedBar

dev.off()

## png 
##   2

## Attach image to output as well in order to enrich Rmd online appendix

include_graphics(filename)

Comments_by_OrgType_11toEnd_n2639_2019-12-01_13.35.54

3.4 Top 10 Commented Rules split out by Rule

Then, we also want to see each of the 10 rules split out.

3.4.1 Top 1 to 10 Rules Split Out

chartTitle = "Most Frequently Commented Ten CFTC Proposed Rules:\nComments by Type of Organization per Rule"
filename=paste(iterations," iterations/","Comments_by_OrgType_Top10Rules_splitout_n",nrow(cftc.Comments.coded.top10),"_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")

# NOTE: in order to get the rules sorted by descending order of volume of comments, we on-the-fly cast  UniqueName into a factor with levels from the top.commented.rules table created earlier which has the names of the rules sorted in descending order of comment volume.  This happens in the aes argument of the ggplot function

png(filename = paste(filename,sep=""), width = 753/72, height = 578/72, units = "in", res=600)

topTenSplitBar <- ggplot(cftc.Comments.coded.top10,
                         aes(factor(UniqueName,levels = top.commented.rules$UniqueName),
                             fill=Classifications.with.Multivalue, 
                             order=-as.numeric(Classifications.with.Multivalue))) + 
  geom_bar()+ 
  theme(axis.text.x=element_text(angle=20,
                                 hjust=.75,
                                 vjust=1)) + 
  scale_fill_viridis(name="Commenter Type", discrete=TRUE) + 
  xlab("Top 10 Commented Rules") + 
  ylab("Number of Comments") +
  labs(title=paste(chartTitle,"\n (n = ",nrow(cftc.Comments.coded.top10),")", sep=""))

topTenSplitBar

dev.off()

## png 
##   2

## Attach image to output as well in order to enrich Rmd online appendix

include_graphics(filename)

Comments_by_OrgType_Top10Rules_splitout_n5812_2019-12-01_13.35.55

### This lets you try it facetted instead, but the results are poor.  Maintained for archival purposes
# png(filename = "../Results/Comments_top10_by_Rule_facets.png", width = 753, height = 578, units = "px")
# ggplot(cftc.Comments.coded.top10, aes(all.comments, fill=Classifications.with.Multivalue, order=-as.numeric(Classifications.with.Multivalue))) + geom_bar() + scale_fill_discrete()  + xlab("") + ylab("Number of Comments") +labs(title=paste("All Comments on the Top 10 Commented CFTC Proposed Rules to Implement the Dodd-Frank Act\n (n =",nrow(cftc.Comments.coded.top10),")")) + facet_wrap(~UniqueName)
# dev.off()

3.4.2 Top 2 to 10 Rules (For easier comparison)

Now, let’s drop the first rule (76 FR 4752) since it is so many more comments than the other rules Then, we also want to see each of the 10 rules split out.

# create the top 2to10 rule comment dataset
cftc.Comments.coded.top2thru10 <- subset(cftc.Comments.coded.top10, UniqueName !="76 FR 4752")


chartTitle = "(2nd to 10th) Most Frequently Commented CFTC Proposed Rules:\nComments by Type of Organization per Rule"
filename=paste(iterations," iterations/","Comments_by_OrgType_Top2thru10Rules_splitout_n",nrow(cftc.Comments.coded.top2thru10),"_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")

# NOTE: in order to get the rules sorted by descending order of volume of comments, we on-the-fly cast  UniqueName into a factor with levels from the top.commented.rules table created earlier which has the names of the rules sorted in descending order of comment volume.  This happens in the aes argument of the ggplot function

png(filename = paste(filename,sep=""), width = 753/72, height = 578/72, units = "in", res=600)

twoToTenSplitBar <- ggplot(cftc.Comments.coded.top2thru10, 
                           aes(factor(UniqueName,
                                      levels = top.commented.rules$UniqueName),
                               fill=Classifications.with.Multivalue,
                               order=-as.numeric(Classifications.with.Multivalue))) + 
  geom_bar()+ 
  theme(axis.text.x=element_text(angle=20,
                                 hjust=.75,
                                 vjust=1)) + 
  scale_fill_viridis(name="Commenter Type", discrete=TRUE) + 
  xlab("Top 2 to 10 Commented Rules") + 
  ylab("Number of Comments") +
  labs(title=paste(chartTitle,"\n (n = ",nrow(cftc.Comments.coded.top2thru10),")", sep=""))

twoToTenSplitBar

dev.off()

## png 
##   2

## Attach image to output as well in order to enrich Rmd online appendix

include_graphics(filename)

Comments_by_OrgType_Top2thru10Rules_splitout_n2477_2019-12-01_13.35.57

3.5 Create a Table of the number of comments by type for each of the top 10 rules

because we cannot include the totals in a clear way in the legend of the 10-bar stacked chart, we need a quick cross-tabs table to give us the totals for each rule. We’ll use the table function to accomplish this.

cftc.Comments.coded.top10.counts <- table(cftc.Comments.coded.top10$Classifications.with.Multivalue,factor(cftc.Comments.coded.top10$UniqueName,levels = top.commented.rules$UniqueName))

#Quick print to console, but this is not useful for copy/paste because window is too narrow
cftc.Comments.coded.top10.counts

##                                                 
##                                                  76 FR 4752 75 FR 80747
##   Major Wall Street Sell-Side Bank                        7           2
##   Other Sell-Side Bank                                    4           1
##   Core Financial Service Trade Association                4           5
##   Financial Sector Association                           15          11
##   Public Asset Manager                                    2           1
##   Private Asset Manager                                  86          14
##   U.S. Chamber of Commerce or Affiliate                   2           2
##   Market Infrastructure Firm                             46           2
##   Law Firms, Consultants, and Related Advisors           66          17
##   Non-Financial Private-Sector Association              115          37
##   Non-Financial Firm                                    465          47
##   Government                                             45           6
##   Academic or Other Expert                               42           6
##   Consumer Advocacy or other Citizens Group             160          81
##   Trade Union or other Formal Labor Organization         13           6
##   Market Advocacy or other Anti-Regulation Group          8           1
##   Unaffiliated Individual                              2197         415
##   Un-Coded                                               58           5
##                                                 
##                                                  75 FR 3281 77 FR 8332
##   Major Wall Street Sell-Side Bank                        3         25
##   Other Sell-Side Bank                                    0         27
##   Core Financial Service Trade Association                1         24
##   Financial Sector Association                            0         89
##   Public Asset Manager                                    0          0
##   Private Asset Manager                                  90         53
##   U.S. Chamber of Commerce or Affiliate                   0         18
##   Market Infrastructure Firm                             77          8
##   Law Firms, Consultants, and Related Advisors           42         30
##   Non-Financial Private-Sector Association                4         10
##   Non-Financial Firm                                    136          7
##   Government                                              6         49
##   Academic or Other Expert                               21         10
##   Consumer Advocacy or other Citizens Group               4         11
##   Trade Union or other Formal Labor Organization          0          0
##   Market Advocacy or other Anti-Regulation Group          0          1
##   Unaffiliated Individual                                24         15
##   Un-Coded                                               24          0
##                                                 
##                                                  75 FR 80174 75 FR 63732
##   Major Wall Street Sell-Side Bank                         7           6
##   Other Sell-Side Bank                                    11           4
##   Core Financial Service Trade Association                18           7
##   Financial Sector Association                            33           5
##   Public Asset Manager                                     3           1
##   Private Asset Manager                                   50          19
##   U.S. Chamber of Commerce or Affiliate                    6           0
##   Market Infrastructure Firm                               3          30
##   Law Firms, Consultants, and Related Advisors            25           7
##   Non-Financial Private-Sector Association                43           8
##   Non-Financial Firm                                      28          15
##   Government                                              17          33
##   Academic or Other Expert                                 3           1
##   Consumer Advocacy or other Citizens Group                5          32
##   Trade Union or other Formal Labor Organization           1           5
##   Market Advocacy or other Anti-Regulation Group           1           1
##   Unaffiliated Individual                                  7          11
##   Un-Coded                                                 0           0
##                                                 
##                                                  77 FR 41213 78 FR 75680
##   Major Wall Street Sell-Side Bank                         6           2
##   Other Sell-Side Bank                                     8           0
##   Core Financial Service Trade Association                 9           2
##   Financial Sector Association                            19           4
##   Public Asset Manager                                     0           0
##   Private Asset Manager                                    5          14
##   U.S. Chamber of Commerce or Affiliate                    1           1
##   Market Infrastructure Firm                               8           8
##   Law Firms, Consultants, and Related Advisors             4           2
##   Non-Financial Private-Sector Association                 5          44
##   Non-Financial Firm                                       8          57
##   Government                                              11           2
##   Academic or Other Expert                                 6           1
##   Consumer Advocacy or other Citizens Group               26           8
##   Trade Union or other Formal Labor Organization           4           0
##   Market Advocacy or other Anti-Regulation Group           1           1
##   Unaffiliated Individual                                 46           6
##   Un-Coded                                                 1           1
##                                                 
##                                                  76 FR 1214 77 FR 15460
##   Major Wall Street Sell-Side Bank                        6           5
##   Other Sell-Side Bank                                    6           4
##   Core Financial Service Trade Association               13           7
##   Financial Sector Association                           18          22
##   Public Asset Manager                                    0           0
##   Private Asset Manager                                  15          18
##   U.S. Chamber of Commerce or Affiliate                   2           1
##   Market Infrastructure Firm                             41          25
##   Law Firms, Consultants, and Related Advisors            6           4
##   Non-Financial Private-Sector Association                9          10
##   Non-Financial Firm                                      2           5
##   Government                                              5           3
##   Academic or Other Expert                                0           0
##   Consumer Advocacy or other Citizens Group               5           5
##   Trade Union or other Formal Labor Organization          0           0
##   Market Advocacy or other Anti-Regulation Group          0           2
##   Unaffiliated Individual                                 1           2
##   Un-Coded                                                0           0

#write to CSV for simple data export to excel or word
write.csv(cftc.Comments.coded.top10.counts, 
          file=paste(iterations," iterations/", "Comments_by_OrgType_Top10Rules_splitoutTable_n",nrow(cftc.Comments.coded.top10),".csv", sep=""))

# latex type setting
xtable(cftc.Comments.coded.top10.counts)

## % latex table generated in R 3.6.1 by xtable 1.8-4 package
## % Sun Dec 01 12:44:16 2019
## \begin{table}[ht]
## \centering
## \begin{tabular}{rrrrrrrrrrr}
##   \hline
##  & 76 FR 4752 & 75 FR 80747 & 75 FR 3281 & 77 FR 8332 & 75 FR 80174 & 75 FR 63732 & 77 FR 41213 & 78 FR 75680 & 76 FR 1214 & 77 FR 15460 \\ 
##   \hline
## Major Wall Street Sell-Side Bank &   7 &   2 &   3 &  25 &   7 &   6 &   6 &   2 &   6 &   5 \\ 
##   Other Sell-Side Bank &   4 &   1 &   0 &  27 &  11 &   4 &   8 &   0 &   6 &   4 \\ 
##   Core Financial Service Trade Association &   4 &   5 &   1 &  24 &  18 &   7 &   9 &   2 &  13 &   7 \\ 
##   Financial Sector Association &  15 &  11 &   0 &  89 &  33 &   5 &  19 &   4 &  18 &  22 \\ 
##   Public Asset Manager &   2 &   1 &   0 &   0 &   3 &   1 &   0 &   0 &   0 &   0 \\ 
##   Private Asset Manager &  86 &  14 &  90 &  53 &  50 &  19 &   5 &  14 &  15 &  18 \\ 
##   U.S. Chamber of Commerce or Affiliate &   2 &   2 &   0 &  18 &   6 &   0 &   1 &   1 &   2 &   1 \\ 
##   Market Infrastructure Firm &  46 &   2 &  77 &   8 &   3 &  30 &   8 &   8 &  41 &  25 \\ 
##   Law Firms, Consultants, and Related Advisors &  66 &  17 &  42 &  30 &  25 &   7 &   4 &   2 &   6 &   4 \\ 
##   Non-Financial Private-Sector Association & 115 &  37 &   4 &  10 &  43 &   8 &   5 &  44 &   9 &  10 \\ 
##   Non-Financial Firm & 465 &  47 & 136 &   7 &  28 &  15 &   8 &  57 &   2 &   5 \\ 
##   Government &  45 &   6 &   6 &  49 &  17 &  33 &  11 &   2 &   5 &   3 \\ 
##   Academic or Other Expert &  42 &   6 &  21 &  10 &   3 &   1 &   6 &   1 &   0 &   0 \\ 
##   Consumer Advocacy or other Citizens Group & 160 &  81 &   4 &  11 &   5 &  32 &  26 &   8 &   5 &   5 \\ 
##   Trade Union or other Formal Labor Organization &  13 &   6 &   0 &   0 &   1 &   5 &   4 &   0 &   0 &   0 \\ 
##   Market Advocacy or other Anti-Regulation Group &   8 &   1 &   0 &   1 &   1 &   1 &   1 &   1 &   0 &   2 \\ 
##   Unaffiliated Individual & 2197 & 415 &  24 &  15 &   7 &  11 &  46 &   6 &   1 &   2 \\ 
##   Un-Coded &  58 &   5 &  24 &   0 &   0 &   0 &   1 &   1 &   0 &   0 \\ 
##    \hline
## \end{tabular}
## \end{table}

4. Deprecated Additional Analyses

In our initial exploration of the topic models and topic proportions, these analyses were helpful to characterize the data. They are not used in the paper as they ultimately did not provide the insights which answered our research questions. They are retained here for future analysts looking to explore and characterize topic model data based on our approach.

4.1 Easily adjustable two-type comparison

Based on the 17 part proportions in 2.3, this section allows us to easily explore different commenter type combinations. It works by setting two types at the begining of the code block. To set the types, you must copy the exact name from the list below into the typeA and typeB variables.

“Major Wall Street Sell-Side Bank”, “Core Financial Service Trade Association”, “Other Sell-Side Bank”, “Public Asset Manager”, “Private Asset Manager”, “U.S. Chamber of Commerce or Affiliate”, “Market Infrastructure Firm”, “Law Firms, Consultants, and Related Advisors”, “Non-Financial Firm”, “Private-Sector Association”, “Government”, “Academic or Other Expert”, “Consumer Advocacy or other Citizens Group”, “Trade Union or other Formal Labor Organization”, “Market Advocacy or other Anti-Regulation Group”, “Unaffiliated Individual”, “Un-Coded” <———————————————————-

# reset the working dataframe to the backup to ensure clean data for every step
documents.withTopicsAndMultivalues <- documents.withTopicsAndMultivaluesBKUP_2.2

## Set the commenter types you want to compare below
groupTypes = c(
  "industryGroup"
 ,"nonIndustryGroup"
  )

# average all the documents down to the 17 part typology in Classifications.with.Multivalue

typologyTopicProportions <- ddply(
                        documents.withTopicsAndMultivalues[2:length(names(documents.withTopicsAndMultivalues))] ### omit ID
                        ,"Classifications.with.Multivalue"
                        ,numcolwise(mean))

# report the dataframe where each row is one of the commenter types from VA.KSC, and each column is a topic.  Each cell will then be the average Proportions of the commenter type to mention a particular topic.  (Does this add up to 1 always?, probably not due to averaging)

typologyTopicProportions

# now, drop all lines from typologyTopicProportions which are not typeA and typeB
typeABtopicProportions <- typologyTopicProportions[typologyTopicProportions$Classifications.with.Multivalue %in% groupTypes,]

# confirm that the correct types were retained
typeABtopicProportions





###########################
#   creation of png file  #
#   (be sure to turn on   #
#   dev.off() line after  #
#    the chart)           #
###########################
#filename = paste("Topic_Proportions_",paste(groupTypes, collapse ="&"),"_",n.topics,"_topics_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")
#png(filename = paste("",filename,sep=""), width = 753, height = 578, units = "px")



# Stacked Bar Chart of Proportions for Visual Comparison
chartTitle = paste("Topic Proportions in Comments from ",paste(groupTypes, collapse =" & ")," type Commenters", sep="")


#create a "tall" file which lists each of the variables vertically rather than a wide file where each variable is a column.
tempPlot <- melt(typeABtopicProportions, id.vars="Classifications.with.Multivalue")
colnames(tempPlot)<- c("Classifications.with.Multivalue","topic","Proportions") #make the names descriptive

ggplot(tempPlot, aes(Classifications.with.Multivalue, Proportions, fill=topic)) +geom_bar(stat = "identity") + theme(axis.text.x=element_text(angle=0,hjust=.5,vjust=1)) + xlab("Commenter Types") + ylab("Average Topic Proportions") +labs(title=paste(chartTitle,"\n (#topics = ",n.topics,")", sep=""))

###############
#dev.off()
###############

## make a nice pseudo-table of the Proportions
# this is the weird green circle table-type thing that we came up with since R doesn't do tables well.

#filename = paste("Topic_Proportions_",paste(groupTypes, collapse ="&"),"_dotsplot_",n.topics,"_topics_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")
#png(filename = paste("",filename,sep=""), width = 1000, height = 578, units = "px")

#g <- ggplot(tempPlot, aes(Classifications.with.Multivalue, topic)) + geom_point(aes(size = Proportions), color= "green") + theme_bw() + xlab("") + ylab("")
#g + scale_size_continuous(range=c(1,10)) + theme(axis.text.x=element_text(angle=35,hjust=1,vjust=1))  + geom_text(aes(label = round(Proportions, 4)), size=4)

#dev.off()

4.2 Basic Binary Industry vs. Non-Industry comparison

Following the data presentation in (Levy & Franklin 2014), we begin by presenting a comparison between two halves of the data: those commenter types who can be considered Industry and those which are non-Industry organization, academics, and private individuals.

Based on the 17 part typology Industry “Major Wall Street Sell-Side Bank”, “Core Financial Service Trade Association”, “Other Sell-Side Bank”, “Public Asset Manager”, “Private Asset Manager”, “U.S. Chamber of Commerce or Affiliate”, “Market Infrastructure Firm”, “Law Firms, Consultants, and Related Advisors”, “Non-Financial Firm”, “Financial Sector Association”, “Non-Financial Private-Sector Association”,

Non-Industry “Government”, “Academic or Other Expert”, “Consumer Advocacy or other Citizens Group”, “Trade Union or other Formal Labor Organization”, “Market Advocacy or other Anti-Regulation Group”, “Unaffiliated Individual”, “Un-Coded” <———————————————————-

# reset the working dataframe to the backup to ensure clean data for every step
documents.withTopicsAndMultivalues <- documents.withTopicsAndMultivaluesBKUP_2.2

# add in the Industry binary as a variable to our documents.withTopic dataframe

industryClassification<-c("Industry","Industry","Industry","Industry","Industry","Industry","Industry","Industry","Industry","Industry","Industry","Non-Industry","Non-Industry","Non-Industry","Non-Industry","Non-Industry","Non-Industry","")
### Last classification is blank and represents the Uncoded category.
### Omit blank line from furter analysis

indusClassificationDF <- data.frame(typology,industryClassification)

indusClassificationDF

temp <- merge(documents.withTopicsAndMultivalues, indusClassificationDF, by.x = "Classifications.with.Multivalue", by.y="typology") 


# average documents to two lines and N topics columns based on the list above
industryBinaryTopicProportions <- ddply(
                        documents.withTopicsAndMultivalues[3:length(names(documents.withTopicsAndMultivalues))] ### omit ID
                        ,"industryClassification"
                        ,numcolwise(mean))


# report the dataframe where each row is one of the commenter types from VA.KSC, and each column is a topic.  

industryBinaryTopicProportions

#to count the number of comments in each super type:
sum(1*(temp$industryClassification=="Industry"))
sum(1*(temp$industryClassification=="Non-Industry"))

#drop the uncoded
industryBinaryTopicProportions <- industryBinaryTopicProportions[2:nrow(industryBinaryTopicProportions),]

# report as a table where each cell is the Proportions of group [row] to mention topic [column]. Each cell will then be the average Proportions of the commenter super-type to mention a particular topic.  (Does this add up to 1 always?, probably not due to averaging, but it's pretty close)

# Stacked Bar Chart of Proportions for Visual Comparison
chartTitle = "Topic Proportions in Comments from Industry and Non-Industry"
filename = paste("Topic_Proportions_by_Industry_NonIndustry_",n.topics,"_topics_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")

tempPlot <- melt(industryBinaryTopicProportions, id.vars="industryClassification")
colnames(tempPlot)<- c("industryClassification","topic","Proportions")

png(filename = paste("",filename,sep=""), width = 753, height = 578, units = "px")

ggplot(tempPlot, aes(industryClassification, Proportions, fill=topic)) +geom_bar(stat = "identity") + theme(axis.text.x=element_text(angle=35,hjust=1,vjust=1)) + xlab("Commenter Types") + ylab("Average Topic Proportions") +labs(title=paste(chartTitle,"\n (n = ",nrow(documents.withTopicsAndMultivalues),")", sep=""))

dev.off()

# make a nice pseudo-table of the Proportions

filename = paste("Topic_Proportions_by_Industry_NonIndustry_dotsplot_",n.topics,"_topics_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")
png(filename = paste("",filename,sep=""), width = 1000, height = 578, units = "px")
g <- ggplot(tempPlot, aes(industryClassification, topic)) + geom_point(aes(size = Proportions), color= "green") + theme_bw() + xlab("") + ylab("")
g + scale_size_continuous(range=c(1,10)) + theme(axis.text.x=element_text(angle=35,hjust=1,vjust=1))  + geom_text(aes(label = round(Proportions, 4)), size=4)
dev.off()

4.3 Four part comparison of Supertypes

Based on the 17 part typology

Industry Sell-Side

“Major Wall Street Sell-Side Bank”, “Core Financial Service Trade Association”, “Other Sell-Side Bank”,

Industry Buy-Side (Derivative and Commodity Users)

“Public Asset Manager”, “Private Asset Manager”, “Non-Financial Firm”, “Financial Sector Association”, “Non-Financial Private-Sector Association”,

Industry Joint Sell-Side and Buy-Side

“U.S. Chamber of Commerce or Affiliate”, “Market Infrastructure Firm”, “Law Firms, Consultants, and Related Advisors”,

Non-Industry Organizations and Individuals

“Government”, “Academic or Other Expert”, “Consumer Advocacy or other Citizens Group”, “Trade Union or other Formal Labor Organization”, “Market Advocacy or other Anti-Regulation Group”, “Unaffiliated Individual”,

This group is dropped as it is categorically not categorizable

“Un-Coded” <———————————————————-

# reset the working dataframe to the backup to ensure clean data for every step
documents.withTopicsAndMultivalues <- documents.withTopicsAndMultivaluesBKUP_2.2

# add in the Industry binary as a variable to our documents.withTopic dataframe

superType<-c(
   "Industry Sell-Side"
  ,"Industry Joint Sell-Side and Buy-Side"
  ,"Industry Sell-Side"
  ,"Industry Buy-Side (Derivative and Commodity Users)"
  ,"Industry Buy-Side (Derivative and Commodity Users)"
  ,"Industry Joint Sell-Side and Buy-Side"
  ,"Industry Joint Sell-Side and Buy-Side"
  ,"Industry Joint Sell-Side and Buy-Side"
  ,"Industry Buy-Side (Derivative and Commodity Users)"
  ,"Industry Buy-Side (Derivative and Commodity Users)"
  ,"Industry Buy-Side (Derivative and Commodity Users)"
  ,"Non-Industry"
  ,"Non-Industry"
  ,"Non-Industry"
  ,"Non-Industry"
  ,"Non-Industry"
  ,"Non-Industry"
  ,"")

superTypeDF <- data.frame(typology,superType)

superTypeDF

rm(temp)
temp <- merge(documents.withTopicsAndMultivalues, superTypeDF, by.x = "Classifications.with.Multivalue", by.y="typology" ) 


##Note: there are 96 blanks due to the "Uncoded"" classification
sum(1*(temp$superType==""))

#to count the number of comments in each super type:
sum(1*(temp$superType=="Industry Sell-Side"))
sum(1*(temp$superType=="Industry Joint Sell-Side and Buy-Side"))
sum(1*(temp$superType=="Industry Buy-Side (Derivative and Commodity Users)"))
sum(1*(temp$superType=="Non-Industry"))

# average documents to two lines and N topics columns based on the list above
superTypeTopicProportions <- ddply(
                        documents.withTopicsAndMultivalues[3:length(names(documents.withTopicsAndMultivalues))] ### omit ID
                        ,"superType"
                        ,numcolwise(mean))


# report the dataframe where each row is one of the commenter superTypes and each column is a topic.  

superTypeTopicProportions

#drop the uncoded
superTypeTopicProportions <- superTypeTopicProportions[2:nrow(superTypeTopicProportions),]

# report as a table where each cell is the Proportions of group [row] to mention topic [column]. Each cell will then be the average Proportions of the commenter super-type to mention a particular topic.  (Does this add up to 1 always?, probably not due to averaging, but it's pretty close)

# Stacked Bar Chart of Proportions for Visual Comparison
chartTitle = "Topic Proportions in Comments from 4 Super Types"
filename = paste("Topic_Proportions_by_SuperType_",n.topics,"_topics_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")

tempPlot <- melt(superTypeTopicProportions, id.vars="superType")
colnames(tempPlot)<- c("superType","topic","Proportions")

png(filename = paste("",filename,sep=""), width = 753, height = 578, units = "px")

ggplot(tempPlot, aes(superType, Proportions, fill=topic)) +geom_bar(stat = "identity") + theme(axis.text.x=element_text(angle=35,hjust=1,vjust=1)) + xlab("Commenter Types") + ylab("Average Topic Proportions") +labs(title=paste(chartTitle,"\n (n = ",nrow(documents.withTopicsAndMultivalues),")", sep=""))

dev.off()

# make a nice pseudo-table of the Proportions

filename = paste("Topic_Proportions_by_SuperType_dotsplot_",n.topics,"_topics_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")
png(filename = paste("",filename,sep=""), width = 1000, height = 578, units = "px")
g <- ggplot(tempPlot, aes(superType, topic)) + geom_point(aes(size = Proportions), color= "green") + theme_bw() + xlab("") + ylab("")
g + scale_size_continuous(range=c(1,10)) + theme(axis.text.x=element_text(angle=35,hjust=1,vjust=1))  + geom_text(aes(label = round(Proportions, 4)), size=5)
dev.off()

4.4 Five part comparison of Supertypes

Based on the 17 part typology

Industry Sell-Side “Major Wall Street Sell-Side Bank”, “Core Financial Service Trade Association”, “Other Sell-Side Bank”,

** Industry Buy-Side (Derivative and Commodity Users)** “Public Asset Manager”, “Private Asset Manager”, “Non-Financial Firm”, “Financial Sector Association”, “Non-Financial Private-Sector Association”,

Industry Joint Sell-Side and Buy-Side “U.S. Chamber of Commerce or Affiliate”,

Infrastructure and Market Services “Market Infrastructure Firm”, “Law Firms, Consultants, and Related Advisors”,

Non-Industry Organizations and Individuals “Government”, “Academic or Other Expert”, “Consumer Advocacy or other Citizens Group”, “Trade Union or other Formal Labor Organization”, “Market Advocacy or other Anti-Regulation Group”, “Unaffiliated Individual”,

This group is dropped as it is categorically not categorizable “Un-Coded” <———————————————————-

# reset the working dataframe to the backup to ensure clean data for every step
documents.withTopicsAndMultivalues <- documents.withTopicsAndMultivaluesBKUP_2.2

# add in the Industry binary as a variable to our documents.withTopic dataframe

FiveType<-c(
   "Industry Sell-Side"
  ,"Industry Sell-Side"
  ,"Industry Joint Sell-Side and Buy-Side"
  ,"Industry Buy-Side (Derivative and Commodity Users)"
  ,"Industry Buy-Side (Derivative and Commodity Users)"
  ,"Industry Joint Sell-Side and Buy-Side"
  ,"Infrastructure and Market Services"
  ,"Infrastructure and Market Services"
  ,"Industry Buy-Side (Derivative and Commodity Users)"
  ,"Industry Buy-Side (Derivative and Commodity Users)"
  ,"Industry Buy-Side (Derivative and Commodity Users)"
  ,"Non-Industry"
  ,"Non-Industry"
  ,"Non-Industry"
  ,"Non-Industry"
  ,"Non-Industry"
  ,"Non-Industry"
  ,"")

FiveTypeDF <- data.frame(typology,FiveType)

FiveTypeDF

rm(temp)
temp <- merge(documents.withTopicsAndMultivalues, FiveTypeDF, by.x = "Classifications.with.Multivalue", by.y="typology" ) 


##Note: there are 96 blanks due to the "Uncoded"" classification
sum(1*(temp$FiveType==""))
##counts number of documents with a blank

sum(1*(temp$FiveType=="Industry Sell-Side"))
sum(1*(temp$FiveType=="Industry Joint Sell-Side and Buy-Side"))
sum(1*(temp$FiveType=="Industry Buy-Side (Derivative and Commodity Users)"))
sum(1*(temp$FiveType=="Infrastructure and Market Services"))
sum(1*(temp$FiveType=="Non-Industry"))
# average documents to two lines and N topics columns based on the list above
fiveTypeTopicProportions <- ddply(
                        documents.withTopicsAndMultivalues[3:length(names(documents.withTopicsAndMultivalues))] ### omit ID
                        ,"FiveType"
                        ,numcolwise(mean))


# report the dataframe where each row is one of the commenter superTypes and each column is a topic.  

fiveTypeTopicProportions

#drop the uncoded
fiveTypeTopicProportions <- fiveTypeTopicProportions[2:nrow(fiveTypeTopicProportions),]

# report as a table where each cell is the Proportions of group [row] to mention topic [column]. Each cell will then be the average Proportions of the commenter super-type to mention a particular topic.  (Does this add up to 1 always?, probably not due to averaging, but it's pretty close)

# Stacked Bar Chart of Proportions for Visual Comparison
chartTitle = "Topic Proportions in Comments In Five Super Types"
filename = paste("Topic_Proportions_by_Five_Types_",n.topics,"_topics_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")

tempPlot <- melt(fiveTypeTopicProportions, id.vars="FiveType")
colnames(tempPlot)<- c("FiveType","topic","Proportions")

png(filename = paste("",filename,sep=""), width = 753, height = 578, units = "px")

ggplot(tempPlot, aes(FiveType, Proportions, fill=topic)) +geom_bar(stat = "identity") + theme(axis.text.x=element_text(angle=35,hjust=1,vjust=1)) + xlab("Commenter Types") + ylab("Average Topic Proportions") +labs(title=paste(chartTitle,"\n (n = ",nrow(documents.withTopicsAndMultivalues),")", sep=""))

dev.off()

# make a nice pseudo-table of the Proportions
filename = paste("Topic_Proportions_by_Five_Types_dotsplot_",n.topics,"_topics_",format(Sys.time(), "%Y-%m-%d_%H.%M.%S"),".png", sep="")
png(filename = paste("",filename,sep=""), width = 1000, height = 578, units = "px")
g <- ggplot(tempPlot, aes(FiveType, topic)) + geom_point(aes(size = Proportions), color= "green") + theme_bw() + xlab("") + ylab("")
g + scale_size_continuous(range=c(1,10)) + theme(axis.text.x=element_text(angle=35,hjust=1,vjust=1))  + geom_text(aes(label = round(Proportions, 4)), size=5)
dev.off()

last revision: December 1, 2019

Setup Environment

0. Prepare Corpus

0.0 Combine the CFTC raw data file with the organization codings in this file from (2017-05-11) syntax file

0.1 Load the Data

0.1.1 Drop Ex Parte Meetings and Comments “None” type values in Organization

0.1.2 Get Unique Organization Values from the Corpus for Data Coding Purposes

0.2 Created a unified text column to solve the gaps in CommentText and ExtractedText

0.3 Make the 18-part typology a factor

1. Estimate Mallet Topics

1.1 Clean the comment database

1.1.1 Clean the database

1.1.2 Save a Clean Copy of the Database

1.2 Generate the mallet topic model

1.2.1 Run the topic model

2. Substantive Analyses of Groups of Comments (sub-Corpora)

2.1 Initial Topic Model Descriptives

2.2 Create a dataframe with the documents, meta-data, and topic Proportions

2.2.1 Deal with Comment Letters Signed by Multiple Types of Commentors

2.2.2 Backup the Analysis Dataframe

2.3 Generate and Display Topic Proportions for each of the 18 groups

2.3.1 Calculate the 18 Type Typology Topic Proportions

2.3.2 Create a Table of Aggregated Topic Proportions for Plotting

2.3.3 Stacked Bar Chart of Proportions for Visual Comparison

2.3.4 Generate a Heat-Map Table of Topic Proportions

2.3.4.1 Color Landscape Heatmap

2.3.4.2 Color Portrait Heatmap

2.4 Sample of Representative Comment Letters from Each Topic

2.4.1 Select ALL Comment Letters which are {1,2} standard deviations above the mean topic proportions

2.4.2 Select Top XX highest topic proportion Comment Letters

2.4.3 Select Median 5 Comment letters closest to the median Topic Proportion

3. Who Commented on the 10 most commented rules in the Coded (8k) dataset

3.1 The top 10 Rules by number of comments

3.2 Getting the comments from the coded dataset

3.3 Stacked Bar Charts of Commenter type

3.3.1 Top 10 Rules Combined

3.3.2 11 to N Rules Combined for Comparison

3.4 Top 10 Commented Rules split out by Rule

3.4.1 Top 1 to 10 Rules Split Out

3.4.2 Top 2 to 10 Rules (For easier comparison)

3.5 Create a Table of the number of comments by type for each of the top 10 rules

4. Deprecated Additional Analyses

4.1 Easily adjustable two-type comparison

4.2 Basic Binary Industry vs. Non-Industry comparison

4.3 Four part comparison of Supertypes

4.4 Five part comparison of Supertypes