Deeply Trivial: computers

Showing posts with label computers. Show all posts

Monday, March 11, 2019

Statistics Sunday: Scatterplots and Correlations with ggpairs

As I conduct some analysis for a content validation study, I wanted to quickly blog about a fun plot I discovered today: ggpairs, which displays scatterplots and correlations in a grid for a set of variables.

To demonstrate, I'll return to my Facebook dataset, which I used for some of last year's R analysis demonstrations. You can find the dataset, a minicodebook, and code on importing into R here. Then use the code from this post to compute the following variables: RRS, CESD, Extraversion, Agree, Consc, EmoSt, Openness. These correspond to measures of rumination, depression, and the Big Five personality traits. We could easily request correlations for these 7 variables. But if I wanted scatterplots plus correlations for all 7, I can easily request it with ggpairs then listing out the columns from my dataset I want included on the plot:

library(ggplot2)
ggpairs(Facebook[,c(112,116,122:126)]

(Note: I also computed the 3 RRS subscales, which is why the column numbers above skip from 112 (RRS) to 116 (CESD). You might need to adjust the column numbers when you run the analysis yourself.)

The results look like this:

Since the grid is the number of variables squared, I wouldn't recommend this type of plot for a large number of variables.

Thursday, February 28, 2019

A New Trauma Population for the Social Media Age

Even if you aren't a Facebook use, you're probably aware that there are rules about what you can and cannot post. Images or videos that depict violence or illegal behavior would of course be taken down. But who decides that? You as a user can always report an image or video (or person or group) if you think it violates community standards. But obviously, Facebook doesn't want to traumatize its users if it can be avoided.

That's where the employees of companies like Cognizant come in. It's their job to watch some of the most disturbing content on the internet - and it's even worse than it sounds. In this fascinating article for The Verge, Casey Newton describes just how traumatic doing such a job can be. (Content warning - this post has lots of references to violence, suicide, and mental illness.)

The problem with the way these companies do business is that, not only do employees see violent and disturbing content; they also don't have the opportunity to talk about what they see with their support networks:

Over the past three months, I interviewed a dozen current and former employees of Cognizant in Phoenix. All had signed non-disclosure agreements with Cognizant in which they pledged not to discuss their work for Facebook — or even acknowledge that Facebook is Cognizant’s client. The shroud of secrecy is meant to protect employees from users who may be angry about a content moderation decision and seek to resolve it with a known Facebook contractor. The NDAs are also meant to prevent contractors from sharing Facebook users’ personal information with the outside world, at a time of intense scrutiny over data privacy issues.

But the secrecy also insulates Cognizant and Facebook from criticism about their working conditions, moderators told me. They are pressured not to discuss the emotional toll that their job takes on them, even with loved ones, leading to increased feelings of isolation and anxiety.

The moderators told me it’s a place where the conspiracy videos and memes that they see each day gradually lead them to embrace fringe views. One auditor walks the floor promoting the idea that the Earth is flat. A former employee told me he has begun to question certain aspects of the Holocaust. Another former employee, who told me he has mapped every escape route out of his house and sleeps with a gun at his side, said: “I no longer believe 9/11 was a terrorist attack.”

It's a fascinating read on an industry I really wasn't aware existed, and a population that could be diagnosed with PTSD and other responses to trauma.

Thursday, October 4, 2018

Resistance is Futile

In yet another instance of science imitating science fiction, scientists figured out how to create a human hive mind:

A team from the University of Washington (UW) and Carnegie Mellon University has developed a system, known as BrainNet, which allows three people to communicate with one another using only the power of their brain, according to a paper published on the pre-print server arXiv.

In the experiments, two participants (the senders) were fitted with electrodes on the scalp to detect and record their own brainwaves—patterns of electrical activity in the brain—using a method known as electroencephalography (EEG). The third participant (the receiver) was fitted with electrodes which enabled them to receive and read brainwaves from the two senders via a technique called transcranial magnetic stimulation (TMS).

The trio were asked to collaborate using brain-to-brain interactions to solve a task that each of them individually would not be able to complete. The task involved a simplified Tetris-style game in which the players had to decide whether or not to rotate a shape by 180 degrees in order to correctly fill a gap in a line at the bottom of the computer screen.

All of the participants watched the game, although the receiver was in charge of executing the action. The catch is that the receiver was not able to see the bottom half of their screen, so they had to rely on information sent by the two senders using only their minds in order to play.

This system is the first successful demonstration of a “multi-person, non-invasive, direct, brain-to-brain interaction for solving a task,” according to the researchers. There is no reason, they argue, that BrainNet could not be expanded to include as many people as desired, opening up a raft of possibilities for the future.

Pretty cool, but...

Tuesday, September 11, 2018

No Take Bachs

About a week ago, Boing Boing published a story with a shocking claim: you can't post performance videos of Bach's music because Sony owns the compositions.

Wait, what?

James Rhodes, a pianist, performed a Bach composition for his Facebook account, but it didn't go up -- Facebook's copyright filtering system pulled it down and accused him of copyright infringement because Sony Music Global had claimed that they owned 47 seconds' worth of his personal performance of a song whose composer has been dead for 300 years.

You don't need to be good at math to know that this claim must be false. Sony can't possibly own compositions that are clearly in the public domain. What this highlights though is that something, while untrue in theory, can be true in practice. Free Beacon explains:

As it happens, the company genuinely does hold the copyright for several major Bach recordings, a collection crowned by Glenn Gould's performances. The YouTube claim was not that Sony owned Bach's music in itself. Rather, YouTube conveyed Sony's claim that Rhodes had recycled portions of a particular performance of Bach from a Sony recording.

The fact that James Rhodes was actually playing should have been enough to halt any sane person from filing the complaint. But that's the real point of the story. No sane person was involved, because no actual person was involved. It all happened mechanically, from the application of the algorithms in Youtube's Content ID system. A crawling bot obtained a complex digital signature for the sound in Rhodes's YouTube posting. The system compared that signature to its database of registered recordings and found a partial match of 47 seconds. The system then automatically deleted the video and sent a "dispute claim" to Rhodes's YouTube channel. It was a process entirely lacking in human eyes or human ears. Human sanity, for that matter.

Does Sony own copyright on Bach in theory? No, absolutely not. But this system, which scans for similarity in the audio, is making this claim true in practice: performers of Bach's music will be flagged automatically by the system as using copyrighted content, and attacked with take-down notices and/or having their videos deleted altogether. There's only so much one can do with interpretation and tempo to change the sound, and while skill of the performer will also impact the audio, to a computer, the same notes and same tempo will sound the same.

Automation is being added to this and many related cases to take out the bias of human judgment. This leads to a variety of problems with technology running rampant and affecting lives, as has been highlighted in recent books like Weapons of Math Destruction and Technically Wrong.

A human being watching Rhodes's video would be able to tell right away that no copyright infringement took place. Rhodes was playing the same composition played in a performance owned by Sony - it's the same source material, which is clearly in the public domain, rather than the same recording, which is not public domain.

This situation is also being twisted into a way to make money:

[T]he German music professor Ulrich Kaiser wanted to develop a YouTube channel with free performances for teaching classical music. The first posting "explained my project, while examples of the music played in the background. Less than three minutes after uploading, I received a notification that there was a Content ID claim against my video." So he opened a different YouTube account called "Labeltest" to explore why he was receiving claims against public-domain music. Notices from YouTube quickly arrived for works by Bartok, Schubert, Puccini, Wagner, and Beethoven. Typically, they read, "Copyrighted content was found in your video. The claimant allows its content to be used in your YouTube video. However, advertisements may be displayed."

And that "advertisements may be displayed" is the key. Professor Kaiser wanted an ad-free channel, but his attempts to take advantage of copyright-free music quickly found someone trying to impose advertising on him—and thereby to claim some of the small sums that advertising on a minor YouTube channel would generate.

Last January, an Australian music teacher named Sebastian Tomczak had a similar experience. He posted on YouTube a 10-hour recording of white noise as an experiment. "I was interested in listening to continuous sounds of various types, and how our perception of these kinds of sounds and our attention changes over longer periods," he wrote of his project. Most listeners would probably wonder how white noise, chaotic and random by its nature, could qualify as a copyrightable composition (and wonder as well how anyone could get through 10 hours of it). But within days, the upload had five different copyright claims filed against it. All five would allow continued use of the material, the notices explained, if Tomczak allowed the upload to be "monetized," meaning accompanied by advertisements from which the claimants would get a share.

Wednesday, June 6, 2018

The Importance of Training Data

In the movie Arrival, 12 alien ships visit Earth, landing at various locations around the world. Dr. Louise Banks, a linguist, is brought in to help make contact with the aliens who have landed in Montana.

With no experience with their language, Louise instead teaches the aliens, which they call heptapods, English while they teach her their own language. Other countries around the world follow suit. But China seems to view the heptapods with suspicion that turns into outright hostility later on.

Dr. Banks learns that China was using Mahjong to teach/communicate with the heptapods. She points out that using a game in this way changes the nature of the communication - everything becomes a competition with winners and loser. As they said in the movie, paraphrasing something said by Abraham Maslow, "If all I ever gave you was a hammer, everything is a nail."

Training material matters and can drastically affect the outcome. Just look at Norman, the psychopathic AI developed at MIT. As described in an article from BBC:

The psychopathic algorithm was created by a team at the Massachusetts Institute of Technology, as part of an experiment to see what training AI on data from "the dark corners of the net" would do to its world view.

The software was shown images of people dying in gruesome circumstances, culled from a group on the website Reddit.

Then the AI, which can interpret pictures and describe what it sees in text form, was shown inkblot drawings and asked what it saw in them.

Norman's view was unremittingly bleak - it saw dead bodies, blood and destruction in every image.

Alongside Norman, another AI was trained on more normal images of cats, birds and people.

It saw far more cheerful images in the same abstract blots.

The fact that Norman's responses were so much darker illustrates a harsh reality in the new world of machine learning, said Prof Iyad Rahwan, part of the three-person team from MIT's Media Lab which developed Norman.

"Data matters more than the algorithm."

This finding is especially important when you consider how AI has and can be used - for instance, in assessing risk of reoffending among parolees or determining which credit card applications to approve. If the training data is flawed, the outcome will be too. In fact, a book I read last year, Weapons of Math Destruction, discusses how decisions made by AI can reflect the biases of their creators. I highly recommend reading it!

Tuesday, May 8, 2018

"GIF" or "JIF"?

Now we can put the debate to rest:

Monday, May 7, 2018

New Database, Who Dis?

Next week, my company is launching a new database to maintain our many records: anyone who has taken one of our exams, certificants (people who have completed sets of exams that entitle them to a certificate), schools, companies, and so on. Our old system, which I found out today is 20 years old, goes down Wednesday evening, and then we'll begin transferring over 4 million records into the new system. (We've done some test runs of transferring data and extensive mapping for many months now. And I've been scoring and QCing exams in a Dev environment for almost that long.)

It's been a long, busy process to get to this point, working with multiple vendors, and I've been spending the last several months preparing us to score our exams in the new system. This project has been good practice for, among other things, understanding how we convert Rasch scores to our standard scaling, how we create item strings for scoring computer adaptive exams, and keeping my SQL skills fresh. And since I'm currently a one-person department, as a coworker is out on maternity leave and another left to take another job, I'm completing job tasks for 2.5 to 3 people every day, on top of database stuff.

Blogging will likely be on hold this week and I may not have a Statistics Sunday post, since I'll be working all-day Saturday and most/all of Sunday. But I'll see if I can squeeze some time in for something.

Tuesday, April 24, 2018

U is for (Data From) URLs

Working with URLs in R Up to now, we've been working with files saved locally on your computer. But that limits you to files that can be easily saved to your computer and, up to now, structured data. As we move from pure statistics to a data science approach, more and more, we'll be working with data stored somewhere in the cloud. Today, I'd like to introduce how to begin working with data available from URLs. There are many directions you can take, depending on the type of data you're working with and how you plan to analyze it, so we'll only be scratching the surface of what could lead to multiple more advanced topics.

URL stands for Uniform Resource Locator. It can refer to a webpage (http), file transfer protocol (ftp), email (mailto), and any number of other resources. We'll focus today on collecting data from webpages.

Let's start with the simplest way to access data at a URL - using the same syntax we've used to read delimited files saved locally. For this example, I'm using Chris Albon's War of the Five Kings dataset, which contains data on battles from George R.R. Martin's A Song of Ice and Fire series (Game of Thrones for those of you familiar with the HBO show based on the series.)

The data is in a CSV file, and contains data on 25 variables for 38 battles. Unfortunately, the dataset hasn't been updated since 2014, but it's based on the books rather than the show. It's fine for our present example. Since it's a CSV file, we can use a read CSV function to read the data into an R data frame. (I selected read_csv from the tidyverse, to create a tibble.) Where we would normally put the path for our locally saved file, we use the path to the CSV file saved on GitHub.

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.4.4

## -- Attaching packages -------------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.4
## v tidyr   0.8.0     v stringr 1.2.0
## v readr   1.1.1     v forcats 0.2.0

## Warning: package 'tibble' was built under R version 3.4.4

## Warning: package 'tidyr' was built under R version 3.4.4

## Warning: package 'purrr' was built under R version 3.4.4

## -- Conflicts ----------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Battles<-read_csv('https://raw.githubusercontent.com/chrisalbon/war_of_the_five_kings_dataset/master/5kings_battles_v1.csv')

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   year = col_integer(),
##   battle_number = col_integer(),
##   major_death = col_integer(),
##   major_capture = col_integer(),
##   attacker_size = col_integer(),
##   defender_size = col_integer(),
##   summer = col_integer()
## )

## See spec(...) for full column specifications.

You should now have a data frame called Battles, containing 38 observations on 25 variables. You could run some simple descriptives on the dataset but a fun analysis might be adopt a network analysis approach, and map out the attacker and defender kings, to show who is attacking whom and demonstrate the insanity that is ASoIaF. Because that kind of nerdy analysis sounded fun, I went ahead and did it, basing my code on code from this post. To keep this post from getting overly long, I made this a separate post, available here.

Another way to access data from URLs is to read in an HTML table. ESPN provides tons of sports-related data on their site, often in the form of HTML tables. Let's practice pulling some of the NCAA Division I men's basketball data, specifically their team scoring per game. This table also provides a few other variables, like percentages of field goals, 3-pointers, and free throws made. One of the best ways to cleanly scrape these data is with the XML package.

library(XML)
scoring_game<-readHTMLTable('http://www.espn.com/mens-college-basketball/statistics/team/_/stat/scoring-per-game/sort/avgPoints/count/', which=1)

The which argument tells which HTML table to read in. There's only 1 on the page, but without specifying, readHTMLTable will read the entire page for all tables and create a list object that contains each table as a separate data frame, embedded within the list (even if it only finds 1 table). But since I wanted to go straight to the data frame without going through the list first, I specified the table to read. Now we have a data frame called scoring_game, with columns named after the column headings from the HTML tables. There are a few quirks to this data frame we want to fix. First, every 10 rows, ESPN adds an additional header row, which R read in as a data row. I want to drop those, since they don't contain any data. Second, each team is ranked, but when there are ties, the rank cell is blank, which R filled in with a special character (Â). Rather than manually cleaning up these data, I'll just have R generate new ranks for this data frame, allowing for ties, and drop the RK variable.

scoring_game<-scoring_game[-c(11,22,33),]
library(tidyverse)
scoring_game %>% mutate(rank = min_rank(desc(PTS)))

##    RK              TEAM GP  PTS   FGM-FGA  FG%   3PM-3PA  3P%   FTM-FTA
## 1   1     UNC Asheville  1 98.0 35.0-75.0 .467 14.0-31.0 .452 14.0-18.0
## 2   2            Oregon  2 95.5 31.0-55.0 .564  9.0-19.5 .462 24.5-32.5
## 3   3           Wofford  1 94.0 34.0-63.0 .540 15.0-31.0 .484 11.0-16.0
## 4   4           Seattle  1 90.0 33.0-72.0 .458 12.0-31.0 .387 12.0-21.0
## 5   5               USC  2 89.0 34.5-70.0 .493 10.0-25.5 .392 10.0-21.0
## 6   Â        Fort Wayne  1 89.0 33.0-60.0 .550 11.0-33.0 .333 12.0-15.0
## 7   7  Central Michigan  3 87.7 30.7-63.7 .482 14.7-36.3 .404 11.7-14.3
## 8   8        Miami (OH)  1 87.0 33.0-60.0 .550 13.0-26.0 .500   8.0-9.0
## 9   9        Seton Hall  2 86.5 28.5-61.0 .467  8.5-22.5 .378 21.0-27.5
## 10 10            Xavier  2 86.0 29.0-56.5 .513  8.0-18.5 .432 20.0-31.0
## 11  Â             Rider  1 86.0 35.0-76.0 .461  6.0-22.0 .273 10.0-16.0
## 12 12     West Virginia  3 85.7 30.7-66.0 .465  7.7-21.3 .359 16.7-21.7
## 13 13 Northern Colorado  4 85.5 30.3-60.8 .498 10.0-26.0 .385 15.0-23.3
## 14 14         Villanova  6 83.8 27.5-58.0 .474 12.7-30.5 .415 16.2-19.8
## 15 15          NC State  1 83.0 28.0-61.0 .459 11.0-30.0 .367 16.0-27.0
## 16  Â             Texas  1 83.0 31.0-67.0 .463 11.0-24.0 .458 10.0-18.0
## 17  Â               BYU  1 83.0 34.0-76.0 .447  6.0-27.0 .222  9.0-15.0
## 18  Â     Virginia Tech  1 83.0 30.0-54.0 .556  9.0-18.0 .500 14.0-20.0
## 19 19         Marquette  3 82.7 27.7-56.0 .494 10.3-23.7 .437 17.0-19.0
## 20 20       North Texas  6 82.5 28.7-60.3 .475  8.7-22.8 .380 16.5-22.8
## 21  Â        Ohio State  2 82.5 28.5-69.0 .413 11.5-33.0 .348 14.0-17.5
## 22 22           Buffalo  2 82.0 30.0-64.5 .465 11.0-30.5 .361 11.0-14.5
## 23 23              Duke  4 81.5 29.3-61.0 .480  8.8-26.5 .330 14.3-19.5
## 24 24            Kansas  5 80.6 28.2-61.6 .458  9.2-23.4 .393 15.0-20.0
## 25 25       Austin Peay  2 80.5 31.5-68.0 .463  4.5-19.5 .231 13.0-21.5
## 26 26       Utah Valley  2 80.0 28.5-56.5 .504  7.5-17.0 .441 15.5-18.5
## 27 27           Clemson  3 79.7 30.0-61.7 .486  7.3-20.0 .367 12.3-17.0
## 28 28  Middle Tennessee  2 79.5 32.0-55.5 .577  7.5-18.0 .417  8.0-11.0
## 29 29        Washington  2 79.0 29.5-57.5 .513  8.5-20.0 .425 11.5-17.5
## 30 30  Western Kentucky  4 78.5 29.0-59.8 .485  5.3-16.5 .318 15.3-19.3
## 31  Â            Baylor  2 78.5 30.5-59.0 .517  6.5-16.0 .406 11.0-17.5
## 32 32    Oklahoma State  3 78.3 24.7-67.0 .368  8.3-22.7 .368 20.7-28.7
## 33 33          Oklahoma  1 78.0 29.0-69.0 .420  4.0-20.0 .200 16.0-24.0
## 34  Â          Bucknell  1 78.0 23.0-55.0 .418 11.0-20.0 .550 21.0-28.0
## 35  Â          Canisius  1 78.0 23.0-61.0 .377  7.0-29.0 .241 25.0-36.0
## 36 36               LSU  2 77.5 27.5-57.0 .482  6.0-24.0 .250 16.5-23.0
## 37 37      Saint Mary's  3 77.3 29.7-57.0 .520  9.7-20.3 .475  8.3-12.3
## 38 38          Kentucky  3 77.0 26.0-52.3 .497  3.3-11.0 .303 21.7-30.7
## 39  Â         Texas A&M  3 77.0 29.7-59.7 .497  6.3-18.3 .345 11.3-19.0
## 40  Â      South Dakota  1 77.0 26.0-70.0 .371  6.0-33.0 .182 19.0-29.0
##     FT% rank
## 1  .778    1
## 2  .754    2
## 3  .688    3
## 4  .571    4
## 5  .476    5
## 6  .800    5
## 7  .814    7
## 8  .889    8
## 9  .764    9
## 10 .645   10
## 11 .625   10
## 12 .769   12
## 13 .645   13
## 14 .815   14
## 15 .593   15
## 16 .556   15
## 17 .600   15
## 18 .700   15
## 19 .895   19
## 20 .723   20
## 21 .800   20
## 22 .759   22
## 23 .731   23
## 24 .750   24
## 25 .605   25
## 26 .838   26
## 27 .725   27
## 28 .727   28
## 29 .657   29
## 30 .792   30
## 31 .629   30
## 32 .721   32
## 33 .667   33
## 34 .750   33
## 35 .694   33
## 36 .717   36
## 37 .676   37
## 38 .707   38
## 39 .596   38
## 40 .655   38

scoring_game<-scoring_game[,-1]

This isn't the full dataset, since ESPN spreads their tables across multiple pages. So we'd need to write additional code to scrape data from all of these tables and combine them. But this should be enough to get you started with scraping data from HTML tables. (Look for future posts digging into these different topics.)

Last, let's lightly scratch the surface of reading in some unstructured data. First, we could read in a text file from a URL. Project Gutenberg includes tens of thousands of free books.¹ We can read in the text of Mary Shelley's Frankenstein, using the readLines function.

Frankenstein<-readLines('http://www.gutenberg.org/files/84/84-0.txt')

Now we have an R object with every line from the text file. The beginning and end of the file contain standard text from Project Gutenberg, which we probably want to remove before we started any analysis of this file. We can find beginning and end with grep.

grep("START", Frankenstein)

## [1]   22 7493

grep("END", Frankenstein)

## [1] 7460 7781

We can use those row numbers to delete all of the standard text, leaving only the manuscript itself in a data frame (which will require some additional cleaning before we begin analysis).

Frankensteintext<-data.frame(Frankenstein[22:7460])

We can also use readLines to read in HTML files, though the results are a bit messy, and an HTML parser is needed. To demonstrate, the last example involves a classical psychology text, On the Witness Stand by Hugo Munsterberg, considered by many (including me) to be the founder of psychology and law as a subfield. His work, which is a collection of essays on the role psychologists can play in the courtroom, is available full-text online. It's a quick, fascinating read for anyone interested in applied psychology and/or law. Let's read in the first part of his book, the introduction.

Munsterberg1<-readLines("http://psychclassics.yorku.ca/Munster/Witness/introduction.htm")

This object contains his text, but also HTML code, which we want to remove. We can do this with the htmlParse function.

library(XML)
library(RCurl)

## Loading required package: bitops

## 
## Attaching package: 'RCurl'

## The following object is masked from 'package:tidyr':
## 
##     complete

Mdoc<-htmlParse(Munsterberg1, asText=TRUE)
plain.text<-xpathSApply(Mdoc, "//p", xmlValue)
Munsterberg<-data.frame(plain.text)

This separates the text between opening and closing paragraph tags into cells in a data frame. We would need to do more cleaning and work on this file to use it for analysis. Another post for another day!

¹There's an R package called gutenbergr that gives you direct access to Project Gutenberg books. If you'd like to analyze some of those works, it's best to start there. As usual, this was purely for demonstration.

Sunday, April 15, 2018

Statistics Sunday: Fit Statistics in Structural Equation Modeling

Fit Measures In my video on interpreting confirmatory factor analysis output, I promised a post on the various fit statistics. And here we are! As I said in the video, when you conduct structural equation modeling, the program is comparing the observed data - specifically the observed covariance matrix - to the model-specified covariance matrix. Fit refers to how well those two things match up. Any metric examining fit is using those pieces of information.

There are two types of fit statistics in structural equation modeling: absolute fit and relative fit. When assessing model fit, you should use a combination of both, though nearly all of them are derived from chi-square, which is neither a measure of absolute or relative fit, in some way. So let's start there.

Chi-Square

Chi-square is an exception to the absolute versus relative fit dichotomy. It's a measure of exact fit: does your model fit the data? Any deviations between the observed covariance matrix and the model-specified covariance matrix are tallied up, giving an overall metric of the difference between observed and model-specified. If the chi-square is not significant, the model fits your data. If it is significant, the model does not fit your data.

The problem is that chi-square is biased to be significant with large sample sizes and/or large correlations between variables. So for many models, your chi-square will indicate the model does not fit the data, even if it's actually a good model. One way to correct for this is with the normed chi-square I mentioned in the video: divide chi-square by your degrees of freedom. There is no agreed upon cutoff value for normed chi-square. Personally, I use the critical value for a chi-square with 1 degree of freedom, 3.841. I've been told that's too liberal and also too conservative. Like I said: no agreed upon cutoff value.

But chi-square is still very useful for two reasons. First, we use it to compute other fit indices. I'll talk about that next. Second, we can use it to compare nested models. You can find out more about that a little farther down in this post.

You may ask, then - if chi-square is biased to be significant, why do we use it for all of our other fit indices? The calculations conducted to create these different fit indices are meant to correct for these biases in different ways, factoring in things like sample size or model complexity. That underlying bias is there, though, and there are many different ways to try to correct for it, each way with its own flaws. This is why you should look at a range of fit indices.

Because your fit indices are based on chi-square, which is given to you by whatever statistical program you use to conduct your SEM, you can compute any fit index, even if your program doesn't give them to you.

Measures of Absolute Fit

These measures are based on the assumption that the perfect model has a fit of 0 - or rather, no deviation between observed and model-specified covariance matrices. As a result, these measures tell you how much worse your model is than the theoretically perfect model, and are sometimes called badness of fit measures. For these measures, smaller is better.

Root Mean Square Error of Approximation (RMSEA)

Chi-square is a little like ANOVA in how it deals with variance. This is why it's chi-square; we measure deviations from central tendency by squaring them, to keep them from adding up to 0. The same thing is done in ANOVA: squared deviations are added up, which produce the sum of squares. This value is divided by degrees of freedom, to produce the mean square, which is then used in the calculation of the F statistic. RMSEA is calculated in a very similar way as this process of creating sum of squares then mean square:

√(χ² - df) / √[df(N-1)]

where df is degrees of freedom and N is total sample size.

Chi-square is biased to be significant, so the higher the degrees of freedom, the higher the chi-square will likely be. In fact, the expected value of chi-square is equal to its degrees of freedom. The expected value of RMSEA for a perfectly fitting model, then, is 0, since in the equation above, degrees of freedom is subtracted from chi-square. There is not one single agreed upon cutoff for RMSEA, though 0.05 and 0.07 are commonly used.

Let's look once again at the fit measures from the Satisfaction with Life confirmatory factor analysis. In fact, here's a trick I didn't introduce previously - while including fit.measures=TRUE in the summary function will give you only a small number of fit measures, you can access more information with fitMeasures:

Facebook<-read.delim(file="small_facebook_set.txt", header=TRUE)
SWL_Model<-'SWL =~ LS1 + LS2 + LS3 + LS4 + LS5'
library(lavaan)

## This is lavaan 0.5-23.1097

## lavaan is BETA software! Please report any bugs.

SWL_Fit<-cfa(SWL_Model, data=Facebook)
fitMeasures(SWL_Fit)

##                npar                fmin               chisq 
##              10.000               0.052              26.760 
##                  df              pvalue      baseline.chisq 
##               5.000               0.000             635.988 
##         baseline.df     baseline.pvalue                 cfi 
##              10.000               0.000               0.965 
##                 tli                nnfi                 rfi 
##               0.930               0.930               0.916 
##                 nfi                pnfi                 ifi 
##               0.958               0.479               0.966 
##                 rni                logl   unrestricted.logl 
##               0.965           -2111.647           -2098.267 
##                 aic                 bic              ntotal 
##            4243.294            4278.785             257.000 
##                bic2               rmsea      rmsea.ci.lower 
##            4247.082               0.130               0.084 
##      rmsea.ci.upper        rmsea.pvalue                 rmr 
##               0.181               0.003               0.106 
##          rmr_nomean                srmr        srmr_bentler 
##               0.106               0.040               0.040 
## srmr_bentler_nomean         srmr_bollen  srmr_bollen_nomean 
##               0.040               0.040               0.040 
##          srmr_mplus   srmr_mplus_nomean               cn_05 
##               0.040               0.040             107.321 
##               cn_01                 gfi                agfi 
##             145.888               0.959               0.876 
##                pgfi                 mfi                ecvi 
##               0.320               0.959               0.182

The RMSEA is 0.13. We can recreate this using the model chi-square (called chisq above), degrees of freedom (df), and sample size (ntotal):

chi.sq=26.76
df = 5
N = 257
sqrt(chi.sq-df)/sqrt(df*(N-1))

## [1] 0.130384

Standardized Root Mean Square Residual

The standardized root mean square residual (SRMR) is the average square root of the residual between the observed covariance matrix and the model-specified covariance matrix, which has been standardized to range between 0 and 1. Unlike some of the other fit indices I discuss here, SRMR is biased to be larger for models with few degrees of freedom or small sample size. This means SRMR has the unusual characteristic of being smaller (i.e., showing better fit) for more complex models. If you remember from the CFA post and video, both models showed poor fit for many of the fit indices but showed good fit based on SRMR. In essence, SRMR rewards something that is penalized with other fit indices. Also unlike the other fit indices discussed here, SRMR is not based on chi-square; you can read more about its calculation here.

Measures of Relative Fit

In addition to measures of absolute fit, which deal with deviations of the observed covariance matrix from the model-specified covariance matrix, we have measures of relative fit, which compare our model to another theoretical model, the null model, sometimes called the independence model. This model assumes that all variables included are independent, or uncorrelated with each other. This is basically the worst possible model, and fit measures using this model can be thought of as goodness of fit measures - how much better does your model fit than the worst possible model you could have? In the fit measures output, this value is called the Baseline Chi-Square. So let's create a new variable to use in our calculations called "null", which uses this baseline chi-square value.

null=634.988
null.df=10

In general, closer to 1 is better. Anything lower than 0.9 would be considered poor fit. If any of these formulas produce a value higher than 1, the fit measure is set at 1.

Normed Fit Index (NFI)

According to David Kenny, this was the first fit measure proposed in the literature. It's computed as the difference between the null and observed model chi-squares, divided by the null chi-square.

(null-chi.sq)/null

## [1] 0.9578575

This measure doesn't provide any kind of correction for more complex models, so it isn't recommended for use. (Although, when I was in grad school, which wasn't that long ago, it was one of the recommended measures in my SEM course. How quickly things change...)

Tucker-Lewis Index (TLI)

This measure is also sometimes call the Non-Normed Fit Index (NNFI). It is similar to the NFI but corrects for more complex models, by taking a ratio of each chi-square and its corresponding degrees of freedom.

((null/null.df)-(chi.sq/df))/((null/null.df)-1)

## [1] 0.9303667

Comparative Fit Index (CFI)

CFI provides a very similar, and slightly elevated, estimate as the NNFI/TLI. The penalty for complexity is smaller than for the TLI. Instead of taking a ratio of chi-square to degrees of freedom, CFI uses the difference between chi-square and the corresponding degrees of freedom.

((null-null.df)-(chi.sq-df))/(null-null.df)

## [1] 0.9651833

There are many other fit indices you'll see listed in the fit measures output. GFI and AGFI (which are actually absolute fit measures) were developed by the creators of the LISREL software and are automatically computed by that program. However, pretty much everything else I've read said not to use these fit indices. (Again, different from what I heard in grad school.) I prefer to use CFI and TLI. CFI is always going to be higher than TLI, because it penalizes you less for model complexity than the TLI. So using both gives you a sort of range of goodness of fit, with lower end of the continuum (TLI) being more conservative than the other (CFI). They're similar, so they'll often tell you the same thing, but you can run into the situation of having a TLI just below your cutoff and CFI just above it.

Comparing Nested Models

I mentioned in the video the idea of nested versus non-nested models. First, let's talk about nested models. A nested model is another model you specify that has the same structure but adds or drops paths. For instance, I conducted two three-factor models using the Rumination Response Scale: one in which the 3 factors were allowed to correlate with each other and another where they were considered orthogonal (uncorrelated). If I drew out these two models, they would look the same except that one would have curved arrows between the 3 factors to reflect the correlations and the other would not. Because I'm comparing two models with the same structure, I can test the impact of that change with my chi-square values.

RRS_Model<- '
  Depression =~ Rum1 + Rum2 + Rum3 + Rum4 + Rum6 + Rum8 + 
    Rum9 + Rum14 + Rum17 + Rum18 + Rum19 + Rum22
  Reflecting =~ Rum7 + Rum11 + Rum12 + Rum20 + Rum21
  Brooding =~ Rum5 + Rum10 + Rum13 + Rum15 + Rum16
'
RRS_Fit<-cfa(RRS_Model, data=Facebook)
RRS_Fit2<-cfa(RRS_Model, data=Facebook, orthogonal=TRUE)
summary(RRS_Fit)

## lavaan (0.5-23.1097) converged normally after  40 iterations
## 
##   Number of observations                           257
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic              600.311
##   Degrees of freedom                               206
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   Depression =~                                       
##     Rum1              1.000                           
##     Rum2              0.867    0.124    6.965    0.000
##     Rum3              0.840    0.124    6.797    0.000
##     Rum4              0.976    0.126    7.732    0.000
##     Rum6              1.167    0.140    8.357    0.000
##     Rum8              1.147    0.141    8.132    0.000
##     Rum9              1.095    0.136    8.061    0.000
##     Rum14             1.191    0.135    8.845    0.000
##     Rum17             1.261    0.141    8.965    0.000
##     Rum18             1.265    0.142    8.887    0.000
##     Rum19             1.216    0.135    8.992    0.000
##     Rum22             1.257    0.142    8.870    0.000
##   Reflecting =~                                       
##     Rum7              1.000                           
##     Rum11             0.906    0.089   10.138    0.000
##     Rum12             0.549    0.083    6.603    0.000
##     Rum20             1.073    0.090   11.862    0.000
##     Rum21             0.871    0.088    9.929    0.000
##   Brooding =~                                         
##     Rum5              1.000                           
##     Rum10             1.092    0.133    8.216    0.000
##     Rum13             0.708    0.104    6.823    0.000
##     Rum15             1.230    0.143    8.617    0.000
##     Rum16             1.338    0.145    9.213    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   Depression ~~                                       
##     Reflecting        0.400    0.061    6.577    0.000
##     Brooding          0.373    0.060    6.187    0.000
##   Reflecting ~~                                       
##     Brooding          0.419    0.068    6.203    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .Rum1              0.687    0.063   10.828    0.000
##    .Rum2              0.796    0.072   11.007    0.000
##    .Rum3              0.809    0.073   11.033    0.000
##    .Rum4              0.694    0.064   10.857    0.000
##    .Rum6              0.712    0.067   10.668    0.000
##    .Rum8              0.778    0.072   10.746    0.000
##    .Rum9              0.736    0.068   10.768    0.000
##    .Rum14             0.556    0.053   10.442    0.000
##    .Rum17             0.576    0.056   10.370    0.000
##    .Rum18             0.611    0.059   10.418    0.000
##    .Rum19             0.526    0.051   10.352    0.000
##    .Rum22             0.609    0.058   10.428    0.000
##    .Rum7              0.616    0.067    9.200    0.000
##    .Rum11             0.674    0.069    9.746    0.000
##    .Rum12             0.876    0.080   10.894    0.000
##    .Rum20             0.438    0.056    7.861    0.000
##    .Rum21             0.673    0.068    9.867    0.000
##    .Rum5              0.955    0.090   10.657    0.000
##    .Rum10             0.663    0.065   10.154    0.000
##    .Rum13             0.626    0.058   10.819    0.000
##    .Rum15             0.627    0.064    9.731    0.000
##    .Rum16             0.417    0.050    8.368    0.000
##     Depression        0.360    0.072    4.987    0.000
##     Reflecting        0.708    0.111    6.408    0.000
##     Brooding          0.455    0.096    4.715    0.000

summary(RRS_Fit2)

## lavaan (0.5-23.1097) converged normally after  31 iterations
## 
##   Number of observations                           257
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic             1007.349
##   Degrees of freedom                               209
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   Depression =~                                       
##     Rum1              1.000                           
##     Rum2              0.903    0.129    6.985    0.000
##     Rum3              0.915    0.129    7.065    0.000
##     Rum4              1.071    0.134    8.023    0.000
##     Rum6              1.245    0.147    8.462    0.000
##     Rum8              1.142    0.145    7.849    0.000
##     Rum9              1.124    0.141    7.961    0.000
##     Rum14             1.219    0.140    8.686    0.000
##     Rum17             1.198    0.143    8.374    0.000
##     Rum18             1.189    0.144    8.235    0.000
##     Rum19             1.240    0.141    8.806    0.000
##     Rum22             1.215    0.145    8.380    0.000
##   Reflecting =~                                       
##     Rum7              1.000                           
##     Rum11             0.999    0.100    9.952    0.000
##     Rum12             0.614    0.090    6.842    0.000
##     Rum20             1.002    0.100    9.979    0.000
##     Rum21             0.971    0.098    9.875    0.000
##   Brooding =~                                         
##     Rum5              1.000                           
##     Rum10             1.132    0.150    7.536    0.000
##     Rum13             0.662    0.112    5.901    0.000
##     Rum15             1.295    0.164    7.914    0.000
##     Rum16             1.461    0.176    8.292    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   Depression ~~                                       
##     Reflecting        0.000                           
##     Brooding          0.000                           
##   Reflecting ~~                                       
##     Brooding          0.000                           
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .Rum1              0.692    0.065   10.637    0.000
##    .Rum2              0.777    0.072   10.829    0.000
##    .Rum3              0.766    0.071   10.808    0.000
##    .Rum4              0.630    0.060   10.454    0.000
##    .Rum6              0.653    0.064   10.184    0.000
##    .Rum8              0.790    0.075   10.537    0.000
##    .Rum9              0.719    0.069   10.485    0.000
##    .Rum14             0.540    0.054    9.999    0.000
##    .Rum17             0.640    0.062   10.247    0.000
##    .Rum18             0.686    0.066   10.337    0.000
##    .Rum19             0.513    0.052    9.881    0.000
##    .Rum22             0.655    0.064   10.243    0.000
##    .Rum7              0.656    0.075    8.790    0.000
##    .Rum11             0.588    0.069    8.491    0.000
##    .Rum12             0.838    0.079   10.604    0.000
##    .Rum20             0.582    0.069    8.446    0.000
##    .Rum21             0.580    0.067    8.613    0.000
##    .Rum5              0.993    0.096   10.386    0.000
##    .Rum10             0.671    0.071    9.454    0.000
##    .Rum13             0.671    0.063   10.729    0.000
##    .Rum15             0.616    0.072    8.530    0.000
##    .Rum16             0.342    0.064    5.368    0.000
##     Depression        0.354    0.073    4.867    0.000
##     Reflecting        0.668    0.112    5.972    0.000
##     Brooding          0.417    0.096    4.332    0.000

The first model, where the 3 factors are allowed to correlate, produces a chi-square of 600.311, with 206 degrees of freedom. The second model, where the 3 factors are forced to be orthogonal, produces a chi-square of 1007.349, with 209 degrees of freedom. I can compare these two models by looking at the difference in chi-square between them. That produces a chi-square with degrees of freedom equal to the difference between df for model 1 and df for model 2.

1007.349-600.311

## [1] 407.038

This gives me a change in chi-square (Δχ²) of 407.038, with 3 degrees of freedom. I don't even need to check a chi-square table to tell you that value is significant. (I looked it up and was informed my p-value is less than 0.00001.) So forcing the 3 factors to be orthogonal significantly worsens model fit. This provides further evidence that the 3 subscales are highly correlated with each other.

Information Criterion Measures

There are a few other fit indices that don't really fall within absolute or relative. These are the information criterion measures: the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC), and the Sample-Size Adjusted BIC. These fit indices are only meaningful when comparing two different models using the same data. That is, the models should be non-nested. For instance, let's say that in addition to examining a single factor analysis of the Satisfaction with Life Scale, I also tested a two-factor model. These two models have a different structure, so they would be non-nested models. I can't look at difference in chi-square to figure out which model is better. Instead, I can compare my information criterion measures. I prefer to use AIC. In this case, the model with the lowest AIC is the superior model.

Fit measures are a hotly debated topic in structural equation modeling, with disagreement on which ones to use, which cutoffs to apply, and even whether we should be using them at all. (What can I say? We statisticians don't get out much.) Regardless of where you fall on the debate, if you're testing a structural equation model, chances are someone is going to ask to see fit measures, so it's best to provide them even if you hate them with a fiery passion. And though people will likely disagree with which ones you selected and which cutoffs you use, the best things you can do are 1) pick your fit measures before conducting your analysis and stick to them - do not cherry-pick fit measures that make your model look good, and 2) provide sources to back up which ones you used and which cutoffs you selected. My recommendations for sources are:

1. Hooper, D., Coughlan, J., & Mullen, M.R. (2008). Structural equation modelling: Guidelines for determining model fit. Electronic Journal of Business Research, 6, 53-60.
2. Hu, L., & Bentler, P.M. (1998). Fit indices in covariance structure modeling: Sensitivity to underparameterized model misspecification. Psychological Methods, 3, 424-453.

Saturday, April 14, 2018

M is for R Markdown Files

Today's A to Z of R will be a bit different from previous ones in that the focus is not on how to code something in R, but how to use a feature in R Studio to create documents, such as HTML and PDF. Either of these types of documents - and others - can be easily created thanks to R Markdown files. In this post, I'll show you how to set up your computer and R Studio so that you can create an R Markdown file and "knit" it into a PDF or HTML document. In the future, I'd love to dig into some of the other types of documents you can create with Markdown - from Shiny apps to ebooks. And as an added example, almost all of my posts this month have been created with an R Markdown file knitted into HTML.

Here's what you need to install to get started: if you haven't been using R Studio, you'll want to install it now. Don't worry - it's free for individual users. You'll also want to install pandoc, and a TeX system, such as MiKTeX or MacTeX. (Also free to install.) In R, you'll want to install the rmarkdown package; you'll also need the knitr package, which should have come with your install of R Studio. You can easily verify whether a package is installed in R Studio by clicking on the Packages tab in the lower right corner.

If you need to install a package, you can do so with install.packages("packagename"). Note that package names are case sensitive.

Once you're finished installing everything, open R Studio and create a new R Markdown document. On the far left, just below the menu, is a button that looks like a blank page with a plus sign on it and a black down arrow next to it. Click on the arrow and select R Markdown.

You'll see this dialogue box pop up.

We'll start by creating an HTML document - note that you can easily switch to another file type. At the top of the new R Markdown file, you can update title, author, and date. It's also currently set for an output of HTML document. But we could switch that to PDF if we'd like. Let's stick with HTML for now.

Code chunks appear as grey boxes that begin with ```{r. This where you type in any R code, at the point in the document you want the code and/or output to appear. Each R code chunk can be given a descriptive name. Leave the first R code chunk alone - this is setup for when you knit the final document together.

Under that, you can begin typing the information for your document. For plain text, just type. Text with # signs in front of it will appear as headers. The level of the header is equal to the number of # signs, so a level 1 header (the largest) is created with 1 #, level 2 with ##, and so on up to 6.

Bulleted lists are created with asterisks * for level 1 items and pluses + for level 2. Create an ordered list by simply typing in the numbers and adding line breaks at the end of each item.

Add inline equations by putting a $ on each side and equation blocks with $$. R Markdown will also replace text names of mathematical symbols, such as pi, with the actual symbol.

You can create tables with ASCII characters | and - to create the borders, and the tables will be formatted automatically, and much more prettily. Or you can create your tables with R code and add in commands to format it.

But really, let's get into the best part about R Markdown, which is that you can include your code and output automatically. Add new code chunks where you want them to go by placing your cursor at that point in the document, then clicking Insert just above the R Markdown document.

Any code you type in the code chunk will automatically appear along with any output it produces. If you only want output, but no code, add , echo = FALSE after the name of the R code chunk. And if you want code but no output, type it in as plain text but add ` on either side of the code.

Some R packages automatically give you warnings, such as when a function of one package is being masked from another, which happened when I loaded the QuantPsyc package. I can hide those from my final R Markdown file by adding , warning=FALSE after the name of the R code chunk.

These are the features I use most frequently in R Markdown, but you can get all of this and more on the R Markdown Cheatsheet.

I've created an example R Markdown file and a version rendered into HTML, which recreates the analyses I performed in the B is for Beta post. After I finished creating my document, I rendered it into an HTML file by clicking the Knit button.

Then published it by clicking on the publish button, which looks a little like a blue eyeball. Note that you'll need to create an account on rpubs.com and that anything you publish will be public.

But I could very easily render it into a PDF as well, by clicking on the black down arrow next to knit and selecting PDF document. If you'd like to see what that looks like, check it out here.

Wednesday, April 11, 2018

J is for jsonlite Package

J is for jsonlite package Today I'm going to introduce a new method of storing and exchanging data: JSON or JavaScript Object Notation. Up to now, we've been working with delimited text files and R data frames. But JSON (pronounced "Jason") is another way we can store data that can be read by different software packages. In my previous job, some of our tests arrived in Research as JSON files, which we then parsed using Python, SAS, or R. In fact, JSON can be parsed by any programming language. JSON allowed the transfer of large amounts of data, organized into name and value pairs. Each name and value pair is separated by commas, with each object (case) enclosed in curly brackets {}, and also separated by commas. The full dataset is then enclosed in square brackets []. JSON files are human readable; they are self-describing, because you get to pick whatever name to assign to a value, and hierarchical.

Here's what a JSON file might look like for my Blogging A to Z posts:

{"posts": [
  {"postname": "A is for (Cronbach's) Alpha", "date": "20180401", "shorturl": "a-is-for-cronbachs-alpha.html", "posted": true},
  {"postname": "B is for Betas (Standardized Regression Coefficients)", "date": "20180402", "shorturl": "b-is-for-betas-standardized-regression.html", "posted": true},
  {"postname": "C is for Cross Tabs Analysis", "date": "20180403", "shorturl": "c-is-for-cross-tabs-analysis.html", "posted": true},
  {"postname": "D is for Data Frame", "date": "20180404", "shorturl": "d-is-for-data-frame.html", "posted": true},
  {"postname": "E is for Effect Sizes", "date": "20180405", "shorturl": "e-is-for-effect-sizes.html", "posted": true},
  {"postname": "F is for (Confirmatory) Factor Analysis", "date": "20180406", "shorturl": "f-is-for-confirmatory-factor-analysis.html", "posted": true},
  {"postname": "G is for glm Function", "date": "20180407", "shorturl": "g-is-for-glm-function.html", "posted": true},
  {"postname": "H is for Help with R", "date": "20180409", "shorturl": "h-is-for-help-with-r.html", "posted": true},
  {"postname": "I is for (Classical) Item Analysis or I Must Be Flexible", "date": "20180410", "shorturl": "i-is-for-classical-item-analysis-or-i.html", "posted": true},
  {"postname": "J is for jsonlite Package", "date": "20180411", "shorturl": null, "posted": false}
]}

As you can see, the structure is readable and you can make sense out of what information it is communicating. JSON allows many kinds of data, including numeric, string, logical, and null values. This was one reason it was so useful for our test data; because some of our tests were adaptive, each examinee only received certain items, so they would have null values for most of the items in the item bank. We could read in their responses to the items they saw, along with the item ID, and fill in null values for other items they didn't see. We can then put all examinees in a single file, with those who saw an item having values in that column, and those who didn't with null values. Then we can analyze all examinees together and generate item statistics and/or person ability estimates.

JSON does not allow functions or dates, though, so I've create my date variable as a string, enclosed in quotes. To read that information as a date, I would need to do an extra step once I parse it into R, but that's only necessary if you plan on doing any kind of analysis or calculations with dates. For instance, you might have a date of birth variable and want to calculate exact age, using current date, for everyone in your sample. In that case, you'd want to make certain that whatever statistical package you're using knows the variable is a date so it can handle it properly in calculations.

White space is ignored in JSON, with brackets dictating hierarchy and structure, so I could space this file out more if I wanted to, to make it even more readable:

{"posts": [

  {"postname": "A is for (Cronbach's) Alpha",

  "date": "20180401",

  "shorturl": "a-is-for-cronbachs-alpha.html",

  "posted": true}

]}

I saved the object created above, using a simple text editor, as a .json file, which I can then read into R with the jsonlite package. Though the jsonlite package has a way of coercing an object into a data frame, I found it a bit finicky, so I just read the object in then converted it to a data frame.

install.packages("jsonlite")

library(jsonlite)
posts<-fromJSON("posts.json")
posts<-as.data.frame(posts)

Now I have a data frame called "posts", containing all of the information from my JSON file. Let's take a look at how the data were read in, in particular the data types.

str(posts)

## 'data.frame': 10 obs. of  4 variables:
##  $ posts.postname: chr  "A is for (Cronbach's) Alpha" "B is for Betas (Standardized Regression Coefficients)" "C is for Cross Tabs Analysis" "D is for Data Frame" ...
##  $ posts.date    : chr  "20180401" "20180402" "20180403" "20180404" ...
##  $ posts.shorturl: chr  "a-is-for-cronbachs-alpha.html" "b-is-for-betas-standardized-regression.html" "c-is-for-cross-tabs-analysis.html" "d-is-for-data-frame.html" ...
##  $ posts.posted  : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...

If I want to do any kind of date math, I need to convert my post.date column into a date variable. I just need to tell R to turn it into a date and provide the format of the string. (Nerdy note: date variables are actually represented as the number of seconds since January 1, 1970, known as the Unix epoch. This is then converted into a date, formatted in whatever way you specify.)

posts$posts.date <- as.Date(posts$posts.date, "%Y%m%d")
str(posts$posts.date)

##  Date[1:10], format: "2018-04-01" "2018-04-02" "2018-04-03" "2018-04-04" "2018-04-05" ...

Now I can use that variable to compute a new variable - days since posted.

posts$days.since.post <- Sys.Date() - posts$posts.date
str(posts)

## 'data.frame': 10 obs. of  5 variables:
##  $ posts.postname : chr  "A is for (Cronbach's) Alpha" "B is for Betas (Standardized Regression Coefficients)" "C is for Cross Tabs Analysis" "D is for Data Frame" ...
##  $ posts.date     : Date, format: "2018-04-01" "2018-04-02" ...
##  $ posts.shorturl : chr  "a-is-for-cronbachs-alpha.html" "b-is-for-betas-standardized-regression.html" "c-is-for-cross-tabs-analysis.html" "d-is-for-data-frame.html" ...
##  $ posts.posted   : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ days.since.post:Class 'difftime'  atomic [1:10] 8 7 6 5 4 3 2 0 -1 -2
##   .. ..- attr(*, "units")= chr "days"

But the jsonlite package will not only parse JSON; it can also create a JSON file, for easy sharing. Let's convert the Facebook data file into a JSON file. We'll also add an additional argument, pretty, which adds whitespace to make the file more readable.

Facebook<-read.delim(file="small_facebook_set.txt", header=TRUE)
Facebook_js<-toJSON(Facebook, dataframe=c("rows","columns","values"), pretty=TRUE)
save(Facebook_js, file="FB_JS.JSON")

If you're interested in learning more about JSON files, check out the tutorial on W3Schools.com.

Thursday, March 22, 2018

Science Fiction Meets Science Fact Meets Legal Standards

Any fan of science fiction is probably familiar with the Three Laws of Robotics developed by prolific science fiction author, Isaac Asimov:

A robot may not injure a human being or, through inaction, allow a human being to come to harm.
A robot must obey orders given it by human beings except where such orders would conflict with the First Law.
A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

It's an interesting thought experiment on how we would handle artificial intelligence that could potentially hurt people. But now, with increased capability and use of AI, it's no longer a thought experiment - it's something we need to consider seriously:

Here’s a curious question: Imagine it is the year 2023 and self-driving cars are finally navigating our city streets. For the first time one of them has hit and killed a pedestrian, with huge media coverage. A high-profile lawsuit is likely, but what laws should apply?

At the heart of this debate is whether an AI system could be held criminally liable for its actions.

[Gabriel] Hallevy [at Ono Academic College in Israel] explores three scenarios that could apply to AI systems.

The first, known as perpetrator via another, applies when an offense has been committed by a mentally deficient person or animal, who is therefore deemed to be innocent. But anybody who has instructed the mentally deficient person or animal can be held criminally liable. For example, a dog owner who instructed the animal to attack another individual.

The second scenario, known as natural probable consequence, occurs when the ordinary actions of an AI system might be used inappropriately to perform a criminal act. The key question here is whether the programmer of the machine knew that this outcome was a probable consequence of its use.

The third scenario is direct liability, and this requires both an action and an intent. An action is straightforward to prove if the AI system takes an action that results in a criminal act or fails to take an action when there is a duty to act.

Then there is the issue of defense. If an AI system can be criminally liable, what defense might it use? Could a program that is malfunctioning claim a defense similar to the human defense of insanity? Could an AI infected by an electronic virus claim defenses similar to coercion or intoxication?

Finally, there is the issue of punishment. Who or what would be punished for an offense for which an AI system was directly liable, and what form would this punishment take? For the moment, there are no answers to these questions.

But criminal liability may not apply, in which case the matter would have to be settled with civil law. Then a crucial question will be whether an AI system is a service or a product. If it is a product, then product design legislation would apply based on a warranty, for example. If it is a service, then the tort of negligence applies.

Here's the problem with those 3 laws: in order to follow them, the AI must recognize someone as human and be able to differentiate between human and not human. In the article, they discuss a case in which a robot killed a man in a factory, because he was in the way. As far as the AI was concerned, something was in the way and kept it from doing its job. It removed that barrier. It didn't know that barrier was human, because it wasn't programmed to do that. So it isn't as easy as putting a three-laws strait jacket on our AI.

Wednesday, March 21, 2018

Statistical Sins: The Myth of Widespread Division

Recently, many people, including myself, have commented on how divided things have become, especially for any topic that is even tangentially political. In fact, I briefly deactivated my Facebook account, and have been spending much less time on Facebook, because of the conflicts I was witnessing among friends and acquaintances. But a recent study of community interactions on Reddit suggests that only a small number of people are responsible for conflicts and attacks:

User-defined communities are an essential component of many web platforms, where users express their ideas, opinions, and share information. However, despite their positive benefits, online communities also have the potential to be breeding grounds for conflict and anti-social behavior.

Here we used 40 months of Reddit comments and posts (from January 2014 to April 2017) to examine cases of intercommunity conflict ('wars' or 'raids'), where members of one Reddit community, called "subreddit", collectively mobilize to participate in or attack another community.

We discovered these conflict events by searching for cases where one community posted a hyperlink to another community, focusing on cases where these hyperlinks were associated with negative sentiment (e.g., "come look at all the idiots in community X") and led to increased antisocial activity in the target community. We analyzed a total of 137,113 cross-links between 36,000 communities.

A small number of communities initiate most conflicts, with 1% of communities initiating 74% of all conflicts. The image above shows a 2-dimensional map of the various Reddit communities. The red nodes/communities in this map initiate a large amount of conflict, and we can see that these conflict intiating nodes are rare and clustered together in certain social regions. These communities attack other communities that are similar in topic but different in point of view.

Conflicts are initiated by active community members but are carried out by less active users. It is usually highly active users that post hyperlinks to target communities, but it is more peripheral users who actually follow these links and particpate in conflicts.

Conflicts are marked by the formation of "echo-chambers", where users in the discussion thread primarily interact with other members of their own community (i.e., "attackers" interact with "attackers" and "defenders" with "defenders").

So even though the conflict may appear to be a widespread problem, it really isn't, at least not on Reddit. Instead, it's only a handful of users (trolls) and communities. Here's the map they reference in their summary:

The researchers will be presenting their results at a conference next month. And they also make all of their code and data available.

Thursday, March 8, 2018

The Art of Conversation

There are many human capabilities we take for granted, until we try to create artificial intelligence intended to mimic these human capabilities. Even an unbelievably simple conversation requires attention to context and nuance, and an ability to improvise, that is almost inherently human. For more on this fascinating topic, check out this article from The Paris Review, in which Mariana Lin, writer and poet, discusses creative writing for AI:

If the highest goal in crafting dialogue for a fictional character is to capture the character’s truth, then the highest goal in crafting dialogue for AI is to capture not just the robot’s truth but also the truth of every human conversation.

Absurdity and non sequiturs fill our lives, and our speech. They’re multiplied when people from different backgrounds and perspectives converse. So perhaps we should reconsider the hard logic behind most machine intelligence for dialogue. There is something quintessentially human about nonsensical conversations.

Of course, it is very satisfying to have a statement understood and a task completed by AI (thanks, Siri/Alexa/cyber-bot, for saying good morning, turning on my lamp, and scheduling my appointment). But this is a known-needs-met satisfaction. After initial delight, it will take on the shallow comfort of a latte on repeat order every morning. These functional conversations don’t inspire us in the way unusual conversations might. The unexpected, illumed speech of poetry, literature, these otherworldly universes, bring us an unknown-needs-met satisfaction. And an unknown-needs-met satisfaction is the miracle of art at its best.

Not only does she question how we can use the essence of human conversation to reshape AI, she questions how AI could reshape our use of language:

The reality is most human communication these days occurs via technology, and with it comes a fiber-optic reduction, a binary flattening. A five-dimensional conversation and its undulating, ethereal pacing is reduced to something functional, driven, impatient. The American poet Richard Hugo said, in the midcentury, “Once language exists only to convey information, it is dying.”

I wonder if meandering, gentle, odd human-to-human conversations will fall by the wayside as transactional human-to-machine conversations advance. As we continue to interact with technological personalities, will these types of conversations rewire the way our minds hold conversation and eventually shape the way we speak with each other?