Why do companies with unbounded resources still have terrible moderation?

Disclaimer: I work at Google, but none of this reflects the opinion of my employer, my experiences at work, or describes any product I’ve worked on or even have indirect knowledge about. I’m speaking purely in hypotheticals based on my experiences with human labelled data and machine learning, and am not describing any specific event that has happened to me. Do not interpret this as insider information about YouTube, I have never worked on YouTube and have no knowledge of its policies or systems. Also, if you don’t want to see some really revolting racism, don’t click the links.

Social media companies have moderation SNAFUs on a near daily basis. People get suspended for joking with their friends. People get suspended for clap backs at Nazis. And the Nazis keep on Nazi-ing and no one seems to care. Inevitably, the public and the tech press reacts with incredulity, “This company has so much money. They’ve hired thousands of PhDs. How can they still be so bad at this?”

I’m going to tell you how.

Pretend you work at Twitter.

Twitter disallows threats of violence. You want to enforce this as uniformly as you can, so you want to build an automated system that detects threats of violence and suspends users.

First step, you hire a team of PhDs that can build state of the art transformer-based language models. This is what the tech press thinks is the hard part, and that after you’ve done that, it should be a cakewalk right?

This is the easiest part of the process, because all you have to do is spend money to do it. Language models are a commodity now. You can download the best in the world from GitHub. The expertise to adapt them to a particular task is still rare and expensive, but rapidly becoming more common. But models by themselves are useless. What you need is training data. How do you get training data for what is, and what isn’t, a threat of violence?

Typically, you’d write up some guidelines for what is and what isn’t a threat, iterate on those with the legal, policy, and PR departments, and then contract with some external company to have human beings read those guidelines and rate lots of examples that you send them.

Writing guidelines for what is and isn’t a threat is really hard. The people who write these guidelines often have backgrounds in Law or Philosophy and lots of experience with the subtleties, but governments have hundreds of pages of case law covering this specific thing, and experts still disagree all the time. When governments need to figure out whether something is or isn’t a threat, they take months of adjudication and 10s of thousands of dollars. You have to be able to do it at the rate of QPS.

Unsurprisingly, when you get your ratings back from the contractor, they’ve labelled tons of sarcasm and jokes as threats, and tons of outright threats that are slightly hidden behind euphemisms, memes, or slang as benign. So in the next round of guideline revisions you realize you have to teach people about memes.

Imagine that the archetypal rater you’re trying to teach is someone in rural India. I’m using rural India as an example not because they’re actually from rural India, but if I describe them in a way that seems less foreign to you, you will instinctively overestimate how much you have in common with them. They are a random English speaking person interested in a flexible part time job. They are from somewhere that is not where you’re from and probably have very little cultural context in common with you. If you are reading this, they are probably “less online” than you. They might be a stay at home mom in Minnesota who needs some extra cash while the kids are at school. They might be a security guard working the night shift with a lot of time to stare at a screen. They might be someone in India. These people are different from you, and their understanding of the world and the meme wars is different from yours. Your sub-culture is deeply aware of how 8chan white supremacists talk about the people they’re going to swat because you see that shit every day. But theirs isn’t. Imagine how well you’d do if you were asked to rate hate speech directed at the Rohingya. There are slurs out there that you couldn’t even imagine, and they change from moment to moment.

So now, as you rewrite the guidelines, you’re providing a lot of links to urban dictionary and knowyourmeme. You’re explaining what 4chan is. You’re explaining what 8chan is. Your guidelines are becoming less and less general and more and more extremely online. You realize that they’ll all be obsolete in 6 months. 4chan finds new ways to be racist way faster than your legal department can approve new documents, likely faster than you can write them or the raters can read them. Rival communities are now weaponizing your “Report Abuse” button, so you can’t even get any indirect signal from what your users think in aggregate without becoming somebody’s tool.

Your users expect the system to behave perfectly, or at least non-embarrassingly. Your users also expect everything to be interpreted in a cultural context that they’re deeply steeped in, but that even a random person from the US can’t figure out without a ton of research and training, let alone someone from the Philippines.

You’re realizing that your ideal rater is a multilingual extremely-online expert in first amendment law who takes a morbid interest in the dregs of the internet, but there’s only one Popehat and he’s already gainfully employed.


Let’s say you haven’t had enough suffering from trying to do something as seemingly unambiguous as getting rid of outright threats of violence. Now you want to go even further to do what lots of Twitter users have been begging for and ban the Nazis. So you hold your nose, roll up your sleeves, and write some guidelines about Nazis.

You get your first ratings back and you realize you need to describe the difference between 1488 and 1619, that one is a white supremacist slogan and the other is an important date in American history. You update the guidelines with policy, PR, and legal, and then you try again.

On the next round you realize you need to teach them the difference between “Le 56%” and “the 1%.” That one of them is how European white nationalists make fun of American white nationalists, and the other is how the left talks about rich people in America. So you update the guidelines with policy, PR, and legal, and try again.

On the next round you realize that they haven’t quite figured out what antisemitism looks like, and you need to explain the difference between “Jews have infiltrated the government” and “Russians have infiltrated the Trump campaign,” that one is a justification for mass murder, and the other is something that kinda, maybe, happened. So you update the guidelines. Legal and policy are all increasingly concerned that if any of the contractors leak the guidelines that you’re going to be hauled before congress. PR is worried about angry presidential tweets. You try again.

But there’s still a lot of stuff they’re getting wrong, so you try to explain the difference between “flashing an OK sign upside down at a gun range” vs “flashing an OK sign right-side up on a ski slope,” that one might be a white supremacist showing a “white power” sign, maybe, and the other is a guy enjoying his hobby. Most of your raters have never been to a gun range or to a ski slope, and may have never seen the OK sign. Now PR is sure that if any of this leaks not only will you get hauled before congress, but the general public will think you’re a bunch of loons.

This process can go on literally forever, and the ratings may never materially improve. The world of racism is full of euphemism specifically designed to fool onlookers, to allow people to be racist in public without detection, to be joking and not joking at the same time for plausible deniability. Racists have already inoculated their movement against exactly the type of social/technical system you are trying to design.

How do you know it sucks?

You’ve built the best system you can. You may know it sucks, but you need a solution today, there isn’t any obvious way to improve it, so you deploy it and hope for the best. Despite investing 10s of millions of dollars and years of many people’s lives, you have a system that still causes public relations nightmares on a continual basis.

It’s difficult at an aggregate level to even figure out what the system is getting wrong. Sure, you can find individual examples, they’re in your inbox every day, usually as a flood of vitriol and accusations, or worse, in the press. You can try to cut around them, to add them as examples to your guidelines the next time you do a round of rating, but what you really need are aggregate statistics. Maybe these are all just outliers and things are fine on average? But how do you get aggregate statistics? You need humans to rate things again. And then you’re back in the same morass of teaching rural Indians about 4chan.

You’d love to explain to the public how difficult this all is, how hopeless it feels, and what an impossible standard they’re holding you to. But PR knows that making any comment less tepid than “we’re working to continually improve,” will kick off another press cycle, and so the train wreck rolls on.

What can be done?

If I had a real solution, I’d be taking it to the bank instead of writing this, but I’ll take a stab at back-seat product design anyways.

The bottom line is that people cannot effectively moderate a culture and a context that they are not a part of. Maybe that means that being “extremely online” should be part of the hiring criteria. Maybe that means hiring people specifically from each subculture in your network. Or maybe it means turning the problem around, and giving the power to the users.

Give communities the tools to moderate themselves at scale. Give users (or advertisers) the tools to control their own experience at scale. It may be impossible to teach someone with specificity what is and isn’t doxxing or what the early signs of getting targeted by the alt-right are, but once you’ve experienced them, or watched it happen to someone, you’ll never forget. This means that users may need to see more data about other users than you are comfortable exporting. If it’s a signal you think your spam or abuse team might need, show it to your users. Let them use it themselves, or even script it themselves, so they can share protection techniques the same way they share everything else. Know that perfect content moderation is unattainable, and your users cannot wait for you to improve. They need the tools now, and they will be better protecting themselves than anything you develop as an outsider.

Because if you’re still imagining that if only you had better ratings you could fix this, imagine what would it look like to have a system that could do content moderation perfectly, that understood the nuances of every culture on earth, that could see the intention behind everything that everyone could write and would know whether it was meant in jest or in hate, that could really see into our hearts, that would know if we deserve forgiveness.

It would look like a lot like a God, and I think that’s a bit much to expect out of any group of people, let alone a big pile of linear algebra.

1 Comment

  1. Ryan, you hit the nail on the head with the problem of machine-based moderation. It’s nearly impossible, especially for video containing hate speech. I worked at YouTube 2007-2009, starting shortly after Google acquired YT, while YT was allowed to operate independently with ~200 employees at the time. (I don’t believe any of the information I provide here is confidential, and I have no actual knowledge of current YT moderation practices, although I can guess.)

    At YT in the early days, every new employee had to spend one week moderating flagged videos. When users flag videos, they are asked what type of violation: porn, animal cruelty, hate speech, copyright violation, etc. For violations that could be visually confirmed, the moderator would be shown a screen of 20 or 30 keyframe video stills spanning the length of the video. It’s pretty easy to tell at a glance if the video is porn or animal cruelty. If so, the video would be removed, the user who uploaded it warned, and a digital “fingerprint” of the video would be made to prevent that video from being uploaded again.

    Fortunately, I never had to go through that moderation “hazing” — YT had a dedicated review department of a few people by the time I started. (I was told that the worst videos that stuck in people’s minds were the animal cruelty videos. “You can’t unsee what you saw.”) During my time at YT, that review department grew quickly, and expanded from staff at YT HQ in in San Bruno, CA, to staff in Dublin, Ireland, and an office in India for 24-hour coverage. User-flagged videos were reviewed by a human within 1-2 minutes in most cases, an impressive statistic.

    One interesting data point: videos containing pornographic images were viewed on average 100 times before someone someone flagged the video. Videos containing suspected child pornography were viewed on average 10 times before someone flagged the video. (We attributed this to 1) the fact that you have to be logged in to YT to flag videos and many viewers at the time did not have accounts, and 2) many people simply clicked away not knowing about the “report this video” feature.)

    I imagine the moderation of visually-identifiable violations, such as pornographic images, has been mechanized, but recognizing animal cruelty is probably still tough to teach a computer.

    Hate speech was much, much more difficult then and now. You have to 1) watch the video for visual cues, 2) listen to the words, 3) know the language (what if it’s in Polish and you speak only English?), and 4) as you pointed out, you must know the latest lingo. Reviewing these flagged videos was much more labor-intensive. challenging, and often required a subjective decision.

    At the time, YouTube was officially a US-only service, even though anyone around the world could go to youtube.com and watch videos. The “US-only” policy was for liability protection — YT’s lawyers understood US law, DMCA requirements for take-downs, etc. This meant that YT was not legally bound by the laws of foreign countries. For example, holocaust denial is universally bad, and a violation of YT policies. Those videos are removed. However, in Germany and some other countries, holocaust denial is a crime, and a website could be fined, or worse, if they failed to remove videos about holocaust denial. The “We’re a US-only website” protected YT from legal ramifications in other countries where laws might be different from the US.

    For hate speech, YT moderators were required to watch the video in real time, review attorney-approved guidelines, and make their judgement call, sometimes escalating to management if they weren’t sure. But obviously there were many foreign language videos, even on the “US-only” site. Initially YT moderators had a list of YT employees who spoke foreign languages and would enlist their help if necessary, but that wasn’t always possible, and often took days to review the flagged video, rather than the minutes (from user flag to review) for pornographic videos.

    Later, around 2008 or 2009, YouTube decided to officially launch foreign country versions of the website (around 20 countries at the time, e.g., http://www.youtube.de, with the home page and menus defaulting to German, rather than English). This involved 1) localization of the website (language, featured videos, etc.), 2) attorneys reviewing the relevant laws of each country to ensure compliance with local laws, and 3) ensuring the moderation team had native speakers of the language(s) used in those new, officially supported countries. But even then, a hate-speech video still might take a couple of days to be reviewed, for example if a German-language video was flagged on a Friday evening and the “official” German-speaking moderation staff (in Dublin, or where ever) would not be back in the office until Monday morning.

    Hate speech moderation is no doubt still a complex, vexing problem for YT. My conjecture is that they now use speech-to-text conversion as the main input to a machine-based moderation tool. The moderation tool probably produces a score for the likelihood of the video being hate speech. Above a certain threshold, that score causes the video to be removed automatically. Below the threshold, it is probably forwarded to a human moderator for confirmation. No doubt they err on the side of “when in doubt, take it down.” As you pointed out, training the algorithm is still very difficult and a constantly moving target.

    As far as I know, YT still relies on users to flag videos for review, in accordance with DMCA guidelines (or the applicable laws in “supported” countries). The one exception is copyright violations, which are handled automatically at the time of upload. (Music companies, movie studios, etc., allow YT to create digital “fingerprints” of audio and video for copyright-protected music and videos. Those fingerprints are used to identify uploads containing that music or video. The owner of the copyright can then specify whether to allow the video to be uploaded and monetized (show ads and share revenue), or to block the video from being uploaded (one example: any video containing music by Prince is automatically rejected).

    For copyright violations, there is an appeal process for users who upload videos that fall within the definition of “fair use”, such as criticism, news reporting, satire, etc. One classic example is the warning that the NFL shows before each football game on TV, saying “the entire contents of the broadcast is protected by copyright.” One legal professor uploaded the NFL warning to YT years ago, it was taken down, and they went back and forth before it was returned to YT, and is still there today. (Blog post: https://preview.tinyurl.com/nflcopyright)

    Anyway, thought you’d be interested in some real world experience with YouTube, even if it is 12 years out of date! Lol


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s