How big is YouTube?
I got interested in this question a few years ago, when I started writing about the “denominator problem”. A great deal of social media research focuses on finding unwanted behavior – mis/disinformation, hate speech – on platforms. This isn’t that hard to do: search for “white genocide” or “ivermectin” and count the results. Indeed, a lot of eye-catching research does just this – consider Avaaz’s August 2020 report about COVID misinformation. It reports 3.8 billion views of COVID misinfo in a year, which is a very big number. But it’s a numerator without a denominator – Facebook generates dozens or hundreds of views a day for each of its 3 billion users – 3.8 billion views is actually a very small number, contextualized with a denominator.
A few social media platforms have made it possible to calculate denominators. Reddit, for many years, permitted Pushshift to collect all Reddit posts, which means we can calculate what a small fraction of Reddit is focused on meme stocks or crypto, versus conversations about mental health or board gaming. Our Redditmap.social platform – primarily built by Virginia Partridge and Jasmine Mangat – is based around the idea of looking at the platform as a whole and understanding how big or small each community is compared to the whole. Alas, Reddit cut off public access to Pushshift this summer, so Redditmap.social can only use data generated early this year.
Twitter was also a good platform for studying denominators, because it created a research API that took a statistical sample of all tweets and gave researchers access to every 10th or 100th one. If you found 2500 tweets about ivermectin a day, and saw 100m tweets through the decahose (which gave researchers 1/10th of tweet volume), you could calculate an accurate denominator (100m x 10) (All these numbers are completely made up.) Twitter has cut off access to these excellent academic APIs and now charges massive amounts of money for much less access, which means that it’s no longer possible for most researchers to do denominator-based work.
Interesting as Reddit and Twitter are, they are much less widely used than YouTube, which is used by virtually all internet users. Pew reports that 93% of teens use YouTube – the closest service in terms of usage is Tiktok with 63% and Snapchat with 60%. While YouTube has a good, well-documented API, there’s no good way to get a random, representative sample of YouTube. Instead, most research on YouTube either studies a collection of videos (all videos on the channels of a selected set of users) or videos discovered via recommendation (start with Never Going to Give You Up, objectively the center of the internet, and collect recommended videos.) You can do excellent research with either method, but you won’t get a sample of all YouTube videos and you won’t be able to calculate the size of YouTube.
I brought this problem to Jason Baumgartner, creator of PushShift, and prince of the dark arts of data collection. One of Jason’s skills is a deep knowledge of undocumented APIs, ways of collecting data outside of official means. Most platforms have one or more undocumented APIs, widely used by programmers for that platform to build internal tools. In the case of YouTube, that API is called “Inner Tube” and its existence is an open secret in programmer communities. Using InnerTube, Jason suggested we do something that’s both really smart and really stupid: guess at random URLs and see if there are videos there.
Here’s how this works: YouTube URLs look like this: https://www.youtube.com/ watch?v=vXPJVwwEmiM
That bit after “watch?v=” is an 11 digit string. The first ten digits can be a-z,A-Z,0-9 and _-. The last digit is special, and can only be one of 16 values. Turns out there are 2^64 possible YouTube addresses, an enormous number: 18.4 quintillion. There are lots of YouTube videos, but not that many. Let’s guess for a moment that there are 1 billion YouTube videos – if you picked URLs at random, you’d only get a valid address roughly once every 18.4 billion tries.
We refer to this method as “drunk dialing”, as it’s basically as sophisticated as taking swigs from a bottle of bourbon and mashing digits on a telephone, hoping to find a human being to speak to. Jason found a couple of cheats that makes the method roughly 32,000 times as efficient, meaning our “phone call” connects lots more often. Kevin Zheng wrote a whole bunch of scripts to do the dialing, and over the course of several months, we collected more than 10,000 truly random YouTube videos.
There’s lots you can do once you’ve got those videos. Ryan McGrady is lead author on our paper in the Journal of Quantitative Description, and he led the process of watching a thousand of these videos and hand-coding them, a massive and fascinating task. Kevin wired together his retrieval scripts with a variety of language detection systems, and we now have a defensible – if far from perfect – estimate of what languages are represented on YouTube. We’re starting some experiments to understand how the videos YouTube recommends differ from the “average” YouTube video – YouTube likes recommending videos with at least ten thousand views, while the median YouTube video has 39 views.
I’ll write at some length in the future about what we can learn from a true random sample of YouTube videos. I’ve been doing a lot of thinking about the idea of “the quotidian web”, learning from the bottom half of the long tail of user-generated media so we can understand what most creators are doing with these tools, not just from the most successful influencers. But I’m going to limit myself to the question that started this blog post: how big is YouTube?
Consider drunk dialing again. Let’s assume you only dial numbers in the 413 area code: 413-000-0000 through 413-999-9999. That’s 10,000,000 possible numbers. If one in 100 phone calls connect, you can estimate that 100,000 people have numbers in the 413 area code. In our case, our drunk dials tried roughly 32k numbers at the same time, and we got a “hit” every 50,000 times or so. Our current estimate for the size of YouTube is 13.325 billion videos – we are now updating this number every few weeks at tubestats.org.
Once you’re collecting these random videos, other statistics are easy to calculate. We can look at how old our random videos are and calculate how fast YouTube is growing: we estimate that over 4 billion videos were posted to YouTube just in 2023. We can calculate the mean and median views per video, and show just how long the “long tail” is – videos with 10,000 or more views are roughly 4% of our data set, though they represent the lion’s share of views of the YouTube platform.
Perhaps the most important thing we did with our set of random videos is to demonstrate a vastly better way of studying YouTube than drunk dialing. We know our method is random because it iterates through the entire possible address space. By comparing our results to other ways of generating lists of YouTube videos, we can declare them “plausibly random” if they generate similar results. Fortunately, one method does – it was discovered by Jia Zhou et. al. in 2011, and it’s far more efficient than our naïve method. (You generate a five character string where one character is a dash – YouTube will autocomplete those URLs and spit out a matching video if one exists.) Kevin now polls YouTube using the “dash method” and uses the results to maintain our dashboard at Tubestats.
We have lots more research coming out from this data set, both about what we’re discovering and about some complex ethical questions about how to handle this data. (Most of the videos we’re discovering were only seen by a few dozen people. If we publish those URLs, we run the risk of exposing to public scrutiny videos that are “public” but whose authors could reasonably expect obscurity. Thus our paper does not include the list of videos discovered.) Ryan has a great introduction to main takeaways from our hand-coding. He and I are both working on longer writing about the weird world of random videos – what can we learn from spending time deep in the long tail?
Perhaps most importantly, we plan to maintain Tubestats so long as we can. It’s possible that YouTube will object to the existence of this resource or the methods we used to create it. Counterpoint: I believe that high level data like this should be published regularly for all large user-generated media platforms. These platforms are some of the most important parts of our digital public sphere, and we need far more information about what’s on them, who creates this content and who it reaches.
Many thanks to the Journal for Quantitative Description of publishing such a large and unwieldy paper – it’s 85 pages! Thanks and congratulations to all authors: Ryan McGrady, Kevin Zheng, Rebecca Curran, Jason Baumgartner and myself. And thank you to everyone who’s funded our work: the Knight Foundation has been supporting a wide range of our work on studying extreme speech on social media, and other work in our lab is supported by the Ford Foundation and the MacArthur Foundation.
Finally – I’ve got COVID, so if this post is less coherent than normal, that’s to be expected. Feel free to use the comments to tell me what didn’t make sense and I will try to clear it up when my brain is less foggy.
NOT big enough; that it can’t be wiped away from the surface earth; with a relatively cyber attack from the Alliance. In fact; Google/YouTube will just be one; of 63 communist controlled platforms; that will be completely erased from existence once the order is given.
too cool. in the age of the algorithm, randomness is the key to authenticity
Fascinating!
Fyi on a typo: “videos with 10,000 videos” should be views
You have COVID again? I find it odd that the people always pearl clutching about COVID misinformation always seem to be getting sick.
“videos with 10,000 or more videos” should be “videos with 10,000 or more views”
Turns out comments about YouTube are YouTube comments too.
Pingback: How big is YouTube? | News Lexa
I can confirm the theory proposed by BIGJYMN’s comment – as a communist myself I do indeed own YouTube.
“videos with 10,000 or more videos” presumably should be ‘or more views’? Doing similar research at the moment but thankfully my target website uses sequential IDs.
Pingback: How big is YouTube? - CIBERSEGURANÇA
There is no cold reason to not share low-views public videos. It literally can’t be more obvious that they are public. No law is being broken.
Pingback: How big is YouTube? – Latest-News-Events
Very interesting work! Congrats for you all! I will read all the papers and references you posted. I understand that with this dataset you would be able to select a sampling of the best rated videos for a determined category. Is that correct? Thanks in advance!
Pingback: New best story on Hacker News: How big is YouTube? – Cassinni
I would love to visit a site that drunk dials randomly for the user and watch.
Why not using the world’s best indexer? You can make google surface tail videos with searches like:
https://www.google.com/search?q=inurl:www.youtube.com/watch+“10+views”
The challenge then shifts to evading google anti-scraping (assuming you’re searching at scale) and avoiding english-only results.
To find the yearly statistics you can filter by date range and look at the “About N results” at the top. I know, naive but… why not?
https://www.google.com/search?q=inurl:www.youtube.com/watch&tbs=cdr:1,cd_min:1/1/2018,cd_max:31/12/2018
Results I’ve extracted this way:
2022 About 997,000,000 results
2021 About 867,000,000 results
2020 About 601,000,000 results
2019 About 608,000,000 results
2018 About 437,000,000 results
2017 About 284,000,000 results
They differ quite a lot from those at tubestats.org, unsure who’s right.
Some tangentially related prior “art” for your perusal/amusement: https://justinsomnia.org/2016/04/eighteen-quintillion-youtube-videos/
Yes, that’s an excellent method to find videos Google indexes. But it’s a small subset of all YouTube videos. Google doesn’t index everything. We can make some guesses – I’m guessing that the set of videos you’ve found are significantly more popular than those we are able to find. It would be interesting to compare – we may try a set of videos using the method you suggest and comparing to our set. But our set is much larger, which suggests that yours is a subset of those Google has decided – for whatever reasons – to index, whereas ours is an (estimate of) the entire public set.
It’s less interesting than you think. In an earlier comment, Juan mentions that you can just search for “https://www.google.com/search?q=inurl:www.youtube.com/watch+” You’ll find that some are wonderful, and most are terribly, terribly boring.
Yuri, we are unlikely to get the sort of sample you are looking for. Our method gets only a few videos and uses math to extrapolate from there. I think to find highly rated videos, you’d do better to follow YouTube recommendations for promising videos.
Thanks. As mentioned previously: COVID, brain fog. :-)
Take a look at the story of Justine Sacco – she made a “public” tweet that was intended to be seen by a few friends who followed her on Twitter. It went viral and destroyed her life. Since then, researchers have been very careful about revealing low-popularity public content, whether or not doing so would violate any laws. So… despite your opinion, we’re going to be very careful about how or whether we share those videos.
Hmm. Twice in three years doesn’t seem all that often. And I don’t think you’ve actually seen much pearl clutching from me – my observation was actually about people doing weak research on COVID misinfo.
Pingback: Winter Break Reading List – December – J.C. Otiono
Pingback: A study estimates that there are 13.3B+ videos on YouTube, of which 4B+ were posted to the platform in 2023, and the median YouTube video has 39 views (Ethan Zuckerman) – Only for men
Pingback: A study estimates that there are 13.3B+ videos on YouTube, of which 4B+ were posted to the platform in 2023, and the median YouTube video has 39 views (Ethan Zuckerman) – The News Company
Pingback: Ethan Zuckerman: How Big is YouTube? | ResearchBuzz: Firehose
This is fascinating stuff, congratulations on the publication! I’m curious of your opinion on how YouTube has acted as a research resource, these last few years. We’re not in a terribly transparent place with platforms right now, with the mentioned API winter visiting some platforms, Facebook’s fight against the NYU Ad Observatory still echoing in some ears, and the whittling down of regular platform transparency reports (which has affected my own research), so how’s YouTube been? From a research-access perspective, have APIs been upkept or, dare I say, /improved/ for greater transparency? Is the company taking feedback? Or are they still products of the late-2010s era when social media companies felt pressure to provide such tools?
My nerbs are slightly in the wrong camp here because I saw the ban on numbers. But you have to admit, there’s a lot of hate in this world.
I want to eat this one. And forget we’re such a species. Haha. :)
But maybe such a number is a representation of how big our sins are. We are made of meat. I’m just saying maybe that’s why YouTube is so big. Our meat, out on display. Like puppets held up by the strings of the web.
It’s night and in my mind I’m sleeping. That’s why I’m so wise.
Pingback: Library of Congress, Twitter, American Journalism Project, More: Monday ResearchBuzz, December 25, 2023 – ResearchBuzz
Roddy Piper, confirmation bias, it’s real!
Pingback: Daily Search Forum Recap: December 26, 2023 – pottuvilps
How long can YouTube, X, FB, IG keep adding data? How many data centers will be required taking up how much space and energy? There has to be a limit. Seems like we’ll reach it in the not too distant future.
This is called random dialing (RDD) in the polling industry.
Prior to cellular phones, when everyone had a land line, area codes were matched with demographic data from the census, and the desired number of phone numbers were generated for the remaining digits. Trained staff then convinced the phone owners to cooperate, and as long as 95 percent did, a mere 5,000 people could represent the entire population United States. Unlike today palls were quite accurate.
Pingback: Estimating the size of YouTube | Data
Several companies independently index YouTube on behalf of content owners to find copyright violations. I believe Pex.com is one of them.
You might want to reach out to them (as well as YouTube) and see if they will officially or unofficially confirm your estimates, as these companies likely figured out how to crawl the entire dataset long ago.
Ethan, in your estimated history of YouTube’s growth, what is the growth rate? And is it accelerating or de-accelerating?
When you said 2023 alone had over 4 billion videos uploaded versus the 13 point something billion overall videos, that took my by surprise initially. Of course, YT has grown rapidly since the pandemic but the graph seems curvy enough. That was an interesting pointer.
Also, agree to the point about platforms publishing this kind of data. Netflix recently surprised us with a summary but it’s not as granular as we would want it to be. It would be interesting to see if and how YouTube response to your work. In any case, this is amazing stuff. Take care.
Love the methodology and this ability to see into this very important dataset. I’d be interested to see some of the dimensions collected on Tubestat plotted over time, such as video length. It is clear that YT are prioritizing short form video, and that increasingly their algorithmic systems are reinforcing that business/product decision. But is that actually impacting accounts publishing to the platform? Do short-form-video account for a larger % of views on the platform?
Author Ethan Zuckerman discusses the question “How big is YouTube?” by examining the actual number of YouTube videos. He begins by pointing out a problem often encountered when studying social media content: researchers tend to focus on bad content, such as misinformation and hate speech, while ignoring the vast amount of ordinary content on the platform. It’s like focusing on sharks while ignoring other sea creatures when studying the ocean.
To solve this problem, Ethan Zuckerman and his team developed a new way to measure the size of YouTube by randomly sampling YouTube videos. They used a method akin to “blind dialing” to find random videos by guessing the video URL. By analyzing these randomly selected videos, they estimated that there are about 13.3 billion videos on YouTube, with 4 billion videos added in 2023. They also found that the average YouTube video is viewed only 39 times, which suggests that most of the content on YouTube is not widely watched.
Ethan Zuckerman believes that it is important to understand the actual size of YouTube and the distribution of content, which can help us better understand the platform’s impact on society. He calls on platform owners to provide more data on their platform’s content and encourages researchers to use methods such as random sampling to study social media content in order to obtain more accurate and comprehensive results.
Overall, this article explores the methodology and importance of studying social media content through an interesting case study. It reminds us not to overlook the existence and value of ordinary content while focusing on problematic content on the platform.
Pingback: Saturday links: forgetting to have fun - QuantInfo - Empowering Algorithmic Trading Insights
Pingback: 10 Weekend Reads - The Large Image -
Wow, ‘drunk dialing’ YouTube URLs for data collection? That’s like trying to call every number in the phone book to find one person who’ll listen to your drunken ramblings! Jokes aside, it’s fascinating how this method gives us a glimpse into the enormity of YouTube. Makes you wonder what else is hiding in those billions of videos :-))
Pingback: Future of TV Briefing: The top trends and developments that will shape the future of TV in 2024 – Ricas Media Group
Pingback: Combien y-a-t-il de pages indexés par Google ?
Pingback: We Love the Internet 2024/01: The Internet is about to get weird again edition | Curiously Persistent
Pingback: Read, Learn, Improve – 6-Jan-24 – Random Thoughts of Analyst
DR. BUGGSTEIN TEINNEN, guessing URLs will sweep up many videos that have been uploaded with “only people with the link” permissions. The intent is clearly that the video is private to a small number of people.
Pingback: How Big Is YouTube? - TidBITS
Pingback: How Big Is YouTube? - TidBITS - Worldnews4
Comments are closed.