Estimating G+ Communities by size
I've been kicking at the problem of trying to estimate total G+ activity for a few years, and with the Plexodus, the question of how many active Communities are there is particularly pressing. Setting an arbitrary cutoff, consider communities of > 1,000 members.
Based on some preliminary results -- 6,000+ completed of a larger randomly-selected communities sample of 12,000, I'm going to suggest there are about 55k - 65k communities > 1,000 members on Google+, from a total Communities population of 7.974 million.
I'm using my old familiar method of working from sitemaps for Google+, with a running commentary as results come in posted to Diaspora here:
https://joindiaspora.com/posts/dbee1250d0680136d1dc0218b70db60d
Briefly, you can grab the top-level sitemaps via Google+'s robots.txt file, at https://plus.google.com/robots.txt
One line of that file reads:
Sitemap: http://www.gstatic.com/communities/sitemap/communities-sitemap.xml
That's not actually the Communities list itself, but a listing of another 100 files containing all 7,974,281 individual communities.
Rather than look at each of those, I've randomly selected a set of 12,000 (after a first run of 300) to look at. Generally, random sampling allows you to work with very small sample sizes, but given the highly-skewed Poison / power-curve distribution of Community membership populations, we need more numbers.
The pull is running now, at about a query a second, and I'd crossed the 6,000 community mark a few minutes back.
There are two sets of statistics which are of interest, univariate moments which give a general sense of the dataset, and a ranked-item report, listing the largest of the communities.
The first gives an overall sense of the data, the second reveals what the top end of the data show, much of which is far outside the reach of the first report, and where the vast majority of community memberships reside.
The univariate data: count, sum, min, max, mean, etc:
n: 5655, sum: 597008, min: 1, max: 217297, mean: 105.571706, median: 2, sd: 2979.245712
%-ile: 5: 1, 10: 1, 15: 1, 20: 1, 25: 1, 30: 1, 35: 1, 40: 1, 45: 1, 55: 2, 60: 3, 65: 4, 70: 5, 75: 8, 80: 13, 85: 22, 90: 45, 95: 129
What this tells us is that the typical Community size is very nearly 1. The mode is 2. And it's not until we get to the 95%ile that populations are over 100. The maximum (of the sample so far) is nearly a quarter million, at 217,297 members (a Spanish-speaking religious community, "La Palabra de Dios tiene Poder Comunidad", "The Word of God has Power Community" -- https://plus.google.com/communities/118362296060634412141), which is over 8x larger than the 2nd largest in the sample.
The ranked listing shows (at this moment) 56 communities in the sample of > 1,000 members, each one representing about 1,086 other communities, or giving us roughly 60,000 communities of > 1,000 members.
Similarly, there are likely about 9,800 communities with 10k+ members, and about 2,000 with 26,000+ members.
(There's a report of the top 60 ranked communities below).
*Keep in mind that community counts and populations have little to do with actual quality, and there may be some exceptionally vibrant, small communities of only a dozen or so members. That's not what I'm analysing here.*
(I've ... heard variants of this argument for going on four years based on earlier analysis.)
But for a sense of what potential bounds are, this should be useful information.
(Needless to say, I'd really like to obtain confirmation of this from, oh, say, Google, just to pick a random authoritative source out of the air, but if it's got to be Space Alien Cats, let it be Space Alien Cats.)
Other potentially interesting bits:
Current sample (of 12,000): 6831
Total public: 6190 (90.61%)
Member distribution:
n: 5914, sum: 602661, min: 1, max: 217297, mean: 101.904126, median: 2, sd: 2913.446902
%-ile: 5: 1, 10: 1, 15: 1, 20: 1, 25: 1, 30: 1, 35: 1, 40: 1, 45: 1, 55: 2, 60: 3, 65: 4, 70: 5, 75: 8, 80: 13, 85: 22, 90: 44, 95: 125
Communities report
Total communities: 6827
Total public: 6183 (90.57%)
Total private: 644 ( 9.43%)
Total open membership: 3773 (55.27%)
Total closed membership: 3054 (44.73%)
Total membership (public only): 602661
Mean membership (public only): 101.90
(And if you're noticing that the counts are creeping up, that's because, as I've said, the script's running now, though the numbers are sufficiently solid I'm confident in leaking a set.)
(Further update: a 2nd large community with 126,123 members has turned up.)
In other news, I've been looking at present G+ public participation and the fortunes of the 4,214 active profiles I found in 2015, versus the attrition rate of Google's last-updated profiles sitemap from 2017-3-1. Of the latter set, 1.6% were unreachable, giving 404 errors when I attempted to scrape them. Of the 2015 sample, the 404 rate is 13.24%. That is, having an active profile in 2015 gives an 8.275x higher likelihood of having a dead account in 2018 than having any account in March of 2017. That's ... an interesting result.
The constructive active, public participants in G+ are too small to be seen in my current 3,000 profile sample. That corresponds roughly to < 1 million such users, and given some other data (3,248 members of G+MM as I type this, 7,061 users on the Pluspora Diaspora pod, 33,066 signatures on the "Don't Shut Down Google Plus" Change.org petition), it seems that somewhere in the 10k - 100k range, possibly bumping toward 1m with lurkers, etc., is the likely solid core. A value I've suggested often.
(And yes, aggreeing on definitions, and finding accessible and trustworthy metrics is difficult.)
1 217297 members
2 26389 members
3 23412 members
4 22953 members
5 16902 members
6 12763 members
7 11490 members
8 10321 members
9 8782 members
10 7406 members
11 6819 members
12 6704 members
13 5344 members
14 5148 members
15 4972 members
16 4624 members
17 4590 members
18 4142 members
19 3518 members
20 3368 members
21 3243 members
22 3177 members
23 3164 members
24 3158 members
25 3035 members
26 2973 members
27 2764 members
28 2507 members
29 2477 members
30 2474 members
31 2334 members
32 2226 members
33 2070 members
34 2029 members
35 1964 members
36 1745 members
37 1744 members
38 1720 members
39 1620 members
40 1614 members
41 1593 members
42 1563 members
43 1501 members
44 1458 members
45 1343 members
46 1266 members
47 1230 members
48 1196 members
49 1160 members
50 1155 members
51 1133 members
52 1084 members
53 1035 members
54 1025 members
55 1009 members
56 1008 members
57 978 members
58 964 members
59 957 members
60 924 members
Reminder: these are partial results, values will vary a bit, though should be reasonably reliable.
https://joindiaspora.com/posts/dbee1250d0680136d1dc0218b70db60d
I've been kicking at the problem of trying to estimate total G+ activity for a few years, and with the Plexodus, the question of how many active Communities are there is particularly pressing. Setting an arbitrary cutoff, consider communities of > 1,000 members.
Based on some preliminary results -- 6,000+ completed of a larger randomly-selected communities sample of 12,000, I'm going to suggest there are about 55k - 65k communities > 1,000 members on Google+, from a total Communities population of 7.974 million.
I'm using my old familiar method of working from sitemaps for Google+, with a running commentary as results come in posted to Diaspora here:
https://joindiaspora.com/posts/dbee1250d0680136d1dc0218b70db60d
Briefly, you can grab the top-level sitemaps via Google+'s robots.txt file, at https://plus.google.com/robots.txt
One line of that file reads:
Sitemap: http://www.gstatic.com/communities/sitemap/communities-sitemap.xml
That's not actually the Communities list itself, but a listing of another 100 files containing all 7,974,281 individual communities.
Rather than look at each of those, I've randomly selected a set of 12,000 (after a first run of 300) to look at. Generally, random sampling allows you to work with very small sample sizes, but given the highly-skewed Poison / power-curve distribution of Community membership populations, we need more numbers.
The pull is running now, at about a query a second, and I'd crossed the 6,000 community mark a few minutes back.
There are two sets of statistics which are of interest, univariate moments which give a general sense of the dataset, and a ranked-item report, listing the largest of the communities.
The first gives an overall sense of the data, the second reveals what the top end of the data show, much of which is far outside the reach of the first report, and where the vast majority of community memberships reside.
The univariate data: count, sum, min, max, mean, etc:
n: 5655, sum: 597008, min: 1, max: 217297, mean: 105.571706, median: 2, sd: 2979.245712
%-ile: 5: 1, 10: 1, 15: 1, 20: 1, 25: 1, 30: 1, 35: 1, 40: 1, 45: 1, 55: 2, 60: 3, 65: 4, 70: 5, 75: 8, 80: 13, 85: 22, 90: 45, 95: 129
What this tells us is that the typical Community size is very nearly 1. The mode is 2. And it's not until we get to the 95%ile that populations are over 100. The maximum (of the sample so far) is nearly a quarter million, at 217,297 members (a Spanish-speaking religious community, "La Palabra de Dios tiene Poder Comunidad", "The Word of God has Power Community" -- https://plus.google.com/communities/118362296060634412141), which is over 8x larger than the 2nd largest in the sample.
The ranked listing shows (at this moment) 56 communities in the sample of > 1,000 members, each one representing about 1,086 other communities, or giving us roughly 60,000 communities of > 1,000 members.
Similarly, there are likely about 9,800 communities with 10k+ members, and about 2,000 with 26,000+ members.
(There's a report of the top 60 ranked communities below).
*Keep in mind that community counts and populations have little to do with actual quality, and there may be some exceptionally vibrant, small communities of only a dozen or so members. That's not what I'm analysing here.*
(I've ... heard variants of this argument for going on four years based on earlier analysis.)
But for a sense of what potential bounds are, this should be useful information.
(Needless to say, I'd really like to obtain confirmation of this from, oh, say, Google, just to pick a random authoritative source out of the air, but if it's got to be Space Alien Cats, let it be Space Alien Cats.)
Other potentially interesting bits:
Current sample (of 12,000): 6831
Total public: 6190 (90.61%)
Member distribution:
n: 5914, sum: 602661, min: 1, max: 217297, mean: 101.904126, median: 2, sd: 2913.446902
%-ile: 5: 1, 10: 1, 15: 1, 20: 1, 25: 1, 30: 1, 35: 1, 40: 1, 45: 1, 55: 2, 60: 3, 65: 4, 70: 5, 75: 8, 80: 13, 85: 22, 90: 44, 95: 125
Communities report
Total communities: 6827
Total public: 6183 (90.57%)
Total private: 644 ( 9.43%)
Total open membership: 3773 (55.27%)
Total closed membership: 3054 (44.73%)
Total membership (public only): 602661
Mean membership (public only): 101.90
(And if you're noticing that the counts are creeping up, that's because, as I've said, the script's running now, though the numbers are sufficiently solid I'm confident in leaking a set.)
(Further update: a 2nd large community with 126,123 members has turned up.)
In other news, I've been looking at present G+ public participation and the fortunes of the 4,214 active profiles I found in 2015, versus the attrition rate of Google's last-updated profiles sitemap from 2017-3-1. Of the latter set, 1.6% were unreachable, giving 404 errors when I attempted to scrape them. Of the 2015 sample, the 404 rate is 13.24%. That is, having an active profile in 2015 gives an 8.275x higher likelihood of having a dead account in 2018 than having any account in March of 2017. That's ... an interesting result.
The constructive active, public participants in G+ are too small to be seen in my current 3,000 profile sample. That corresponds roughly to < 1 million such users, and given some other data (3,248 members of G+MM as I type this, 7,061 users on the Pluspora Diaspora pod, 33,066 signatures on the "Don't Shut Down Google Plus" Change.org petition), it seems that somewhere in the 10k - 100k range, possibly bumping toward 1m with lurkers, etc., is the likely solid core. A value I've suggested often.
(And yes, aggreeing on definitions, and finding accessible and trustworthy metrics is difficult.)
1 217297 members
2 26389 members
3 23412 members
4 22953 members
5 16902 members
6 12763 members
7 11490 members
8 10321 members
9 8782 members
10 7406 members
11 6819 members
12 6704 members
13 5344 members
14 5148 members
15 4972 members
16 4624 members
17 4590 members
18 4142 members
19 3518 members
20 3368 members
21 3243 members
22 3177 members
23 3164 members
24 3158 members
25 3035 members
26 2973 members
27 2764 members
28 2507 members
29 2477 members
30 2474 members
31 2334 members
32 2226 members
33 2070 members
34 2029 members
35 1964 members
36 1745 members
37 1744 members
38 1720 members
39 1620 members
40 1614 members
41 1593 members
42 1563 members
43 1501 members
44 1458 members
45 1343 members
46 1266 members
47 1230 members
48 1196 members
49 1160 members
50 1155 members
51 1133 members
52 1084 members
53 1035 members
54 1025 members
55 1009 members
56 1008 members
57 978 members
58 964 members
59 957 members
60 924 members
Reminder: these are partial results, values will vary a bit, though should be reasonably reliable.
https://joindiaspora.com/posts/dbee1250d0680136d1dc0218b70db60d
Of all the posts of this, where would you like commentary? ;)
ReplyDeleteI take it you're scraping front pages to get the membership figure? Shame there's no easy way to get total posts, owner or moderators. Shame there's no API for communities. Shame Takeout for communities is so minimal.
Julian Bond Here for general stuff, if you really want to nerd out, the P:TBiN community.
ReplyDeleteAnd yes, this is just a basic web scraper. I've got a list of 12,000 Community URLs in a file, let's call that "sample", and what's running is:
cat sample | while read url; do w3m -dump $url | ; done | tee community-stats
I find basic web scraping is often easier to deal with than APIs (particularly rate-limited or query-limited ones) anyway. Basic, but effective.
The reporting scripts largely just grep for certain strings (that's ... occasionally inaccurate) or look for stuff and then read additional lines of input until something else likely-looking comes along, then report what was found at the end of the run (this in awk). Again, crude but useful.
m'kay. This looks like a classic power law. So expect a very long tail. The interesting bit is going to be “The Fat Middle”. That’s the ~250,000 (wild guess) of communities with >250 people.
ReplyDeleteAutomating it is going to be hard but hand sampling of things might be interesting. Like “date of most recent non-pinned post” and the profile ID of the owners and moderators.
As sites go from nerdy to massive the UI function gets smaller. When you start it seems really obvious to provide a list of communities like a large spreadsheet with numerous columns and sort orders. All communities sorted by most recent activity, or by owner name, or by total number of posts or whatever. By the time Google has applied it's army of managers to it, you get a page of pretty pictures in response to a search and that's it. "Communities 'Suggested for you' ".
https://plus.google.com/communities/recommended seems particularly fond of very large communities high up in the "Short Head".
Wow! https://plus.google.com/communities/101740425670472889181
Photography - 4,320,172 members - Public
Julian Bond Looks as if the 250+ set is going to be closer to 175,000 communities.
ReplyDeleteAt 10,246 sampled communities, 250 members is ranked 249th (coincidence!), and we're looking at about 669 actual communities per sampled one, for an actual rank of 175,478. That works out to about the 99.98%ile.
The 95%ile is holding at about 118 members, and represents 398,000 Communities.
Here's a thought: communities with more than 10 posts a day are something different. I think I'd call them channels. Something you watch, not something you engage with or belong to.
ReplyDeleteThe onboarding process of Android and G+ is going to lead to a winner-takes-all of promoted communities with low-effort content. Which is fine, for that demographic that just wants some distraction when they have a spare moment.
Having a community promoted, if it is more like a club, could be a kiss of death. Once that started happening, membership numbers became meaningless.
Possible metrics which might be very difficult to extract: posts per week, posts per week with engagement, ratio of the two, number of posting accounts per week, number of engaging accounts per week.
As a community moderator, I have access to takeout data which gives me URLs of all of a community's posts. I haven't taken the step of fetching or even sampling from those lists. But I could share the data (suggest a method if you want it). Only one of the communities in question was thriving and most were moribund. But data from their first year of existence might yet be interesting.
Code Poetry/Members.vcf:66
Computer History Book Club/Members.vcf:377
Computer History/Members.vcf:308
Computer Science, Seriously/Members.vcf:498
Computer security _ lockpicking/Members.vcf:1830
EthnoComputing/Members.vcf:182
HP Calculators/Members.vcf:504
Mighty Moe_s Scholastic Adventures/Members.vcf:29
Postmortems/Members.vcf:921
QuackeryLand/Members.vcf:35
Retro Computing/Members.vcf:515
Software Engineering/Members.vcf:107
Twilight Machines/Members.vcf:109
Unix Retrocomputing/Members.vcf:391
VLSI Design _ Electronics/Members.vcf:20
Code Poetry/Posts-1.json:52
Computer History Book Club/Posts-1.json:166
Computer History/Posts-1.json:1000
Computer History/Posts-2.json:403
Computer Science, Seriously/Posts-1.json:232
Computer security _ lockpicking/Posts-1.json:1000
Computer security _ lockpicking/Posts-2.json:243
EthnoComputing/Posts-1.json:90
HP Calculators/Posts-1.json:247
Mighty Moe_s Scholastic Adventures/Posts-1.json:1000
Mighty Moe_s Scholastic Adventures/Posts-2.json:227
Postmortems/Posts-1.json:290
QuackeryLand/Posts-1.json:37
Retro Computing/Posts-1.json:1000
Retro Computing/Posts-2.json:1000
Retro Computing/Posts-3.json:1000
Retro Computing/Posts-4.json:1000
Retro Computing/Posts-5.json:32
Software Engineering/Posts-1.json:342
Twilight Machines/Posts-1.json:28
Unix Retrocomputing/Posts-1.json:336
VLSI Design _ Electronics/Posts-1.json:162
Ed S You'd shared an earlier list of forums (I don't see those here) as well. Those were interesting to look at.
ReplyDeleteFrom a publisher and advertiser point of view (Google are both) what's sought is a large audience, with little regard to quality other than the costs of maintaining or rebuilding that audience. There are numerous tricks to quickly "goose" eyeball count, though most tend to be long-term counterproductive.
For content/day counts, I've done a bit of research and have spotted a number of patterns. Just off the top of my head:
The top-of-the-hour news rundown on a typical public broadcast network is about 5-7 items. An "hour-long" (nominally ~43 - 54 minutes in most cases) news programme typically covers about ten items, with interruptions limiting those to about 4-5 minutes each.
A national daily news site (NYTimes, WashPo, WSJ) typically produces about 100-500 original articles/day, generally more on weekends. Actual news coverage is generally a small fraction of that -- various features (style, fashion, sport, health, etc.) are a substantial share. The news wires, AP, UPI, Reuters, and AFP, produce 1,000 - 5,000 items/day.
(Sources, various, I've cited most of this on the Dreddit: https://reddit.com/r/dredmorbius Article for the newspaper bits, Vanderbilt Television News Archive for broadcast news, personal observation for radio, newswire sites / annual rerports for their production.)
Daily Social Media usage is about 45 minutes, or the content-portion of a light news programme. If ten items are read, the average is 4.5 minutes each, and more likely a Poisson / power curve relation is seen.
The best communities for me tend to be somewhat like G+MM is proving to be: have a strong focus, have involved and engaged members, post a mix of relevant (and if possible actionable) content, and stimulate real thought and discussion. Not just a "marketing platform" (though that's been somewhat tried here as well -- it's a part I don't much care for.) (And yes, I suspect I've been somewhat guilty of it myself: REQUEST JSON FORMAT ;-)
The popularity kiss-of-death has been much noted at Reddit. It's a perennial topic of Theory of Reddit (https://reddit.com/r/TheoryOfReddit), and comes up in certain subreddits, notably several which have refused status as default subreddits because of the inevitable nosedive in quality and focus which follows that.
My go-to example for reddit - also happens to be the only subreddit I frequent - is the SpaceX subreddit. It's strongly moderated and refuses to be populist. That wasn't universally popular, and they created SpaceXLounge for a less curated experience. Of course people don't like to be sent to a ghetto, but it might well now be a viable alternate place. Much smaller in member count, but almost as active, perhaps?
ReplyDeleteIt's rare to see a focus on quality rather than quantity, especially in at-scale capitalism. But that's where the better life experience is to be had.
Ed S Yeah, there are some very highly-curated / moderated subs, a few are even popular (AskHistorians, AskScience), but it's a constant struggle.
ReplyDeleteBuffer Social Media Examiner David Amerland Ade Oshineye Christine DeGraff John Skeats Matt Cutts
ReplyDeleteUpdate: The run's finished and I've sorted some discrepancies due to several now-deleted (or otherwise 404'd) Communities.
ReplyDeleteFull 12,000 sample results:
(lined-out values here are wrong by data-artifacts.)
Current sample (of 12,000): 12000
Total public: 10822 (90.18%)
Open membership ('Join'): 6590 (54.91%)
Closed membership ('Ask to join'): 7745 (64.54%)
Member distribution:
n: 10358, sum: 1331992, min: 1, max: 217297, mean: 128.595482, median: 2, sd: 3301.906970
%-ile: 5: 1, 10: 1, 15: 1, 20: 1, 25: 1, 30: 1, 35: 1, 40: 1, 45: 1, 55: 2, 60: 3, 65: 4, 70: 5, 75: 8, 80: 13, 85: 22, 90: 43, 95: 118.5
Communities report
Total communities: 12000
Null (missing) communities: 8
Total public: 10814 (90.12%)
Total private: 1170 ( 9.75%)
Total open membership: 6586 (54.88%)
Total closed membership: 5414 (45.12%)
Total membership (public only): 1331812
Mean membership (public only): 128.67
I'm working on a referenceable set of URLs with member counts and such for individual assessments. So. Much. Pr0n. Spam. A small handful of constructive communities, maybe four.
Strong log-log linear relationship of size to rank.
ReplyDelete50% of memberships in top 5 sampled communities. Probably means top 3k or so of actual communities, or fewer, account for half of all memberships in G+ communities.