Skip to main content

#Google+ #GooglePlus #Takeout #Shutdown #JSON #PostgreSQL

#Google+ #GooglePlus #Takeout #Shutdown #JSON #PostgreSQL

I rarely do public posts this technical... I'm not sure I ever have but since the reality is that the G+ shutdown is happening, I wanted to share my thoughts on using Google Takeout to get my data. I'm also posting this to the Google+ Mass Migration Community in hope it might help someone else.

~ ~ ~

Let me give you a quick reason for this post. What I think pains many of us who really embraced G+ and to be honest, are very upset about this shutdown, is that we spend a LOT of time posting here. As of my last takeout archive on 2019-02-13 I have made 4237 posts since 2011-07-10 (you'll see how I know that later). That may seem like a lot but it really isn't compared to other people I know. The number isn't what's important, its my time and that is something that can't be quantified because time is invaluable.

For that reason, the prospect of losing the results of nearly 8 years of my life spent here was really getting to me. More so than the shutdown itself and more than considering what comes next (and I still don't know that one either). The one thing I needed from Google was a way to give me something so that this wasn't all for naught.

Enter Google Takeout...

First thing's, first. It DOES work. Mine came to about 103Gb split over 3 files. What was happening to me and other people was that there is an error showing and did not effect overall data fidelity. Once you get all the files, you'll see that there is an index.html that shows you what Google Takeout could not retrive for you. There is a very high chance that is probably just a media file of some sort.

I don't know about you but that was not critical to me. Anything I produced is part of Photo's and truth be told the archive would have been smaller is that data wasn't included but having duplicate is fine (since, who knows where Photos might be taken from us).

Ok, so now that I have my data, now what?

This where things are going to get technical but that's because it has to be and for those of us who are technically inclined I hope this gets you started.

Having my history is one thing but being able to refer to it in a meaningful way is another matter. In my case, what I wanted was something that was searchable so that regardless of what I do next, I can look up my past writings if for not other reason than to see how I've personally evolved or to undestand the contexuality of something from years past.

Enter PostgreSQL... a database system

Just about every major application I've ever written uses PostgreSQL (I could go on for hours about how much I love this system but that's out of scope and you don't care). One of the more recent features is the abilty to work with JSON, including an ability to query elements directly and index or do pretty much anything you would do with normalized or "proper" data structures. Basically PostgreSQL can handle SQL and "NoSQL" workloads in the same system.

Furthermore, PostgresSQL can also do full text indexing which means that I would be able to efficiently search my posts in a more "web search" style.

The trick is getting the JSON files you get via Google Takeout into a table in the database system. I say "trick" because Google provides you "pretty" JSON files (aka human readable, if want to call it that) and those are NOT compliant with the JSON spec so you have to do some basic manipluation of the files first. Furtunately, in Linux this trivial.

To recap so far...
1) Use Google Take to export your Google Plus data to JSON files
2) Download and extract all the files to the same path

You'll have a path that starts with Takeout/Google+ Stream and then you'll see Posts, Photos, Events and ActivityLog

3) Create a database with a table similar to this...

CREATE TABLE public.postjson
(
id integer NOT NULL DEFAULT nextval('"Posts_id_seq"'::regclass),
data jsonb,
CONSTRAINT "Posts_pkey" PRIMARY KEY (id)
)
WITH (
OIDS = FALSE
)
TABLESPACE pg_default;

4) Import your data correcting for JSON formatting issues (i.e. remove the newlines) and also escaping specialty characters. Linux / Unix people, you know this drill. Here's what I did initially from the Posts level...

for f in $(ls -1 2011*); do cat $f | tr -d '\n' | sed 's/\\/\\\\/g' | /usr/local/pgsql/bin/psql G+20190213 -c "COPY postjson (data) from STDIN"; done

What this does is find all the files that start with 2011 in the "Post", removes the \n or "newline" character and then converts all the \ characters to \\ which is something we have to do in Unix systems to get the bashslash treated like a bashslash (and not the character that starts and escape sequence).

The psql command line database tool and then import the data. In may case, this successed without error and all 95 of my 2011 post were ingested. I just went year by year after that.

5) Create views and indexes to "see" and "search" your posts

As you can imagine not every post you have made has the same information b ut there are some things that are common. So as basic starting point you can do something like this:

CREATE OR REPLACE VIEW public.posts AS
SELECT postjson.id,
postjson.data ->> 'creationTime'::text AS creationtime,
postjson.data ->> 'updateTime'::text AS updatetime,
postjson.data ->> 'url'::text AS url,
postjson.data ->> 'content'::text AS content,
postjson.data ->> 'comments'::text AS comments
FROM postjson
ORDER BY postjson.id, (postjson.data ->> 'creationTime'::text);

That will give you a view of your table with those common fields. Your post text is the "content" column and comments is , you guessed it, the "comments" column. What each of those two fields comtain is a combination of HTML and other JSON fields but as least now you have something that you can query against with typical SQL query language. For example, lets say I was looking a post I did about my track time at Monticello (this was my second post on G+ so I knew it was there)...

select * from posts
where content ilike '%monticello track%'

The above search, is a basic SQL query that is doing case insensitve search for the phrase "monticello track" an my post text. This doesn't return anything because monticello and track don't occur next to each other. I could search for each word individually but that is going to contain more false positives. I would also do a more complicated SQL query but by using full text indexing I don't have to and FTS is suited to this type of query. Lets do that for content and comments (just in case I ever want to search those too)

CREATE INDEX postcontent_idx
ON public.postjson USING gin
(to_tsvector('english'::regconfig, data ->> 'content'::text))
TABLESPACE pg_default;

and

CREATE INDEX postcomment_idx
ON public.postjson USING gin
(to_tsvector('english'::regconfig, data ->> 'comments'::text))
TABLESPACE pg_default;

That same query would now be:

select * from posts
where to_tsvector('english',content) @@ websearch_to_tsquery('monticello track')

and properly returns the post I was expecting. You don't strickly have to create the indexes but on my system using the index is 67% faster (or it takes 1/3 of the time to return the query compared to doing a sequential scan of all the records).

That's pretty much it! Now I have a basic ability to search and retrive my posts (and comments). There are other things I can do with this but now I have a solid starting point.

6) What about a subsequent dump?

One thing that I'm going to do is another dump close to the end of February. I'm still not clear on what is actually going to happen in March. Since you can't select data ranges and have to download everything again, you can make a unique index against your JSON content field so that doing the same import steps will prevent duplicates.

CREATE UNIQUE INDEX postdata_idx
ON public.postjson USING btree
(encode(digest(data::text, 'sha1'::text), 'hex'::text) COLLATE pg_catalog."default")
TABLESPACE pg_default;

You'll need to include the pgcryto entension in your database on PostgreSQL for this but what is give you is a SHA1 hash against your post text which is how you can fingerprint something digitally.

Finally, I know this I've omitted quite a bit so here's a link to the complete schema you can use to build what I'm using now in your own database. This was designed around PostgreSQL but you might be able to port it to something more to your liking. Again, I hope this serves as a starting point to ease the anxiety of someone else.

Good Luck!

https://drive.google.com/file/d/1yvzowNUIhHL1ffNX6ShzdPU9L9PS2Bwm/view?usp=sharing

https://drive.google.com/file/d/1yvzowNUIhHL1ffNX6ShzdPU9L9PS2Bwm/view?usp=sharing

Comments

  1. Sarah Rosen Thanks... I'm definitely sleeping better now LOL

    ReplyDelete
  2. Keith C. Perry Have you tried converting Google takeout json to ActivityPub compliant json?
    w3.org - ActivityPub

    ReplyDelete
  3. Mike Noyes I didn't know about this but it doesn't really apply in terms of getting the data into my own database.

    This is probably a good thing to know for whomever I end up going to. Being able to automatically ingest JSON streams will make it easier to make sure nothing else is at risk in the future.

    ReplyDelete
  4. Good information that I fortunately understand with my previous employment working with a supervising web development and website for the agency. But I do know when something is way out of my skills and perhaps more important, deciding what not to do. I’ll not go further than using my JSON export with a text editor to extract in text what I want from posts. I’ll gladly receive what will be a load of other stuff to delete right away.

    Thanks for your report!

    ReplyDelete
  5. Hey, this is good stuff. Thanks for sharing.

    ReplyDelete
  6. Keith C. Perry You may be able to use jq as a more robust JSON reformatter, in step 4.

    ReplyDelete
  7. Edward Morbius funny, you mentioned jq. I did look at that first since I don't normally work with JSON or NoSQL workloads. What I didn't realize at the time was that what Google produces is what I called "pretty" JSON. That is essentially want jq does (and if if you look at my schema, I PostgreSQL can do that too so I included a view that allows me to look at the JSON data in a manner close to what Google sent).

    We you realize that newlines are problem, this because a standard issue we see in Linux / Unix world (having to remove single characters from a file of stream). There's about 3 or 4 tools on included on every system to be able to do this so ever if jq did work, I would also go with nature tools if they get the job done.

    Plus, as you see, I know I can chain the native tools together and probably great a nice little one-liner to combine a number of tasks.

    ReplyDelete
  8. cobalt please For what it's worth, I'm OCP DBA. This isn't webdev stuff. It's database. What makes this particularly interesting to me is that Keith C. Perry 's code looks to be compatible with Oracle PL/SQL with minimal hammering and sawing.

    ReplyDelete
  9. Edward Morbius I'm looking at that and wondering how hard it would be to make a SQL*Loader control file.

    ReplyDelete
  10. Sarah Rosen I was referring to software and dB we had to work with but I see I wasn’t clear enough. Thanks.

    ReplyDelete
  11. Keith C. Perry I think Edward Morbius meant this option:
    ––compact–output / –c:
    By default, jq pretty-prints JSON output. Using this option will result in more compact output by instead putting each JSON object on a single line.

    That way you can let an actual JSON parser handle it, rather than having to use sed :)

    ReplyDelete
  12. Keith C. Perry jq is a mature tool for dealing with JSON. And as Filip H.F. Slagter notes, you're getting a domain-specific tool built for parsing JSON rather than trying to mimic that via regexes.

    (I may have actually mimicked that using regexes at some point in the past ... week.)

    ReplyDelete
  13. Filip H.F. Slagter and Edward Morbius the issue isn't about parsing JSON per se. The issue is loading the database system. In that regard a little big of regex goes a long way :)

    For giggles, I cloned my database and tried to reload my 2011 JSON files (a total of 95) with

    for f in $(ls -1 2011*); do cat $f | jq -c '.' | /usr/local/pgsql/bin/psql G+test -c "COPY postjson (data) from STDIN"; done

    and only 35 didn't flag some sort of error because you still need to handle the escape character issue (i.e. pipe the jq output through sed worked).

    So either tr or jq can be used as long as you still run it through something like sed.

    ReplyDelete
  14. Mike Noyes Yes, something similar to that would work. The main issue is temporal brain buffer space for processing a script. I have an approximate temporal brain buffer space of 15 minutes per sitting, at least until I can reach a flow-state (see https://en.wikipedia.org/wiki/Flow_(psychology)).

    I'm not kidding. I once reached a flow-state while writing Schemetris (Tetris in Scheme) in college, and was able to code for approximately 6 hours straight during midnight hours, but that was because it was a game project (Tetris) using a favorite programming language (Scheme) with someone else I was able to get along with. A flow state occurs much more easily if I can feel relaxed and concentrate in doing something that I like.

    That just doesn't occur for UNIX shell-scripting. First off, I don't like writing non-game applications in unfamiliar territory. Second, I don't like using a programming language that isn't similar to Scheme, Lisp, or some functional programming language (such as Haskell) (with the exception of Ruby--but not Python). Third, I don't like complicated, tedious procedures with many steps that can go wrong, and this is exactly that sort of procedure.

    After making a few mistakes, I wind up starting to get a headache. It reminds me of that Ziv-Lempel-Welch encode/decode nightmare assignment in C that I was forced to do for my systems programming class, and that I never got right.

    Some programs can be fun to write. Towers of Hanoi in Scheme can be fun. Fibonacci and factorial are very fun. Generally speaking, anything that is short, simple, and recursive in Scheme or any Lisp-like language qualifies as "fun" to me. But writing a complicated shell script with many options as part of a convoluted procedure with many steps that can go wrong just doesn't qualify as "fun" to me.
    en.wikipedia.org - Flow (psychology) - Wikipedia

    ReplyDelete
  15. Mike Noyes I initially tried some things like that and it it didn't work. YMMV so I'd give it a try and see if it works for you.

    ReplyDelete
  16. Benjamin Russell This brings back memories of nightly builds of DocBook source into html and pdf on the SourceForge shell. Cron, bash, cvs, sed, ant, fop, cpan, etc.

    ReplyDelete
  17. Keith C. Perry It has been a long time since I did anything like that. I started getting a headache looking at your regex. I hope someone younger will create a ActivityPub parser/converter for Google+ takeout. If push comes to shove I may take a stab at it.

    ReplyDelete
  18. Benjamin Russell the "natural high" of flow state is why I play guitar.

    ReplyDelete
  19. Mike Noyes heheh, I hear ya. I do a regular amount of database work so using regex to massage the data to get imported is something I've become rather adept at. The goal is always to do just enough to have a high data fidelity once in the database- i.e. as raw a possible so I can use the database to do the real work.

    I'm not a fan of JSON but younger and less experienced "programmers" (a term I quote these days) seem to be which is a good thing. NoSQL data and asymmetric data structures are a thing and always have been, we're just in recently years coming to terms with it.

    The thing that I really like about pgSQL is that now I have this capability in the same system I'm used to so this was not only a first for me but a chance to see how usable the system is.

    So far, I couldn't be happier. I've been able to find and pull up my posts easily which has been necessary with a couple of more recent posts I've done (just to refar back and see what I've write before).

    This is way easier than having to search for my stuff on G+. I've never been able to get that to work properly.

    ReplyDelete
  20. Sarah Rosen

    "the "natural high" of flow state is why I play guitar. "

    Funny, I'm a musician too. As we all know though, embracing your artistic side keeps you balanced ☺

    ReplyDelete

  21. > "Thou may have, in return,
    > Such means necessary
    >
    > "For in that Empire State
    > A Big Apple to see."
    >
    > This contract we both vow,
    > and onward we journey.
    >
    > Until one cannot miss
    > This sight, amid its steeps,
    >
    > Looming, high above, this
    > City that Never Sleeps.
    >
    > Such land where Chaos reigns
    > Never before have seen!
    >
    > Inhabitants around
    > Push others all about,
    >
    > 'Til not a soul is stout,
    > Still standing on the ground.
    >
    > Each warily about,
    > Gazes, always to doubt,
    >
    > Whether 'tis safe to trust,
    > Or else be made to bust.

    ReplyDelete
  22. Keith C. Perry I'm getting the feeling that I should have encountered you on G+ long before this. Where will you be found when the lights go out?

    ReplyDelete
  23. It looks like a lot of the initial hard work is done. Bernhard Suter completed his Diaspora* conversion. Now all we need is someone to convert to ActivityPub for the rest of the federated fediverse/federation.
    blog.kugelfish.com - Google+ Migration - Part VIII: Export to Diaspora*

    ReplyDelete
  24. Benjamin Russell "It's rather funny how you quote "programmers" similarly to how I quote "poets.""

    Yea, I can't help myself sometimes but I'm trying to do better. LOL

    I like to keep a variety of people in my orbit to keep me on my toes but I've been doing this for a long time and "programmer" like "engineer" (something that I actually have degrees in) is overused and gets under my skin.

    Sarah Rosen same to you ! always happy to me another jedi ;)

    I'm not sure yet.. MeWe is probably tops now but someone also told me about Liker. I just circled you so within the next week or so I think I'll have something figured out but I'm going to at least provide my email address to people in my circles.

    ReplyDelete
  25. Yea Diaspora is up on my list too but this only think about picking where to go next is stressing.

    I really whish Google would give the whole thing to the Apache Foundation. They get most of Google IP that gets open sourced.

    I honestly would pay of couple buck a year to keep the family together. If for no other reason that it would support Apache's work.

    If we low-ball it to 200 million active users and then then say only 1/3 of the people here would be willing to pay $1 USD / year. That would be $66 million / year.

    That would actually be a good funding level. Maybe even allow community board to interface with Apache so we could help direct development.

    ReplyDelete
  26. Keith C. Perry Maybe Rich Bowen can facilitate action on the Apache front.

    ReplyDelete
  27. Mike Noyes I've mentioned this before publically elsewhere but I think I might try to see who I might know over there or maybe 1 or 2 degrees from me.

    Might be a good idea to start some sort of email campaign.

    ReplyDelete
  28. Keith C. Perry Hum, ActivityPub uses JSON-LD.

    ReplyDelete

Post a Comment

New comments on this blog are moderated. If you do not have a Google identity, you are welcome to post anonymously. Your comments will appear here after they have been reviewed. Comments with vulgarity will be rejected.

”go"