Skip to main content

Continuing the takeout data migration process with a first look at the data that is in the archive.

Continuing the takeout data migration process with a first look at the data that is in the archive.
https://blog.kugelfish.com/2018/10/google-migration-part-ii-understanding.html

Comments

  1. What I hate for automatic processing: They localize the takeout directory names. I get something like “Takeout/Stream in Google+/Beiträge” instead of ”Takeout/Google+ Stream/Posts”.

    ReplyDelete
  2. Bernd Paysan - thanks for pointing out! This is indeed rather unhelpful for developing easily reusable tools...

    ReplyDelete
  3. Bernhard Suter I would probably deal with that by collecting their localization database (probably requires setting up a test user, and changing the language of this test user), and use an i18n package to localize the takeout directory, too… but yes, this will be more error-prone, and require user interactions, if it doesn't quite work.

    ReplyDelete
  4. "Once we the takeout archive" - the first five words.

    ReplyDelete
  5. An example of Collections JSON info for a post in a single collection.
    "postAcl": {
    "collectionAcl": {
    "collection": {
    "resourceName": "collections/AB2YX",
    "displayName": "Politics"
    }
    }
    }
    Here's the HTML
    https://voidstar.com/Takeout/Google+/Posts/20160216%20-%20I%20wonder%20how%20much%20easier%20travelling%20by_.html

    Shared to the collection https://plus.google.com/collection/AB2YX">Politics - Private

    ReplyDelete
  6. Nice start already! :)
    You might also find my analysis of the JSON activity files useful as reference: https://social.antefriguserat.de/index.php/Data_Migration_Process_and_Considerations#Takeout_Data_Structure (location within the wiki will change, but I'll make sure to leave a reference to the new location once I've extracted it to a page of its own).
    The detailed example there might still be missing some data (I noticed for instance that I haven't included Location example data in it), but I do think the flat structure of hash keys is quite complete. Perhaps you can run the jq-command as well on your json files, and diff it against mine to see if there are any more keys I'm missing?

    In the comments of https://plus.google.com/104092656004159577193/posts/VXFuh7kJFyd?fscid=z13mffthxm3jghkpe04cjjgi3ti3zp1gz0c.1540739963224711 you'll find some additional jq library method definitions to quickly filter down the contents of the JSON files to public posts, with or without comments, media (or even narrowed down to images, video or audio), and interactions with specific people. I'll release this as well as a public git repo on Github and/or Gitlab once I've finished writing documentation for it.

    ReplyDelete
  7. The takeout data structures seem different, but related to the API structures.
    developers.google.com - Activities | Google+ Platform for Web | Google Developers

    Have you matched them off? Is there data available in the API that doesn't appear in the Takeout?

    ReplyDelete
  8. Julian Bond years ago, atleast back in 2013, the json files actually matched the API's Activity resource structure. Its google-api-client models were actually what I initially used to try and load in my json files, as I was developing against an old takeout backup. It's also why I was quite disappointed to find out they'd changed part of their Takeout data structure (and not updated their api-client, nor provided docs).

    As for changes: the Access Control List (Acl) structure has changed significantly. It's no longer a single type identifier that decides access, but split up into visibleToStandardAcl (which controls circle and individual user visibility), communityAcl (which community it is posted to), eventAcl (for Events, and who have been invited to them), collectionAcl (giving access to those following a collection), and there is an 'isLegacyAcl' key, of which I'm not quite sure yet what the purpose is.
    Another significant change is that the current format no longer contains an originalContent key anymore, which used to contain the contents of the Activity unformatted, that is, it would contain the exact same text as you'd submit, complete with asterisks, underscores and dashes, without them interpreted as HTML formatting instead. What is left is just 'content' keys, which contain the HTML-formatted content.

    ReplyDelete
  9. Julian Bond as a reference, compare this flat structure of all the possible keys (at least as found in my own json files) of the old format:
    access
    access.description
    access.items
    access.items[]
    access.items[].type
    access.kind
    actor
    actor.displayName
    actor.id
    actor.image
    actor.image.url
    actor.url
    annotation
    etag
    id
    kind
    location
    location.address
    location.address.formatted
    location.displayName
    location.kind
    location.position
    location.position.latitude
    location.position.longitude
    object
    object.actor
    object.actor.displayName
    object.actor.id
    object.actor.image
    object.actor.image.url
    object.actor.url
    object.attachments
    object.attachments[]
    object.attachments[].categories
    object.attachments[].categories[]
    object.attachments[].categories[].schema
    object.attachments[].categories[].term
    object.attachments[].content
    object.attachments[].displayName
    object.attachments[].embed
    object.attachments[].embed.type
    object.attachments[].embed.url
    object.attachments[].fullImage
    object.attachments[].fullImage.height
    object.attachments[].fullImage.type
    object.attachments[].fullImage.url
    object.attachments[].fullImage.width
    object.attachments[].id
    object.attachments[].image
    object.attachments[].image.height
    object.attachments[].image.type
    object.attachments[].image.url
    object.attachments[].image.width
    object.attachments[].objectType
    object.attachments[].thumbnails
    object.attachments[].thumbnails[]
    object.attachments[].thumbnails[].description
    object.attachments[].thumbnails[].image
    object.attachments[].thumbnails[].image.height
    object.attachments[].thumbnails[].image.type
    object.attachments[].thumbnails[].image.url
    object.attachments[].thumbnails[].image.width
    object.attachments[].thumbnails[].url
    object.attachments[].url
    object.content
    object.id
    object.objectType
    object.originalContent
    object.plusoners
    object.plusoners.items
    object.plusoners.items[]
    object.plusoners.items[].displayName
    object.plusoners.items[].etag
    object.plusoners.items[].id
    object.plusoners.items[].image
    object.plusoners.items[].image.url
    object.plusoners.items[].kind
    object.plusoners.items[].url
    object.plusoners.totalItems
    object.replies
    object.replies.items
    object.replies.items[]
    object.replies.items[].actor
    object.replies.items[].actor.displayName
    object.replies.items[].actor.id
    object.replies.items[].actor.image
    object.replies.items[].actor.image.url
    object.replies.items[].actor.url
    object.replies.items[].etag
    object.replies.items[].id
    object.replies.items[].kind
    object.replies.items[].object
    object.replies.items[].object.content
    object.replies.items[].object.objectType
    object.replies.items[].object.originalContent
    object.replies.items[].plusoners
    object.replies.items[].plusoners.totalItems
    object.replies.items[].published
    object.replies.items[].updated
    object.replies.items[].verb
    object.replies.totalItems
    object.resharers
    object.resharers.items
    object.resharers.items[]
    object.resharers.items[].displayName
    object.resharers.items[].etag
    object.resharers.items[].id
    object.resharers.items[].image
    object.resharers.items[].image.url
    object.resharers.items[].kind
    object.resharers.items[].url
    object.resharers.totalItems
    object.statusForViewer
    object.statusForViewer.canComment
    object.statusForViewer.canPlusone
    object.statusForViewer.isPlusOned
    object.statusForViewer.resharingDisabled

    ReplyDelete
  10. object.url
    provider
    provider.title
    published
    title
    updated
    url
    verb



    to the current format:
    album
    album.media
    album.media[]
    album.media[].contentType
    album.media[].description
    album.media[].height
    album.media[].resourceName
    album.media[].url
    album.media[].width
    author
    author.avatarImageUrl
    author.displayName
    author.profilePageUrl
    author.resourceName
    comments
    comments[]
    comments[].author
    comments[].author.avatarImageUrl
    comments[].author.displayName
    comments[].author.profilePageUrl
    comments[].author.resourceName
    comments[].content
    comments[].creationTime
    comments[].link
    comments[].link.imageUrl
    comments[].link.title
    comments[].link.url
    comments[].media
    comments[].media.contentType
    comments[].media.height
    comments[].media.resourceName
    comments[].media.url
    comments[].media.width
    comments[].postUrl
    comments[].resourceName
    comments[].updateTime
    communityAttachment
    communityAttachment.coverPhotoUrl
    communityAttachment.displayName
    communityAttachment.resourceName
    content
    creationTime
    link
    link.imageUrl
    link.title
    link.url
    location
    location.displayName
    location.latitude
    location.longitude
    location.physicalAddress
    media
    media.contentType
    media.description
    media.height
    media.resourceName
    media.url
    media.width
    plusOnes
    plusOnes[]
    plusOnes[].plusOner
    plusOnes[].plusOner.avatarImageUrl
    plusOnes[].plusOner.displayName
    plusOnes[].plusOner.profilePageUrl
    plusOnes[].plusOner.resourceName
    postAcl
    postAcl.communityAcl
    postAcl.communityAcl.community
    postAcl.communityAcl.community.displayName
    postAcl.communityAcl.community.resourceName
    postAcl.communityAcl.users
    postAcl.communityAcl.users[]
    postAcl.communityAcl.users[].displayName
    postAcl.communityAcl.users[].resourceName
    postAcl.eventAcl
    postAcl.eventAcl.event
    postAcl.eventAcl.event.resourceName
    postAcl.isLegacyAcl
    postAcl.visibleToStandardAcl
    postAcl.visibleToStandardAcl.circles
    postAcl.visibleToStandardAcl.circles[]
    postAcl.visibleToStandardAcl.circles[].displayName
    postAcl.visibleToStandardAcl.circles[].resourceName
    postAcl.visibleToStandardAcl.circles[].type
    postAcl.visibleToStandardAcl.users
    postAcl.visibleToStandardAcl.users[]
    postAcl.visibleToStandardAcl.users[].displayName
    postAcl.visibleToStandardAcl.users[].resourceName
    resharedPost
    resharedPost.album
    resharedPost.album.media
    resharedPost.album.media[]
    resharedPost.album.media[].contentType
    resharedPost.album.media[].description
    resharedPost.album.media[].height
    resharedPost.album.media[].resourceName
    resharedPost.album.media[].url
    resharedPost.album.media[].width
    resharedPost.author
    resharedPost.author.avatarImageUrl
    resharedPost.author.displayName
    resharedPost.author.profilePageUrl
    resharedPost.author.resourceName
    resharedPost.content
    resharedPost.link
    resharedPost.link.imageUrl
    resharedPost.link.title
    resharedPost.link.url
    resharedPost.media
    resharedPost.media.contentType
    resharedPost.media.description
    resharedPost.media.height
    resharedPost.media.resourceName
    resharedPost.media.url
    resharedPost.media.width
    resharedPost.resourceName
    resharedPost.url
    reshares
    reshares[]
    reshares[].resharer
    reshares[].resharer.avatarImageUrl
    reshares[].resharer.displayName
    reshares[].resharer.profilePageUrl
    reshares[].resharer.resourceName

    ReplyDelete
  11. Julian Bond - thanks for providing the collection example. This seems to be another ACL, which is not how I would have collections expected to work...

    ReplyDelete
  12. Filip H.F. Slagter - I have a subset of the keys listed in the wiki. In particular no communityAttachment/... and postAcl.communityAcl.users/...

    What is a communityAttachment? Sharing a community through a post? What is the communityAcl.users? Limiting a community post only to a few users? Can this be done for cirlces as well?

    ReplyDelete
  13. the communityAcl.users[] is probably when the post, shared to a community, also mentions other users, thus automatically including them in the audience. I'd have to double check the actual json file from which it's grabbed to be sure though.

    As for communityAttachment, that's exactly what it is. See https://plus.google.com/112064652966583500522/posts/5YqznvvKu7c as an example of such post.
    plus.google.com - Nature Photography Community The Nature Photography Community by +Nature Phot...

    ReplyDelete

Post a Comment

New comments on this blog are moderated. If you do not have a Google identity, you are welcome to post anonymously. Your comments will appear here after they have been reviewed. Comments with vulgarity will be rejected.

”go"