Reconstructing post to image references in the takeout archive is unfortunately not reliably possible in all cases.
Reconstructing post to image references in the takeout archive is unfortunately not reliably possible in all cases.
As a workaround download the images again from the provided URLs.
https://blog.kugelfish.com/2018/11/google-migration-part-v-image.html
As a workaround download the images again from the provided URLs.
https://blog.kugelfish.com/2018/11/google-migration-part-v-image.html
Thanks, this is an excellent series.
ReplyDeleteEarlier, for those who've missed it:
https://blog.kugelfish.com/2018/10/google-migration-part-i-takeout.html
https://blog.kugelfish.com/2018/10/google-migration-part-ii-understanding.html
https://blog.kugelfish.com/2018/11/google-migration-part-iii-content.html
https://blog.kugelfish.com/2018/11/google-migration-part-iv-visibility.html
Highlighted on #PlexodusWiki:
https://social.antefriguserat.de/index.php/Data_Migration_Process_and_Considerations#References
blog.kugelfish.com - Google+ Migration - Part I: Takeout
John Skeats Care to share this to G+H?
ReplyDeleteBernhard Suter check your nav links on the blog, some don't link forward correctly.
ReplyDelete404:
https://blog.kugelfish.com/2018/10/google-migration-part-ii-understanding.html
No link to V:
https://blog.kugelfish.com/2018/11/google-migration-part-iv-visibility.html
Bernhard Suter It may simply make sense to grab the image by URL, create a distinct hash, name the file by that, and update references to point to that. hashed name. I'm surprised Google don't already do this.
ReplyDeleteRetaining original name and URL somewhere as attributes w/in the processed archive might also be useful.
Dsambiguation is hard.
Have you tried comparing the file creation times from the images with the json files / the post's creation timestamp? AFAIK the files in the archive should have retained their original upload date/time.
ReplyDeleteAlso worth noting perhaps is that in rare cases the file extensions will be cropped (.jp or .j rather than .jpg or .jpeg, and also .metadata might have a variety of croppings), rather than the filenames, especially with filenames of a certain long length. This is a bug I've reported through Feedback myself, but if you also run into it, it wouldn't hurt to report it yourself as well.
ReplyDeleteThe issues you've encountered with the deduplication and lack of matching post identifier to match images up to their posts is imho also worth reporting through Feedback.
I've already sent feedback suggesting an additional metadata file mapping any changed filenames to their original filenames, but that was mostly in the context of cropped filenames. An additional Feedback report to request for metadata to map to an Activity#id that matches the API and a Google+ Stream Post resourceName would probably be a good idea too. I'll do so myself once I run into the issue too, but the more relevant, individual, Feedback reports they get, the higher the probability of them fixing it (I hope).
I'm curious btw if the files in the archives are at original quality, or if they have been compressed or stripped of EXIF/PICT metadata. If they are completely original, then that would be an argument for finding ways to match up the Takeout files rather than redownload them (as those available from Google's usercontent servers are likely stripped of metainfo and probably also converted to png or webp).
ReplyDeleteNeed to remember to do a few tests with this myself.
Filip H.F. Slagter Hah, I was going to mention EXIF.
ReplyDeleteMy recollection is that some of that (device identification, possibly time/location) is stripped, but technical (exposure, apature, ISO) are not.
The main problem is that there's no correlation of the EXIF data to the extract AFAIR, though you've looked at that more than I have in the past five years.
Filip H.F. Slagter - at this point, I have a very low tolerance threshold for crazy heuristics like relying on timestamps or EXIF which might or might now work in all cases. We are not yet doing forensic reconstruction... For those with a fast network, re-downloading becomes the most pragmatic and reliable solution. This also points to some flaws in the takeout archive format, which hopefully will be addressed before the sunset date. Part of my goal is to beat up on the takeout archive with some semi-realistic use-cases to discover problems while there is still a lot of time.
ReplyDeleteBernhard Suter Part of my goal is to beat up on the takeout archive with some semi-realistic use-cases to discover problems while there is still a lot of time.
ReplyDeleteFor which I thank you, and specifically why I've been encouraging people to do takeouts now.
(And no, I've not yet launched one. Sigh.)
Edward Morbius - (links should be fixed now, thanks). The purpose for this cache is very focused on the one purpose of generating new posts with correct image attachment and not to create a general purpose archive. To keep things simple, I am using the one unique ID that is already in the archive, even if it's a bit unwieldy.
ReplyDeleteFor this I am trying to stick to the simplest way of organizing data using flat files. For anything more complicated it might make sense to throw the other IT Swiss-army knife at the problem: SQL based relational databases ;-) SQLite is so easy to use in Python and other languages, that it's often easier to use it than not. I am thinking to sticking exclusively to the flat-file model for this particular wokflow until the end (slow sync to target system), but maybe we can sketch out a SQL importer for people who want to do more sophhisticated slicing (e.g. for community migration with identity mapping).
Bernhard Suter I hear you on flatfiles, though that's where hashing may be useful. The hash is the file identity, and if you work with your archive by post-processing it, looking for image links, holding the URL in a variable, re-downloading the file, computing the hash, updating the link (sed 's///g'), and dumping a ": " record to some additional indexing file in case you need to determine correspondences later, you may still be able to skate by with a flatfile-based system.
ReplyDelete(This is the sort of brute-force-hackish approach I'd be inclined to do myself, YMMV.)
Edward Morbius Nope, I don't have time to read through and validate all of the content at this time. I won't share something I haven't confirmed to be 100% accurate and appropriate for me to share.
ReplyDeleteLooks like they've done a change with regards to the filenames of the images and image metadata in the archive.
ReplyDeleteAs previously suggested in feedback, they now seem to be using hashes as filenames. This does seem to fix the cropped file extension issue I was running into, though I haven't fully verified that yet.
I've not yet looked at the actual data in those files either, so I don't know yet if anything in that has changed yet.
It has seriously reduced the ability to identify individual files just by their filename though, but hopefully it will make matching images with their posts easier.
Filip H.F. Slagter - interesting, there seems to be some progress in the right direction to make image files reliably unique and unambiguous. Is the hash the same as in the image resourceName or has some additional ID been added to the JSON data?
ReplyDeleteBernhard Suter haven't looked too closely at it yet, but it looked different.
ReplyDeleteUnfortunately I'd forgotten to switch from HTML to JSON on that dump, so I didn't have all details either. I've requested a new dump, and will download and analyse that later.
(hope the different filenames aren't a result of choosing HTML rather than JSON)