Some notes regarding filenames in Takeout exports
Originally shared by Filip H.F. “FiXato” Slagter
Feedback provided regarding Google Takeout archive issues:
Because I am still running into some issues with cropped filenames, even when exporting to a zip64 archive, I've filed two tickets through Google Feedback (https://www.google.com/tools/feedback/reports?hl=en or rather the Send Feedback form on takeout.google.com).
I'm sharing them here to make them public record, so others are aware these issues can occur, and perhaps can also provide feedback to Google in case they're also affected.
Regarding Contacts with periods in their name:
When a Contact's (first?) name contains (ends with?) a period, the exported VCF tends to lack the .vcf file extension.
For instance, I had a contact named:
First name: 'wolf.'
Last name: 'nanaki'
The exported vcf however was named: "Takeout/Contacts/Chat4All/wolf.nanaki", no .vcf file extension at all.
This makes exporting specific file types only from an archive a bit trickier.
Regarding cropped file extensions:
In certain cases the archives will contain filenames with cropped file extensions. Especially image files from the Google+ Stream Photos type tend to be affected.
Here's a list of unique file extensions extracted from one of my last Takeout archives (and the command used to extract them):
`7z l -an -ai'!takeout-20181111T153533Z-GooglePlus-00*.zip' | ggrep -Ev '^Path = |Listing archive: ' | ggrep -E -o '\.([^. ]+)$' |sort -u`
.3gp
.CR2
.JPG
.PNG
.csv
.gif
.html
.ics
.j
.jp
.jpeg
.jpg
.json
.m4v
.mp4
.mpg
.nanaki
.net
.p
.pn
.png
.tiff
.vcf
As you can see, several png and jpg files got their file extensions cropped to `.pn` (or even `.p`), `.jp` (or even `.j`).
Example files:
`Takeout/Google+ Stream/Posts/FiXato with Pancake Helmet - By Jessica aka Maki.j`
`Takeout/Google+ Stream/Posts/GoogleSearch-AutocompleteHorror-IAccidentallyAte(5).p`
This again makes extracting just specific filetypes from the archive needlessly difficult, and makes it harder for operating systems to recognise file types without looking at the file header.
If filenames need to be cropped, I would suggest to make sure it doesn't affect the file extension, and instead crops from the file name instead.
Also, I would very much appreciate if archives would include an index file that lists all the filenames that had to be cropped, as well as including a mapping between the cropped filename and their original filename, so I can programmatically restore the original filenames after expanding the archive.
(It also shows the `.nanaki` extension from my earlier Feedback regarding Contacts with a period at the end of their name. The `.net` extension apparently is also the result of the same flaw, namely a contact named "Esper.net")
UPDATE: Note, the Esper.net issue is actually not because of the name of a contact, but rather because of the name of a Group Label. The composite VCF file for all contacts in that group will lack the `.vcf` file extension.
#GoogleFeedback #Feedback #Bugreport #GoogleTakeout #Plexodus #GooglePlusExodus #PlexodusTools
Originally shared by Filip H.F. “FiXato” Slagter
Feedback provided regarding Google Takeout archive issues:
Because I am still running into some issues with cropped filenames, even when exporting to a zip64 archive, I've filed two tickets through Google Feedback (https://www.google.com/tools/feedback/reports?hl=en or rather the Send Feedback form on takeout.google.com).
I'm sharing them here to make them public record, so others are aware these issues can occur, and perhaps can also provide feedback to Google in case they're also affected.
Regarding Contacts with periods in their name:
When a Contact's (first?) name contains (ends with?) a period, the exported VCF tends to lack the .vcf file extension.
For instance, I had a contact named:
First name: 'wolf.'
Last name: 'nanaki'
The exported vcf however was named: "Takeout/Contacts/Chat4All/wolf.nanaki", no .vcf file extension at all.
This makes exporting specific file types only from an archive a bit trickier.
Regarding cropped file extensions:
In certain cases the archives will contain filenames with cropped file extensions. Especially image files from the Google+ Stream Photos type tend to be affected.
Here's a list of unique file extensions extracted from one of my last Takeout archives (and the command used to extract them):
`7z l -an -ai'!takeout-20181111T153533Z-GooglePlus-00*.zip' | ggrep -Ev '^Path = |Listing archive: ' | ggrep -E -o '\.([^. ]+)$' |sort -u`
.3gp
.CR2
.JPG
.PNG
.csv
.gif
.html
.ics
.j
.jp
.jpeg
.jpg
.json
.m4v
.mp4
.mpg
.nanaki
.net
.p
.pn
.png
.tiff
.vcf
As you can see, several png and jpg files got their file extensions cropped to `.pn` (or even `.p`), `.jp` (or even `.j`).
Example files:
`Takeout/Google+ Stream/Posts/FiXato with Pancake Helmet - By Jessica aka Maki.j`
`Takeout/Google+ Stream/Posts/GoogleSearch-AutocompleteHorror-IAccidentallyAte(5).p`
This again makes extracting just specific filetypes from the archive needlessly difficult, and makes it harder for operating systems to recognise file types without looking at the file header.
If filenames need to be cropped, I would suggest to make sure it doesn't affect the file extension, and instead crops from the file name instead.
Also, I would very much appreciate if archives would include an index file that lists all the filenames that had to be cropped, as well as including a mapping between the cropped filename and their original filename, so I can programmatically restore the original filenames after expanding the archive.
(It also shows the `.nanaki` extension from my earlier Feedback regarding Contacts with a period at the end of their name. The `.net` extension apparently is also the result of the same flaw, namely a contact named "Esper.net")
UPDATE: Note, the Esper.net issue is actually not because of the name of a contact, but rather because of the name of a Group Label. The composite VCF file for all contacts in that group will lack the `.vcf` file extension.
#GoogleFeedback #Feedback #Bugreport #GoogleTakeout #Plexodus #GooglePlusExodus #PlexodusTools
Do you know the hard limit count for the size? I'm assuming it might be possible that some ended up losing their file extensions completely.
ReplyDeleteI also wonder how hard it would be to write a simple script (bash/Python) to replace them if they are missing... either using >file to discover and then fix or by guessing based on your learnings here.
John Lewis github.com - Plexodus-Tools has some more details on file naming structure, at least of the json files, and what I think is the max length of each segment. Of course, without actual documentation this is based on analysis of my own exported files.
ReplyDeleteLet's look at one of the other files that got cropped:
`Takeout/Google+ Stream/Photos/Miscellaneous Photos/Through the Mirror - Behind the scenes/Through the Mirror - Kaleidorose Voodoo - Layers.p`
It's filepath is much longer, but its cropped filename is again 51 characters. So, there might indeed be an upper limit for image files of 51 characters. But why crop the file extension, rather than part of the filename? I am more wondering if it has something to do with a wrong length range they've specified, to try to account for file extensions.
Let's look for instance at one of the files that didn't get cropped by its extension, but rather by its name:
Takeout/Google+ Stream/Posts/GooglePlus-20130516-Whitespace-WideSpace-single_co(15).png
That filename, including the auto-generated auto-increment counter suffix (15), is actually 59 characters. That's longer than the previous listed ones, but it didn't get its file extension cropped.
This again makes me think that the cropping of file extensions only occurs when the name itself is just on the limit, but would give issues when the .metadata.csv extension is added for the metadata files.
Actually looking at the filename length of all the affected files:
https://gist.github.com/FiXato/26992ec8fa273c9194555f7e23cc7628#file-filename-lengths-for-filenames-md
It becomes clear that all of filenames (including file extensions) of the affected files are either 50 characters long, or 53 (when they include the (i) prefix, where i is an auto-incrementing integer).
Looking at the files with full file extensions however, for instance `.png`, I find ones that are actually 54 characters long (without (i) suffix):
`54 : steam-wintersale-2012-DarksidersII-why_buy_just_th.png`
and those with (i) suffix can be up to 58 characters long:
`58 : GoogleSearch2013-10-27@11-51-19-Showing_GPlusYouTu(15).png`
Changing filenames to reflect their actual mimetype is an option, though since the csv files can contain html, file sometimes thinks those are text/html files.
Also, changing those filenames might lead to issues with the references inside the .metadata.csv and .json files, which would also need fixing.
As I indicated in my Feedback ticket though, the best solution would be for Google to include a 'cropped_filenames.json' or even a simple .txt, which maps CroppedFilename: OriginalFilename.
(Had to extract the list of files to a Gist, or else the post wouldn't get submitted)
Filip H.F. Slagter Thanks, this was well beyond the call of duty and I and other's appreciate your efforts.
ReplyDeleteFor completeness sake, I've submitted 2 additional follow-up Feedback reports:
ReplyDeleteRegarding Cropped File Extensions:
Follow-up to my previous feedback regarding cropped file extensions, in particular for photos in the Google+ Stream/Photos Takeout data:
Upon further analysis all the affected filenames have a specific length in common:
The filename (excluding path/parent directory names, but including the file extension) is always 50 characters (e.g.: "FiXato with Pancake Helmet - By Jessica aka Maki.j"), or 53 characters if the filename includes the "(i)" auto-incrementing integer suffix.
However, in the archive I do find filenames with a longer length, which don't have their file extensions cropped, such as: Takeout/Google+ Stream/Photos/Photos from posts/2013-04-23/Ministerie van Veiligheid en Justitie - Informatie(1).png of which the filename, excluding the path leading up to the file, but including the file increment suffix and filename, is 57 characters long.
Or, an example which doesn't include the increment suffix:
Takeout/Google+ Stream/Photos/Photos from posts/2012-12-28/steam-wintersale-2012-DarksidersII-why_buy_just_th.png
which is 54 characters long.
This leads me to believe there is an edge usecase concerning filenames that are 50 (or 50 + length of unique increment suffix) characters long, that incorrectly causes the file extension to be cropped, rather than the filename.
I hope the above is useful extra information to find the cause of this issue so it can be fixed.
Again though, I would really also appreciate an index metadata file that provides mappings between any altered (cropped, suffixed or otherwise transformed) filename and their original filename.
If you need further details, feel free to contact me.
Regarding Contacts
Follow-up to another previously submitted Feedback ticket,regarding the issue with missing file extensions for Contacts that have a period in their name.
This not only affects contact names with periods in them, but also group labels.
I have a Contacts group labeled 'Esper.net'. The composite VCF file for all contacts in this group is stored as "Takeout/Contacts/Esper.net/Esper.net" rather than "Takeout/Contacts/Esper.net/Esper.net.vcf".
I hope the above is useful in case anyone else has similar issues and wishes to report them.
Updated the original post at plus.google.com - Feedback provided regarding Google Takeout archive issues: Because I am still... with more details and an additional ticket submitted today about issues with displayNames not actually including the nickname, even though set as display name through Google+ Profile / AboutMe.google.com.
ReplyDeleteUpdated the original post at plus.google.com - Feedback provided regarding Google Takeout archive issues: Because I am still...
ReplyDeletewith the latest feature request feedback I sent today:
====================================================
ADDENDUM 3: New ticket submitted per November 16th:
Add Incremental and Metadata-only Exports
A very welcome addition would be the ability to set a date from which you want the latest changes, defaulting to the date you last did a backup.
Currently when you want a backup of your latest Google+ Stream posts for instance, you'd have to take a full backup again, which in my case means downloading close to 50GB again. It would be nice if instead I could get only the data that was changed since a certain date, for instance to get just the data that was created or modified in the past year.
Bonus points if you can set date ranges, so you can export per year for instance.
Additionally, a way to only get the metadata (.json, .csv, .vcf, .ics, etc) files, rather than also the larger media (.png, .jpg, .mp4, etc), would also help reduce the payload needed to be downloaded just to keep an up-to-date backup.