This update adds option forcedecompress when using options -f and -s.
More info: Analyzing “Zombie Zip” Files (CVE-2026-0866).
zipdump_v0_0_35.zip (http)MD5: F4A48AE14C1B258D688BF61D9ACF5E54
SHA256: 8DF7B3EBA282A0391AD619AD33A5F77CD25CC0FDA760E116934DD953714A27C5
Good afternoon, Great work on a great tool, I must say)) But there is one thing that surprises me.
We know that there is no strict indication in a ZIP archive that it is necessary to set the encoding of the names of files and folders stored in the archive. In this regard, many archivers work on the principle of “Unpack according to the OS locale”. And in theory, this is most often true. But when we analyze the file, we try to find the reason for WHAT is broken. In the archive, so that it can be fixed, we need to display information about objects in the console ;). And that’s where your script breaks. In principle, it does not seem to be able to change the encoding of file/folder names when output to the console. All output is in raw bytes.
It is clear that such a conclusion has a place in this life, but still, first of all, I would like to have an option like: –name-encoding CP1200 among the supported options of your script. Is it possible to see this?
Comment by Anonymous — Saturday 6 June 2026 @ 16:53
Are you doing this on Windows and are you seeing squares with question marks inside? Then it means the shell’s font can’t render the glyphs. Try changing the font in cmd.exe, or use a shell that has more fonts, like Cmder.
Comment by Didier Stevens — Sunday 7 June 2026 @ 11:59
Thanks for a feedback! Appreciate this;)
1) Yes I’m on a Windows station) 2) An NO – I redirect a output into the file, like “bla-bla > output.log” And I see python raw byte literals output in it like: b’C:\\Users\\xc4\xec\xe8\xf2\xf0\xe8\xe9\\AppData’ and the same I see and in a regular console screen – cmd.exe. So there is no problems with font here.
Problems are only inside the fact that the ZIP records for any name of a compressed file or a folder – DOES NOT have a corresponding bit which is responsible for a encoding showing. So after that or your code directly, or Python’s code indirectly through the call of a standard objects, or any command related to the ZIP processing – they all behave very directly IMHO – or use predefined CP437 table, or try to use OS locale or do nothing)) and output raw bytes, And so the programmer – which will use these results – should create some additional logic for processing such bytes))) And right now there is NO such logic and so that’s why I suggest this new option: –name-encoding
This option will work always and everywhere according to the code – where the output goes to the user (I think it’s not worth somehow highlighting where the output goes: the console screen or the file after the redirect) directly about the name of a file or a folder that is contained in compressed form in the processed archive. Now you have as I understand it the “-t TRANSLATE” option. But I admit that using it for the purpose of correctly viewing the names of files and folders did not help, and where it worked later (and it worked because there were no errors) was not clear to me. Can you clarify? Well, since the existing option did not help, a new one is proposed)))
Comment by Anonymous — Sunday 7 June 2026 @ 14:35
The –translate option works for the content of the file, not for the filename.
But I’m curious to know when you see raw byte literals?
They should not appear when you use zipdump, because the zipfile module returns a string for fileinfo.filename.
Are you maybe using option -f ? Because with option -f, the zipfile module is not used, and with that option, filenames are displayed as bytes.
Comment by Didier Stevens — Sunday 7 June 2026 @ 19:44
That’s right – this option was used. After all, I wrote above that it is important for us to understand WHERE (on which elements) the problem with file processing occurs during unpacking, and first of all we should at least see the contents of the files/folders listing that are present in the archive. And yes, I understand that current byte-by-byte output is just as useful here. BUT! IN ADDITION to this “complex” output, a simple one is also needed – for a human to view the listing – and for this, the bytes need to be converted. And there is no option for this in the current version of the utility.
So going back to the initial request, can we see this necessary thing in the utility?
And let me make it clear again that if we are talking about implementing the option – then its use must be present in all places where file-folder names are processed and output. And this is how you answered above: “the zipfile module returns a string for fileinfo.filename“. So in all places with this code – you have to be sure that the module can convert the names correctly before it outputs a string with the name. And now imho the module has no external information about what encoding of names it should work with. It seems that –metadata-encoding <encoding> parameter is responsible for this when using the -l, -e and -t options for calling of the zipfile module, which seems to be our goal.
Here is what I’ve googled so far: gist[.]githubusercontent[.]com/ElusiveSpirit/d441aae1f52f2d63530bdb255da3f64e/raw/4c35ebaec6f18b562169aa6065bbd681f9a2ec22/windows_zipfile[.]py – just remove [ ]
By the way, another question arose while analyzing the existing listing of the list of files and folders. Does the ZIP standard somehow support absolute/full paths to a file/folder when placing them in the archive? Now we have full Windows-paths in the listing:
C:\Users\<username>\AppData\Roaming\<folder>\<folder>\filename.ext
And of course even for unpacking such a path cannot be used directly, at least because of the presence of forbidden element “:” in the full path. But I am confused by the fact of finding such a path in the archive data. Is such a thing possible? And how should the unpacking itself take place? If the file location paths are already hard-coded, so filename.ext should be unpacked to my C: drive to my system folder????
Comment by Anonymous — Monday 8 June 2026 @ 7:18
Ah OK, you are using option -f.
I’ll implement a feature so that you can specify the string encoding.
Regarding full paths: it all depends on the application.
For example, 7-zip command-line has 2 commands to extract files:
e : Extract files from archive (without using directory names)
x : eXtract files with full paths
Comment by Didier Stevens — Monday 8 June 2026 @ 10:59
Sorry for my insistence – but I am NOT ONLY in favor of the -f option, I repeat once again – EVERYWHERE where the code implies outputting file-folder names to the user’s eyes, if the –name-encoding option is enabled, conversion of these names read from the archive body must take place. And it doesn’t matter which of the allowed utility options these outputs are hidden under. Whenever working with names, if the option is enabled == use it.
The fact that programs can somehow handle such full paths in a special way is half the point. The main thing is different: they CAN be stored in the archive itself? Are you saying that after studying all the standards for this type of archive – you confirm that they are allowed to include/store FULL paths that have local meaning (and even lead to fatal consequences when unpacking on another OS)?
And one more clarification question. You have an option to output additional information -E and there is the output of the bit flag #flags:…# Question – are we correctly understand that this is where the information about the utf-8 encoding should be stored, if we don’t use the default one defined by the standard – CP437? Can we make the output for the -E option of the enCODING attribute, so that we could immediately get the output of the file listing and understand whether the possibility to specify the full-fledged utf-8 encoding was used when packing the archive under analysis? Well, and so that this bit would be described in human form when outputting it:
C:Demo>zipdump.py -f l -E encoding,version,crc double-suffix
Comment by Anonymous — Monday 8 June 2026 @ 12:43
I don’t understand – I already entered a new comment yesterday and still don’t see it. What’s this weird blog engine? HOW do I understand that I was able to send a text in principle? I am forced to repeat yesterday’s text, sorry….I apologize if this causes any inconvenience on your part. But we are still discussing important changes as I see it.
Again sorry for my insistence – but I am voting NOT ONLY in favor of the “-f” option, I repeat once again – EVERYWHERE where the code implies outputting file-folder names to the user’s eyes, if the –name-encoding option is enabled, conversion of these names read from the archive body must take place. And it doesn’t matter which of the allowed utility options these outputs are hidden under. Whenever working with names, if the option is enabled ==> use it.
The fact that programs can somehow handle such full paths in a special way is half the point. The main thing is different: they CAN be stored in the archive itself? Are you saying that after studying all the standards for this type of archive – you confirm that they are simply allowed to include/store FULL paths in ZIP that have local meaning (and even lead to fatal consequences when unpacking on another OS)?
And one more clarification question. You have an option to output additional information -E and there is the output of the bit flag flags: Question – are we correctly understand that this is where the information about the utf-8 encoding should be stored, if we don’t use the default one defined by the standard – CP437? If yes – CAN we make the output for the -E option of the CODING attribute, so that we could immediately get the output of the file listing and understand whether the possibility to specify the full-fledged utf-8 encoding was used when packing the archive under analysis? Well, and so that this bit would be described in human form when outputting it:
C:Demo>zipdump.py -f l -E encoding,version,crc double-suffix
Comment by Anonymous — Tuesday 9 June 2026 @ 9:17
I will reply later, I need to schedule time to look into this.
Comment by Didier Stevens — Tuesday 9 June 2026 @ 18:56
OK, I see, no problem. And pls remove comment#8 – blog engine was very unfriendly & I was forced to make a double posting.
Comment by Anonymous — Tuesday 9 June 2026 @ 19:15
I one of your comments, you talk about a ZIP file that uses cp1200. That’s UTF16-LE.
Can you tell me what archive tool you used to create such a ZIP file, so that I can test it?
Because the tools I know do CP437 and UTF8, not cp1200/UTF16
Comment by Didier Stevens — Monday 15 June 2026 @ 19:45
You ask a logical and good question – but the problem is that because of this situation(see below) we had to look for a utility that could show us at least the names of the objects in the archive (even without unpacking it to begin with). because the archive itself came to us as it is. And no – unfortunately we can’t send it to you. And no – we don’t know this utility-creator. And since it seems that this will be repeated – we were looking for a utility that could at least help in the sense that it can give extended information about the objects inside the archive. That’s why all three of my suggestions were submitted to you for consideration and implementation. Then your script would definitely be able to start helping us.
1. Add a new option “–name-encoding” to convert archive object names
2. Use it in all places in the code where data is output to the user regarding those objects
3. Introduce a new option for issuing extended information “-E encoding”, so that we can immediately understand whether there was an attempt to save objects in an archive with utf-8 or cp437.
Thus, using the “-E ..” flag, we can quickly determine how to behave, if it’s 437, then we print out the first list of names in raw form, get an array of bytes, adjust the transcoding tables for them, get readable names, understand which code table we need to use. Connect it via the “–name-encoding ..” option. We get a complete readable output. We refine it and then use it in our unpacker.
P.S. As for the CP1200 – I gave this just as an example of WHAT the line would look like using this parameter. Yes, we will have other CPxxxxx to choose from. Perhaps 1200 will come across))) who knows. Now there are definitely 1250, 1251 & 1255.
P.P.S. And you haven’t deleted 8 comment – it’s inconvenient to scroll through the history of contacting up and stumble upon the forced duplication of posts.
Comment by Anonymous — Tuesday 16 June 2026 @ 9:11
Ah! We forgot about one more thing.:
Well, that is, can you really say that the standard, in principle, allows you to enter full paths (including DiskLetter:\ ?) that are RIGIDLY linked to the OS where the archive is being made, into the body of the archive itself? So that later, when unpacking, we could get unexpected behavior?
It’s just that our research on this topic shows that the standard seems to PROHIBIT absolute paths:
And if our observations turn out to be correct, then your script could also help in analyzing and outputting information – if there is such a situation in the archive when the paths to objects are NOT specified ACCORDING to the STANDARD (according to any of the points) – then output information about it.
Comment by Anonymous — Tuesday 16 June 2026 @ 10:18
I understand that you can’t share the file.
But you have to understand that I don’t like to release new code that I can’t test.
Would it be possible to share a hexdump of the first 48 bytes of the sample?
Mind you that this would reveal the date, time and (partial) filename.
But it would help me.
Comment by Didier Stevens — Tuesday 16 June 2026 @ 17:26
48 bytes – easy.
but this will only show you that we are dealing with an ordinary ZIP File – and the “inconvenient” nuances simply exist inside the body of an ordinary ZIP archive. It’s just that someone once couldn’t use the usual archivers and seemed to have made his own according to the existing descriptions of the ZIP archive standard at that time. And slightly deviated from the standard, as we understand it. And once again, I understand that you can even just refuse to do anything in the code right now. I understand this, I accept the risks. But can we at least bring the general discussion to an end? BEFORE any reference to the code and real examples?
1) And then, IF you suddenly decided, would the work strategy I described above be suitable for you for coding?
2) Can you comment on the discussion about the full file names from the root of the disk? At least theoretically. It’s a very important nuance of all files we have…
P.S. This blog has a very bad engine. It always cuts the “name-encoding” option to one dash in front of it. And according to the logic of your command line options analyzer, there should be two dashes. And that’s how I wrote them – when I was preparing the text. But the blog engine cuts out one dash, and it’s already infuriating, to be honest….
Comment by Anonymous — Tuesday 16 June 2026 @ 20:55
Conversations are certainly not the final software product, but under the circumstances, I would prefer to first identify all the possible and impossible pitfalls of such changes, and probably agree on some previously unknown circumstances that could possibly shift the focus of the new development to a different direction. And do all this BEFORE you start doing something in the code right away.
Comment by Anonymous — Thursday 18 June 2026 @ 8:33