Sorting and Repairing
This Article comes with no Warranty – You do everything shown here on your own Risk. It may be possible to sort, repair and recover your Data, see this Article as a help.
I wrote a little Article (here) about Data Recovery within my Blog. As you may have read it was possible to recover a lot of stuff, the problem after recovering you’re running into is: What to do now with the one million unsorted files? What to do with recovered but corrupted files? This Article tries to help a bit on this topic.
First of all we have to understand that some files are easier sortable than other files. For example: Every file containing meta data is easier to sort than files without meta data. Look at plain txt files for example. How would you sort them? All what we have is some date (maybe – this depends) and the txt file. We could sort them by size, we could sort them by date – thats all. How about mp3 or Images? mp3s are containing tags which we can read out – most Images are containing exif and other tags which we can read out, too. So it’s much easier to sort such files. Anyway. My restore/recovery is finished. I have now more than 100 GB of data without useful filenames and without a useful directory hierarchy. Sorting those files manually by checking every file or looking at every picture will take me years. What i need are some tools – I will write them from scratch and in bash using some external tools like exiftool, imagemagick, ghostscript, etc. You could do the same in C or another language :)
Images
Let’s look at pictures first. Not all pictures are containing useful tags. It’s possible to get the width and height of nearly every picture. Plus the fact that we can use the extension of a graphic and the year when it was created (if the tag is there) we have 4 options to sort: Year, Width, Height, Extension. My Bashscript will first find all graphics which are supported by exiftool and imagemagick (probably i’ve forgotten some) – the most important are probably jpg, png and gif. Other fileformats are bmp, tif/tiff and eps. After finding all of these files the script will move them to /tmp/ImageSorting, run ImageMagick on it to convert it (This will remove invisible garbage at the end of some files if there is any (at least i read that)) which will run ghostscript on some and finally read out some data for the sorting process – It will copy the file to Images/Year/small|large|medium|unknown/Extension – For example: Images/2008/large/PNG/ – How do i know whether it’s a large or a small picture? Well the Script will put everything higher than 800 width and 800 height into “large” everything higher than 400 width and 400 height into “medium” and everything below 400 height and 400 width into “small” – Within small we will most likely find some thumbnails. Also we will mark some files which couldn’t be processed correctly as .corrupted and we make sure that we don’t copy over existing files (i run the tool on several directories because i have several backups in several different directories .. would be bad if some images get overwritten due to the same filename or something) instead such files will be renamed to filename.X.extension, where X is a number from 1 to 999. If this Script was helpful for you, leave me a comment. You can get that script here: Useful Stuff. You can give the script arguments. just try imageSorting.sh -h.
Let take a fast look at it:
276K Images/2005/small/JPEG
280K Images/2005/small
60K Images/2005/medium/JPEG
64K Images/2005/medium
60M Images/2008/unknown/GIF
8.0K Images/2008/unknown/PC
5.8M Images/2008/unknown/JPEG
5.4M Images/2008/unknown/PostScript
236M Images/2008/unknown/bzip2
12M Images/2008/unknown/PNG
Should be useable – Now we can proceed sorting the images by using gimp’s preview or some other tool like a file browser to watch which graphics are needed and which can be easily deleted. The next thing, when you have more time, is giving them real names, though as we have them in some nice directories, i don’t think that this is needed. Anyway, keep a look at “Sha1sum/File Copies” – if you used many several backups like i, this will help you to save some manual deleting time and it will help you to save space.
Sha1sum/File Copies
As i recovered data from several discs which will most likely have sometimes the same files, and as i used different tools to get much data back (foremost, photorec, jpegrecovery) i will prolly have some files twice or more than that in my system. As this will probably eat a lot of space, i wrote another little Bash Script which collects sha1sum’s of every file. The Script will move all copies to the folder “copies/format” where format is the filetype – for example png. It will do this for every file, not just for graphics. After it finished you can verify the things in that folder to make sure that you don’t delete important stuff – And after you made sure this are really copies, you can delete the folder “copies”. You can obtain the script here: Useful Stuff. You need to set the directory within the script – open it up in a text editor.
The Script in Action
On how many files do i use this script?
wdp@yulivee ~/Images $ find . -type f | wc -l
131744
Adding files and sums:
+ added /home/wdp/Images/2008/small/JPEG/25867234.jpg:cdaaaa0441ce330769ec5d100ea116eb23ab858a
+ added /home/wdp/Images/2008/small/JPEG/292753146.jpg:618a8cd417cd4d376e432e02851f29aaba25f86b
+ added /home/wdp/Images/2008/small/JPEG/116818362.jpg:1ac82df7453f925675607c141898b57868ead74f
+ added /home/wdp/Images/2008/small/JPEG/292669178.jpg:b571301d07492125bddf2cb47ac3a6a843a2587d
Finding, verifying and moving copies:
+ detected 2 copies of image12840.jpg
+ moving image12840.jpg to /home/wdp/copies/image12840.jpg
+ detected 3 copies of 25856362.jpg
+ moving 25856362.jpg to /home/wdp/copies/25856362.jpg
+ detected 2 copies of 292750762.jpg
+ moving 292750762.jpg to /home/wdp/copies/292750762.jpg
+ detected 11 copies of image08035.jpg
+ moving image08035.jpg to /home/wdp/copies/image08035.jpg
Filename exists already
+ Copy detected
+ Renamed to: image05922.1.jpg
Finished:
+ moved 37186 copies
That means we got 37 186 copies. You really wanted to check all 131 744 graphics manually? :-)
3.5G Images/
558M copies
Just to make sure, i run the script a second time on the same directory. For that just do:
rm -rf /tmp/getCopies.txt
./getCopies.sh
And wait :)
+ processed 95864 files
mkdir: cannot create directory `/home/wdp/copies’: File exists
The mkdir error is not important here – we don’t want to delete the copies now, and the script will continue even with this error.
+ moved 0 copies
And now?
Now you can go make coffee (respective tea) open up your favorite graphic program or file browser and check manually for corrupted graphics/graphics with glitches. Also Remove unwanted graphics (I mean.. it seems i found graphics from all websites i visited in the past.. i dont need them, you probably not, too). After this, you’re finished with the images.
Video Files
It was not possible for me (Well, i wasn’t really trying to, because it’s just not the work worth for me.) to repair any of the recovered videos. Most of them are splitted into more than one file, nearly all came without sound, some are containing very very many artifacts. To repair those i would do:
1. Trying to re-encode the broken videos. (to remove some artifacts by skipping frames. In this stage no filtering, we just want to remove broken stuff from the files)
2. Trying to add the sound files if available (this is much more difficult than you may imagine because all audio files are splitted also – we need to merge them first so that they fit to the splitted video – Happy searching! – And believe me, you will most likely have fucked up a/v sync.)
3. Trying to merge the splitted video files which are containing sound now to one big video file
4. If we did all this steps, (finding audio files, merging them to such long files as the re-encoded splitted video will need them, adding them to the re-encoded splitted video, merging all re-encoded-splitted-videos-with-added-sound-to-one-big-video-file) we can finally re-encode the video once more and add some useful filters to improve quality and remove artifacts and other stuff which might be in.
As you can see, it would be a very very very time intensive task. Well, at least for me, probably someone else got another idea :-) Tools which you could use? “lame” “mencoder” “mplayer” “kino” and probably a few others.
Sorting Videos? Well.. Again something where i don’t have any idea yet. There are probably tools to read out the header and other information.
Audio Files
Sorting audio files is “normally” as easy as sorting images. Because ogg got some tags and mp3 have tags, too. As these tags are in my humble opinion much better than tags at graphics, we can sort audio files even better. Anyway: Using PhotoRec i got some splitted mp3s. I hope i can put them together though i don’t think this will be as easy as i think, and to be honest: in my case, it’s just not the work worth, thats why i will only explain how to sort audio files, but hell, what means explain? Of course i wrote a Bash script for this: Useful Stuff. This script will sort your audio to the following directory hierarchy: Artist/Year_ALBUM/(TRACK)TITLE.mp3|ogg – If there are no tags we will sort to unknown../unknown../unknown.mp3|ogg. This should help you to get your audio files sorted, too. I do this only for mp3 and ogg because wav doesn’t provide such information and i dont know how to read such stuff out of wma’s plus: i dont have any wma’s.
Probably i’ll write soon how to merge splitted audio files.. but i don’t think so :) How many audio files i got back? Well, a lot. And 40% of them are corrupted (splitted) – and even if i would merge them somehow, it seems there would still be some “parts” missing. Anyway. After processing them with my audio sorting script, i run the getCopies script on the created directory (dont forget to edit paths in getCopies.sh). It found a few copies which i can delete now:
Archives
Archives are a different story. From my experience (of course this could be wrong) your best chance is when you have something like a .tar archive. Let me explain: If you have a .tar archive and some parts in this archive are broken you can still get the working parts out of it. If you have a tar.bz2 or tar.gz – You have only “one big file” within the bz2 or gz – so if the bz2 or gz is really broken, it’s much more difficult to come to the .tar within it. Anyway, all this isn’t helping, let’s start to get our data back:
First of all we want to move every archive into a folder named like the archive (archives/FORMAT/blablubb.extension). Of course, i wrote a little bash script for this, too. (Did you expected something else?) – This Bash Script is not doing much, it’s just fishing out some archive formats and moving them into another directory and its making sure to not overwrite already existing archives. After that we proceed first by making sure we have no copies (our getCopies script will help us with that) to reduce our work. So.. let’s run copyArchives.sh.
Anyway.. i wont go through all archives, too much work to write it down here. Just try your luck with manually unpacking them.
.gz
The next step is a bit more difficult, because you will see that we got a lot of files in some of the directories. For example .gz. Just for fun i unpacked one manually and saw a XML File in. It was some config – very useful *cough* (i hope you get my sarcasm). Anyway: gunzip is “just” removing the .gz at the end. That means we can easily unpack it on the fly without overwriting something – My script is NOT doing this. After that you can run “file” on every file (just use *) and use grep and grep -v to filter out not needed stuff like text or xml files. Just an example:
root@yulivee /home/wdp/archives # du -h gz
5.0G gz
# example (yes.. there are nicer ways!)
FILES=$(find . -type f -iregex ‘.*\.\(rar\|zip\|tar\|gz\|bz2\)’);
IFS=$’\n’;
for FILENAME in $FILES; do gunzip $FILENAME; done
root@yulivee /home/wdp/archives/gz # ls -la | grep -v “.gz” | wc -l
1154
The best error messages on some gz archives:
gzip: ./f101073943.gz: invalid compressed data–format violated
gzip: ./f101046863.gz: invalid compressed data–crc error
gzip: ./f101046863.gz: invalid compressed data–length error
gzip: ./f3342105.gz: unexpected end of file
gzip: ./f202679045.gz is a a multi-part gzip file — not supported
I get em’ all! :-) Still nothing to worry – We can try with gzrecover later. Anyway. Now let’s take a fast look using file on the recovered files:
root@yulivee /home/wdp/archives/gz # file * | grep -v “gzip compressed data” | g
rep -v “troff or preprocessor” | grep -v “ASCII” | grep -v “XML document text” | grep -v “ISO-8859″ | grep -v “GNU message catalog” | grep -v “libtool library file” | grep -v “exported SGML document text” | grep -v “empty” | grep -v “Emacs” | grep -v “UTF-8 Unicode” | grep -v “PPD file” | grep -v “ELF 32-bit” | grep -v “ELF 64-bit”
Yes yes.. there are nicer lines, i know. Anyway. Just to give you an overview what that will list:
f7202073: X pixmap image text
f7128443: data
f15078855: HTML document text
f123665445: PNG image, 16 x 16, 8-bit colormap, non-interlaced
f121121467: SVG Scalable Vector Graphics image
f121121195: bzip2 compressed data, block size = 900k
f121120903: Ogg data, Vorbis audio, mono, 44100 Hz, ~96000 bps,
f121119451: POSIX shell script text executable
f109653207: GIF image data, version 89a, 216 x 37
f109654929: current ar archive
f109208833: PDF document, version 1.3
And thats just “a few” lines of the output – You see, we get a bit useful stuff out of it. At least we found some audio files and some images within it – Time to run imageSorting and audioSorting again on this new directory. Hm? Anyway, now let’s take a further look on the files which have “data” and on the corrupted gz files first:
data
As “file” is not finding any mime info within that file, i cannot tell you WHAT the hell that file is. Most likely some special not recognized format. Or a corrupted piece.
the corrupted gz’s
You’ve seen at the top a few error messages i got while unpacking the gz’s. Now i will run gzrecover on the corrupted files in the hope that i can unpack them. For this we just run our “for” again but instead of using gunzip, we use the path to gzrecover:
root@yulivee /home/wdp/archives/gz # FILES=$(find . -type f -iregex ‘.*\.\(gz\)’);
root@yulivee /home/wdp/archives/gz # IFS=$’\n’;
root@yulivee /home/wdp/archives/gz # for FILENAME in $FILES; do /home/wdp/Desktop/gzrt-0.5/gzrecover $FILENAME; done
Now we have some .recovered files in our directory, let’s check them using “file”
Now rename some of the files manually, if it’s much you can write a little bash script for that. I just look out for useful stuff like tar, bz2, zip and rar archives, images, sound files, pdf’s, office documents. After renaming them i use imagesorting and audiosorting to get the images and audiofiles out, then i copy all office documents and pdfs manually, last but not least i try to unpack the archives.
Well.. you know what to do, go luke!
rar
With rar files it’s quiet easy: Either you can unpack or you simply cant. Use “unrar e filename” and try your luck, if it’s not working.. well, i know no way to recover/repair those.
bz2
root@yulivee /home/wdp/archives/bz2 # bunzip2 f5464729.bz2
bunzip2: Data integrity error when decompressing.
Input file = f5464729.bz2, output file = f5464729
It is possible that the compressed file(s) have become corrupted.
You can use the -tvv option to test integrity of such files.
You can use the `bzip2recover’ program to attempt to recover
data from undamaged sections of corrupted files.
bunzip2: Deleting output file f5464729, if it exists.
root@yulivee /home/wdp/archives/bz2 # bzip2recover f5464729.bz2
bzip2recover 1.0.5: extracts blocks from damaged .bz2 files.
bzip2recover: searching for block boundaries …
block 1 runs from 80 to 1730802
block 2 runs from 1730851 to 1736704 (incomplete)
bzip2recover: splitting into blocks
writing block 1 to `rec00001f5464729.bz2′ …
bzip2recover: finished
root@yulivee /home/wdp/archives/bz2 #
Remember
Even if we unpacked some archives now, files in it may be broken – So if you have images or audio files in it, repeat the imageSorting, audioSorting and probably getCopies scripts.
Clean Up
So.. i processed now every image, every audio file and every archive, i deleted copies. I checked that my new folder is containing all important audio and image files and all data from the recovered archives – Time to delete those from our Backup directory. For this we will use a simple find command. Make sure that you give the correct directory, to not lose any important data!
As i had no use for the video files recovered by photorec i will delete them using the following line within the recovery folders:
find . -type f -iregex ‘.*\.\(avi\|mpg\)’ -exec rm -rf ‘{}’ \;
As i processed already all images, i will delete them from the recovery-folders using the following line within a recovery folder:
find . -type f -iregex ‘.*\.\(png\|jpg\|gif\|bmp\|svg\|tiff\|tif\|jpeg\|eps\|psd\)’ -exec rm -rf ‘{}’ \;
As i processed already ogg and mp3s, i will delete them from the recovery-folders using the following line within a recovery folder:
find . -type f -iregex ‘.*\.\(mp3\|ogg\)’ -exec rm -rf ‘{}’ \;
As i processed already all archives, i will delete them from the recovery-folders using:
find . -type f -iregex ‘.*\.\(gz\|zip\|bz2\|tar\|rar\)’ -exec rm -rf ‘{}’ \;
The other files
Still a lot of other files around, hm? txt, sxw, etc. What to do with them? Let’s copy over all files into one directory per extension so that we can process them manually and get useful stuff out of them thats all what we can do. Surely there exist probably ways to get more data out of such files – Go ahead, have fun :-)
Blubb
And? Did this helped you a bit to get an overview about your recovered data? It helped me a lot, though it’s still work to sort all this files. You could also, if you recovered files from windows discs or similar stuff, run clamscan on every file (probably in the sha1sum script – add :clean|infected to the file) to check for viruses if you really care. Let’s just say it this way: every really corrupted file and every deleted file, reduces your work. so your best try is to find automatically those files which aren’t worth repairing/recovering/sorting and which simply can’t be used anymore. Good luck!
