Your SSD is dying!

6 556
Avatar for Geri
Written by
4 years ago

The SSD drive in your computer, which you have bragged about to your friends since years, are about to die. You have just checked the health status of your SSD in drive health tools supplied by your manufacturer, or have used a generic tool like Crystal Disk Mark and shaking your head in disbelief, believing the drive is in good condition? Then i have very bad news for you, you are probably about to just lose all of your data on your SSD.

What is an SSD

An SSD - solid state drive - is the evolution of storing your data. Compared to hard disks, when the data is being recorded to spinning platters, the SSD have no moving parts. The SSD works similarly to a pen-drive or an SD card. The bits and bytes are represented by electrons in the flash memory chips. Due to this, the computer can access the data much faster, as there is no mechanical read/write head involved when accessing the data.

How the SSD works

The SSD works similarly to memory cards, but its designed as a replacement for hard disks. The pen-drives have hard times writing small files, the SSD-s offering good performance when running operating systems from it. The problem is that the flash chips are degrade after every write. After certain number of rewrites, the SSD dies, and it brings the data with it to the abyss as well. Once the SSD is dead, the data is not accessible any more.

Why the SSD dies

Modern SSD (and modern pendrives and memory cards) are very complex, the information is being stored in very small place, terabytes are being forced into areas as large as a penny. There are various methods involved to reach this data density, such as TLC and MLC, which means storing multiple bits in every memory cell. More modern the SSD and the pendrive is, less writes it can endure. Modern SSD-s can barely tolerate more than 100 rewites per cell. The SSD memory blocks are being allocated into 4 kbyte chunks, rather than the usual 512 byte chuncks of HDD sectors.

What your SSD does

The modern SSD have a very complex controller, that shuffles the data writes around the flash cells, so the data is being written into cells with less usage to keep the SSD alive for longer. Every SSD have a TBW number that indicates, how many TBytes can be written on the SSD before it dies. After you reach this number, the warranty on the SSD voids, and some SSD just switches into a read-only state. Drives like Intel SSD-s will switch to this read only state, and Crucial MX100 SSD-s will usually switch to read-only state as well. Samsung drives are just usually silently die after exceeding the TBW (except the professional EVO pro series).

Checking your SSD health with Crystal Disk Mark

Crystal disk mark is a disk info tool which uses sexy anime girls to cover its professional inadequacy. Crystal disk mark will indicate totally good hard disks as bad (indicating 0% health remaining) usually due to reallocated sector counts, which confuses amatheur buyers when they try to buy hard disks. Reallocated sectors are normal, hard disks (and SSD-s) will replace dying sectors, so you dont have to deal with system level bad sector mitigations through fsck or chkdsk. The another problem is that they dont understand what an LBA write is, so they will indicate that an SSD is still in a totally good condition, despite of its imminent death.

screenshot: Crystal Disk Mark

Crucial Storage Executive and stuff of other manufacturers

For the Crucial MX100 and MX500 series of SSD-s, you can use the Crucial Storage Executive software. This tool allows you to monitor the health of your SSD, and its from the manufacturer itself. Of course, other manufacturers have their own tools for their own SSD-s, such as Samsung has its own tool, Intel has its own tool, and so on. Similarly to Crystical Disk Mark, these software will mislead you, and it will indicate that the drive is fine, even if its already on the verge of death.

How to actually find out

You will have to download a tool that can read the SMART of the drive. You will need a tool that reads raw values, and not some more shitlordic tool that tries to calculate things arbitrary for you, and show shiny graphs and characteristics. If you have Debian Linux, you can just apt-get install smartmontools to get smartctl, and then you can use the smartctl -a /dev/sdx command to get the SMART information of an HDD or SSD drive.

Check the writes

The simplest way to get a picture about the health of your SSD, you have to check the number of writes. You can get this parameter on SSD-s as param #246 called Total LBAs Written. On this diagram, we will show you the parameters of a dead Crucial MX100 drive, that died after just a few years of active usage:

=== START OF INFORMATION SECTION ===

Model Family: Crucial/Micron BX/MX1/2/3/500, M5/600, 1100 SSDs
Device Model: Crucial_CT512MX100SSD1
Sector Sizes: 512 bytes logical, 4096 bytes physical
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 933
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 3573
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 099 099 000 Old_age Always - 48
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 382
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 4403
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 066 047 000 Old_age Always - 34 (Min/Max 8/53)
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
202 Percent_Lifetime_Remain 0x0031 099 099 000 Pre-fail Offline - 1
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 19361152390

You can observe the following line:

246 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 19361152390

This means the SSD endured a total 19361152390 writes. This number is however, not in bytes. This means the drive have received a total 19361152390 worth of logical block address write commands under its lifetime. An LBA command operates on 512 byte chunks.

But 512 bytes actually means 4096 bytes!

If you multiply 19361152390 with 512, then you can get the number of byte writes the drive received. If you divide that number with 1024, you get the number in kbytes, if you divide it again with 1024, you get it in mbytes, and so on. (((((19361152390*512)/1024)/1024)/1024)/1024) means the drive was issued with 9 TBytes of DATA. The Crucial MX100 SSD is rated to survive 72 TByte of data written. Its, however, dead. How so? Lets change our math a bit. The hardware sector size of the SSD-s are 4096 byte, and not 512. If we multiply the number with 4096, we can see that (((((19361152390*4096)/1024)/1024)/1024)/1024) equals to 72 TByte being written. And the drive just dead after reaching this point.

What is going on?

The system issues 512 byte long blocks to be written, but thats not possible. The SSD have 4096 byte blocks, so it can write 4096 byte chucks to the cells. If an 512 byte block write commands comes in, the SSD has to read the whole 4096 block, change 512 bytes, and write the whole 4096 byte block to the disk again. This would mean that every 512 byte writes will result an 4096 byte write to your SSD.

File system and chip tries to control this

When you write your data, the controller chip on the SSD will try to reorganize these writes, so it will cache these 512 byte long writes, especially if the writes go into contignous blocks. The file system on your computer will also typically use 4 kbyte large chunks. This would mean that most of the writes will be cached to 4096 byte writes, but thats sadly not true, as the operating system's disk driver will also hammer the file allocation table after a few blocks of writes, to register where the file contents are going to the disk. Multiple programs running in the same time, files are getting fragmented in the file system, files are being accessed randomly, and the SSD will also have to shuffle around the blocks to avoid exhausting single blocks.

So what is the actual number of data being written out?

There is another indication of this in the SMART table, called Average Block Erase Count. Lets observe our previous drive:

173 Ave_Block-Erase_Count 0x0032 099 099 000 Old_age Always - 48

This means blocks got overwritten 48 times on average. On a half TByte Crucial MX100 drive, this means, that 24 TBytes of data was written out to the disk at least. So the actual 9 TBytes of LBA writes of this drive caused 24 TBytes of writes on the disk. What this means is that on average, 8x512 bytes of writes resulted 3x4096 blocks to be overwritten in reality. This means that the system, on average, every 1 GByte of typical data write on the SSD meant an actual 3 GByte of data write. And of course, this is just an average number, so some of the blocks were hammered even more, hitting the worst scenario (where 8x 512 byte writes meant 8x4096 byte long writes).

What this SSD was able to endure in reality?

We can see that we should multiply the LBA writes data with 4096, and not with 512 so we get the worst scenario of what some sectors had to endure. This is which you must took into consideration when you try to figure our your SSD-s remaining life. Read up your SSD-s TBW and compare it with this number, to find out, how much of its life is remained. If the average block erase count multiplied with your SSD size exceeds third of your TBW, and your LBA writes multipled with 4096 also near to reach the TBW, then basically you can start digging up a grave for your SSD.

What is the number of data written out on this example drive?

From the aspect of the user and the operating system, the drive in the previous example was fed with a total 9 TByte worth of writes under its lifespan. Due to the build technology of SSD-s, this 9 TByte of writes resulted in a 48-72 TBytes of data being written onto the cells, which caused the death of the SSD (the MX100 is rated to 72 TBytes of maximium data being written). Actually you can just kill an older SSD with downloading a few 30-40 GB sized games from the internet and installing them to try them, and then deleting them and trying out new games. If you do this for a few months, you already killed your SSD.

What's up with the newer SSD-s?

Lets see another drive. This is a totally working Crucial MX500 SSD in 1 TB size:

=== START OF INFORMATION SECTION ===
Device Model: CT1000MX500SSD1
Sector Sizes: 512 bytes logical, 4096 bytes physical
=== START OF READ SMART DATA SECTION ===

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

5 Reallocated_Sector_Ct 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 478
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 210
173 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 6
174 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 1
180 Unused_Rsvd_Blk_Cnt_Tot 0x0033 000 000 000 Pre-fail Always - 49
194 Temperature_Celsius 0x0022 066 045 000 Old_age Always - 34 (Min/Max 0/55)
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
246 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 3329062386

We can see the following attributes to indicate the drive's health:

246 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 3329062386

173 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 6

By using the formula from above, we can determine that the number of bytes written, in the worst case is ((((3329062386*4096)/1024)/1024)/1024)/1024 = 12 TByte which the driver is reached within 2 years of less active usage. By taking the average block erases into consideration, we can determine that this SSD has endured a 6-12 TBytes of written to it under its lifespan. The Crucial MX500 1 TB SSD is rated at 360 TBW, therefore this drive will survive more than a decade under this type of usage.

To summarize up

You really should not expect realibility from a first generation SSD. You should not trust HDD/SSD monitoring programs, always observe the numbers manually, and do the calculations for yourself. You could just get a few TByte HDD and get decades out of it, only the newest generation of SSD-s will be able to offer similar realibility. Always have backups from your important data somewhere! Special thanks to Conker for the SMART information of the dead MX100 SSD.

8
$ 2.45
$ 1.28 from @sanctuary.the-one-law
$ 1.14 from @TheRandomRewarder
$ 0.02 from @BigBlockIfTrue
+ 1
Avatar for Geri
Written by
4 years ago

Comments

I have no idea about those knowledge

$ 0.00
4 years ago

I was worried about weekly Manjaro gigabyte updates,
but apparently not as heavy an impact as guesstimated.


context:
https://www.reddit.com/r/ManjaroLinux/comments/hfypd8/updates_are_huge_even_for_minor_updates/?utm_source=share&utm_medium=web2x&context=3

$ 0.00
4 years ago

You can observe the smart table before applying the update, then do the updates, reboot the computer, and observe the smart table once again. Based on the attribute 246, and multiplying it with 4096, you will exactly know how much data was written to the ssd in the worst case.

(please note, smart table is not updated in real time on all devices, it could take several minutes or a reboot)

oh and also, as the commenters pointed out in that thread, Debian is the way to go. i am really having a hard time to take other linux distributions seriously nowadays.

$ 0.00
4 years ago

Debian first certainly for everything mission-critical.
... and most of anything else. Manjaro is just for bleeding-edge stuff I need from time to time.
The table says I have not done nearly as much as I thought I had, but the ssd is not in the database, and it misses att 246. 234 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 8535 235 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 12144 241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 11718 242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 5849 250 Read_Error_Retry_Rate 0x0032 100 100 000 Old_age Always - 5306345

Guess I will need to dig deeper but LBA written seems okay for now.

$ 0.00
4 years ago

Is their any way to recover files from a dead or dying SSD?

$ 0.00
4 years ago

If you can still see the file system, then there is no problem, you can copy your data.

However, once the SSD damages, then usually everything is gone. On hard disks, when the file system dies, you still can access the raw data, and use specialized software to get the data. This is however usually not possible on SSD, as if it becames unreadable, the chip usually ends up with a garbled data relocation table, making impossible to read the content of data sectors.

$ 0.05
4 years ago