MCELOG Analysis

Introduce

MCE - Machine Check Exception(Error) 是 Linux 用來檢查硬體錯誤的軟體,特別針對 CPU 和 Memory

Case Study

Case 1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Hardware event. This is not a software error.
MCE 0
mcelog: Family 6 Model 4d CPU: only decoding architectural errors
Hardware event. This is not a software error.
CPU 2 BANK 0 TSC 22366c36e58bc0
TIME 1520300677 Tue Mar 6 09:44:37 2018
MCG status:MCIP
MCi status:
Corrected error
Error enabled
MCA: External error
STATUS 9000000020000003 MCGSTATUS 4
MCGCAP 806 APICID 4 SOCKETID 0
CPUID Vendor Intel Family 6 Model 77
Unknown CPU type vendor 21 family 0 model

Case 2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 0
ADDR fef80000
TIME 1520309887 Tue Mar 6 04:18:07 2018
MCG status:
MCi status:
Uncorrected error
MCi_ADDR register valid
Processor context corrupt
MCA: Internal unclassified error: 410
Running trigger `unknown-error-trigger'
STATUS a600000007600410 MCGSTATUS 0
MCGCAP 806 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 77

Case 3

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
mcelog: failed to prefill DIMM database from DMI data
Hardware event. This is not a software error.
MCE 0
CPU 0 BANK 0
ADDR fef80000
TIME 1521087195 Thu Mar 15 04:13:15 2018
MCG status:
MCi status:
Uncorrected error
MCi_ADDR register valid
Processor context corrupt
MCA: Internal unclassified error: 410
Running trigger `unknown-error-trigger'
STATUS a600000007600410 MCGSTATUS 0
MCGCAP 806 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 77
  • Root cause
    • mcelog: failed to prefill DIMM database from DMI data
  • Solution
    • This is a harmless warning message

Useful command

  • Check mcelog support the CPU typ or not

    • Non-support

      1
      2
      root@localhost:# mcelog --is-cpu-supported
      mcelog: Family 6 Model 4d CPU: only decoding architectural errors
    • Support

      1
      2
      root@localhost:# mcelog --is-cpu-supported
      root@localhost:#
  • Start mcelog daemon
    mcelog --daemon

    • You need enable the service when booting

Terminology

  • MCE: Machine Check Exception(Error)
  • MCA: Machine Check Architecture
  • NMI: NMI notification of ECC errors
  • MSRs: Machine Specific Register error cases

References:
Linux x86 machine check user space processing utility
Machine check exception - how to read and understand it?