Server running AIX with Oracle RAC reboots itself

Главная Форумы Программное обеспечение СУБД Server running AIX with Oracle RAC reboots itself

Просмотр 1 ветки ответов
  • Автор
    Сообщения
    • #7019
      andrewk
      Участник

      я это безуспешно пытался доказывать более года назад нашим ораклистам. IBM проснулся и написал:

      Problem(Abstract)
      Server running AIX with Oracle RAC reboots itself with no warning

      Symptom
      AIX server shuts down and/or reboots.

      A REBOOT_ID is logged in /var/adm/ras/errlog indicating “SYSTEM SHUTDOWN BY USER” although no shutdown or reboot command was issued by any user.

      example error message…

      LABEL: REBOOT_ID
      IDENTIFIER: 2BFA76F6

      Date/Time: Wed Dec 3 08:19:09 2008
      Sequence Number: 1447
      Machine Id: 0000ABCD1234
      Node Id: nodeA
      Class: S
      Type: TEMP
      Resource Name: SYSPROC

      Description
      SYSTEM SHUTDOWN BY USER

      Probable Causes
      SYSTEM SHUTDOWN

      Detail Data
      USER ID
      0
      0=SOFT IPL 1=HALT 2=TIME REBOOT
      0
      TIME TO REBOOT (FOR TIMED REBOOT ONLY)
      0

      Cause
      Oracle Real Application Clusters (RAC) is known to reboot the operating system with no warning due to configuration of the oprocd daemon

      Environment
      AIX with Oracle RAC

      Diagnosing the problem
      Oracle Real Application Clusters (RAC) typically runs a process called oprocd.
      The idea of OPROCD is quite straightforward. It’s goal is to provide I/O fencing. Basically oprocd works by setting a timer, then sleeping. If, when it wakes up again and gets scheduled onto cpu, it sees that a longer time has passed than the acceptable margin, oprocd will decide to reboot the node.

      You can check for the oprocd process with the ps command…

      # ps -ef | grpe oprocd
      root 221672 1 0 08:27:44 – 0:00
      /u01/crs/oracle/product/10.2.0/crs_1/bin/oprocd run -t 1000 -m 500 -f

      These options to oprocd are saying -t 1000 (wake up every 1000 ms) and -m 500 (allow up to 500 ms margin of error on the time that oprocd wakes up before rebooting). In other words, if oprocd wakes up after > 1.5 secs it’s going to force a reboot.

      Resolving the problem
      The timeout and margin times are computed from the elements of diagwait and reboot time and it isn’t recommended changing them via the init.cssd file, but rather through the command ‘crsctl set css diagwait ‘.
      There is a formula involved in the calculation of the times. For example, if the reboot time is 3 and you submit a diagwait setting of 13 you will get -t 1000 -m 10000.

      # crsctl set css diagwait 13 -force

      # ps -ef | grep oprocd
      root 221672 1 0 08:27:44 – 0:00
      /u01/crs/oracle/product/10.2.0/crs_1/bin/oprocd run -t 1000 -m 10000 -f

      You can see that the margin has changed to 10000 ms, that is 10 seconds in place of the default 0.5 seconds. This is a 20 fold increase and basically makes oprocd less trigger happy in rebooting the node.

      The AIX recommended diagwait value is 30 seconds (30000 milliseconds).
      Please advise customer to change this diagwait value to the AIX recommended value or greater.

      IBM recommends the customer contact Oracle Support before modifying this value.

    • #7084
      Дмитрий
      Участник

      1. И не только AIX.
      2. Всё-равно не верят.

Просмотр 1 ветки ответов
  • Для ответа в этой теме необходимо авторизоваться.