Server running AIX with Oracle RAC reboots itself

Главная Форумы Программное обеспечение СУБД Server running AIX with Oracle RAC reboots itself

В этой теме 1 ответ, 2 участника, последнее обновление  Дмитрий 9 года/лет назад.

  • Автор
    Сообщения
  • #7019

    andrewk
    Участник

    я это безуспешно пытался доказывать более года назад нашим ораклистам. IBM проснулся и написал:

    Problem(Abstract)
    Server running AIX with Oracle RAC reboots itself with no warning

    Symptom
    AIX server shuts down and/or reboots.

    A REBOOT_ID is logged in /var/adm/ras/errlog indicating “SYSTEM SHUTDOWN BY USER” although no shutdown or reboot command was issued by any user.

    example error message…

    LABEL: REBOOT_ID
    IDENTIFIER: 2BFA76F6

    Date/Time: Wed Dec 3 08:19:09 2008
    Sequence Number: 1447
    Machine Id: 0000ABCD1234
    Node Id: nodeA
    Class: S
    Type: TEMP
    Resource Name: SYSPROC

    Description
    SYSTEM SHUTDOWN BY USER

    Probable Causes
    SYSTEM SHUTDOWN

    Detail Data
    USER ID
    0
    0=SOFT IPL 1=HALT 2=TIME REBOOT
    0
    TIME TO REBOOT (FOR TIMED REBOOT ONLY)
    0

    Cause
    Oracle Real Application Clusters (RAC) is known to reboot the operating system with no warning due to configuration of the oprocd daemon

    Environment
    AIX with Oracle RAC

    Diagnosing the problem
    Oracle Real Application Clusters (RAC) typically runs a process called oprocd.
    The idea of OPROCD is quite straightforward. It’s goal is to provide I/O fencing. Basically oprocd works by setting a timer, then sleeping. If, when it wakes up again and gets scheduled onto cpu, it sees that a longer time has passed than the acceptable margin, oprocd will decide to reboot the node.

    You can check for the oprocd process with the ps command…

    # ps -ef | grpe oprocd
    root 221672 1 0 08:27:44 – 0:00
    /u01/crs/oracle/product/10.2.0/crs_1/bin/oprocd run -t 1000 -m 500 -f

    These options to oprocd are saying -t 1000 (wake up every 1000 ms) and -m 500 (allow up to 500 ms margin of error on the time that oprocd wakes up before rebooting). In other words, if oprocd wakes up after > 1.5 secs it’s going to force a reboot.

    Resolving the problem
    The timeout and margin times are computed from the elements of diagwait and reboot time and it isn’t recommended changing them via the init.cssd file, but rather through the command ‘crsctl set css diagwait ‘.
    There is a formula involved in the calculation of the times. For example, if the reboot time is 3 and you submit a diagwait setting of 13 you will get -t 1000 -m 10000.

    # crsctl set css diagwait 13 -force

    # ps -ef | grep oprocd
    root 221672 1 0 08:27:44 – 0:00
    /u01/crs/oracle/product/10.2.0/crs_1/bin/oprocd run -t 1000 -m 10000 -f

    You can see that the margin has changed to 10000 ms, that is 10 seconds in place of the default 0.5 seconds. This is a 20 fold increase and basically makes oprocd less trigger happy in rebooting the node.

    The AIX recommended diagwait value is 30 seconds (30000 milliseconds).
    Please advise customer to change this diagwait value to the AIX recommended value or greater.

    IBM recommends the customer contact Oracle Support before modifying this value.

  • #7084

    Дмитрий
    Участник

    1. И не только AIX.
    2. Всё-равно не верят.

Для ответа в этой теме необходимо авторизоваться.