LoveUnix » AIX -IBM UNIX » HACMP,拔两条网线,资源组状态长期处于Releasing问题。
让LU留住您的每

一天 让LU博客留住您的每一天
2005-11-21 10:58 beginner-bj
HACMP,拔两条网线,资源组状态长期处于Releasing问题。

如题,请看看是怎么回事?
HACMP5.2 AIX5.3

1、最初,两台机器clstat 的结果相同,如下:


                clstat - HACMP Cluster Status Monitor
                -------------------------------------

Cluster: testdb_ha       (1130856062)
Mon Nov 21 09:05:06 2005
                State: UP               Nodes: 2
                SubState: STABLE


        Node: TESTDB1            State: UP
           Interface: testdb1_boot1 (0)          Address: 10.68.68.1
                                                State:   UP
           Interface: testdb1_boot2 (0)          Address: 10.68.168.1
                                                State:   UP
           Interface: TESTDB1_tty0_01 (1)        Address: 0.0.0.0
                                                State:   UP
           Interface: testdb_svc (0)            Address: 10.181.168.105
                                                State:   UP
           Resource Group: testdb                        State:  On line

        Node: TESTDB2            State: UP
           Interface: testdb2_boot1 (0)          Address: 10.68.68.2
                                                State:   UP
           Interface: testdb2_boot2 (0)          Address: 10.68.168.2
                                                State:   UP
           Interface: TESTDB2_tty0_01 (1)        Address: 0.0.0.0
                                                State:   UP           



2、然后拔掉TESTDB1的两条网线,观察TESTDB2的clstat命令,长期(10多分钟)如下:

                clstat - HACMP Cluster Status Monitor
                -------------------------------------

Cluster: testdb_ha       (1130856062)
Mon Nov 21 09:36:07 2005
                State: UP               Nodes: 2
                SubState: UNSTABLE


        Node: TESTDB1            State: UP
           Interface: testdb1_boot1 (0)          Address: 10.68.68.1
                                                State:   DOWN
           Interface: testdb1_boot2 (0)          Address: 10.68.168.1
                                                State:   DOWN
           Interface: TESTDB1_tty0_01 (1)        Address: 0.0.0.0
                                                State:   UP
           Interface: testdb_svc (0)            Address: 10.181.168.105
                                                State:   DOWN
           Resource Group: testdb                        State:  Releasing

        Node: TESTDB2            State: UP
           Interface: testdb2_boot1 (0)          Address: 10.68.68.2
                                                State:   UP
           Interface: testdb2_boot2 (0)          Address: 10.68.168.2
                                                State:   UP
           Interface: TESTDB2_tty0_01 (1)        Address: 0.0.0.0
                                                State:   UP

同时在TESTDB2上敲tail -f /tmp/hacmp.out,结果如下
testdb:rg_move[310] [ -f /tmp/.NFSSTOPPED ]
testdb:rg_move[316] [ -f /tmp/.RPCLOCKDSTOPPED ]
testdb:rg_move[334] exit 0
Nov 21 09:36:03 EVENT COMPLETED: rg_move TESTDB1 1 RELEASE 0

+ exit 0
Nov 21 09:36:03 EVENT COMPLETED: rg_move_release TESTDB1 1 0


Nov 21 09:42:03 EVENT START: config_too_long 360 /usr/es/sbin/cluster/events/rg_move.rp

:config_too_long[64] [[ high = high ]]
:config_too_long[64] version=1.11
:config_too_long[65] :config_too_long[65] cl_get_path
HA_DIR=es
:config_too_long[67] NUM_SECS=360
:config_too_long[68] EVENT=/usr/es/sbin/cluster/events/rg_move.rp
:config_too_long[70] HOUR=3600
:config_too_long[71] THRESHOLD=5
:config_too_long[72] SLEEP_INTERVAL=1
:config_too_long[78] PERIOD=30
:config_too_long[81] set -u
:config_too_long[86] LOOPCNT=0
:config_too_long[87] MESSAGECNT=0
:config_too_long[88] :config_too_long[88] cllsclstr -c
:config_too_long[88] cut -d : -f2
:config_too_long[88] grep -v cname
CLUSTER=testdb_ha
:config_too_long[89] TIME=360
:config_too_long[90] sleep_cntr=0
:config_too_long[95] [ -x /usr/lpp/ssp/bin/spget_syspar ]
WARNING: Cluster testdb_ha has been running recovery program '/usr/es/sbin/cluster/events/rg_move.rp' for 360 seconds. Please check c
luster status.
WARNING: Cluster testdb_ha has been running recovery program '/usr/es/sbin/cluster/events/rg_move.rp' for 390 seconds. Please check c
luster status.
WARNING: Cluster testdb_ha has been running recovery program '/usr/es/sbin/cluster/events/rg_move.rp' for 420 seconds. Please check c
luster status.
WARNING: Cluster testdb_ha has been running recovery program '/usr/es/sbin/cluster/events/rg_move.rp' for 450 seconds. Please check c
luster status.
WARNING: Cluster testdb_ha has been running recovery program '/usr/es/sbin/cluster/events/rg_move.rp' for 480 seconds. Please check c
luster status.
WARNING: Cluster testdb_ha has been running recovery program '/usr/es/sbin/cluster/events/rg_move.rp' for 540 seconds. Please check c
luster status.
WARNING: Cluster testdb_ha has been running recovery program '/usr/es/sbin/cluster/events/rg_move.rp' for 600 seconds. Please check c
luster status.
WARNING: Cluster testdb_ha has been running recovery program '/usr/es/sbin/cluster/events/rg_move.rp' for 660 seconds. Please check c
luster status.

3、拔网线后10多分钟,插回TESTDB1的两条网线,TESTDB2上的tail -f /tmp/hacmp.out才有大量输出,约一分钟后clstat如下:

                clstat - HACMP Cluster Status Monitor
                -------------------------------------

Cluster: testdb_ha       (1130856062)
Mon Nov 21 09:48:31 2005
                State: UP               Nodes: 2
                SubState: STABLE


        Node: TESTDB1            State: UP
           Interface: testdb1_boot1 (0)          Address: 10.68.68.1
                                                State:   UP
           Interface: testdb1_boot2 (0)          Address: 10.68.168.1
                                                State:   UP
           Interface: TESTDB1_tty0_01 (1)        Address: 0.0.0.0
                                                State:   UP

        Node: TESTDB2            State: UP
           Interface: testdb2_boot1 (0)          Address: 10.68.68.2
                                                State:   UP
           Interface: testdb2_boot2 (0)          Address: 10.68.168.2
                                                State:   UP
           Interface: TESTDB2_tty0_01 (1)        Address: 0.0.0.0
                                                State:   UP
           Interface: testdb_svc (0)            Address: 10.181.168.105
                                                State:   UP
           Resource Group: testdb                        State:  On line

2005-11-21 12:52 老农
是你的应用停不下来啊,脚本有问题?

2005-11-21 13:03 beginner-bj
脚本如下
TESTDB2#/ha52> cat /ha52/stop
date >> /ha52/stop.log
banner "stop app1 " >> /tmp/hacmp.out
su - db2admin -c /home/db2admin/sqllib/adm/db2stop force
TESTDB2#/ha52>

还有就是,为何把把网线插回去,资源组马上就接管了呢?

2005-11-21 18:41 beginner-bj
HACMP5.2 IY58496
AIX5.3  5300-03

应该也不是补丁的问题,实在搞不懂了。

2005-11-21 19:54 lzolder
beginner-bj有玩HACMP的环境了,不错呀:lu2:

2005-11-21 20:00 beginner-bj
[quote]原帖由 [i]lzolder[/i] 于 2005-11-21 19:54 发表
beginner-bj有玩HACMP的环境了,不错呀:lu2: [/quote]
是啊,呵呵
断断续续有1个月了,还老是被一些简单问题困扰:L

2005-11-21 22:12 charly
手工HACMP的take over 正常吗?

2005-11-21 22:38 beginner-bj
[quote]原帖由 [i]charly[/i] 于 2005-11-21 22:12 发表
手工HACMP的take over 正常吗? [/quote]

手工肯定是正常的。
我刚才又研究了一下hacmp.out,有点小想法,明天去公司试试这样行不行

nohup su - db2admin -c /home/db2admin/sqllib/adm/db2stop force

但愿能成功,呵呵

2005-11-21 23:00 charly
AIX5.3上有补丁号为IY71815
Patch安了没有?

还有,建议检查一下TTY心跳,如果心跳有问题也会出现这个现象的

2005-11-22 21:36 逍遥160
我没有环境啊,只能看书。

2005-11-22 22:01 beginner-bj
[quote]原帖由 [i]charly[/i] 于 2005-11-21 23:00 发表
AIX5.3上有补丁号为IY71815
Patch安了没有?

还有,建议检查一下TTY心跳,如果心跳有问题也会出现这个现象的 [/quote]

这个贴子又被顶起来了,呵呵。

IY71815没有特意装,但机器上有。大概装了IY58496 ,IY71815也就有了。

TTY心跳肯定没问题,我贴的clstat的结果应该也能看出来。

今天忙别的事情了,没时间多测试,所以问题还没有着落。但我发现一个有趣的现象:什么都不变的情况下,用ifconfig en1 down; ifconfig en0 down 资源就能顺利切换,大概2-3分钟就完成了;直接去拔两条网线,就是上面的现象;一旦接好网线, 资源也很快切换完成。

2005-11-22 22:03 beginner-bj
[quote]原帖由 [i]逍遥160[/i] 于 2005-11-22 21:36 发表
我没有环境啊,只能看书。 [/quote]
没环境,只能纸上谈兵了,呵呵

2005-11-23 09:07 beginner-bj
实在是搞不懂了,无论脚本/ha52/stop怎么修改,在被拔网线的机器上都是这样:
testdb:stop_server[104][color=Red] /ha52/stop 停应用的脚本开始执行,前面可以看到network_down等event发生[/color]testdb:stop_server[104] ODMDIR=/etc/objrepos

Nov 23 08:41:39 EVENT START: [color=Red]config_too_long[/color] 360 /usr/es/sbin/cluster/events/rg_move.rp

:config_too_long[64] [[ high = high ]]
:config_too_long[64] version=1.11
:config_too_long[65] :config_too_long[65] cl_get_path
HA_DIR=es
:config_too_long[67] NUM_SECS=360
:config_too_long[68] EVENT=/usr/es/sbin/cluster/events/rg_move.rp
:config_too_long[70] HOUR=3600
:config_too_long[71] THRESHOLD=5
:config_too_long[72] SLEEP_INTERVAL=1
:config_too_long[78] PERIOD=30
:config_too_long[81] set -u
:config_too_long[86] LOOPCNT=0
:config_too_long[87] MESSAGECNT=0
:config_too_long[88] :config_too_long[88] cllsclstr -c
:config_too_long[88] cut -d : -f2
:config_too_long[88] grep -v cname
CLUSTER=testdb_ha
:config_too_long[89] TIME=360
:config_too_long[90] sleep_cntr=0
:config_too_long[95] [ -x /usr/lpp/ssp/bin/spget_syspar ]
WARNING: Cluster testdb_ha has been running recovery program '/usr/es/sbin/cluster/events/rg_move.rp' for 360 seconds. Please check c
luster status.
WARNING: Cluster testdb_ha has been running recovery program '/usr/es/sbin/cluster/events/rg_move.rp' for 390 seconds. Please check c
luster status.
WARNING: Cluster testdb_ha has been running recovery program '/usr/es/sbin/cluster/events/rg_move.rp' for 420 seconds. Please check c
luster status.
WARNING: Cluster testdb_ha has been running recovery program '/usr/es/sbin/cluster/events/rg_move.rp' for 450 seconds. Please check c
luster status.
WARNING: Cluster testdb_ha has been running recovery program '/usr/es/sbin/cluster/events/rg_move.rp' for 480 seconds. Please check c
luster status.
WARNING: Cluster testdb_ha has been running recovery program '/usr/es/sbin/cluster/events/rg_move.rp' for 540 seconds. Please check c
luster status.
WARNING: Cluster testdb_ha has been running recovery program '/usr/es/sbin/cluster/events/rg_move.rp' for 600 seconds. Please check c
luster status.
WARNING: Cluster testdb_ha has been running recovery program '/usr/es/sbin/cluster/events/rg_move.rp' for 660 seconds. Please check c
luster status.
WARNING: Cluster testdb_ha has been running recovery program '/usr/es/sbin/cluster/events/rg_move.rp' for 720 seconds. Please check c
luster status.
[color=Red]这个时候开始插回网线。下面两行是db2stop的结果
11/23/2005 08:48:19     0   0   SQL1064N  DB2STOP processing was successful.
SQL1064N  DB2STOP processing was successful. [/color]
testdb:stop_server[106] [ 0 -ne 0 ]
testdb:stop_server[134] ALLNOERRSERV=All_nonerror_servers
testdb:stop_server[135] [ REAL = EMUL ]
testdb:stop_server[140] cl_RMupdate resource_down All_nonerror_servers stop_server
Reference string: Wed.Nov.23.08:48:19.BEIST.2005.stop_server.All_nonerror_servers.testdb.ref
testdb:stop_server[143] exit 0
Nov 23 08:48:19 EVENT COMPLETED: stop_server app1 0

testdb:node_down_local[208] server_release_lpar_resources app1
testdb:server_release_lpar_resources[380] [[ high == high ]]
testdb:server_release_lpar_resources[380] version=1.1
testdb:server_release_lpar_resources[382] typeset HOSTNAME
testdb:server_release_lpar_resources[383] typeset MANAGED_SYSTEM
testdb:server_release_lpar_resources[384] typeset HMC_IP

看来db2stop这个程序太智能了,它能判断出网口上是否插了网线,不插网线就拒绝执行。:L:L

2005-11-23 09:12 beginner-bj
ifconfig en1 down; ifconfig en0 down 和直接去拔两条网线难道真有区别?

另外,两台机器是9113-550,网卡全是集成网卡。

2005-11-23 09:30 beginner-bj
HACMP for AIX 5L V5.2 Planning and Installation Guide里有ORACLE的例子,说得也是莫名其妙。


Example 1: Oracle Database
The Oracle Database, like many databases, functions very well under HACMP. It is a robust application that handles failures well. It can roll back uncommitted transactions after a fallover and return to service in a timely manner. There are, however, a few things to keep in mind when using Oracle Database under HACMP.

Starting Oracle
Oracle must be started by the Oracle user ID. Thus, the start script should contain an su - oracleuser. The dash (-) is important since the su needs to take on all characteristics of the Oracle user and reside in the Oracle user’s home directory. The command would look something like this:su - oracleuser -c “/apps/oracle/startup/dbstart”Commands like dbstart and dbshut read the /etc/oratabs file for instructions on which database instances are known and should be started. In certain cases it is inappropriate to start all of the instances, because they might be owned by another node. This would be the case in the mutual takeover of two Oracle instances. The oratabs file typically resides on the internal disk and thus cannot be shared. If appropriate, consider other ways of starting different Oracle instances.

Stopping Oracle
The stopping of Oracle is a process of special interest. There are several different ways to ensure Oracle has completely stopped. The recommended sequence is this: first, implement a graceful shutdown; second, call a shutdown immediate, which is a bit more forceful method; finally, create a loop to check the process table to ensure all Oracle processes have exited.

2008-5-31 03:17 ChaosLegion
对,就是脚本问题,应用没有停下来,一定要小心应用是在数据库之前停,如果先停数据库,应用就会死在哪里了,自然ha就不能够释放资源了

2008-5-31 15:29 beginner-bj
太早的帖子,不记得怎么回事了。现在怀疑就是补丁没打齐。

2008-6-3 16:15 xuandhe
ls的ls喜欢挖墙脚

页: [1]


Powered by Discuz! Archiver 5.5.0  © 2001-2006 Comsenz Inc.