Wednesday, March 7, 2012

Checkpoint SPID Blocked

I have a nifty problem with SQL 7.0 (build 1063). Perhaps someone here could
shed some light on it. We have a situation where the checkpoint process is
hung waiting on a pagiolatch_sh to release for a page that doesn't exist.
About midnight last night it appears that an I/O was attempted on this
non-existent page. The I/O processor of course threw a tantrum about this,
as specified in these SQL log records:
2004-12-13 00:12:43.95 spid363 Error: 5180, Severity: 22, State: 1
2004-12-13 00:12:43.95 spid363 Could not open FCB for invalid file
ID 116 in database 'DGFE3x00Prod'. Table or database may be corrupted..
SQL just kept going about its business up until the point where another spid
attempted to execute a read on this same page. These messages started
sprouting up in the log:
2004-12-13 00:26:49.45 spid243 Time out occurred while waiting for
buffer latch type 2, bp 0x14e54f80, page (116:16), stat 0x40d, object ID
29:1690181632:0, waittime 500. Continuing to wait.
The problem is that file #116 doesn't exist for this database. There are two
files: the data and the log. I checked sysfiles and the file ID's are 1 and
2. Things really went sideways when the checkpoint process came along and
attempted to flush the page. Now this process is blocked because of the
outstanding share latch on this page:
2004-12-13 02:09:07.19 spid5 Time out occurred while waiting for
buffer latch type 3, bp 0x14e54f80, page (116:16), stat 0x40d, object ID
29:1690181632:0, waittime 500. Continuing to wait.
2004-12-13 02:10:47.19 spid5 Error: 845, Severity: 17, State: 2
With the checkpoint process blocked, we have no ability to back up this
database now. This is the second time that this has happened, the first
occuring about three weeks ago. The first time we just restarted SQL Server
and the problem went away. I chalked it up to an anolmoly until it recurred
today. The database is still running and the website that it stores data for
is doing fine, but this outstanding latch is blocking our ability to back
up. I'd like the problem to go away for good without requiring a restart of
SQL Server when it happens.
Here's the output from nonzero sysprocesses:
spid kpid blocked waittype waittime lastwaittype waitresource dbid
-- -- -- -- -- --
-----------
- -- 2 0 0 0x0000 3157 MISCELLANEOUS 0 3 0 0 0x0000 954 MISCELLANEOUS 0 4 0 0 0x0000 110 MISCELLANEOUS 0 5 0 0 0x0423 26141 PAGEIOLATCH_UP 29:116:16 29 243 3452 0 0x0422 62719 PAGEIOLATCH_SH 29:116:16 29We have a case open wit
h Microsoft on this as I'm pretty sure that it's abug somewhere, probably in the free page list maintenance. I'm guessing thatthe fileID and the pageID got concatenated somehow and the free list wasupdated to reference page 116:16 when it should have been
page 1:16. DBCCCheckDB and DBCC Checkalloc didn't turn up any allocation errors. I ran adbcc page on page 1:16 and nothing out of the ordinary was there. Obviouslyrunning a dbcc page on page 116:16 threw an error.Has anyone experienced this before?Kevin
StarkDirector, Technology ServicesSigma Micro
You really, really need to run DBCC CHECKDB on this database.
Sincerely,
Anthony Thomas

"Kevin Stark" <SENDkevo97NO@.POTTEDhotMEATmailHERE.com> wrote in message
news:eNDe0eV4EHA.2876@.TK2MSFTNGP12.phx.gbl...
I have a nifty problem with SQL 7.0 (build 1063). Perhaps someone here could
shed some light on it. We have a situation where the checkpoint process is
hung waiting on a pagiolatch_sh to release for a page that doesn't exist.
About midnight last night it appears that an I/O was attempted on this
non-existent page. The I/O processor of course threw a tantrum about this,
as specified in these SQL log records:
2004-12-13 00:12:43.95 spid363 Error: 5180, Severity: 22, State: 1
2004-12-13 00:12:43.95 spid363 Could not open FCB for invalid file
ID 116 in database 'DGFE3x00Prod'. Table or database may be corrupted..
SQL just kept going about its business up until the point where another spid
attempted to execute a read on this same page. These messages started
sprouting up in the log:
2004-12-13 00:26:49.45 spid243 Time out occurred while waiting for
buffer latch type 2, bp 0x14e54f80, page (116:16), stat 0x40d, object ID
29:1690181632:0, waittime 500. Continuing to wait.
The problem is that file #116 doesn't exist for this database. There are two
files: the data and the log. I checked sysfiles and the file ID's are 1 and
2. Things really went sideways when the checkpoint process came along and
attempted to flush the page. Now this process is blocked because of the
outstanding share latch on this page:
2004-12-13 02:09:07.19 spid5 Time out occurred while waiting for
buffer latch type 3, bp 0x14e54f80, page (116:16), stat 0x40d, object ID
29:1690181632:0, waittime 500. Continuing to wait.
2004-12-13 02:10:47.19 spid5 Error: 845, Severity: 17, State: 2
With the checkpoint process blocked, we have no ability to back up this
database now. This is the second time that this has happened, the first
occuring about three weeks ago. The first time we just restarted SQL Server
and the problem went away. I chalked it up to an anolmoly until it recurred
today. The database is still running and the website that it stores data for
is doing fine, but this outstanding latch is blocking our ability to back
up. I'd like the problem to go away for good without requiring a restart of
SQL Server when it happens.
Here's the output from nonzero sysprocesses:
spid kpid blocked waittype waittime lastwaittype waitresource dbid
-- -- -- -- -- --
----
-- -- 2 0 0 0x0000 3157 MISCELLANEOUS 0
3 0 0 0x0000 954 MISCELLANEOUS 0 4 0 0 0x0000 110 MISCELLANEOUS 0
5 0 0 0x0423 26141 PAGEIOLATCH_UP 29:116:16 29 243 3452 0 0x0422 62719
PAGEIOLATCH_SH 29:116:16 29We have a case open with Microsoft on this as I'm
pretty sure that it's abug somewhere, probably in the free page list
maintenance. I'm guessing thatthe fileID and the pageID got concatenated
somehow and the free list wasupdated to reference page 116:16 when it should
have been page 1:16. DBCCCheckDB and DBCC Checkalloc didn't turn up any
allocation errors. I ran adbcc page on page 1:16 and nothing out of the
ordinary was there. Obviouslyrunning a dbcc page on page 116:16 threw an
error.Has anyone experienced this before?Kevin StarkDirector, Technology
ServicesSigma Micro
|||My original post said that I ran that already. Nothing turned up. I'll
repost the last paragraph since Outlook Express apparently mangled it:
"We have a case open with Microsoft on this as I'm pretty sure that it's a
bug somewhere, probably in the free page list maintenance. I'm guessing that
the fileID and the pageID got concatenated somehow and the free list was
updated to reference page 116:16 when it should have been page 1:16. DBCC
CheckDB and DBCC Checkalloc didn't turn up any allocation errors. I ran a
dbcc page on page 1:16 and nothing out of the ordinary was there. Obviously
running a dbcc page on page 116:16 threw an error. Has anyone experienced
this before?"
"AnthonyThomas" <Anthony.Thomas@.CommerceBank.com> wrote in message
news:OgXvuKY4EHA.3092@.TK2MSFTNGP10.phx.gbl...
> You really, really need to run DBCC CHECKDB on this database.
> Sincerely,
>
> Anthony Thomas
>
> --
> "Kevin Stark" <SENDkevo97NO@.POTTEDhotMEATmailHERE.com> wrote in message
> news:eNDe0eV4EHA.2876@.TK2MSFTNGP12.phx.gbl...
> I have a nifty problem with SQL 7.0 (build 1063). Perhaps someone here
> could
> shed some light on it. We have a situation where the checkpoint process is
> hung waiting on a pagiolatch_sh to release for a page that doesn't exist.
> About midnight last night it appears that an I/O was attempted on this
> non-existent page. The I/O processor of course threw a tantrum about this,
> as specified in these SQL log records:
> 2004-12-13 00:12:43.95 spid363 Error: 5180, Severity: 22, State: 1
> 2004-12-13 00:12:43.95 spid363 Could not open FCB for invalid file
> ID 116 in database 'DGFE3x00Prod'. Table or database may be corrupted..
> SQL just kept going about its business up until the point where another
> spid
> attempted to execute a read on this same page. These messages started
> sprouting up in the log:
> 2004-12-13 00:26:49.45 spid243 Time out occurred while waiting for
> buffer latch type 2, bp 0x14e54f80, page (116:16), stat 0x40d, object ID
> 29:1690181632:0, waittime 500. Continuing to wait.
> The problem is that file #116 doesn't exist for this database. There are
> two
> files: the data and the log. I checked sysfiles and the file ID's are 1
> and
> 2. Things really went sideways when the checkpoint process came along and
> attempted to flush the page. Now this process is blocked because of the
> outstanding share latch on this page:
> 2004-12-13 02:09:07.19 spid5 Time out occurred while waiting for
> buffer latch type 3, bp 0x14e54f80, page (116:16), stat 0x40d, object ID
> 29:1690181632:0, waittime 500. Continuing to wait.
> 2004-12-13 02:10:47.19 spid5 Error: 845, Severity: 17, State: 2
> With the checkpoint process blocked, we have no ability to back up this
> database now. This is the second time that this has happened, the first
> occuring about three weeks ago. The first time we just restarted SQL
> Server
> and the problem went away. I chalked it up to an anolmoly until it
> recurred
> today. The database is still running and the website that it stores data
> for
> is doing fine, but this outstanding latch is blocking our ability to back
> up. I'd like the problem to go away for good without requiring a restart
> of
> SQL Server when it happens.
> Here's the output from nonzero sysprocesses:
> spid kpid blocked waittype waittime lastwaittype waitresource dbid
> -- -- -- -- -- --
> --
> ----
> ----
> ----
> -- -- 2 0 0 0x0000 3157 MISCELLANEOUS 0
> 3 0 0 0x0000 954 MISCELLANEOUS 0 4 0 0 0x0000 110 MISCELLANEOUS 0
> 5 0 0 0x0423 26141 PAGEIOLATCH_UP 29:116:16 29 243 3452 0 0x0422
> 62719
> PAGEIOLATCH_SH 29:116:16 29We have a case open with Microsoft on this as
> I'm
> pretty sure that it's abug somewhere, probably in the free page list
> maintenance. I'm guessing thatthe fileID and the pageID got concatenated
> somehow and the free list wasupdated to reference page 116:16 when it
> should
> have been page 1:16. DBCCCheckDB and DBCC Checkalloc didn't turn up any
> allocation errors. I ran adbcc page on page 1:16 and nothing out of the
> ordinary was there. Obviouslyrunning a dbcc page on page 116:16 threw an
> error.Has anyone experienced this before?Kevin StarkDirector, Technology
> ServicesSigma Micro
>
|||Yes, I have ran into mangled page links before, but an Index Rebuild and
DBCC CHECKDB found and then I actually had to use the ALLOW_DATA_LOSS
parameter, but it fixed it.
Can you backup and then verify the backup? If so, get one; then, I would
try to detach and reattach. The, rerun the checks and index rebuilds.
If you have an MS PSS case open, where are they leaning towards?
Sincerely,
Anthony Thomas

"Kevin Stark" <SENDkevo97NO@.POTTEDhotMEATmailHERE.com> wrote in message
news:ONai5kY4EHA.4092@.TK2MSFTNGP14.phx.gbl...
My original post said that I ran that already. Nothing turned up. I'll
repost the last paragraph since Outlook Express apparently mangled it:
"We have a case open with Microsoft on this as I'm pretty sure that it's a
bug somewhere, probably in the free page list maintenance. I'm guessing that
the fileID and the pageID got concatenated somehow and the free list was
updated to reference page 116:16 when it should have been page 1:16. DBCC
CheckDB and DBCC Checkalloc didn't turn up any allocation errors. I ran a
dbcc page on page 1:16 and nothing out of the ordinary was there. Obviously
running a dbcc page on page 116:16 threw an error. Has anyone experienced
this before?"
"AnthonyThomas" <Anthony.Thomas@.CommerceBank.com> wrote in message
news:OgXvuKY4EHA.3092@.TK2MSFTNGP10.phx.gbl...
> You really, really need to run DBCC CHECKDB on this database.
> Sincerely,
>
> Anthony Thomas
>
> --
> "Kevin Stark" <SENDkevo97NO@.POTTEDhotMEATmailHERE.com> wrote in message
> news:eNDe0eV4EHA.2876@.TK2MSFTNGP12.phx.gbl...
> I have a nifty problem with SQL 7.0 (build 1063). Perhaps someone here
> could
> shed some light on it. We have a situation where the checkpoint process is
> hung waiting on a pagiolatch_sh to release for a page that doesn't exist.
> About midnight last night it appears that an I/O was attempted on this
> non-existent page. The I/O processor of course threw a tantrum about this,
> as specified in these SQL log records:
> 2004-12-13 00:12:43.95 spid363 Error: 5180, Severity: 22, State: 1
> 2004-12-13 00:12:43.95 spid363 Could not open FCB for invalid file
> ID 116 in database 'DGFE3x00Prod'. Table or database may be corrupted..
> SQL just kept going about its business up until the point where another
> spid
> attempted to execute a read on this same page. These messages started
> sprouting up in the log:
> 2004-12-13 00:26:49.45 spid243 Time out occurred while waiting for
> buffer latch type 2, bp 0x14e54f80, page (116:16), stat 0x40d, object ID
> 29:1690181632:0, waittime 500. Continuing to wait.
> The problem is that file #116 doesn't exist for this database. There are
> two
> files: the data and the log. I checked sysfiles and the file ID's are 1
> and
> 2. Things really went sideways when the checkpoint process came along and
> attempted to flush the page. Now this process is blocked because of the
> outstanding share latch on this page:
> 2004-12-13 02:09:07.19 spid5 Time out occurred while waiting for
> buffer latch type 3, bp 0x14e54f80, page (116:16), stat 0x40d, object ID
> 29:1690181632:0, waittime 500. Continuing to wait.
> 2004-12-13 02:10:47.19 spid5 Error: 845, Severity: 17, State: 2
> With the checkpoint process blocked, we have no ability to back up this
> database now. This is the second time that this has happened, the first
> occuring about three weeks ago. The first time we just restarted SQL
> Server
> and the problem went away. I chalked it up to an anolmoly until it
> recurred
> today. The database is still running and the website that it stores data
> for
> is doing fine, but this outstanding latch is blocking our ability to back
> up. I'd like the problem to go away for good without requiring a restart
> of
> SQL Server when it happens.
> Here's the output from nonzero sysprocesses:
> spid kpid blocked waittype waittime lastwaittype waitresource dbid
> -- -- -- -- -- --
-
> --
> ----
-
> ----
--
> ----
--
> -- -- 2 0 0 0x0000 3157 MISCELLANEOUS 0
> 3 0 0 0x0000 954 MISCELLANEOUS 0 4 0 0 0x0000 110 MISCELLANEOUS 0
> 5 0 0 0x0423 26141 PAGEIOLATCH_UP 29:116:16 29 243 3452 0 0x0422
> 62719
> PAGEIOLATCH_SH 29:116:16 29We have a case open with Microsoft on this as
> I'm
> pretty sure that it's abug somewhere, probably in the free page list
> maintenance. I'm guessing thatthe fileID and the pageID got concatenated
> somehow and the free list wasupdated to reference page 116:16 when it
> should
> have been page 1:16. DBCCCheckDB and DBCC Checkalloc didn't turn up any
> allocation errors. I ran adbcc page on page 1:16 and nothing out of the
> ordinary was there. Obviouslyrunning a dbcc page on page 116:16 threw an
> error.Has anyone experienced this before?Kevin StarkDirector, Technology
> ServicesSigma Micro
>
|||No clear idea right now from PSS. I'm going to have to restart the service
tonight, so that will probably kill our debugging efforts. We haven't had a
backup now for two days on this particular database and I'm getting quite
nervous about it. I will take your advice and try detaching and reattaching
and then check and rebuild the indexes after the backup. Thanks.
"AnthonyThomas" <Anthony.Thomas@.CommerceBank.com> wrote in message
news:eHhc9Ge4EHA.2964@.TK2MSFTNGP15.phx.gbl...
> Yes, I have ran into mangled page links before, but an Index Rebuild and
> DBCC CHECKDB found and then I actually had to use the ALLOW_DATA_LOSS
> parameter, but it fixed it.
> Can you backup and then verify the backup? If so, get one; then, I would
> try to detach and reattach. The, rerun the checks and index rebuilds.
> If you have an MS PSS case open, where are they leaning towards?
> Sincerely,
>
> Anthony Thomas
>
> --
> "Kevin Stark" <SENDkevo97NO@.POTTEDhotMEATmailHERE.com> wrote in message
> news:ONai5kY4EHA.4092@.TK2MSFTNGP14.phx.gbl...
> My original post said that I ran that already. Nothing turned up. I'll
> repost the last paragraph since Outlook Express apparently mangled it:
> "We have a case open with Microsoft on this as I'm pretty sure that it's a
> bug somewhere, probably in the free page list maintenance. I'm guessing
> that
> the fileID and the pageID got concatenated somehow and the free list was
> updated to reference page 116:16 when it should have been page 1:16. DBCC
> CheckDB and DBCC Checkalloc didn't turn up any allocation errors. I ran a
> dbcc page on page 1:16 and nothing out of the ordinary was there.
> Obviously
> running a dbcc page on page 116:16 threw an error. Has anyone experienced
> this before?"
>
> "AnthonyThomas" <Anthony.Thomas@.CommerceBank.com> wrote in message
> news:OgXvuKY4EHA.3092@.TK2MSFTNGP10.phx.gbl...
> -
> -
> --
> --
>
|||Silly thought -
Could you run a DTS and export to another database then back that up? Of
course this would probably blow up when the export tried to read from the
non-existant page. Some variation on this (a table by table SELECT INTO
SomeOtherDatabase?) would at least let you isolate any bad tables and then
you could work on recovering those.
Yeah, I know. Too weird for serious consideration.
:p
Bob Castleman
SuccessWare Software
"Kevin Stark" <SENDkevo97NO@.POTTEDhotMEATmailHERE.com> wrote in message
news:ueArmre4EHA.1392@.tk2msftngp13.phx.gbl...
> No clear idea right now from PSS. I'm going to have to restart the service
> tonight, so that will probably kill our debugging efforts. We haven't had
> a backup now for two days on this particular database and I'm getting
> quite nervous about it. I will take your advice and try detaching and
> reattaching and then check and rebuild the indexes after the backup.
> Thanks.
>
> "AnthonyThomas" <Anthony.Thomas@.CommerceBank.com> wrote in message
> news:eHhc9Ge4EHA.2964@.TK2MSFTNGP15.phx.gbl...
>
|||That could potentially work, since the blocking latch is a share latch. I
don't have anything that would lead me to believe that I have corrupted
data, though. The DBCC's don't report any errors. The website that this
supports is humming along just fine, too.
When this happened last month, we just restarted SQL and things went back to
normal. We'll run some tests when we restart the service tonight, but I
haven't turned up anything that would lead me to believe that the data
itself is corrupted. The only item of note is that there is a mixed extent
in tempdb and the one free page in it is not included in the SGAM. Other
than that, there are no errors reported by DBCC's.
My suspicion is that somehow a fargled page ID was inserted into the free
list. The page in question is 29:116:16. Given that there are only two files
associated with this database, something is out of whack. SQL behaved
correctly in that it spewed an error message when attempting to access that
page. When it encountered this error, however, I think what happened was
that it did not remove this page from the list of available pages and it did
not remove it from the list of dirty pages. The checkpoint process came
along and is now waiting to flush this page to disk. It's waiting on a spid
that attempted to I/O this 29:116:16 page, issued a PAGEIOLATCH_SH latch to
do the I/O, and is still waiting to complete the I/O because the page
doesn't exist. (We can't kill the blocking spid...it was one of the first
things we tried.)
I really wish that I had the query that caused the inital FCB error message
in the error log. Since the page referenced is an invalid page, I can't look
at the contents of the page in question, so I don't know what objects were
trying to be accessed. Are there any hidden trace flags available that will
dump the input buffer to the error log when an error like 5180 happens?
"Bob Castleman" <nomail@.here> wrote in message
news:u0G9pDf4EHA.2196@.TK2MSFTNGP14.phx.gbl...
> Silly thought -
> Could you run a DTS and export to another database then back that up? Of
> course this would probably blow up when the export tried to read from the
> non-existant page. Some variation on this (a table by table SELECT INTO
> SomeOtherDatabase?) would at least let you isolate any bad tables and then
> you could work on recovering those.
> Yeah, I know. Too weird for serious consideration.
> :p
> Bob Castleman
> SuccessWare Software
> "Kevin Stark" <SENDkevo97NO@.POTTEDhotMEATmailHERE.com> wrote in message
> news:ueArmre4EHA.1392@.tk2msftngp13.phx.gbl...
>
|||I offered using DTS only as a suggestion since flying without a backup gives
me the willies and I figured it might offer an immediate way out.
Unfortunately, the root of the problem is beyond me. This is one for the MS
sorcerers.
And what kind of word is "fargled"? I love it!
Bob.
"Kevin Stark" <SENDkevo97NO@.POTTEDhotMEATmailHERE.com> wrote in message
news:u8FZ$lf4EHA.1188@.tk2msftngp13.phx.gbl...
> That could potentially work, since the blocking latch is a share latch. I
> don't have anything that would lead me to believe that I have corrupted
> data, though. The DBCC's don't report any errors. The website that this
> supports is humming along just fine, too.
> When this happened last month, we just restarted SQL and things went back
> to normal. We'll run some tests when we restart the service tonight, but I
> haven't turned up anything that would lead me to believe that the data
> itself is corrupted. The only item of note is that there is a mixed extent
> in tempdb and the one free page in it is not included in the SGAM. Other
> than that, there are no errors reported by DBCC's.
> My suspicion is that somehow a fargled page ID was inserted into the free
> list. The page in question is 29:116:16. Given that there are only two
> files associated with this database, something is out of whack. SQL
> behaved correctly in that it spewed an error message when attempting to
> access that page. When it encountered this error, however, I think what
> happened was that it did not remove this page from the list of available
> pages and it did not remove it from the list of dirty pages. The
> checkpoint process came along and is now waiting to flush this page to
> disk. It's waiting on a spid that attempted to I/O this 29:116:16 page,
> issued a PAGEIOLATCH_SH latch to do the I/O, and is still waiting to
> complete the I/O because the page doesn't exist. (We can't kill the
> blocking spid...it was one of the first things we tried.)
> I really wish that I had the query that caused the inital FCB error
> message in the error log. Since the page referenced is an invalid page, I
> can't look at the contents of the page in question, so I don't know what
> objects were trying to be accessed. Are there any hidden trace flags
> available that will dump the input buffer to the error log when an error
> like 5180 happens?
>
> "Bob Castleman" <nomail@.here> wrote in message
> news:u0G9pDf4EHA.2196@.TK2MSFTNGP14.phx.gbl...
>
|||Yes, thanks for the suggestion.
"Fargled" is a nice substitute for the f-word that I heard one time. Somehow
it worked its way into my vernacular.
"Bob Castleman" <nomail@.here> wrote in message
news:Ow3WfPh4EHA.3504@.TK2MSFTNGP12.phx.gbl...
>I offered using DTS only as a suggestion since flying without a backup
>gives me the willies and I figured it might offer an immediate way out.
>Unfortunately, the root of the problem is beyond me. This is one for the MS
>sorcerers.
> And what kind of word is "fargled"? I love it!
> Bob.
>
> "Kevin Stark" <SENDkevo97NO@.POTTEDhotMEATmailHERE.com> wrote in message
> news:u8FZ$lf4EHA.1188@.tk2msftngp13.phx.gbl...
>
|||fwiw...Restarting the SQL service cleared the problem. Nothing showed up in
the DBCC's after backing up, verifying, detaching, and reattaching. We know
for certain the initial markers to look for now, so I think we'll have a
bigger window to involve PSS in the debugging the next time this occurs.
Thanks Anthony and Bob for your input.
"Kevin Stark" <SENDkevo97NO@.POTTEDhotMEATmailHERE.com> wrote in message
news:O1tPNdh4EHA.4008@.TK2MSFTNGP15.phx.gbl...
> Yes, thanks for the suggestion.
> "Fargled" is a nice substitute for the f-word that I heard one time.
> Somehow it worked its way into my vernacular.
>
> "Bob Castleman" <nomail@.here> wrote in message
> news:Ow3WfPh4EHA.3504@.TK2MSFTNGP12.phx.gbl...
>

No comments:

Post a Comment