src/newlib/winsup/cygwin/DevNotes - stadia-controller/gcc-arm-none-eabi - Git at Google

 2014-04-26  cgf-000026

 Forgot to clear to the end of screen when restoring a screen buffer.
 That worked, for some reason, with Take Command but not with normal
 consoles.  I don't remember why I didn't resize the screen like a Linux
 X terminal emulator but that might have made things work a little
 better.  Right now, there is a scroll bar for apps like less or vi and
 that doesn't feel right.

 2014-03-29  cgf-000025

 Reorganized _cygtls::signal_debugger to avoid sending anything to the
 debugger if we've seen an exception.  I think it used to work that way
 and I changed it without noting why.  It sure seems like, if we don't do
 this, gdb will see two signals and, it really does, when there has been
 a Windows-recognized exception.

 2014-02-15  cgf-000024

 Wow.  It's hard getting the screen handling stuff working correctly when
 there is a screen buffer larger than screen size and vice versa.  These
 changes attempt to use SetConsoleWindowInfo whenever possible so that
 the contents of the screen buffer are never wiped out.  They also fix
 some previously misbehaving "scroll the screen" commands.

 2013-06-07  cgf-000023

 Given the fact that the signal thread never exits there is no need
 for exit_thread to ever block.  So, nuke this code.

 2013-01-31  cgf-000022

 While researching the lftp behavior reported here:

 http://cygwin.com/ml/cygwin/2013-01/msg00390.html

 after a frenzy of rewriting sigflush handling to avoid blocking in the
 signal thread (which is now and should ever have been illegal), it
 dawned on me that we're not supposed to be flushing the tty input buffer
 every time a signal is received.  We're supposed to do this only when
 the user hits a character (e.g., CTRL-C) which initiates a signal
 action.  So, I removed sigflush from sigpacket::process and moved it to
 tc ()->kill_pgrp ().  This function should only be called to send
 signals related to the tty so this should have the desired effect.

 2013-01-11  cgf-000021

 Apparently I got the signal handling semantics of select() wrong again
 even though I would have sworn that I tested this on Linux and Windows.

 select() is apparently *always* interrupted by a signal and *never*
 restarts.  Hopefully, between the comment added to the code and this
 note, I'll not make this mistake again.

 2013-01-02  cgf-000020

 (This entry should have been checked in with the changes but... I forgot)

 This is a fairly big revamp of the way that windows signals are handled.
 The intent is that all signal decisions should be made by the signal
 thread; not by the exception handler.

 This required the ability to pass information from the exception handler
 to the signal thread so, a si_cyg field was added to siginfo_t.  This
 contains information needed to generate a "core dump".  Hmm.  Haven't
 checked to see if this breaks Cygwin's hardly-ever-used real core dump
 facility.

 Anyway, I moved signal_exit back into exceptions.cc and removed it from
 the sigpacket class.  This function is now treated like a signal handler
 function - Cygwin will cause it to be dispatched in the context of
 whatever thread caught the signal.  signal_exit also makes the
 determination about when to write a stackdump.

 The signal-handler thread will no longer ever attempt to exit.  It will
 just keep processing signals (it will not process real signals after
 Cygwin stops shutting down, however).  This should make it impossible
 for the signal thread to ever block waiting for the process lock since
 it now never grabs the process lock.  The signal-handler thread will
 notify gdb when it gets a signal now but, in theory, gdb should see the
 context of the thread which received the signal, not the signal-handler
 thread.

 2012-12-28  cgf-000019

 (I forgot to mention that cgf-000018 was reverted.  Although I never saw
 a hang from this, I couldn't convince myself that one wasn't possible.)

 This fix attempts to correct a deadlock where, when a true Windows
 signal arrives, Windows creates a thread which "does stuff" and attempts
 to exit.  In the process of exiting Cygwin grabs the process lock.  If
 the signal thread has seen the signal and wants to exit, it can't
 because the newly-created thread now holds it.  But, since the new
 thread is relying on the signal thread to release its process lock,
 it exits and the process lock is never released.

 To fix this, I removed calls to _cygtls::signal_exit in favor of
 flagging that we were exiting by setting signal_exit_code (almost forgot
 to mark that NO_COPY: that would have been fun).  The new function
 setup_signal_exit() now handles setting things up so that ReadFile loop
 in wait_sig will do the right thing when it terminates.  This function
 may just Sleep indefinitely if a signal is being sent from a thread
 other than the signal thread.  wait_sig() was changed so that it will
 essentially drop into asychronous-read-mode when a signal which exits
 has been detected.  The ReadFile loop is exited when we know that the
 process is supposed to be exiting and there is nothing else in the
 signal queue.

 Although I never actually saw this happen, exit_thread() was also
 changed to release the process lock and just sleep indefintely if it is
 detected that we are exiting.

 2012-12-21  cgf-000018

 Re: cgf-000017

 It occurred to me that just getting the process lock during
 DLL_THREAD_DETACH in dll_entry() might be adequate to fix this
 problem.  It's certainly much less intrusive.

 There are potential deadlock problems with grabbing a lock in
 this code, though, so this check-in will be experimental.

 2012-12-21  cgf-000017

 The changes in this set are to work around the issue noted here:

 http://cygwin.com/ml/cygwin/2012-12/threads.html#00140

 The problem is, apparently, that the return value of an ExitThread()
 will take precedence over the return value of TerminateProcess/ExitProcess
 if the thread is the last one exiting.  That's rather amazing...

 For the fix, I replaced all calls to ExitThread with exit_thread().  The
 exit_thread function, creates a handle to the current thread and sends
 it to a packet via sig_send(__SIGTHREADEXIT).  Then it acquires the
 process lock and calls ExitThread.

 wait_sig will then wait for the handle, indicating that the thread has
 exited, and, when that has happened, removes the process lock on behalf
 of the now-defunct thread.  wait_sig will now also avoid actually
 exiting since it could trigger the same problem.

 Holding process_lock should prevent threads from exiting while a Cygwin
 process is shutting down.  They will just block forever in that case -
 just like wait_sig.

 2012-08-17  cgf-000016

 While debugging another problem I finally noticed that
 sigpacket::process was unconditionally calling tls->set_siginfo prior to
 calling setup_handler even though setup_handler could fail.  In the
 event of two successive signals, that would cause the second signal's
 info to overwrite the first even though the signal handler for the first
 would eventually be called.  Doh.

 Fixing this required passing the sigpacket si field into setup_handler.
 Making setup_handler part of the sigpacket class seemed to make a lot of
 sense so that's what I did.  Then I passed the si element into
 interrupt_setup so that the infodata structure could be filled out prior
 to arming the signal.

 The other changes checked in here eliminate the ResetEvent for
 signal_arrived since previous changes to cygwait should handle the
 case of spurious signal_arrived detection.  Since signal_arrived is
 not a manual-reset event, we really should just let the appropriate
 WFMO handle it.  Otherwise, there is a race where a signal comes in
 a "split second" after WFMO responds to some other event.  Resetting
 the signal_arrived would cause any subsequent WFMO to never be
 triggered.  My current theory is that this is what is causing:

 http://cygwin.com/ml/cygwin/2012-08/msg00310.html

 2012-08-15  cgf-000015

 RIP cancelable_wait.  Yay.

 2012-08-09  cgf-000014

 So, apparently I got it somewhat right before wrt signal handling.
 Checking on linux, it appears that signals will be sent to a thread
 which can accept the signal.  So resurrecting and extending the
 "find_tls" function is in order.  This function will return the tls
 of any thread which 1) is waiting for a signal with sigwait*() or
 2) has the signal unmasked.

 In redoing this it became obvious that I had the class designation wrong
 for the threadlist handling so I moved the manipulation of the global
 threadlist into the cygheap where it logically belongs.

 2012-07-21  cgf-000013

 These changes reflect a revamp of the "wait for signal" functionality
 which has existed in Cygwin through several signal massages.

 We now create a signal event only when a thread is waiting for a signal
 and arm it only for that thread.  The "set_signal_arrived" function is
 used to establish the event and set it in a location referencable by
 the caller.

 I still do not handle all of the race conditions.  What happens when
 a signal comes in just after a WF?O succeeds for some other purpose?  I
 suspect that it will arm the next WF?O call and the subsequent call to
 call_signal_handler could cause a function to get an EINTR when possibly
 it shouldn't have.

 I haven't yet checked all of the test cases for the URL listed in the
 previous entry.

 Baby steps.

 2012-06-12  cgf-000012

 These changes are the preliminary for redoing the way threads wait for
 signals.  The problems are shown by the test case mentioned here:

 http://cygwin.com/ml/cygwin/2012-05/msg00434.html

 I've known that the signal handling in threads wasn't quite right for
 some time.  I lost all of my thread signal tests in the great "rm -r"
 debacle of a few years ago and have been less than enthusiastic about
 redoing everything (I had PCTS tests and everything).  But it really is
 time to redo this signal handling to make it more like it is supposed to
 be.

 This change should not introduce any new behavior.  Things should
 continue to behave as before.  The major differences are a change in the
 arguments to cancelable_wait and cygwait now uses cancelable_wait and,
 so, the returns from cygwait now mirror cancelable_wait.

 The next change will consolidate cygwait and cancelable_wait into one
 cygwait function.

 2012-06-02  cgf-000011

 The refcnt handling was tricky to get right but I had convinced myself
 that the refcnt's were always incremented/decremented under a lock.
 Corinna's 2012-05-23 change to refcnt exposed a potential problem with
 dup handling where the fdtab could be updated while not locked.

 That should be fixed by this change but, on closer examination, it seems
 like there are many places where it is possible for the refcnt to be
 updated while the fdtab is not locked since the default for
 cygheap_fdget is to not lock the fdtab (and that should be the default -
 you can't have read holding a lock).

 Since refcnt was only ever called with 1 or -1, I broke it up into two
 functions but kept the Interlocked* operation.  Incrementing a variable
 should not be as racy as adding an arbitrary number to it but we have
 InterlockedIncrement/InterlockedDecrement for a reason so I kept the
 Interlocked operation here.

 In the meantime, I'll be mulling over whether the refcnt operations are
 actually safe as they are.  Maybe just ensuring that they are atomically
 updated is enough since they control the destruction of an fh.  If I got
 the ordering right with incrementing and decrementing then that should
 be adequate.

 2012-06-02  cgf-000010

 <1.7.16>
 - Fix emacs problem which exposed an issue with Cygwin's select() function.
   If a signal arrives while select is blocking and the program longjmps
   out of the signal handler then threads and memory may be left hanging.
   Fixes: http://cygwin.com/ml/cygwin/2012-05/threads.html#00275
 </1.7.16>

 This was try #4 or #5 to get select() signal handling working right.
 It's still not there but it should now at least not leak memory or
 threads.

 I mucked with the interface between cygwin_select and select_stuff::wait
 so that the "new" loop in select_stuff::wait() was essentially moved
 into the caller.  cygwin_select now uses various enum states to decide
 what to do.  It builds the select linked list at the beginning of the
 loop, allowing wait() to tear everything down and restart.  This is
 necessary before calling a signal handler because the signal handler may
 longjmp away.

 I initially had this all coded up to use a special signal_cleanup
 callback which could be called when a longjmp is called in a signal
 handler.  And cygwin_select() set up and tore down this callback.  Once
 I got everything compiling it, of course, dawned on me that just because
 you call a longjmp in a signal handler it doesn't mean that you are
 jumping *out* of the signal handler.  So, if the signal handler invokes
 the callback and returns it will be very bad for select().  Hence, this
 slower, but hopefully more correct implementation.

 (I still wonder if some sort of signal cleanup callback might still
 be useful in the future)

 TODO: I need to do an audit of other places where this problem could be
 occurring.

 As alluded to above, select's signal handling is still not right.  It
 still acts as if it could call a signal handler from something other
 than the main thread but, AFAICT, from my STC, this doesn't seem to be
 the case.  It might be worthwhile to extend cygwait to just magically
 figure this out and not even bother using w4[0] for scenarios like this.

 2012-05-16  cgf-000009

 <1.7.16>
 - Fix broken console mouse handling.  Reported here:
   http://cygwin.com/ml/cygwin/2012-05/msg00360.html
 </1.7.16>

 I did a cvs annotate on smallprint.cc and see that the code to translate
 %characters > 127 to 0x notation was in the 1.1 revision.  Then I
 checked the smallprint.c predecessor.  It was in the 1.1 version of that
 program too, which means that this odd change has probably been around
 since <= 2000.

 Since __small_sprintf is supposed to emulate sprintf, I got rid of the
 special case handling.  This may affect fhandler_socket::bind.  If so, we
 should work around this problem there rather than keeping this strange
 hack in __small_printf.

 2012-05-14  cgf-000008

 <1.7.16>
 - Fix hang when zero bytes are written to a pty using
   Windows WriteFile or equivalent.  Fixes:
   http://cygwin.com/ml/cygwin/2012-05/msg00323.html
 </1.7.16>

 cgf-000002, as usual, fixed one thing while breaking another.  See
 Larry's predicament in: http://goo.gl/oGEr2 .

 The problem is that zero byte writes to the pty pipe caused the dread
 end-of-the-world-as-we-know-it problem reported on the mailing list
 where ReadFile reads zero bytes even though there is still more to read
 on the pipe.  This is because that change caused a 'record' to be read
 and a record can be zero bytes.

 I was never really keen about using a throwaway buffer just to get a
 count of the number of characters available to be read in the pty pipe.
 On closer reading of the documentation for PeekNamedPipe it seemed like
 the sixth argument to PeekNamedPipe should return what I needed without
 using a buffer.  And, amazingly, it did, except that the problem still
 remained - a zero byte message still screwed things up.

 So, we now detect the case where there is zero bytes available as a
 message but there are bytes available in the pipe.  In that scenario,
 return the bytes available in the pipe rather than the message length of
 zero.  This could conceivably cause problems with pty pipe handling in
 this scenario but since the only way this scenario could possibly happen
 is when someone is writing zero bytes using WriteFile to a pty pipe, I'm
 ok with that.

 2012-05-14  cgf-000007

 <1.7.16>
 - Fix invocation of strace from a cygwin process.  Fixes:
   http://cygwin.com/ml/cygwin/2012-05/msg00292.html
 </1.7.16>

 The change in cgf-000004 introduced a problem for processes which load
 cygwin1.dll dynamically.  strace.exe is the most prominent example of
 this.

 Since the parent handle is now closed for "non-Cygwin" processes, when
 strace.exe tried to dynamically load cygwin1.dll, the handle was invalid
 and child_info_spawn::handle_spawn couldn't use retrieve information
 from the parent.  This eventually led to a strace_printf error due to an
 attempt to dereference an unavailable cygheap.  Probably have to fix
 this someday.  You shouldn't use the cygheap while attempting to print
 an error about the inavailability of said cygheap.

 This was fixed by saving the parent pid in child_info_spawn and calling
 OpenProcess for the parent pid and using that handle iff a process is
 dynamically loaded.

 2012-05-12  cgf-000006

 <1.7.16>
 - Fix hang when calling pthread_testcancel in a canceled thread.
   Fixes some of: http://cygwin.com/ml/cygwin/2012-05/msg00186.html
 </1.7.16>

 This should fix the first part of the reported problem in the above
 message.  The cancel seemed to actually be working but, the fprintf
 eventually ended up calling pthread_testcancel.  Since we'd gotten here
 via a cancel, it tried to recursively call the cancel handler causing a
 recursive loop.

 2012-05-12  cgf-000005

 <1.7.16>
 - Fix pipe creation problem which manifested as a problem creating a
 fifo.  Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00253.html
 </1.7.16>

 My change on 2012-04-28 introduced a problem with fifos.  The passed
 in name was overwritten.  This was because I wasn't properly keeping
 track of the length of the generated pipe name when there was a
 name passed in to fhandler_pipe::create.

 There was also another problem in fhandler_pipe::create.  Since fifos
 use PIPE_ACCESS_DUPLEX and PIPE_ACCESS_DUPLEX is an or'ing of
 PIPE_ACCESS_INBOUND and PIPE_ACCESS_OUTBOUND, using PIPE_ACCESS_OUTBOUND
 as a "never-used" option for PIPE_ADD_PID in fhandler.h was wrong.  So,
 fifo creation attempted to add the pid of a pipe to the name which is
 wrong for fifos.

 2012-05-08  cgf-000004

 The change for cgf-000003 introduced a new problem:
 http://cygwin.com/ml/cygwin/2012-05/msg00154.html
 http://cygwin.com/ml/cygwin/2012-05/msg00157.html

 Since a handle associated with the parent is no longer being duplicated
 into a non-cygwin "execed child", Windows is free to reuse the pid of
 the parent when the parent exits.  However, since we *did* duplicate a
 handle pointing to the pid's shared memory area into the "execed child",
 the shared memory for the pid was still active.

 Since the shared memory was still available, if a new process reuses the
 previous pid, Cygwin would detect that the shared memory was not created
 and had a "PID_REAPED" flag.  That was considered an error, and, so, it
 would set procinfo to NULL and pinfo::thisproc would die since this
 situation is not supposed to occur.

 I fixed this in two ways:

 1) If a shared memory region has a PID_REAPED flag then zero it and
 reuse it.  This should be safe since you are not really supposed to be
 querying the shared memory region for anything after PID_REAPED has been
 set.

 2) Forego duping a copy of myself_pinfo if we're starting a non-cygwin
 child for exec.

 It seems like 2) is a common theme and an audit of all of the handles
 that are being passed to non-cygwin children is in order for 1.7.16.

 The other minor modification that was made in this change was to add the
 pid of the failing process to fork error output.  This helps slightly
 when looking at strace output, even though in this case it was easy to
 find what was failing by looking for '^---' when running the "stv"
 strace dumper.  That found the offending exception quickly.

 2012-05-07  cgf-000003

 <1.7.15>
 Don't make Cygwin wait for all children of a non-cygwin child program.
 Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00063.html,
        http://cygwin.com/ml/cygwin/2012-05/msg00075.html
 </1.7.15>

 This problem is due to a recent change which added some robustness and
 speed to Cygwin's exec/spawn handling by not trying to force inheritance
 every time a process is started.  See ChangeLog entries starting on
 2012-03-20, and multiple on 2012-03-21.

 Making the handle inheritable meant that, as usual, there were problems
 with non-Cygwin processes.  When Cygwin "execs" a non-Cygwin process N,
 all of its N + 1, N + 2, ...  children will also inherit the handle.
 That means that Cygwin will wait until all subprocesses have exited
 before it returns.

 I was willing to make this a restriction of starting non-Cygwin
 processes but the problem with allowing that is that it can cause the
 creation of a "limbo" pid when N exits and N + 1 and friends are still
 around.  In this scenario, Cygwin dutifully notices that process N has
 died and sets the exit code to indicate that but N's parent will wait on
 rd_proc_pipe and will only return when every N + ...  windows process
 has exited.

 The removal of cygheap::pid_handle was not related to the initial
 problem that I set out to fix.  The change came from the realization
 that we were duping the current process handle into the child twice and
 only needed to do it once.  The current process handle is used by exec
 to keep the Windows pid "alive" so that it will not be reused.  So, now
 we just close parent in child_info_spawn::handle_spawn iff we're not
 execing.

 In debugging this it bothered me that 'ps' identified a nonactive pid as
 active.  Part of the reason for this was the 'parent' handle in
 child_info was opened in non-Cygwin processes, keeping the pid alive.
 That has been kluged around (more changes after 1.7.15) but that didn't
 fix the problem.  On further investigation, this seems to be caused by
 the fact that the shared memory region pid handles were still being
 passed to non-cygwin children, keeping the pid alive in a limbo-like
 fashion.  This was easily fixed by having pinfo::init() consider a
 memory region with PID_REAPED as not available.  A more robust fix
 should be considered for 1.7.15+ where these handles are not passed
 to non-cygwin processes.

 This fixed the problem where a pid showed up in the list after a user
 does something like: "bash$ cmd /c start notepad" but, for some reason,
 it does not fix the problem where "bash$ setsid cmd /c start notepad".
 That bears investigation after 1.7.15 is released but it is not a
 regression and so is not a blocker for the release.

 2012-05-03  cgf-000002

 <1.7.15>
 Fix problem where too much input was attempted to be read from a
 pty slave.  Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00049.html
 </1.7.15>

 My change on 2012/04/05 reintroduced the problem first described by:
 http://cygwin.com/ml/cygwin/2011-10/threads.html#00445

 The problem then was, IIRC, due to the fact that bytes sent to the pty
 pipe were not written as records.  Changing pipe to PIPE_TYPE_MESSAGE in
 pipe.cc fixed the problem since writing lines to one side of the pipe
 caused exactly that the number of characters to be read on the other
 even if there were more characters in the pipe.

 To debug this, I first replaced fhandler_tty.cc with the 1.258,
 2012/04/05 version.  The test case started working when I did that.

 So, then, I replaced individual functions, one at a time, in
 fhandler_tty.cc with their previous versions.  I'd expected this to be a
 problem with fhandler_pty_master::process_slave_output since that had
 seen the most changes but was surprised to see that the culprit was
 fhandler_pty_slave::read().

 The reason was that I really needed the bytes_available() function to
 return the number of bytes which would be read in the next operation
 rather than the number of bytes available in the pipe.  That's because
 there may be a number of lines available to be read but the number of
 bytes which will be read by ReadFile should reflect the mode of the pty
 and, if there is a line to read, only the number of bytes in the line
 should be seen as available for the next read.

 Having bytes_available() return the number of bytes which would be read
 seemed to fix the problem but it could subtly change the behavior of
 other callers of this function.  However, I actually think this is
 probably a good thing since they probably should have been seeing the
 line behavior.

 2012-05-02  cgf-000001

 <1.7.15>
 Fix problem setting parent pid to 1 when process with children execs
 itself.  Fixes: http://cygwin.com/ml/cygwin/2012-05/msg00009.html
 </1.7.15>

 Investigating this problem with strace showed that ssh-agent was
 checking the parent pid and getting a 1 when it shouldn't have.  Other
 stuff looked ok so I chose to consider this a smoking gun.

 Going back to the version that the OP said did not have the problem, I
 worked forward until I found where the problem first occurred -
 somewhere around 2012-03-19.  And, indeed, the getppid call returned the
 correct value in the working version.  That means that this stopped
 working when I redid the way the process pipe was inherited around
 this time period.

 It isn't clear why (and I suspect I may have to debug this further at
 some point) this hasn't always been a problem but I made the obvious fix.
 We shouldn't have been setting ppid = 1 when we're about to pass off to
 an execed process.

 As I was writing this, I realized that it was necessary to add some
 additional checks.  Just checking for "have_execed" isn't enough.  If
 we've execed a non-cygwin process then it won't know how to deal with
 any inherited children.  So, always set ppid = 1 if we've execed a
 non-cygwin process.