Text this: Robust processing of spoken situated dialogue