Stage Gating for Robust FX Strategy Research

A Purged Walk-Forward + Bootstrap Framework, with Microstructure Entry Refinement as Stage 2

Lucitech Computer Solutions — Quant Research (White Paper, Part 1)
Author: Sean Plows
Date: 3 February 2026
Version: 1.1 (living document)

Disclaimer

This document describes a personal research programme and engineering methodology. It is not investment advice, not a recommendation to trade, and not a solicitation. Any examples are illustrative and may rely on assumptions (transaction costs, spreads, slippage, data quality) that materially impact results. Markets are non-stationary; relationships observed historically may not persist.

Abstract

Retail trading research often fails for one reason: it confuses found patterns with robust evidence. This paper outlines a framework for systematic FX research designed to avoid “holy grail” hunting by enforcing: (i) decision-time feature constraints, (ii) purged + embargoed walk-forward validation, (iii) block bootstrap uncertainty, and (iv) safeguards against multiple testing.

The research programme is structured as a two-stage funnel:

Stage gating: discover and validate interpretable “veto” rules that remove low-quality regions of feature space and improve conditional outcome probabilities out-of-sample.
Microstructure entry refinement (Stage 2): only after Stage 1 is stable, test whether tick-level microstructure variables can bring entry forward and improve trade quality without increasing adverse excursion.

Part 1 focuses on the methodology, notation, and research stack. Later parts will publish empirical findings as the programme progresses.

1. Motivation and research goal

FX markets are noisy and adaptive. Any research pipeline that tests enough ideas will eventually discover something that looks good in-sample. The goal here is not to find a single magic indicator, but to build a repeatable process that answers:

“Is there a robust conditional edge that survives realistic time-series validation and costs?”

The most reliable lever I’ve found in systematic research is gating: selectively avoiding trades in conditions that consistently degrade expectancy. This is a conservative approach: it does not require forecasting precisely; it requires identifying where not to trade.

Only once gating is robust do I consider Stage 2: whether entry timing can be improved using microstructure variables.

2. Data and instrumentation

2.1 Event sources

The strategy generates:

Orders submitted and filled trades (timestamps, direction, prices, and lifecycle events).
Market data: bar data (e.g., 1-minute) and tick data (bid/ask, and volumes where available).

2.2 Trade representation

Each trade iii is defined by:

Decision time $t_i$ (the time a trade is committed / filled; the decision boundary is defined consistently per experiment).
Direction $d_i \in {+1,-1}$ (LONG/SHORT).
Entry price $p_i$ .
A vector of features available at decision time, $x_i \in \mathbb{R}^k$ .

A key discipline: no feature may use information after $t_i$ .

3. Notation and core definitions

Let $P_t$ be a price process (bid, ask, mid, or another consistent convention).

3.1 Returns and excursions

For trade iii entered at $t_i$ with entry price $p_i$ , define an evaluation window $[t_i, t_i + H]$ .

Directional signed move:

\Delta P_i(t) = d_i \cdot (P_t - p_i)

Maximum favourable excursion (MFE) over horizon $H$ :

\mathrm{MFE}<em data-start="3994" data-end="4005">i(H)=\max</em>{t \in [t_i,, t_i+H]} \Delta P_i(t)

Maximum adverse excursion (MAE) over horizon $H$ :

\mathrm{MAE}<em data-start="4133" data-end="4144">i(H)=\min</em>{t \in [t_i,, t_i+H]} \Delta P_i(t)

(Excursions are measured consistently in pips or price units.)

3.2 Event labels

A common trap is using “MFE is positive” as a tradability claim. Many losing trades briefly go positive. Instead, I define event labels that reflect tradability:

Hit-first event label (recommended):

y_i = \mathbb{1}{\text{price hits } +X \text{ before } -Y \text{ within } H}

Where $X$ and $Y$ are thresholds in pips, typically tied to volatility/spread constraints such as:

X = \max(\kappa \cdot \text{spread},, \mu \cdot \mathrm{ATR})

This makes the label resistant to “predicting noise”.

3.3 Gates

A gate is a boolean function of decision-time features:
$g(x_i) \in {0,1}$
where $g(x_i)=1$ means “trade allowed”, and $g(x_i)=0$ means “veto”.

The key quantity of interest is uplift in a target metric, for example hit probability:

\Delta = \mathbb{E}[y \mid g(x)=1] - \mathbb{E}[y]

Or lift:

\mathrm{Lift} = \frac{\mathbb{E}[y \mid g(x)=1]}{\mathbb{E}[y]}

4. Validation protocol (time-series first)

4.1 Purged and embargoed walk-forward

Time-series validation must avoid leakage from temporal dependence and overlapping trades. I use walk-forward validation with:

Training window: $T_{\text{train}}$ days
Test window: $T_{\text{test}}$ days
Embargo: remove samples within an embargo period around fold boundaries (e.g., 24 hours) to reduce contamination from adjacent time segments.

This ensures I’m estimating behaviour under a realistic “train on past, test on future” regime.

4.2 Block bootstrap for uncertainty

Financial outcomes are autocorrelated and heavy-tailed. I use block bootstrap (e.g., by day) to estimate uncertainty for fold outcomes and uplift metrics.

If $Z$ is a statistic (lift, win-rate uplift, profit-factor proxy, mean return), I estimate a distribution:
${Z^{(b)}}_{b=1}^{B}$
from which confidence intervals and stability measures are derived.

4.3 Multiple testing controls

If I test hundreds of candidate gates, some will appear significant by chance. To reduce “selection by noise”, I apply multiple-testing discipline and require walk-forward stability rather than single-period wins.

Practical promotion standards:

Evidence must appear in multiple folds, not one.
Gates should remain interpretable and operationally simple.

5. Stage gating research design

This programme uses a funnel:

Stage A — Candidate screening (cheap)

Goal: rapidly identify candidate gates that show uplift.

Candidate generation from:

simple thresholds (spread bucket, ATR bucket, session/hour, regime flags, confidence bins),
small combinations of conditions,
“pocket” discovery (bins and intersections).

Evaluation includes:

uplift and retention (how many trades remain),
minimum trade count,
minimum hits in test segments.

Stage B — Walk-forward validation (proper)

Goal: ensure uplift persists out-of-sample.

Apply purged/embargoed walk-forward.
Estimate uncertainty via block bootstrap.
Promote only gates that show consistent benefit across folds.

Stage C — Stress and realism checks

Goal: ensure robustness is not an artefact.

Costs/spread/slippage stress (sensitivity analysis).
Regime drift checks (by year/session/volatility regimes).
Failure mode analysis: where does the gate break?

The output of Stage 1 is not “the strategy”. It is a policy constraint: a compact set of veto rules that reduces exposure to adverse conditions.

6. Interpretable machine learning in the research loop

In parts of the pipeline I use machine learning as a research tool, not as a deployable “black-box trading model”.

The principle is simple:

ML outputs are treated as hypotheses (candidate gates or quantified relationships).
Promotion depends entirely on out-of-sample stability under the validation protocol above.

6.1 ML as a candidate generator (shallow trees)

To efficiently search for compact, interpretable veto rules, I use shallow decision trees as a rule generator. The tree is trained on decision-time features $x_i$ with an appropriate label (e.g., $y_i$ or a loss-event proxy), and then distilled into human-readable conditions.

A typical distilled gate looks like:

“If (spread bucket high) AND (hour in weak session) AND (volatility regime unfavourable) → veto”

Importantly:

the tree itself is not the final model,
rules are extracted and then re-tested independently via purged walk-forward + bootstrap,
only compact rules that remain stable are promoted.

This approach provides a practical compromise: algorithmic discovery with human-auditable outputs.

6.2 ML as quantification (regularised logistic regression)

Where useful, I use regularised logistic regression to quantify associations and monitor stability/drift-like behaviour under feature constraints. Regularisation reduces degrees of freedom and helps avoid fitting noise, especially when multiple correlated features exist.

Again, these fits are used to inform hypotheses and prioritise investigations; they are not treated as “alpha” unless supported by out-of-sample evidence.

6.3 Why this ML usage is deliberately constrained

The primary risk with ML in finance is not implementation complexity; it is overfitting under multiple testing. By limiting ML to interpretability-first roles (candidate generation and quantification), and enforcing strict out-of-sample validation, the pipeline stays aligned with the goal: robust conditional probability rather than fragile curve-fit.

7. Stage 2: microstructure entry refinement (overview)

Once gating is stable, I test a separate hypothesis:

Conditional on “good” gated conditions, can tick-level variables reliably improve entry timing and trade quality?

7.1 Theoretical earlier entry

For each trade, define a local bar-based “cycle window” around the actual entry time. Over that window, compute a range:

R = P_{\text{high}} - P_{\text{low}}

Define a theoretical entry level using an entry fraction $\alpha \in (0,1)$ :

For a LONG:

p^\star = P_{\text{low}} + \alpha R

For a SHORT:

p^\star = P_{\text{high}} - \alpha R

Then use tick data to find the earliest timestamp $\tau$ where the mid price crosses $p^\star$ .

7.2 Microstructure features around $\tau$

At/around $\tau$ , compute microstructure features such as:

spread statistics in short windows pre/post $\tau$ ,
short-horizon signed move statistics,
volume imbalance proxies (when volume is available),
measures of choppiness vs directional persistence,
time-to-cross and subsequent short-horizon MFE/MAE.

7.3 Stage 2 evaluation rule

Stage 2 is only “successful” if earlier entry improves outcomes without increasing risk:

improves probability of reaching $+X$ before $-Y$ within $H$ ,
improves MFE distribution and does not worsen MAE beyond tolerance,
remains stable across walk-forward folds.

This is deliberately conservative: fragile signals are not promoted.

8. Engineering stack and reproducibility

8.1 Pipeline principles

Reproducible runs: each experiment tagged with a run identifier (e.g., run_id).
Decision-time enforcement: features must be computable at $t_i$ without future leakage.
Separation of concerns:
- strategy execution & telemetry capture,
- dataset construction,
- validation harness,
- reporting and artifact generation.

8.2 Architecture sketch (conceptual)

Strategy execution (orders/trades/events)
        |
        v
PostgreSQL telemetry + market data
        |
        v
Dataset builder (labels + features)
        |
        +--> Stage 1: gating / WF / bootstrap / promotion
        |
        +--> Stage 2: theoretical entry + tick microstructure features
        |
        v
Reports: notes + charts + CSV summaries

8.3 Why this matters

The stack is not “the edge”. The stack is what prevents self-deception:

it makes it easy to test hypotheses quickly,
it makes it hard to leak future information accidentally,
it makes negative results useful (they close doors).

9. Reporting policy: what I publish (and what I don’t)

I publish:

methodology,
validation discipline,
stability evidence and limitations,
engineering approach and research notes.

I do not publish:

live deployable parameters or rule sets,
security-sensitive operational details,
anything presented as guaranteed performance.

10. Current status (at time of writing)

Stage gating research is ongoing, with emphasis on out-of-sample stability.
Microstructure entry refinement is tested only after gating reduces the hypothesis space.
The stack evolves to improve iteration speed, auditability, and repeatability.

11. Roadmap for Part 2

Part 2 will focus on empirical results from Stage 2, including:

whether earlier entry opportunities exist conditionally,
which microstructure variables (if any) survive walk-forward,
how entry improvements interact with risk controls (MAE and tail behaviour).

Appendix A: Checklist for each experiment

Define label $y_i$ (horizon $H$ , thresholds $X,Y$ , cost assumptions).
Define features $x_i$ and confirm they are decision-time valid.
Stage A screening with conservative minimum sample sizes.
Stage B walk-forward with embargo and block bootstrap.
Apply multiple-testing discipline; promote only compact gates.
Stress test costs, regime segmentation, and drift.
Write a short research note: what worked, what failed, what next.

Analytics white-paper