Presenting BoundaryNet - a resizing-free approach for high-precision weakly supervised document layout parsing. BoundaryNet will be an ORAL presentation (Oral Session 3) today at @icdar2021 . Project page: ihdia.iiit.ac.in/BoundaryNet/ . Details 👇
Precise boundary annotations can be crucial for downstream applications which rely on region-class semantics. Some document collections contain irregular and overlapping region instances. Fully automatic approaches require resizing and often produce suboptimal parsing results.
Our semi-automatic approach takes region bounding box as input and predicts boundary polygon as output. Importantly, BoundaryNet can handle variable sized images without any need for resizing.
In the first stage, the variable-sized input image is processed by an attention-based fully convolutional network to obtain a region mask and a class label.
The first part of backbone contains a series of residual blocks to obtain progressively refined feature representations. The second part contains Skip Attentional Guidance blocks. Each block produces increasingly compressed feature representations of its input.
Output from immediate earlier SAG block is fused with skip features originating from a lower-level residual block layer. This fusion is modulated via attention mechanism. No spatial downsampling/upsampling is done. Doing so enables crucial boundary information to be preserved.
Features from the last residual block are fed to ‘Region Classifier’ sub-network which predicts the associated region class. The adaptive average pooling block within the sub-network ensures a fixed-dimensional output despite varying input dimensions.
The final set of features generated by skip-connection based attentional guidance are provided to the ‘Mask Decoder’ network which outputs a region mask binary map.
A fast marching distance map is used to guide the region mask optimization to be more boundary aware. The map is used along with per-pixel class weighted binary focal loss to improve robustness.
Overall, our choices within the MCNN encourage the generation of good initial boundary estimates. Such estimates decrease the task complexity for subsequent boundary refinement task.
A series of morphological operations are applied to the region mask output by MCNN above, to obtain the initial estimate of boundary polygon.
Our Anchor Graph Convolutional Network processes the boundary estimate points as a graph and iteratively refines the point locations.
The nodes in the graph that is input to Anchor GCN consist of sampled points on the mask contour. The 2D position and the appropriate skip attention backbone feature are used as node features. Each node is connected to is 10-hop mask contour neighbours.
The boundary estimates are refined via two GCN and six Res-GCN blocks. The fully connected layer at end predicts shifts in x-y locations of initial boundary points.
We use a boundary-centric Hausdorff distance to optimize Anchor GCN parameters. We also use Hausdorff Distance as the evaluation metric.
We source the region data and annotations from Indiscapes - a large-scale #historical#Manuscripts dataset.
BoundaryNet outperforms strong semi-automatic baselines. In particular, it has extremely good performance for the most common region class - Character Line Segment.
A visual illustration of BoundaryNet’s superior quality boundaries compared to other baseline approaches.
BoundaryNet’s boundary predictions are more accurate than fully automatic methods. In particular, note that BoundaryNet predictions enclose text lines in a proper and complete manner.
When deployed for annotation, timing analysis reveals BoundaryNet reduces overall annotation time, including correction time. BoundaryNet’s effective annotation time is smaller than even fully automatic approaches due to high quality of boundaries generated by our approach.
Code, pre-trained models and an interactive viewer for data and predictions are available at ihdia.iiit.ac.in/BoundaryNet/
BoundaryNet was possible due to efforts of @Abhishe53242750 👏👏 .
• • •
Missing some Tweet in this thread? You can try to
force a refresh
📢 Introducing SynSE, a language-guided approach for generalized zero shot learning of pose-based action representations! Great effort by @bublaasaur and @divyanshu1709#actionrecognition
For enabling compositional generalization to novel action-object combinations, the action description is transformed into individual Part-of-Speech based embeddings.
The PoS-based embeddings are aligned with action sequence embedding via a VAE-based generative space. This alignment is optimized using within and cross modality constraints.