Config Router

  • Google Sheets
  • CCNA Online training
    • CCNA
  • CISCO Lab Guides
    • CCNA Security Lab Manual With Solutions
    • CCNP Route Lab Manual with Solutions
    • CCNP Switch Lab Manual with Solutions
  • Juniper
  • Linux
  • DevOps Tutorials
  • Python Array
You are here: Home / How to extract data from a PDF file while keeping track of its structure?

How to extract data from a PDF file while keeping track of its structure?

August 20, 2021 by James Palmer

There is essentially not an easy cut-and-paste solution because PDF isn’t really very interested in structure. There are many other answers on this site that will tell you things in much more detail, but this one should give you the main points:
If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?
If you want to do this in PDF itself (where you would have the majority of control over the process), you’ll have to loop over all text on pages and identify headers by looking at their text properties (fonts used, size relative to the other text on the page, etc…).
On top of that you’ll also have to identify paragraphs by looking at the positioning of text fragments, white space on the page, closeness of certain letters, words and lines… PDF by itself doesn’t even have a concept for a “word”, let alone “lines” or “paragraphs”.
To complicate things even more, the way text is drawn on the page (and thus the order in which it appears in the PDF file itself) doesn’t even have to be the proper reading order (or what us humans would consider to be proper reading order).

PDF parsing for headers and its sub contents are really very difficult (It doesn’t mean its impossible ) as PDF comes in various formats. But I recently encountered with tool named GROBID which can helps in this scenario. I know it’s not perfect but if we provide proper training it can accomplish our goals.
Grobid available as a opensource on github.
https://github.com/kermitt2/grobid

Related

Filed Under: Uncategorized

Recent Posts

  • How do I give user access to Jenkins?
  • What is docker volume command?
  • What is the date format in Unix?
  • What is the difference between ARG and ENV Docker?
  • What is rsync command Linux?
  • How to Add Music to Snapchat 2021 Android? | How to Search, Add, Share Songs on Snapchat Story?
  • How to Enable Snapchat Notifications for Android & iPhone? | Steps to Turn on Snapchat Bitmoji Notification
  • Easy Methods to Fix Snapchat Camera Not Working Black Screen Issue | Reasons & Troubleshooting Tips to Solve Snapchat Camera Problems
  • Detailed Procedure for How to Update Snapchat on iOS 14 for Free
  • What is Snapchat Spotlight Feature? How to Make a Spotlight on Snapchat?
  • Snapchat Hack Tutorial 2021: Can I hack a Snapchat Account without them knowing?

Copyright © 2025 · News Pro Theme on Genesis Framework · WordPress · Log in