docxlatex

docxlatex is a light-weight python package for extracting text and mathematical equations from .docx files.
Influenced by python-docx and python-docx2txt, this project aims to expand the functionality of the previous two libraries by adding support for extracting equations from .docx documents.
docxlatex does not convert the entire .docx file into a TeX/LaTeX source file. It extracts the equations present in the document and converts them into valid LaTeX syntax and wraps them in delimiters.
In order to extract equations inserted into the document, you will need to convert them into linear format first.

To convert all your equations into linear format, click on any equation, go to the Equation tab, make sure LaTeX is selected, and click on Convert → All - Linear.

Installation

Install using pip pip install docxlatex

Usage

Usage is straightforward. For standard usage you will probably only need one method on the Document class.

Import the Document class from docxlatex

from docxlatex import Document

Instantiate an object of the Document class, giving it either the path to the .docx file or a file-like object.

docx = Document('/path/to/document')

# OR

_ = open('/path/to/document')
docx = Document(_)

Call the get_text() method on the Document object to extract all text from the document.

text = docx.get_text()
In order to extract equations inserted into the document, you will need to convert them into linear format first.

To convert all your equations into linear format, click on any equation, go to the Equation tab, make sure LaTeX is selected, and click on Convert → All - Linear.

API Reference

class Document

A class representing a .docx document. A thin wrapper with methods to extract text from the document.

Document.document

The .docx document from which text is to be extracted. Stored as a file-like object OR a path to a file-like object and is only read when the Document.get_text() method is called.

Document.inline_delimiter

The delimiter to wrap inline equations in. Defaults to '$'. All equations found in a continuous line of text will be wrapped between this delimiter.

The slope of a straight line is given by $ \frac{y_2 - y_1}{x_2 - x_1} $.

Document.block_delimiter

The delimiter to wrap equation blocks in. Defaults to '$$'. All equations found on their own line and preceded and/or succeeded by newlines will be wrapped between this delimiter.

Markov's theorem states that the probability that a random variable R is greater than or equal to some value x is at most the expected value of the random variable divided by x -

$$ Pr\left(R \geq x\right) \leq \frac{Ex\left(R\right)}{x} $$

Document.get_text(self, get_header_text=False, get_footer_text=False, image_dir=None, extensions=None)