pdf2chem package

Submodules

pdf2chem.pdf2chem module

Module contents

pdf2chem.aggregate_csv_files()[source]
pdf2chem.curate_folder(pdf_dir='/home/docs/checkouts/readthedocs.org/user_builds/pdf2chem/checkouts/latest/docs')[source]

Extract known chemicals from a folder of pdf files, and export a .csv file of SMILESstrings, a machine-readable chemical format for each file and a combined .csv for all the pdf files.

Extract text from a pdf file. Use chemdataextractor’s NLP to identify chemical entities. Attempt to resolve each entity at NIH’s CACTVS service. Organize chemicals recognized by PubChem into a dataframe. Export the chemical names and SMILES strings as a .csv files Repeat for each pdf file in the folder

Parameters

pdf_dir (string, optional) – path to a folder of pdf files (the default is the current working directory)

pdf2chem.quick_curate(pdf_path, pdf_method, false_positives, regex_number)[source]