pdf2chem package¶
Submodules¶
pdf2chem.pdf2chem module¶
Module contents¶
-
pdf2chem.curate_folder(pdf_dir='/home/docs/checkouts/readthedocs.org/user_builds/pdf2chem/checkouts/latest/docs')[source]¶ Extract known chemicals from a folder of pdf files, and export a .csv file of SMILESstrings, a machine-readable chemical format for each file and a combined .csv for all the pdf files.
Extract text from a pdf file. Use chemdataextractor’s NLP to identify chemical entities. Attempt to resolve each entity at NIH’s CACTVS service. Organize chemicals recognized by PubChem into a dataframe. Export the chemical names and SMILES strings as a .csv files Repeat for each pdf file in the folder
- Parameters
pdf_dir (string, optional) – path to a folder of pdf files (the default is the current working directory)