阅读量:2
要提取PDF中的文字,你可以使用Python中的PyPDF2库或pdfminer库。
首先,你需要安装所需的库。在命令行中运行以下命令来安装PyPDF2库:
pip install PyPDF2
或者,运行以下命令来安装pdfminer库:
pip install pdfminer.six
然后,你可以根据你选择的库使用以下示例代码来提取PDF中的文字。
使用PyPDF2库的示例代码:
import PyPDF2 def extract_text_from_pdf(file_path): text = "" with open(file_path, "rb") as file: pdf = PyPDF2.PdfFileReader(file) num_pages = pdf.numPages for page in range(num_pages): page_obj = pdf.getPage(page) text += page_obj.extract_text() return text file_path = "path_to_your_pdf_file" text = extract_text_from_pdf(file_path) print(text)
使用pdfminer库的示例代码:
import io from pdfminer.converter import TextConverter from pdfminer.pdfinterp import PDFPageInterpreter from pdfminer.pdfinterp import PDFResourceManager from pdfminer.pdfpage import PDFPage def extract_text_from_pdf(file_path): text = "" with open(file_path, "rb") as file: resource_manager = PDFResourceManager() string_io = io.StringIO() converter = TextConverter(resource_manager, string_io) page_interpreter = PDFPageInterpreter(resource_manager, converter) for page in PDFPage.get_pages(file): page_interpreter.process_page(page) text = string_io.getvalue() converter.close() string_io.close() return text file_path = "path_to_your_pdf_file" text = extract_text_from_pdf(file_path) print(text)
请注意,这些代码示例假设你已经将PDF文件的路径存储在变量file_path
中。你需要将其替换为你实际的PDF文件路径。