@uiuxThis is a bit like comparing McLaren and Hyundai
Hi UiUx,
Espacenet has 20 patent applications which refer to DeepSeek. They are all Chinese.
The earliest is from 20200325.
CN118132969A Table data processing large language model training method and device, medium and equipment 20200325
With the advent of the digital age, a large amount of data is structured in tabular form and is widely used in various industries and fields. From transaction monitoring in the banking industry to human resource management to census in government agencies, tabular data plays an irreplaceable role in modern society. Although tabular data is ubiquitous, specific tasks related to tables are usually time-consuming and labor-intensive and require professional skills. Automating these tabular tasks is of great significance to both academia and industry, and has attracted the attention of many researchers.
...
The present invention believes that expanding the reasoning process of short answers to training data can enhance the reasoning ability of large language models. Since WikiTQ and FeTaQA only provide raw table data and do not involve text information, in order to simulate the office scene of mixed text and tables in the real world, the present invention also uses GPT-3.5 to supplement the table description information.
In some embodiments, the length of the document embedding table is less than 500 tokens.
Since the answers in the training data are presented in text form, the solution of question-answer pairs is achieved through the internal parameter-driven method of the model. In connection with the form tasks in office scenarios, a constraint of less than 500 tokens is imposed on the form input in this way to ensure that the model can more accurately build the connection between questions and answers.
Since data operations on tables embedded in spreadsheets mainly involve simple table input and complex table operations, such tasks are more suitable for solving in a code-driven way. In combination with the characteristics of WikiSQL and Spider training sets corresponding to SQL queries, this example selects training data from them, and optionally expands the reasoning process of question-answer pairs into Pandas code form, for example, through the Deepseek large language model to ensure flexibility in data operations. By comparing the quality of the generated code, it is ensured that the execution results after the generated Pandas code are consistent with the original answers provided in the benchmark. Among them, for the Spider dataset, in order to better keep consistent with the needs of single-table operations in office scenarios, data with empty results after multi-table queries and SQL queries are excluded to better reflect actual applications.
Step 1: Obtain form question and answer data, wherein the form question and answer data includes a form, questions and answers.
In some embodiments, the tabular question and answer data is derived from one or more of the WikiTQ, FeTaQA, or TAT-QA datasets;
The tabular question and answer data also comes from WikiSQL and/or Spider datasets.
In the specific implementation, the initial table question answering data comes from the existing table question answering task benchmark data set. Specifically, for the query of document embedded table data, the training data is mainly collected from the widely used table question answering benchmarks, including WikiTQ, FeTaQA and TAT-QA. For the data operation of spreadsheet embedded tables, the training data is obtained from the two Text2SQL benchmarks WikiSQL and Spider.
Step 2: Input the table and question into the intelligent question-answering model to generate an extended answer to the answer, and form extended training data based on the table, question and extended answer.
In some embodiments, the intelligent question answering model includes one or more of GPT3.5, GPT4, and deepseek-coder-33B.
Step three: generate a new question based on the table, input the table and the new question into the intelligent question-answering model, and generate a first answer and a second answer based on different methods.
Step 4: Compare the first answer and the second answer for cross-validation. If the cross-validation passes, generate training data based on the table, the new question, and the first answer and/or the second answer.
…
Last edited: