Enhancing Excel File Support in Apache SeaTunnel 2.3.5: Streamlining Data Integration for the Modern Enterprise

Apache SeaTunnel
5 min readJan 8, 2025

--

Overview

In data integration scenarios, Excel files are a common data source. Their diverse formats and complex functionalities often pose challenges for developers.

This modification is based on the SeaTunnel 2.3.4 version and includes:

  • Automatic recognition of .xlsx and .xls file types, no longer relying on file extensions.
  • New support for parsing the values of formula cells.
  • Enhanced fault tolerance in data type conversions.

With these improvements, SeaTunnel now better supports reading Excel files, automatically detecting .xlsx and .xls formats, reading formula cell values, and improving usability and stability for Excel data handling scenarios.

Relevant Modules

This update to Excel support is located in the connector-file-base module.

Enhanced File Type Detection for .xls and .xlsx

SeaTunnel’s handling of Excel files is located within the ExcelReadStrategy class. In the original version, file type detection was based on the file extension. This approach failed when the file name was randomly generated or lacked an extension, leading to parsing failures. The original code looked like this:

// org.apache.seatunnel.connectors.seatunnel.file.source.reader.ExcelReadStrategy

@SneakyThrows
@Override
public void read(String path, String tableId, Collector<SeaTunnelRow> output) {
Map<String, String> partitionsMap = parsePartitionsByPath(path);
FSDataInputStream file = hadoopFileSystemProxy.getInputStream(path);
Workbook workbook;

if (path.endsWith(".xls")) {
workbook = new HSSFWorkbook(file);
} else if (path.endsWith(".xlsx")) {
workbook = new XSSFWorkbook(file);
} else {
throw new FileConnectorException(
CommonErrorCodeDeprecated.UNSUPPORTED_OPERATION,
"Only support read excel file");
}

Sheet sheet =
pluginConfig.hasPath(BaseSourceConfigOptions.SHEET_NAME.key())
? workbook.getSheet(
pluginConfig.getString(BaseSourceConfigOptions.SHEET_NAME.key()))
: workbook.getSheetAt(0);
cellCount = seaTunnelRowType.getTotalFields();
cellCount = partitionsMap.isEmpty() ? cellCount : cellCount + partitionsMap.size();
......
......
......
}

Using Apache POI’s WorkbookFactory.create for Workbook Creation

By introducing the WorkbookFactory.create method from Apache POI to create the workbook, POI can automatically determine the file type based on its content, without relying on the file extension. This approach improves compatibility and makes the solution less sensitive to file name formats.

// org.apache.seatunnel.connectors.seatunnel.file.source.reader.ExcelReadStrategy

@SneakyThrows
@Override
public void read(String path, String tableId, Collector<SeaTunnelRow> output) {
Map<String, String> partitionsMap = parsePartitionsByPath(path);
FSDataInputStream file = hadoopFileSystemProxy.getInputStream(path);
Workbook workbook;

workbook = WorkbookFactory.create(file);

Sheet sheet =
pluginConfig.hasPath(BaseSourceConfigOptions.SHEET_NAME.key())
? workbook.getSheet(
pluginConfig.getString(BaseSourceConfigOptions.SHEET_NAME.key()))
: workbook.getSheetAt(0);
cellCount = seaTunnelRowType.getTotalFields();
cellCount = partitionsMap.isEmpty() ? cellCount : cellCount + partitionsMap.size();
......
......
......
}

Supporting Formula Cell Value Reading

The method for reading cell values is located in the ExcelReadStrategy class's getCellValue method. The original implementation handles String, Boolean, Numeric, and Error types, but it doesn't support formulas (FORMULA type). To address this, we can also enhance the method to handle formula cells.

By introducing POI’s FormulaEvaluator, we can read the evaluated result of a formula cell, ensuring that even formula cells return their computed values correctly.

// org.apache.seatunnel.connectors.seatunnel.file.source.reader.ExcelReadStrategy

private Object getCellValue(CellType cellType, Cell cell) {
switch (cellType) {
case STRING:
return cell.getStringCellValue();
case BOOLEAN:
return cell.getBooleanCellValue();
case NUMERIC:
if (DateUtil.isCellDateFormatted(cell)) {
DataFormatter formatter = new DataFormatter();
return formatter.formatCellValue(cell);
}
return cell.getNumericCellValue();
case ERROR:
break;
default:
throw new FileConnectorException(
CommonErrorCodeDeprecated.UNSUPPORTED_DATA_TYPE,
String.format("[%s] type not support ", cellType));
}
return null;
}

Full Modification of the read Method

To support formula handling, the getCellValue method needs to accept a FormulaEvaluator parameter. This requires creating a FormulaEvaluator instance in the read method and passing it to getCellValue.

// org.apache.seatunnel.connectors.seatunnel.file.source.reader.ExcelReadStrategy

@SneakyThrows
@Override
public void read(String path, String tableId, Collector<SeaTunnelRow> output) {
Map<String, String> partitionsMap = parsePartitionsByPath(path);
FSDataInputStream file = hadoopFileSystemProxy.getInputStream(path);
Workbook workbook;

// Let POI determine the file type and create the corresponding workbook
workbook = WorkbookFactory.create(file);

// Create FormulaEvaluator instance
FormulaEvaluator evaluator = workbook.getCreationHelper().createFormulaEvaluator();

Sheet sheet = pluginConfig.hasPath(BaseSourceConfigOptions.SHEET_NAME.key())
? workbook.getSheet(pluginConfig.getString(BaseSourceConfigOptions.SHEET_NAME.key()))
: workbook.getSheetAt(0);
cellCount = seaTunnelRowType.getTotalFields();
cellCount = partitionsMap.isEmpty() ? cellCount : cellCount + partitionsMap.size();
......
......

IntStream.range((int) skipHeaderNumber, rowCount)
.mapToObj(sheet::getRow)
.filter(Objects::nonNull)
.forEach(
rowData -> {
int[] cellIndexes = indexes == null ? IntStream.range(0, cellCount).toArray() : indexes;
int z = 0;
SeaTunnelRow seaTunnelRow = new SeaTunnelRow(cellCount);
for (int j : cellIndexes) {
Cell cell = rowData.getCell(j);
seaTunnelRow.setField(z++, cell == null
? null
: convert(getCellValue(cell.getCellType(), cell, evaluator), fieldTypes[z - 1]));
}
if (isMergePartition) {
int index = seaTunnelRowType.getTotalFields();
for (String value : partitionsMap.values()) {
seaTunnelRow.setField(index++, value);
}
}
seaTunnelRow.setTableId(tableId);
output.collect(seaTunnelRow);
});
}

Enhancing Fault Tolerance in Type Conversion

The original convert method directly used parse methods for type conversion (e.g., Double.parseDouble), which didn't provide much fault tolerance. We need to improve this to handle errors more gracefully.

Here’s the original convert method:

// org.apache.seatunnel.connectors.seatunnel.file.source.reader.ExcelReadStrategy

@SneakyThrows
private Object convert(Object field, SeaTunnelDataType<?> fieldType) {
if (field == null) {
return "";
}
SqlType sqlType = fieldType.getSqlType();
switch (sqlType) {
case MAP:
case ARRAY:
return objectMapper.readValue((String) field, fieldType.getTypeClass());
case STRING:
return field.toString();
case DOUBLE:
return Double.parseDouble(field.toString());
case BOOLEAN:
return Boolean.parseBoolean(field.toString());
case FLOAT:
return (float) Double.parseDouble(field.toString());
case BIGINT:
return (long) Double.parseDouble(field.toString());
case INT:
return (int) Double.parseDouble(field.toString());
case TINYINT:
return (byte) Double.parseDouble(field.toString());
case SMALLINT:
return (short) Double.parseDouble(field.toString());
case DECIMAL:
return BigDecimal.valueOf(Double.parseDouble(field.toString()));
...
}
}

We’ve introduced a utility class StrToNumberUtil to provide safer, more tolerant type conversions:

package org.apache.seatunnel.connectors.seatunnel.file.utils;

import org.apache.commons.lang3.StringUtils;

import java.math.BigDecimal;
import java.util.Optional;

public class StrToNumberUtil {
public static Double str2Double(String str) {
if (StringUtils.isBlank(str)) {
return null;
}
try {
return Double.parseDouble(str.trim());
} catch (Exception e) {
return null;
}
}

public static Long str2Long(String str) {
if (StringUtils.isBlank(str)) {
return null;
}
str = str.trim();
if (str.indexOf('.') != str.lastIndexOf('.')) {
return null;
}
String sub = str.indexOf('.') >= 0 ? str.substring(0, str.indexOf('.')) : str;
try {
return Long.parseLong(sub);
} catch (Exception e) {
return null;
}
}

// Add similar methods for other types...
}

Then, in the convert method, we use these utility methods for type conversions:

// org.apache.seatunnel.connectors.seatunnel.file.source.reader.ExcelReadStrategy

@SneakyThrows
private Object convert(Object field, SeaTunnelDataType<?> fieldType) {
if (field == null) {
return "";
}
SqlType sqlType = fieldType.getSqlType();
switch (sqlType) {
case MAP:
case ARRAY:
return objectMapper.readValue((String) field, fieldType.getTypeClass());
case STRING:
return field.toString();
case DOUBLE:
return StrToNumberUtil.str2Double(field.toString());
case BOOLEAN:
return Boolean.parseBoolean(field.toString());
case FLOAT:
return StrToNumberUtil.str2Float(field.toString());
case BIGINT:
return StrToNumberUtil.str2Long(field.toString());
case INT:
return StrToNumberUtil.str2Int(field.toString());
case TINYINT:
return StrToNumberUtil.str2Byte(field.toString());
case SMALLINT:
return StrToNumberUtil.str2Short(field.toString());
case DECIMAL:
return new BigDecimal(field.toString());
...
}
}

Conclusion

The improvements introduced in SeaTunnel 2.3.5 for Excel file handling significantly enhance the robustness and ease of use. By incorporating file type auto-detection, formula value reading, and improved fault tolerance in type conversion, developers can now work more seamlessly with Excel data in SeaTunnel. These changes also contribute to better overall stability and reliability, making SeaTunnel an even more powerful tool for data integration tasks.

--

--

Apache SeaTunnel
Apache SeaTunnel

Written by Apache SeaTunnel

The next-generation high-performance, distributed, massive data integration tool.

No responses yet