Commit db0333b9 authored by bake.snn's avatar bake.snn

支持Phoenix5.x版本读写插件

parent d4d1ea6a
# hbase20xsqlreader 插件文档
___
## 1 快速介绍
hbase20xsqlreader插件实现了从Phoenix(HBase SQL)读取数据,对应版本为HBase2.X和Phoenix5.X。
## 2 实现原理
简而言之,hbase20xsqlreader通过Phoenix轻客户端去连接Phoenix QueryServer,并根据用户配置信息生成查询SELECT 语句,然后发送到QueryServer读取HBase数据,并将返回结果使用DataX自定义的数据类型拼装为抽象的数据集,最终传递给下游Writer处理。
## 3 功能说明
### 3.1 配置样例
* 配置一个从Phoenix同步抽取数据到本地的作业:
```
{
"job": {
"content": [
{
"reader": {
"name": "hbase20xsqlreader", //指定插件为hbase20xsqlreader
"parameter": {
"queryServerAddress": "http://127.0.0.1:8765", //填写连接Phoenix QueryServer地址
"serialization": "PROTOBUF", //QueryServer序列化格式
"table": "TEST", //读取表名
"column": ["ID", "NAME"], //所要读取列名
"splitKey": "ID" //切分列,必须是表主键
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"encoding": "UTF-8",
"print": true
}
}
}
],
"setting": {
"speed": {
"channel": "3"
}
}
}
}
```
### 3.2 参数说明
* **queryServerAddress**
* 描述:hbase20xsqlreader需要通过Phoenix轻客户端去连接Phoenix QueryServer,因此这里需要填写对应QueryServer地址。
* 必选:是 <br />
* 默认值:无 <br />
* **serialization**
* 描述:QueryServer使用的序列化协议
* 必选:否 <br />
* 默认值:PROTOBUF <br />
* **table**
* 描述:所要读取表名
* 必选:是 <br />
* 默认值:无 <br />
* **schema**
* 描述:表所在的schema
* 必选:否 <br />
* 默认值:无 <br />
* **column**
* 描述:填写需要从phoenix表中读取的列名集合,使用JSON的数组描述字段信息,空值表示读取所有列。
* 必选: 否<br />
* 默认值:全部列 <br />
* **splitKey**
* 描述:读取表时对表进行切分并行读取,切分时有两种方式:1.根据该列的最大最小值按照指定channel个数均分,这种方式仅支持整形和字符串类型切分列;2.根据设置的splitPoint进行切分
* 必选:是 <br />
* 默认值:无 <br />
* **splitPoints**
* 描述:由于根据切分列最大最小值切分时不能保证避免数据热点,splitKey支持用户根据数据特征动态指定切分点,对表数据进行切分。建议切分点根据Region的startkey和endkey设置,保证每个查询对应单个Region
* 必选: 否<br />
* 默认值:无 <br />
* **where**
* 描述:支持对表查询增加过滤条件,每个切分都会携带该过滤条件。
* 必选: 否<br />
* 默认值:无<br />
* **querySql**
* 描述:支持指定多个查询语句,但查询列类型和数目必须保持一致,用户可根据实际情况手动输入表查询语句或多表联合查询语句,设置该参数后,除queryserverAddress参数必须设置外,其余参数将失去作用或可不设置。
* 必选: 否<br />
* 默认值:无<br />
### 3.3 类型转换
目前hbase20xsqlreader支持大部分Phoenix类型,但也存在部分个别类型没有支持的情况,请注意检查你的类型。
下面列出MysqlReader针对Mysql类型转换列表:
| DataX 内部类型| Phoenix 数据类型 |
| -------- | ----- |
| String |CHAR, VARCHAR|
| Bytes |BINARY, VARBINARY|
| Bool |BOOLEAN |
| Long |INTEGER, TINYINT, SMALLINT, BIGINT |
| Double |FLOAT, DECIMAL, DOUBLE, |
| Date |DATE, TIME, TIMESTAMP |
## 4 性能报告
## 5 约束限制
* 切分表时切分列仅支持单个列,且该列必须是表主键
* 不设置splitPoint默认使用自动切分,此时切分列仅支持整形和字符型
* 表名和SCHEMA名及列名大小写敏感,请与Phoenix表实际大小写保持一致
* 仅支持通过Phoenix QeuryServer读取数据,因此您的Phoenix必须启动QueryServer服务才能使用本插件
## 6 FAQ
***
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>datax-all</artifactId>
<groupId>com.alibaba.datax</groupId>
<version>0.0.1-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>hbase20xsqlreader</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<properties>
<phoenix.version>5.0.0-HBase-2.0</phoenix.version>
</properties>
<dependencies>
<dependency>
<groupId>com.alibaba.datax</groupId>
<artifactId>datax-common</artifactId>
<version>${datax-project-version}</version>
<exclusions>
<exclusion>
<artifactId>slf4j-log4j12</artifactId>
<groupId>org.slf4j</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.phoenix</groupId>
<artifactId>phoenix-queryserver</artifactId>
<version>${phoenix.version}</version>
<exclusions>
<exclusion>
<artifactId>servlet-api</artifactId>
<groupId>javax.servlet</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-core</artifactId>
<version>2.0.44-beta</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.alibaba.datax</groupId>
<artifactId>datax-core</artifactId>
<version>${datax-project-version}</version>
<exclusions>
<exclusion>
<groupId>com.alibaba.datax</groupId>
<artifactId>datax-service-face</artifactId>
</exclusion>
</exclusions>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.alibaba.datax</groupId>
<artifactId>plugin-rdbms-util</artifactId>
<version>0.0.1-SNAPSHOT</version>
<scope>compile</scope>
</dependency>
</dependencies>
<build>
<resources>
<resource>
<directory>src/main/java</directory>
<includes>
<include>**/*.properties</include>
</includes>
</resource>
</resources>
<plugins>
<!-- compiler plugin -->
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.6</source>
<target>1.6</target>
<encoding>${project-sourceEncoding}</encoding>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptors>
<descriptor>src/main/assembly/package.xml</descriptor>
</descriptors>
<finalName>datax</finalName>
</configuration>
<executions>
<execution>
<id>dwzip</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
<assembly
xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0 http://maven.apache.org/xsd/assembly-1.1.0.xsd">
<id></id>
<formats>
<format>dir</format>
</formats>
<includeBaseDirectory>false</includeBaseDirectory>
<fileSets>
<fileSet>
<directory>src/main/resources</directory>
<includes>
<include>plugin.json</include>
<include>plugin_job_template.json</include>
</includes>
<outputDirectory>plugin/reader/hbase20xsqlreader</outputDirectory>
</fileSet>
<fileSet>
<directory>target/</directory>
<includes>
<include>hbase20xsqlreader-0.0.1-SNAPSHOT.jar</include>
</includes>
<outputDirectory>plugin/reader/hbase20xsqlreader</outputDirectory>
</fileSet>
</fileSets>
<dependencySets>
<dependencySet>
<useProjectArtifact>false</useProjectArtifact>
<outputDirectory>plugin/reader/hbase20xsqlreader/libs</outputDirectory>
<scope>runtime</scope>
</dependencySet>
</dependencySets>
</assembly>
package com.alibaba.datax.plugin.reader.hbase20xsqlreader;
public class Constant {
public static final String PK_TYPE = "pkType";
public static final Object PK_TYPE_STRING = "pkTypeString";
public static final Object PK_TYPE_LONG = "pkTypeLong";
public static final String DEFAULT_SERIALIZATION = "PROTOBUF";
public static final String CONNECT_STRING_TEMPLATE = "jdbc:phoenix:thin:url=%s;serialization=%s";
public static final String CONNECT_DRIVER_STRING = "org.apache.phoenix.queryserver.client.Driver";
public static final String SELECT_COLUMNS_TEMPLATE = "SELECT COLUMN_NAME, COLUMN_FAMILY FROM SYSTEM.CATALOG WHERE TABLE_NAME='%s' AND COLUMN_NAME IS NOT NULL";
public static String QUERY_SQL_TEMPLATE_WITHOUT_WHERE = "select %s from %s ";
public static String QUERY_SQL_TEMPLATE = "select %s from %s where (%s)";
public static String QUERY_MIN_MAX_TEMPLATE = "SELECT MIN(%s),MAX(%s) FROM %s";
public static String QUERY_COLUMN_TYPE_TEMPLATE = "SELECT %s FROM %s LIMIT 1";
public static String QUERY_SQL_PER_SPLIT = "querySqlPerSplit";
}
package com.alibaba.datax.plugin.reader.hbase20xsqlreader;
import com.alibaba.datax.common.plugin.RecordSender;
import com.alibaba.datax.common.spi.Reader;
import com.alibaba.datax.common.util.Configuration;
import java.util.List;
public class HBase20xSQLReader extends Reader {
public static class Job extends Reader.Job {
private Configuration originalConfig;
private HBase20SQLReaderHelper readerHelper;
@Override
public void init() {
this.originalConfig = this.getPluginJobConf();
this.readerHelper = new HBase20SQLReaderHelper(this.originalConfig);
readerHelper.validateParameter();
}
@Override
public List<Configuration> split(int adviceNumber) {
return readerHelper.doSplit(adviceNumber);
}
@Override
public void destroy() {
// do nothing
}
}
public static class Task extends Reader.Task {
private Configuration readerConfig;
private HBase20xSQLReaderTask hbase20xSQLReaderTask;
@Override
public void init() {
this.readerConfig = super.getPluginJobConf();
hbase20xSQLReaderTask = new HBase20xSQLReaderTask(readerConfig, super.getTaskGroupId(), super.getTaskId());
}
@Override
public void startRead(RecordSender recordSender) {
hbase20xSQLReaderTask.readRecord(recordSender);
}
@Override
public void destroy() {
// do nothing
}
}
}
package com.alibaba.datax.plugin.reader.hbase20xsqlreader;
import com.alibaba.datax.common.spi.ErrorCode;
public enum HBase20xSQLReaderErrorCode implements ErrorCode {
REQUIRED_VALUE("Hbasewriter-00", "您缺失了必须填写的参数值."),
ILLEGAL_VALUE("Hbasewriter-01", "您填写的参数值不合法."),
GET_QUERYSERVER_CONNECTION_ERROR("Hbasewriter-02", "获取QueryServer连接时出错."),
GET_PHOENIX_TABLE_ERROR("Hbasewriter-03", "获取 Phoenix table时出错."),
GET_TABLE_COLUMNTYPE_ERROR("Hbasewriter-05", "获取表列类型时出错."),
CLOSE_PHOENIX_CONNECTION_ERROR("Hbasewriter-06", "关闭JDBC连接时时出错."),
ILLEGAL_SPLIT_PK("Hbasewriter-07", "非法splitKey配置."),
PHOENIX_COLUMN_TYPE_CONVERT_ERROR("Hbasewriter-08", "phoenix的列类型转换错误."),
QUERY_DATA_ERROR("Hbasewriter-09", "truncate hbase表时发生异常."),
;
private final String code;
private final String description;
private HBase20xSQLReaderErrorCode(String code, String description) {
this.code = code;
this.description = description;
}
@Override
public String getCode() {
return this.code;
}
@Override
public String getDescription() {
return this.description;
}
@Override
public String toString() {
return String.format("Code:[%s], Description:[%s].", this.code, this.description);
}
}
package com.alibaba.datax.plugin.reader.hbase20xsqlreader;
import com.alibaba.datax.common.element.*;
import com.alibaba.datax.common.exception.DataXException;
import com.alibaba.datax.common.plugin.RecordSender;
import com.alibaba.datax.common.statistics.PerfRecord;
import com.alibaba.datax.common.util.Configuration;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.math.BigDecimal;
import java.sql.*;
public class HBase20xSQLReaderTask {
private static final Logger LOG = LoggerFactory.getLogger(HBase20xSQLReaderTask.class);
private Configuration readerConfig;
private int taskGroupId = -1;
private int taskId=-1;
public HBase20xSQLReaderTask(Configuration config, int taskGroupId, int taskId) {
this.readerConfig = config;
this.taskGroupId = taskGroupId;
this.taskId = taskId;
}
public void readRecord(RecordSender recordSender) {
String querySql = readerConfig.getString(Constant.QUERY_SQL_PER_SPLIT);
LOG.info("Begin to read record by Sql: [{}\n] {}.", querySql);
HBase20SQLReaderHelper helper = new HBase20SQLReaderHelper(readerConfig);
Connection conn = helper.getConnection(readerConfig.getString(Key.QUERYSERVER_ADDRESS),
readerConfig.getString(Key.SERIALIZATION_NAME, Constant.DEFAULT_SERIALIZATION));
Statement statement = null;
ResultSet resultSet = null;
try {
long rsNextUsedTime = 0;
long lastTime = System.nanoTime();
statement = conn.createStatement();
// 统计查询时间
PerfRecord queryPerfRecord = new PerfRecord(taskGroupId,taskId, PerfRecord.PHASE.SQL_QUERY);
queryPerfRecord.start();
resultSet = statement.executeQuery(querySql);
ResultSetMetaData meta = resultSet.getMetaData();
int columnNum = meta.getColumnCount();
// 统计的result_Next时间
PerfRecord allResultPerfRecord = new PerfRecord(taskGroupId, taskId, PerfRecord.PHASE.RESULT_NEXT_ALL);
allResultPerfRecord.start();
while (resultSet.next()) {
Record record = recordSender.createRecord();
rsNextUsedTime += (System.nanoTime() - lastTime);
for (int i = 1; i <= columnNum; i++) {
Column column = this.convertPhoenixValueToDataxColumn(meta.getColumnType(i), resultSet.getObject(i));
record.addColumn(column);
}
lastTime = System.nanoTime();
recordSender.sendToWriter(record);
}
allResultPerfRecord.end(rsNextUsedTime);
LOG.info("Finished read record by Sql: [{}\n] {}.", querySql);
} catch (SQLException e) {
throw DataXException.asDataXException(
HBase20xSQLReaderErrorCode.QUERY_DATA_ERROR, "查询Phoenix数据出现异常,请检查服务状态或与HBase管理员联系!", e);
} finally {
helper.closeJdbc(conn, statement, resultSet);
}
}
private Column convertPhoenixValueToDataxColumn(int sqlType, Object value) {
Column column;
switch (sqlType) {
case Types.CHAR:
case Types.VARCHAR:
column = new StringColumn((String) value);
break;
case Types.BINARY:
case Types.VARBINARY:
column = new BytesColumn((byte[]) value);
break;
case Types.BOOLEAN:
column = new BoolColumn((Boolean) value);
break;
case Types.INTEGER:
column = new LongColumn((Integer) value);
break;
case Types.TINYINT:
column = new LongColumn(((Byte) value).longValue());
break;
case Types.SMALLINT:
column = new LongColumn(((Short) value).longValue());
break;
case Types.BIGINT:
column = new LongColumn((Long) value);
break;
case Types.FLOAT:
column = new DoubleColumn((Float.valueOf(value.toString())));
break;
case Types.DECIMAL:
column = new DoubleColumn((BigDecimal)value);
break;
case Types.DOUBLE:
column = new DoubleColumn((Double) value);
break;
case Types.DATE:
column = new DateColumn((Date) value);
break;
case Types.TIME:
column = new DateColumn((Time) value);
break;
case Types.TIMESTAMP:
column = new DateColumn((Timestamp) value);
break;
default:
throw DataXException.asDataXException(
HBase20xSQLReaderErrorCode.PHOENIX_COLUMN_TYPE_CONVERT_ERROR, "遇到不可识别的phoenix类型," + "sqlType :" + sqlType);
}
return column;
}
}
package com.alibaba.datax.plugin.reader.hbase20xsqlreader;
public class Key {
/**
* 【必选】writer要读取的表的表名
*/
public final static String TABLE = "table";
/**
* 【必选】writer要读取哪些列
*/
public final static String COLUMN = "column";
/**
* 【必选】Phoenix QueryServer服务地址
*/
public final static String QUERYSERVER_ADDRESS = "queryServerAddress";
/**
* 【可选】序列化格式,默认为PROTOBUF
*/
public static final String SERIALIZATION_NAME = "serialization";
/**
* 【可选】Phoenix表所属schema,默认为空
*/
public static final String SCHEMA = "schema";
/**
* 【可选】读取数据时切分列
*/
public static final String SPLIT_KEY = "splitKey";
/**
* 【可选】读取数据时切分点
*/
public static final String SPLIT_POINT = "splitPoint";
/**
* 【可选】读取数据过滤条件配置
*/
public static final String WHERE = "where";
/**
* 【可选】查询语句配置
*/
public static final String QUERY_SQL = "querySql";
}
{
"name": "hbase20xsqlreader",
"class": "com.alibaba.datax.plugin.reader.hbase20xsqlreader.HBase20xSQLReader",
"description": "useScene: prod. mechanism: read data from phoenix through queryserver.",
"developer": "bake"
}
{
"name": "hbase20xsqlreader",
"parameter": {
"queryserverAddress": "",
"serialization": "PROTOBUF",
"schema": "",
"table": "TABLE1",
"column": ["ID", "NAME"],
"splitKey": "rowkey",
"splitPoint":[],
"where": ""
}
}
# HBase20xsqlwriter插件文档
## 1. 快速介绍
HBase20xsqlwriter实现了向hbase中的SQL表(phoenix)批量导入数据的功能。Phoenix因为对rowkey做了数据编码,所以,直接使用HBaseAPI进行写入会面临手工数据转换的问题,麻烦且易错。本插件提供了SQL方式直接向Phoenix表写入数据。
在底层实现上,通过Phoenix QueryServer的轻客户端驱动,执行UPSERT语句向Phoenix写入数据。
### 1.1 支持的功能
* 支持带索引的表的数据导入,可以同步更新所有的索引表
### 1.2 限制
* 要求版本为Phoenix5.x及HBase2.x
* 仅支持通过Phoenix QeuryServer导入数据,因此您Phoenix必须启动QueryServer服务才能使用本插件
* 不支持清空已有表数据
* 仅支持通过phoenix创建的表,不支持原生HBase表
* 不支持带时间戳的数据导入
## 2. 实现原理
通过Phoenix轻客户端,连接Phoenix QueryServer服务,执行UPSERT语句向表中批量写入数据。因为使用上层接口,所以,可以同步更新索引表。
## 3. 配置说明
### 3.1 配置样例
```json
{
"job": {
"entry": {
"jvm": "-Xms2048m -Xmx2048m"
},
"content": [
{
"reader": {
"name": "txtfilereader",
"parameter": {
"path": "/Users/shf/workplace/datax_test/hbase20xsqlwriter/txt/normal.txt",
"charset": "UTF-8",
"column": [
{
"index": 0,
"type": "String"
},
{
"index": 1,
"type": "string"
},
{
"index": 2,
"type": "string"
},
{
"index": 3,
"type": "string"
}
],
"fieldDelimiter": ","
}
},
"writer": {
"name": "hbase20xsqlwriter",
"parameter": {
"batchSize": "100",
"column": [
"UID",
"TS",
"EVENTID",
"CONTENT"
],
"queryServerAddress": "http://127.0.0.1:8765",
"nullMode": "skip",
"table": "目标hbase表名,大小写有关"
}
}
}
],
"setting": {
"speed": {
"channel": 5
}
}
}
}
```
### 3.2 参数说明
* **name**
* 描述:插件名字,必须是`hbase11xsqlwriter`
* 必选:是
* 默认值:无
* **table**
* 描述:要导入的表名,大小写敏感,通常phoenix表都是**大写**表名
* 必选:是
* 默认值:无
* **column**
* 描述:列名,大小写敏感,通常phoenix的列名都是**大写**
* 需要注意列的顺序,必须与reader输出的列的顺序一一对应。
* 不需要填写数据类型,会自动从phoenix获取列的元数据
* 必选:是
* 默认值:无
* **queryServerAddress**
* 描述:Phoenix QueryServer地址,为必填项,格式:http://${hostName}:${ip},如http://172.16.34.58:8765
* 必选:是
* 默认值:无
* **serialization**
* 描述:QueryServer使用的序列化协议
* 必选:否
* 默认值:PROTOBUF
* **batchSize**
* 描述:批量写入的最大行数
* 必选:否
* 默认值:256
* **nullMode**
* 描述:读取到的列值为null时,如何处理。目前有两种方式:
* skip:跳过这一列,即不插入这一列(如果该行的这一列之前已经存在,则会被删除)
* empty:插入空值,值类型的空值是0,varchar的空值是空字符串
* 必选:否
* 默认值:skip
## 4. 性能报告
## 5. 约束限制
writer中的列的定义顺序必须与reader的列顺序匹配。reader中的列顺序定义了输出的每一行中,列的组织顺序。而writer的列顺序,定义的是在收到的数据中,writer期待的列的顺序。例如:
reader的列顺序是: c1, c2, c3, c4
writer的列顺序是: x1, x2, x3, x4
则reader输出的列c1就会赋值给writer的列x1。如果writer的列顺序是x1, x2, x4, x3,则c3会赋值给x4,c4会赋值给x3.
## 6. FAQ
1. 并发开多少合适?速度慢时增加并发有用吗?
数据导入进程默认JVM的堆大小是2GB,并发(channel数)是通过多线程实现的,开过多的线程有时并不能提高导入速度,反而可能因为过于频繁的GC导致性能下降。一般建议并发数(channel)为5-10.
2. batchSize设置多少比较合适?
默认是256,但应根据每行的大小来计算最合适的batchSize。通常一次操作的数据量在2MB-4MB左右,用这个值除以行大小,即可得到batchSize。
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>datax-all</artifactId>
<groupId>com.alibaba.datax</groupId>
<version>0.0.1-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>
<artifactId>hbase20xsqlwriter</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<properties>
<phoenix.version>5.0.0-HBase-2.0</phoenix.version>
<avatica.version>1.12.0</avatica.version>
<commons-codec.version>1.8</commons-codec.version>
</properties>
<dependencies>
<dependency>
<groupId>com.alibaba.datax</groupId>
<artifactId>datax-common</artifactId>
<version>${datax-project-version}</version>
<exclusions>
<exclusion>
<artifactId>slf4j-log4j12</artifactId>
<groupId>org.slf4j</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.phoenix</groupId>
<artifactId>phoenix-queryserver</artifactId>
<version>${phoenix.version}</version>
</dependency>
<!-- for test -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.alibaba.datax</groupId>
<artifactId>datax-core</artifactId>
<version>${datax-project-version}</version>
<exclusions>
<exclusion>
<groupId>com.alibaba.datax</groupId>
<artifactId>datax-service-face</artifactId>
</exclusion>
</exclusions>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-all</artifactId>
<version>1.9.5</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<resources>
<resource>
<directory>src/main/java</directory>
<includes>
<include>**/*.properties</include>
</includes>
</resource>
</resources>
<plugins>
<!-- compiler plugin -->
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.6</source>
<target>1.6</target>
<encoding>${project-sourceEncoding}</encoding>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptors>
<descriptor>src/main/assembly/package.xml</descriptor>
</descriptors>
<finalName>datax</finalName>
</configuration>
<executions>
<execution>
<id>dwzip</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
\ No newline at end of file
<assembly
xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0 http://maven.apache.org/xsd/assembly-1.1.0.xsd">
<id></id>
<formats>
<format>dir</format>
</formats>
<includeBaseDirectory>false</includeBaseDirectory>
<fileSets>
<fileSet>
<directory>src/main/resources</directory>
<includes>
<include>plugin.json</include>
<include>plugin_job_template.json</include>
</includes>
<outputDirectory>plugin/writer/hbase20xsqlwriter</outputDirectory>
</fileSet>
<fileSet>
<directory>target/</directory>
<includes>
<include>hbase20xsqlwriter-0.0.1-SNAPSHOT.jar</include>
</includes>
<outputDirectory>plugin/writer/hbase20xsqlwriter</outputDirectory>
</fileSet>
</fileSets>
<dependencySets>
<dependencySet>
<useProjectArtifact>false</useProjectArtifact>
<outputDirectory>plugin/writer/hbase20xsqlwriter/libs</outputDirectory>
<scope>runtime</scope>
</dependencySet>
</dependencySets>
</assembly>
package com.alibaba.datax.plugin.writer.hbase20xsqlwriter;
public final class Constant {
public static final String DEFAULT_NULL_MODE = "skip";
public static final String DEFAULT_SERIALIZATION = "PROTOBUF";
public static final int DEFAULT_BATCH_ROW_COUNT = 256; // 默认一次写256行
public static final int TYPE_UNSIGNED_TINYINT = 11;
public static final int TYPE_UNSIGNED_SMALLINT = 13;
public static final int TYPE_UNSIGNED_INTEGER = 9;
public static final int TYPE_UNSIGNED_LONG = 10;
public static final int TYPE_UNSIGNED_FLOAT = 14;
public static final int TYPE_UNSIGNED_DOUBLE = 15;
public static final int TYPE_UNSIGNED_DATE = 19;
public static final int TYPE_UNSIGNED_TIME = 18;
public static final int TYPE_UNSIGNED_TIMESTAMP = 20;
}
package com.alibaba.datax.plugin.writer.hbase20xsqlwriter;
import com.alibaba.datax.common.exception.DataXException;
import com.alibaba.datax.common.util.Configuration;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.sql.*;
import java.util.ArrayList;
import java.util.List;
public class HBase20xSQLHelper {
private static final Logger LOG = LoggerFactory.getLogger(HBase20xSQLHelper.class);
/**
* phoenix瘦客户端连接前缀
*/
public static final String CONNECT_STRING_PREFIX = "jdbc:phoenix:thin:";
/**
* phoenix驱动名
*/
public static final String CONNECT_DRIVER_STRING = "org.apache.phoenix.queryserver.client.Driver";
/**
* 从系统表查找配置表信息
*/
public static final String SELECT_CATALOG_TABLE_STRING = "SELECT COLUMN_NAME FROM SYSTEM.CATALOG WHERE TABLE_NAME='%s' AND COLUMN_NAME IS NOT NULL";
/**
* 验证配置参数是否正确
*/
public static void validateParameter(com.alibaba.datax.common.util.Configuration originalConfig) {
// 表名和queryserver地址必须配置,否则抛异常
String tableName = originalConfig.getNecessaryValue(Key.TABLE, HBase20xSQLWriterErrorCode.REQUIRED_VALUE);
String queryServerAddress = originalConfig.getNecessaryValue(Key.QUERYSERVER_ADDRESS, HBase20xSQLWriterErrorCode.REQUIRED_VALUE);
// 序列化格式,可不配置,默认PROTOBUF
String serialization = originalConfig.getString(Key.SERIALIZATION_NAME, Constant.DEFAULT_SERIALIZATION);
String connStr = getConnectionUrl(queryServerAddress, serialization);
// 校验jdbc连接是否正常
Connection conn = getThinClientConnection(connStr);
List<String> columnNames = originalConfig.getList(Key.COLUMN, String.class);
if (columnNames == null || columnNames.isEmpty()) {
throw DataXException.asDataXException(
HBase20xSQLWriterErrorCode.ILLEGAL_VALUE, "HBase的columns配置不能为空,请添加目标表的列名配置.");
}
String schema = originalConfig.getString(Key.SCHEMA);
// 检查表以及配置列是否存在
checkTable(conn, schema, tableName, columnNames);
}
/**
* 获取JDBC连接,轻量级连接,使用完后必须显式close
*/
public static Connection getThinClientConnection(String connStr) {
LOG.debug("Connecting to QueryServer [" + connStr + "] ...");
Connection conn;
try {
Class.forName(CONNECT_DRIVER_STRING);
conn = DriverManager.getConnection(connStr);
conn.setAutoCommit(false);
} catch (Throwable e) {
throw DataXException.asDataXException(HBase20xSQLWriterErrorCode.GET_QUERYSERVER_CONNECTION_ERROR,
"无法连接QueryServer,配置不正确或服务未启动,请检查配置和服务状态或者联系HBase管理员.", e);
}
LOG.debug("Connected to QueryServer successfully.");
return conn;
}
public static Connection getJdbcConnection(Configuration conf) {
String queryServerAddress = conf.getNecessaryValue(Key.QUERYSERVER_ADDRESS, HBase20xSQLWriterErrorCode.REQUIRED_VALUE);
// 序列化格式,可不配置,默认PROTOBUF
String serialization = conf.getString(Key.SERIALIZATION_NAME, "PROTOBUF");
String connStr = getConnectionUrl(queryServerAddress, serialization);
return getThinClientConnection(connStr);
}
public static String getConnectionUrl(String queryServerAddress, String serialization) {
String urlFmt = CONNECT_STRING_PREFIX + "url=%s;serialization=%s";
return String.format(urlFmt, queryServerAddress, serialization);
}
public static void checkTable(Connection conn, String schema, String tableName, List<String> columnNames) throws DataXException {
String selectSystemTable = getSelectSystemSQL(schema, tableName);
Statement st = null;
ResultSet rs = null;
try {
st = conn.createStatement();
rs = st.executeQuery(selectSystemTable);
List<String> allColumns = new ArrayList<String>();
if (rs.next()) {
allColumns.add(rs.getString(1));
} else {
LOG.error(tableName + "表不存在,请检查表名是否正确或是否已创建.", HBase20xSQLWriterErrorCode.GET_HBASE_TABLE_ERROR);
throw DataXException.asDataXException(HBase20xSQLWriterErrorCode.GET_HBASE_TABLE_ERROR,
tableName + "表不存在,请检查表名是否正确或是否已创建.");
}
while (rs.next()) {
allColumns.add(rs.getString(1));
}
for (String columnName : columnNames) {
if (!allColumns.contains(columnName)) {
// 用户配置的列名在元数据中不存在
throw DataXException.asDataXException(HBase20xSQLWriterErrorCode.ILLEGAL_VALUE,
"您配置的列" + columnName + "在目的表" + tableName + "的元数据中不存在,请检查您的配置或者联系HBase管理员.");
}
}
} catch (SQLException t) {
throw DataXException.asDataXException(HBase20xSQLWriterErrorCode.GET_HBASE_TABLE_ERROR,
"获取表" + tableName + "信息失败,请检查您的集群和表状态或者联系HBase管理员.", t);
} finally {
closeJdbc(conn, st, rs);
}
}
private static String getSelectSystemSQL(String schema, String tableName) {
String sql = String.format(SELECT_CATALOG_TABLE_STRING, tableName);
if (schema != null) {
sql = sql + " AND TABLE_SCHEM = '" + schema + "'";
}
return sql;
}
public static void closeJdbc(Connection connection, Statement statement, ResultSet resultSet) {
try {
if (resultSet != null) {
resultSet.close();
}
if (statement != null) {
statement.close();
}
if (connection != null) {
connection.close();
}
} catch (SQLException e) {
LOG.warn("数据库连接关闭异常.", HBase20xSQLWriterErrorCode.CLOSE_HBASE_CONNECTION_ERROR);
}
}
}
package com.alibaba.datax.plugin.writer.hbase20xsqlwriter;
import com.alibaba.datax.common.plugin.RecordReceiver;
import com.alibaba.datax.common.spi.Writer;
import com.alibaba.datax.common.util.Configuration;
import java.util.ArrayList;
import java.util.List;
public class HBase20xSQLWriter extends Writer {
public static class Job extends Writer.Job {
private Configuration config = null;
@Override
public void init() {
this.config = this.getPluginJobConf();
HBase20xSQLHelper.validateParameter(this.config);
}
@Override
public List<Configuration> split(int mandatoryNumber) {
List<Configuration> splitResultConfigs = new ArrayList<Configuration>();
for (int j = 0; j < mandatoryNumber; j++) {
splitResultConfigs.add(config.clone());
}
return splitResultConfigs;
}
@Override
public void destroy() {
//doNothing
}
}
public static class Task extends Writer.Task {
private Configuration taskConfig;
private HBase20xSQLWriterTask writerTask;
@Override
public void init() {
this.taskConfig = super.getPluginJobConf();
this.writerTask = new HBase20xSQLWriterTask(this.taskConfig);
}
@Override
public void startWrite(RecordReceiver lineReceiver) {
this.writerTask.startWriter(lineReceiver, super.getTaskPluginCollector());
}
@Override
public void destroy() {
// 不需要close
}
}
}
\ No newline at end of file
package com.alibaba.datax.plugin.writer.hbase20xsqlwriter;
import com.alibaba.datax.common.spi.ErrorCode;
public enum HBase20xSQLWriterErrorCode implements ErrorCode {
REQUIRED_VALUE("Hbasewriter-00", "您缺失了必须填写的参数值."),
ILLEGAL_VALUE("Hbasewriter-01", "您填写的参数值不合法."),
GET_QUERYSERVER_CONNECTION_ERROR("Hbasewriter-02", "获取QueryServer连接时出错."),
GET_HBASE_TABLE_ERROR("Hbasewriter-03", "获取 Hbase table时出错."),
CLOSE_HBASE_CONNECTION_ERROR("Hbasewriter-04", "关闭Hbase连接时出错."),
GET_TABLE_COLUMNTYPE_ERROR("Hbasewriter-05", "获取表列类型时出错."),
PUT_HBASE_ERROR("Hbasewriter-07", "写入hbase时发生IO异常."),
;
private final String code;
private final String description;
private HBase20xSQLWriterErrorCode(String code, String description) {
this.code = code;
this.description = description;
}
@Override
public String getCode() {
return this.code;
}
@Override
public String getDescription() {
return this.description;
}
@Override
public String toString() {
return String.format("Code:[%s], Description:[%s].", this.code, this.description);
}
}
package com.alibaba.datax.plugin.writer.hbase20xsqlwriter;
public class Key {
/**
* 【必选】writer要写入的表的表名
*/
public final static String TABLE = "table";
/**
* 【必选】writer要写入哪些列
*/
public final static String COLUMN = "column";
/**
* 【必选】Phoenix QueryServer服务地址
*/
public final static String QUERYSERVER_ADDRESS = "queryServerAddress";
/**
* 【可选】序列化格式,默认为PROTOBUF
*/
public static final String SERIALIZATION_NAME = "serialization";
/**
* 【可选】批量写入的最大行数,默认100行
*/
public static final String BATCHSIZE = "batchSize";
/**
* 【可选】遇到空值默认跳过
*/
public static final String NULLMODE = "nullMode";
/**
* 【可选】Phoenix表所属schema,默认为空
*/
public static final String SCHEMA = "schema";
}
package com.alibaba.datax.plugin.writer.hbase20xsqlwriter;
import com.alibaba.datax.common.exception.DataXException;
import java.util.Arrays;
public enum NullModeType {
Skip("skip"),
Empty("empty")
;
private String mode;
NullModeType(String mode) {
this.mode = mode.toLowerCase();
}
public String getMode() {
return mode;
}
public static NullModeType getByTypeName(String modeName) {
for (NullModeType modeType : values()) {
if (modeType.mode.equalsIgnoreCase(modeName)) {
return modeType;
}
}
throw DataXException.asDataXException(HBase20xSQLWriterErrorCode.ILLEGAL_VALUE,
"Hbasewriter 不支持该 nullMode 类型:" + modeName + ", 目前支持的 nullMode 类型是:" + Arrays.asList(values()));
}
}
{
"name": "hbase20xsqlwriter",
"class": "com.alibaba.datax.plugin.writer.hbase20xsqlwriter.HBase20xSQLWriter",
"description": "useScene: prod. mechanism: use hbase sql UPSERT to put data, index tables will be updated too.",
"developer": "bake"
}
{
"name": "hbase20xsqlwriter",
"parameter": {
"queryServerAddress": "",
"table": "",
"serialization": "PROTOBUF",
"column": [
],
"batchSize": "100",
"nullMode": "skip",
"schema": ""
}
}
\ No newline at end of file
...@@ -308,5 +308,19 @@ ...@@ -308,5 +308,19 @@
</includes> </includes>
<outputDirectory>datax</outputDirectory> <outputDirectory>datax</outputDirectory>
</fileSet> </fileSet>
<fileSet>
<directory>hbase20xsqlreader/target/datax/</directory>
<includes>
<include>**/*.*</include>
</includes>
<outputDirectory>datax</outputDirectory>
</fileSet>
<fileSet>
<directory>hbase20xsqlwriter/target/datax/</directory>
<includes>
<include>**/*.*</include>
</includes>
<outputDirectory>datax</outputDirectory>
</fileSet>
</fileSets> </fileSets>
</assembly> </assembly>
...@@ -89,6 +89,8 @@ ...@@ -89,6 +89,8 @@
<!-- common support module --> <!-- common support module -->
<module>plugin-rdbms-util</module> <module>plugin-rdbms-util</module>
<module>plugin-unstructured-storage-util</module> <module>plugin-unstructured-storage-util</module>
<module>hbase20xsqlreader</module>
<module>hbase20xsqlwriter</module>
</modules> </modules>
<dependencyManagement> <dependencyManagement>
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment