BR cannot back up to Alibaba Cloud OSS

Note:
This topic has been translated from a Chinese forum by GPT and might contain errors.

Original topic: BR 无法备份到阿里云OSS

| username: TiDBer_RywnG56h

[TiDB Usage Environment] Production Environment / Test / Poc
[TiDB Version] 7.5.0
[BR Version] 7.5.0, 7.5.1, even 8.0 have the same error
[Reproduction Path]

  1. Create an Alibaba Cloud ECS instance (able to connect to the TiDB cluster) and install the BR command;
  2. Create a RAM role in the Alibaba Cloud console (e.g., ops-backup-runner) and grant this role the AliyunOSSFullAccess and AliyunSTSAssumeRoleAccess permissions.
  3. Assign the above role (ops-backup-runner) to the ECS instance created in step one.
  4. Create an OSS bucket in the Alibaba Cloud console;
  5. Execute the following command to back up data to the OSS bucket:
br backup full \
  --pd "192.168.6.15:2379,192.168.6.16:2379,192.168.6.12:2379" \
  --s3.endpoint "https://oss-cn-hangzhu.aliyuncs.com" \
  --s3.provider "alibaba" \
  --s3.region "oss-cn-hangzhu" \
  --log-level debug \
  --storage "s3://${you-bucket-name}/tidb/test"

[Encountered Problem: Problem Phenomenon and Impact]
BR cannot complete the backup. According to the error message, Alibaba Cloud returns an error page: OpenAPI自助诊断-阿里云OpenAPI开发者门户, which shows:
You are using an STS type AccessKey ID but did not initiate the request using STS authentication.
You used an STS type AccessKey ID but did not include the SecurityToken field in the request to indicate that you are using STS authentication.

Detailed error log is as follows:

[2024/04/04 18:05:18.071 +08:00] [ERROR] [main.go:60] ["br failed"] [error="error happen in store 2 at 192.168.6.18:20160: Io(Custom { kind: Other, error: \"failed to put object rusoto error Request ID: None Body: <?xml version=\\\"1.0\\\" encoding=\\\"UTF-8\\\"?>\\n<Error>\\n  <Code>InvalidAccessKeyId</Code>\\n  <Message>The OSS Access Key Id you provided does not exist in our records. The Security Token may be lost to specify that it is a STS Access Id.</Message>\\n  <RequestId>660E7B5CF947FB36323C0335</RequestId>\\n  <HostId>17study-ops-backup.oss-cn-hangzhou.aliyuncs.com</HostId>\\n  <AWSAccessKeyId>STS.NUqcKhq7qeE7i4nKLu9KeaJbx</AWSAccessKeyId>\\n  <EC>0002-00000003</EC>\\n  <RecommendDoc>https://api.aliyun.com/troubleshoot?q=0002-00000003</RecommendDoc>\\n</Error>\\n\" }): [BR:KV:ErrKVStorage]tikv storage occur I/O error"] [errorVerbose="[BR:KV:ErrKVStorage]tikv storage occur I/O error\nerror happen in store 2 at 192.168.6.18:20160: Io(Custom { kind: Other, error: \"failed to put object rusoto error Request ID: None Body: <?xml version=\\\"1.0\\\" encoding=\\\"UTF-8\\\"?>\\n<Error>\\n  <Code>InvalidAccessKeyId</Code>\\n  <Message>The OSS Access Key Id you provided does not exist in our records. The Security Token may be lost to specify that it is a STS Access Id.</Message>\\n  <RequestId>660E7B5CF947FB36323C0335</RequestId>\\n  <HostId>17study-ops-backup.oss-cn-hangzhou.aliyuncs.com</HostId>\\n  <AWSAccessKeyId>STS.NUqcKhq7qeE7i4nKLu9KeaJbx</AWSAccessKeyId>\\n  <EC>0002-00000003</EC>\\n  <RecommendDoc>https://api.aliyun.com/troubleshoot?q=0002-00000003</RecommendDoc>\\n</Error>\\n\" })\ngithub.com/pingcap/tidb/br/pkg/backup.(*pushDown).pushBackup\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/push.go:218\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).BackupRange\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:962\ngithub.com/pingcap/tidb/br/pkg/backup.(*Client).BackupRanges.func2\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/backup/client.go:876\ngithub.com/pingcap/tidb/br/pkg/utils.(*WorkerPool).ApplyOnErrorGroup.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/utils/worker.go:76\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/sync@v0.3.0/errgroup/errgroup.go:75\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"] [stack="main.main\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/cmd/br/main.go:60\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:267"

[Resource Configuration] Enter TiDB Dashboard - Cluster Info - Hosts and take a screenshot of this page

[Attachments: Screenshots/Logs/Monitoring]

| username: DBAER | Original post link

No access key added.

| username: TiDBer_QYr0vohO | Original post link

It seems that AWS is also like this. You directly use BR with AK and SK to make requests.

| username: Daniel-W | Original post link

To write the authentication key, refer to the following format:

tiup br backup full \
    --pd ${PD} \
    --storage "${S3PREFIX}?access-key=${ACCESSKEY}&secret-access-key=${SECRETKEY}" \
    --s3.endpoint ${S3ENDPOINT} \
    --send-credentials-to-tikv=true
| username: Kamner | Original post link

The OSS Access Key Id you provided does not exist in our records. The Security Token may be lost to specify that it is a STS Access Id

| username: 友利奈绪 | Original post link

It looks like the authentication failed because the access key was not added.

| username: TiDBer_RywnG56h | Original post link

Alibaba Cloud OSS only supports STS mode. In the BR source code, Alibaba Cloud’s SDK is referenced to obtain temporary credentials through the RAM role, making it impossible for us to use it manually.

| username: TiDBer_RywnG56h | Original post link

In the BR source code, if the endpoint contains aliyuncs.com, it will not use the externally provided AK/SK, but instead call the Alibaba Cloud SDK to obtain STS credentials through the RAM rule. This is fine, but when putting files to OSS, the authentication method used does not pass the STS token. I think this is a BUG.

| username: TiDBer_RywnG56h | Original post link

I modified the source code and added debugging information. I found that BR indeed completed the retrieval of the STS token when calling the Alibaba Cloud SDK. However, the STS token was not passed when uploading the file, which is required by Alibaba Cloud, so the upload has been failing.

| username: TiDBer_RywnG56h | Original post link

Another phenomenon is that BR can create backup directories and backup lock files in OSS, but it just cannot push backup files. I suspect there are two different interface calling methods involved here.

| username: TiDBer_RywnG56h | Original post link

Ordinary AK/SK won’t work, especially when backing up through BackupSchedule in K8S. Directly providing AK/SK will result in a permission error because BR accesses an Alibaba Cloud IP address to obtain a RAM rule and uses that RAM rule to access OSS.

| username: Daniel-W | Original post link

Alright, got it. This issue never occurred when I was using Tencent Cloud’s COS before. :sweat_smile:

| username: Swan | Original post link

Encountered the same problem, thanks for sharing.

| username: TiDBer_RywnG56h | Original post link

I also raised an issue on GitHub, but no one has responded yet :sob:

| username: 呢莫不爱吃鱼 | Original post link

You didn’t write the access key, right?

| username: TiDBer_RywnG56h | Original post link

When BR accesses Alibaba Cloud OSS, it does not use AKSK, but automatically obtains an STS token through the Alibaba Cloud SDK.

| username: zhang_2023 | Original post link

You need to add AK SK.

| username: 呢莫不爱吃鱼 | Original post link

Oh, then I don’t know. We stored it in our self-built MinIO.

| username: TiDBer_21wZg5fm | Original post link

It looks like a connection configuration issue.

| username: 像风一样的男子 | Original post link

Currently, br does not support direct upload to Alibaba Cloud’s OSS, right?